Archive for the Bad Statistics Category

Why Universities should ignore League Tables

Posted in Bad Statistics, Education with tags , , , , , on January 12, 2017 by telescoper

Very busy day today but I couldn’t resist a quick post to draw attention to a new report by an independent think tank called the Higher Education Policy Institute  (PDF available here; high-level summary there). It says a lot of things that I’ve discussed on this blog already and I agree strongly with most of the conclusions. The report is focused on the international league tables, but much of what it says (in terms of methodological criticism) also applies to the national tables. Unfortunately, I doubt if this will make much difference to the behaviour of the bean-counters who have now taken control of higher education, for whom strategies intended to ‘game’ position in these, largely bogus, tables seem to be the main focus of their policy rather than the pursuit of teaching and scholarship, which is what should universities actually be for.

Here is the introduction to high-level summary:

Rankings of global universities, such as the THE World University Rankings, the QS World University Rankings and the Academic Ranking of World Universities claim to identify the ‘best’ universities in the world and then list them in rank order. They are enormously influential, as universities and even governments alter their policies to improve their position.

The new research shows the league tables are based almost exclusively on research-related criteria and the data they use are unreliable and sometimes worse. As a result, it is unwise and undesirable to give the league tables so much weight.

Later on we find some recommendations:

The report considers the inputs for the various international league tables and discusses their overall weaknesses before considering some improvements that could be made. These include:

  • ranking bodies should audit and validate data provided by universities;
  • league table criteria should move beyond research-related measures;
  • surveys of reputation should be dropped, given their methodological flaws;
  • league table results should be published in more complex ways than simple numerical rankings; and
  • universities and governments should not exaggerate the importance of rankings when determining priorities.

No doubt the purveyors of these ranking – I’ll refrain from calling them “rankers” – will mount a spirited defence of their business, but I agree with the view expressed in this report that as they stand these league tables are at best meaningless and at worst damaging.

Do you have Confidence in the Teaching Excellence Framework?

Posted in Bad Statistics with tags , , , , on January 4, 2017 by telescoper

The  Teaching Excellence Framework (TEF) is, along with a number of other measures in the 2016 Higher Education and Research Bill, causing a lot of concern in academic circles (see, e.g., this piece by Stephen Curry). One of the intentions of the TEF is to use relatively simple metrics to gauge “teaching quality” in higher education institutions. On top of the fundamental questions of exactly what “teaching quality” means and how it might be measured in any reliable way, there is now another worry: the whole TEF system is to be run by people who are statistically illiterate.

To demonstrate this assertion I refer you to this excerpt from the official TEF documentation:


The highlighted “explanation” of what a confidence interval means is false. It’s not slightly misleading. It’s not poorly worded. It’s just false.

I don’t know who from HEFCE wrote the piece above, but it’s clearly someone who does not understand the basic concepts of statistics.

I can’t imagine what kind of garbled nonsense will come out of the TEF if this is the level of understanding displayed by the people running it.  That garbage will also be fed into the university league tables with potentially devastating effects on individuals, departments and institutions, so my gripe is not just about semantics – this level of statistical illiteracy could have very serious consequences for Higher Education in the UK.

Perhaps HEFCE should call in some experts in statistics to help? Oh, no. I forgot. This country has had enough of experts…





Straw Poll on Statistical Computing

Posted in Bad Statistics, The Universe and Stuff with tags , , on December 20, 2016 by telescoper

The abstract of my previous (reblogged) post claims that R is “the premier language of statistical computing”. That may be true for the wider world of statistics, and I like R very much, but in my experience astronomers and cosmologists are much more likely to do their coding in Python.  It’s certainly the case that astronomers and physicists are much more likely to be taught Python than R. There may well even be some oldies out there still using other languages like Fortran, or perhaps  relying on books of statistical tables!

Out of interest therefore I’ve decided to run the following totally biased and statistically meaningless poll of my immense readership:


If you choose “something else”, please let me know through the comments box what your alternative is. I can then add additional options.


LIGO Echoes, P-values and the False Discovery Rate

Posted in Astrohype, Bad Statistics, The Universe and Stuff with tags , , , , on December 12, 2016 by telescoper

Today is our staff Christmas lunch so I thought I’d get into the spirit by posting a grumbly article about a paper I found on the arXiv. In fact I came to this piece via a News item in Nature. Anyway, here is the abstract of the paper – which hasn’t been refereed yet:

In classical General Relativity (GR), an observer falling into an astrophysical black hole is not expected to experience anything dramatic as she crosses the event horizon. However, tentative resolutions to problems in quantum gravity, such as the cosmological constant problem, or the black hole information paradox, invoke significant departures from classicality in the vicinity of the horizon. It was recently pointed out that such near-horizon structures can lead to late-time echoes in the black hole merger gravitational wave signals that are otherwise indistinguishable from GR. We search for observational signatures of these echoes in the gravitational wave data released by advanced Laser Interferometer Gravitational-Wave Observatory (LIGO), following the three black hole merger events GW150914, GW151226, and LVT151012. In particular, we look for repeating damped echoes with time-delays of 8MlogM (+spin corrections, in Planck units), corresponding to Planck-scale departures from GR near their respective horizons. Accounting for the “look elsewhere” effect due to uncertainty in the echo template, we find tentative evidence for Planck-scale structure near black hole horizons at 2.9σ significance level (corresponding to false detection probability of 1 in 270). Future data releases from LIGO collaboration, along with more physical echo templates, will definitively confirm (or rule out) this finding, providing possible empirical evidence for alternatives to classical black holes, such as in firewall or fuzzball paradigms.

I’ve highlighted some of the text in bold. I’ve highlighted this because as written its wrong.

I’ve blogged many times before about this type of thing. The “significance level” quoted corresponds to a “p-value” of 0.0037 (or about 1/270). If I had my way we’d ban p-values and significance levels altogether because they are so often presented in a misleading fashion, as it is here.

What is wrong is that the significance level is not the same as the false detection probability.  While it is usually the case that the false detection probability (which is often called the false discovery rate) will decrease the lower your p-value is, these two quantities are not the same thing at all. Usually the false detection probability is much higher than the p-value. The physicist John Bahcall summed this up when he said, based on his experience, “about half of all 3σ  detections are false”. You can find a nice (and relatively simple) explanation of why this is the case here (which includes various references that are worth reading), but basically it’s because the p-value relates to the probability of seeing a signal at least as large as that observed under a null hypothesis (e.g.  detector noise) but says nothing directly about the probability of it being produced by an actual signal. To answer this latter question properly one really needs to use a Bayesian approach, but if you’re not keen on that I refer you to this (from David Colquhoun’s blog):

One problem with all of the approaches mentioned above was the need to guess at the prevalence of real effects (that’s what a Bayesian would call the prior probability). James Berger and colleagues (Sellke et al., 2001) have proposed a way round this problem by looking at all possible prior distributions and so coming up with a minimum false discovery rate that holds universally. The conclusions are much the same as before. If you claim to have found an effects whenever you observe a P value just less than 0.05, you will come to the wrong conclusion in at least 29% of the tests that you do. If, on the other hand, you use P = 0.001, you’ll be wrong in only 1.8% of cases.

Of course the actual false detection probability can be much higher than these limits, but they provide a useful rule of thumb,

To be fair the Nature item puts it more accurately:

The echoes could be a statistical fluke, and if random noise is behind the patterns, says Afshordi, then the chance of seeing such echoes is about 1 in 270, or 2.9 sigma. To be sure that they are not noise, such echoes will have to be spotted in future black-hole mergers. “The good thing is that new LIGO data with improved sensitivity will be coming in, so we should be able to confirm this or rule it out within the next two years.

Unfortunately, however, the LIGO background noise is rather complicated so it’s not even clear to me that this calculation based on “random noise”  is meaningful anyway.

The idea that the authors are trying to test is of course interesting, but it needs a more rigorous approach before any evidence (even “tentative” can be claimed). This is rather reminiscent of the problems interpreting apparent “anomalies” in the Cosmic Microwave Background, which is something I’ve been interested in over the years.

In summary, I’m not convinced. Merry Christmas.



The Neyman-Scott ‘Paradox’

Posted in Bad Statistics, Cute Problems with tags , , , , on November 25, 2016 by telescoper

I just came across this interesting little problem recently and thought I’d share it here. It’s usually called the ‘Neyman-Scott’ paradox. Before going on it’s worth mentioning that Elizabeth Scott (the second half of Neyman-Scott) was an astronomer by background. Her co-author was Jerzy Neyman. As has been the case for many astronomers, she contributed greatly to the development of the field of statistics. Anyway, I think this example provides another good illustration of the superiority of Bayesian methods for estimating parameters, but I’ll let you make your own mind up about what’s going on.

The problem is fairly technical so I’ve done done a quick version in latex that you can download

here, but I’ve also copied into this post so you can read it below:




I look forward to receiving Frequentist Flak or Bayesian Benevolence through the comments box below!

The Worthless University Rankings

Posted in Bad Statistics, Education with tags , , , on September 23, 2016 by telescoper

The Times Higher World University Rankings, which were released this weekk. The main table can be found here and the methodology used to concoct them here.

Here I wish to reiterate the objection I made last year to the way these tables are manipulated year on year to create an artificial “churn” that renders them unreliable and impossible to interpret in an objective way. In other words, they’re worthless. This year, editor Phil Baty has written an article entitled Standing still is not an option in which he makes a statement that “the overall rankings methodology is the same as last year”. Actually it isn’t. In the page on methodology you will find this:

In 2015-16, we excluded papers with more than 1,000 authors because they were having a disproportionate impact on the citation scores of a small number of universities. This year, we have designed a method for reincorporating these papers. Working with Elsevier, we have developed a new fractional counting approach that ensures that all universities where academics are authors of these papers will receive at least 5 per cent of the value of the paper, and where those that provide the most contributors to the paper receive a proportionately larger contribution.

So the methodology just isn’t “the same as last year”. In fact every year that I’ve seen these rankings there’s been some change in methodology. The change above at least attempts to improve on the absurd decision taken last year to eliminate from the citation count any papers arising from large collaborations. In my view, membership of large world-wide collaborations is in itself an indicator of international research excellence, and such papers should if anything be given greater not lesser weight. But whether you agree with the motivation for the change or not is beside the point.

The real question is how can we be sure that any change in league table position for an institution from year to year are is caused by methodological tweaks rather than changes in “performance”, i.e. not by changes in the metrics but by changes in the way they are combined? Would you trust the outcome of a medical trial in which the response of two groups of patients (e.g. one given medication and the other placebo) were assessed with two different measurement techniques?

There is an obvious and easy way to test for the size of this effect, which is to construct a parallel set of league tables, with this year’s input data but last year’s methodology, which would make it easy to isolate changes in methodology from changes in the performance indicators. The Times Higher – along with other purveyors of similar statistical twaddle – refuses to do this. No scientifically literate person would accept the result of this kind of study unless the systematic effects can be shown to be under control. There is a very easy way for the Times Higher to address this question: all they need to do is publish a set of league tables using, say, the 2015/16 methodology and the 2016/17 data, for comparison with those constructed using this year’s methodology on the 2016/17 data. Any differences between these two tables will give a clear indication of the reliability (or otherwise) of the rankings.

I challenged the Times Higher to do this last year, and they refused. You can draw your own conclusions about why.

Bayes Factors via Savage-Dickey Supermodels [IMA]

Posted in Bad Statistics, The Universe and Stuff on September 12, 2016 by telescoper

How could I possibly resist reblogging an arXiver post about “Savage-Dickey Supermodels”?


We outline a new method to compute the Bayes Factor for model selection which bypasses the Bayesian Evidence. Our method combines multiple models into a single, nested, Supermodel using one or more hyperparameters. Since the models are now nested the Bayes Factors between the models can be efficiently computed using the Savage-Dickey Density Ratio (SDDR). In this way model selection becomes a problem of parameter estimation. We consider two ways of constructing the supermodel in detail: one based on combined models, and a second based on combined likelihoods. We report on these two approaches for a Gaussian linear model for which the Bayesian evidence can be calculated analytically and a toy nonlinear problem. Unlike the combined model approach, where a standard Monte Carlo Markov Chain (MCMC) struggles, the combined-likelihood approach fares much better in providing a reliable estimate of the log-Bayes Factor. This scheme potentially opens the way to…

View original post 53 more words