Archive for the Bad Statistics Category

Yellow Stars, Red Stars and Bayesian Inference

Posted in The Universe and Stuff, Bad Statistics with tags , , , , , , on May 25, 2017 by telescoper

I came across a paper on the arXiv yesterday with the title `Why do we find ourselves around a yellow star instead of a red star?’.  Here’s the abstract:

M-dwarf stars are more abundant than G-dwarf stars, so our position as observers on a planet orbiting a G-dwarf raises questions about the suitability of other stellar types for supporting life. If we consider ourselves as typical, in the anthropic sense that our environment is probably a typical one for conscious observers, then we are led to the conclusion that planets orbiting in the habitable zone of G-dwarf stars should be the best place for conscious life to develop. But such a conclusion neglects the possibility that K-dwarfs or M-dwarfs could provide more numerous sites for life to develop, both now and in the future. In this paper we analyze this problem through Bayesian inference to demonstrate that our occurrence around a G-dwarf might be a slight statistical anomaly, but only the sort of chance event that we expect to occur regularly. Even if M-dwarfs provide more numerous habitable planets today and in the future, we still expect mid G- to early K-dwarfs stars to be the most likely place for observers like ourselves. This suggests that observers with similar cognitive capabilities as us are most likely to be found at the present time and place, rather than in the future or around much smaller stars.

Athough astrobiology is not really my province,  I was intrigued enough to read on, until I came to the following paragraph in which the authors attempt to explain how Bayesian Inference works:

We approach this problem through the framework of Bayesian inference. As an example, consider a fair coin that is tossed three times in a row. Suppose that all three tosses turn up Heads. Can we conclude from this experiment that the coin must be weighted? In fact, we can still maintain our hypothesis that the coin is fair because the chances of getting three Heads in a row is 1/8. Many events with a probability of 1/8 occur every day, and so we should not be concerned about an event like this indicating that our initial assumptions are flawed. However, if we were to flip the same coin 70 times in a row with all 70 turning up Heads, we would readily conclude that the experiment is fixed. This is because the probability of flipping 70 Heads in a row is about 10-22, which is an exceedingly unlikely event that has probably never happened in the history of the universe. This
informal description of Bayesian inference provides a way to assess the probability of a hypothesis in light of new evidence.

Obviously I agree with the statement right at the end that `Bayesian inference provides a way to assess the probability of a hypothesis in light of new evidence’. That’s certainly what Bayesian inference does, but this `informal description’ is really a frequentist rather than a Bayesian argument, in that it only mentions the probability of given outcomes not the probability of different hypotheses…

Anyway, I was so unconvinced by this description’ that I stopped reading at that point and went and did something else. Since I didn’t finish the paper I won’t comment on the conclusions, although I am more than usually sceptical. You might disagree of course, so read the paper yourself and form your own opinion! For me, it goes in the file marked Bad Statistics!

Polls Apart

Posted in Bad Statistics, Politics with tags , , , , , , , on May 9, 2017 by telescoper

Time for some random thoughts about political opinion polls, the light of Sunday’s French Presidential Election result.

We all know that Emmanuel Macron beat Marine Le Pen in the second round ballot: he won 66.1% of the votes cast to Le Pen’s 33.9%. That doesn’t count the very large number of spoilt ballots or abstentions (25.8% in total). The turnout was down on previous elections, but at 74.2% it’s still a lot higher than we can expect in the UK at the forthcoming General Election.

The French opinion polls were very accurate in predicting the first round results, getting the percentage results for the four top candidates within a percentage or two which is as good as it gets for typical survey sizes.

Nate Silver Harry Enten has written a post on Nate Silver’s FiveThirtyEight site claiming that the French opinion polls for the second round “runoff” were inaccurate. He bases this on the observation that the “average poll” in between the two rounds of voting gave Macron a lead of about 22% (61%-39%). That’s true, but it assumes that opinions did not shift in the latter stages of the campaign. In particular it ignores Marine Le Pen’s terrible performance in the one-on-one TV debate against Macron on 4th May. Polls conducted after that show (especially a big one with a sample of 5331 by IPSOS) gave a figure more like 63-37, i.e. a 26 point lead.

In any case it can be a bit misleading to focus on the difference between the two vote shares. In a two-horse race, if you’re off by +3 for one candidate you will be off by -3 for the other. In other words, underestimating Macron’s vote automatically means over-estimating Le Pen’s. A ‘normal’ sampling error looks twice as bad if you frame it in terms of differences like this.  The last polls giving Macron at 63% are only off by 3%, which is a normal sampling error…

The polls were off by more than they have been in previous years (where they have typically predicted the spread within 4%. There’s also the question of how the big gap between the two candidates may have influenced voter behaviour,  increasing the number of no-shows.

So I don’t think the French opinion polls did as badly as all that. What still worries me, though, is the different polls consistently gave results that agreed with the others to within 1% or so, when there really should be sampling fluctuations. Fishy.

By way of a contrast, consider a couple of recent opinion polls conducted by YouGov in Wales. The first, conducted in April, gave the following breakdown of likely votes:


The apparent ten-point lead for the Conservatives over Labour (which is traditionally dominant in Wales) created a lot of noise in the media as it showed the Tories up 12% on the previous such poll taken in January (and Labour down 3%); much of the Conservative increase was due to a collapse in the UKIP share. Here’s the long-term picture from YouGov:


As an aside I’ll mention that ‘barometer’ surveys like this are sometimes influenced by changes in weightings and other methodological factors that can artificially produce different outcomes. I don’t know if anything changed in this regard between January 2017 and May 2017 that might have contributed to the large swing to the Tories, so let’s just assume that it’s “real”.

This “sensational” result gave various  pundits (e.g. Cardiff’s own Roger Scully) the opportunity to construct various narratives about the various implications for the forthcoming General Election.

Note, however, the sample sample size (1029), which implies an uncertainty of ±3% or so in the result. It came as no surprise to me, then, to see that the next poll by YouGov was a bit different: Conservatives on 41% (+1), but Labour on 35% (+5). That’s still grim for Labour, of course, but not quite as grim as being 10 points behind.

So what happened in the two weeks between these two polls? Well, one thing is that many places had local elections which resulted in lots of campaigning. In my ward, at least, that made a big difference: Labour increased its share of the vote compared to the 2012 elections (on a 45% turnout, which is high for local elections). Maybe then it’s true that Labour has been “fighting back” since the end of April.

Alternatively, and to my mind more probably, what we’re seeing is just the consequence of very large sampling errors. I think it’s likely that the Conservatives are in the lead, but by an extremely uncertain margin.

But why didn’t we see fluctuations of this magnitude in the French opinion polls of similar size?

Answers on a postcard, or through the comments box, please.

Why Universities should ignore League Tables

Posted in Bad Statistics, Education with tags , , , , , on January 12, 2017 by telescoper

Very busy day today but I couldn’t resist a quick post to draw attention to a new report by an independent think tank called the Higher Education Policy Institute  (PDF available here; high-level summary there). It says a lot of things that I’ve discussed on this blog already and I agree strongly with most of the conclusions. The report is focused on the international league tables, but much of what it says (in terms of methodological criticism) also applies to the national tables. Unfortunately, I doubt if this will make much difference to the behaviour of the bean-counters who have now taken control of higher education, for whom strategies intended to ‘game’ position in these, largely bogus, tables seem to be the main focus of their policy rather than the pursuit of teaching and scholarship, which is what should universities actually be for.

Here is the introduction to high-level summary:

Rankings of global universities, such as the THE World University Rankings, the QS World University Rankings and the Academic Ranking of World Universities claim to identify the ‘best’ universities in the world and then list them in rank order. They are enormously influential, as universities and even governments alter their policies to improve their position.

The new research shows the league tables are based almost exclusively on research-related criteria and the data they use are unreliable and sometimes worse. As a result, it is unwise and undesirable to give the league tables so much weight.

Later on we find some recommendations:

The report considers the inputs for the various international league tables and discusses their overall weaknesses before considering some improvements that could be made. These include:

  • ranking bodies should audit and validate data provided by universities;
  • league table criteria should move beyond research-related measures;
  • surveys of reputation should be dropped, given their methodological flaws;
  • league table results should be published in more complex ways than simple numerical rankings; and
  • universities and governments should not exaggerate the importance of rankings when determining priorities.

No doubt the purveyors of these ranking – I’ll refrain from calling them “rankers” – will mount a spirited defence of their business, but I agree with the view expressed in this report that as they stand these league tables are at best meaningless and at worst damaging.

Do you have Confidence in the Teaching Excellence Framework?

Posted in Bad Statistics with tags , , , , on January 4, 2017 by telescoper

The  Teaching Excellence Framework (TEF) is, along with a number of other measures in the 2016 Higher Education and Research Bill, causing a lot of concern in academic circles (see, e.g., this piece by Stephen Curry). One of the intentions of the TEF is to use relatively simple metrics to gauge “teaching quality” in higher education institutions. On top of the fundamental questions of exactly what “teaching quality” means and how it might be measured in any reliable way, there is now another worry: the whole TEF system is to be run by people who are statistically illiterate.

To demonstrate this assertion I refer you to this excerpt from the official TEF documentation:


The highlighted “explanation” of what a confidence interval means is false. It’s not slightly misleading. It’s not poorly worded. It’s just false.

I don’t know who from HEFCE wrote the piece above, but it’s clearly someone who does not understand the basic concepts of statistics.

I can’t imagine what kind of garbled nonsense will come out of the TEF if this is the level of understanding displayed by the people running it.  That garbage will also be fed into the university league tables with potentially devastating effects on individuals, departments and institutions, so my gripe is not just about semantics – this level of statistical illiteracy could have very serious consequences for Higher Education in the UK.

Perhaps HEFCE should call in some experts in statistics to help? Oh, no. I forgot. This country has had enough of experts…





Straw Poll on Statistical Computing

Posted in Bad Statistics, The Universe and Stuff with tags , , on December 20, 2016 by telescoper

The abstract of my previous (reblogged) post claims that R is “the premier language of statistical computing”. That may be true for the wider world of statistics, and I like R very much, but in my experience astronomers and cosmologists are much more likely to do their coding in Python.  It’s certainly the case that astronomers and physicists are much more likely to be taught Python than R. There may well even be some oldies out there still using other languages like Fortran, or perhaps  relying on books of statistical tables!

Out of interest therefore I’ve decided to run the following totally biased and statistically meaningless poll of my immense readership:


If you choose “something else”, please let me know through the comments box what your alternative is. I can then add additional options.


LIGO Echoes, P-values and the False Discovery Rate

Posted in Astrohype, Bad Statistics, The Universe and Stuff with tags , , , , on December 12, 2016 by telescoper

Today is our staff Christmas lunch so I thought I’d get into the spirit by posting a grumbly article about a paper I found on the arXiv. In fact I came to this piece via a News item in Nature. Anyway, here is the abstract of the paper – which hasn’t been refereed yet:

In classical General Relativity (GR), an observer falling into an astrophysical black hole is not expected to experience anything dramatic as she crosses the event horizon. However, tentative resolutions to problems in quantum gravity, such as the cosmological constant problem, or the black hole information paradox, invoke significant departures from classicality in the vicinity of the horizon. It was recently pointed out that such near-horizon structures can lead to late-time echoes in the black hole merger gravitational wave signals that are otherwise indistinguishable from GR. We search for observational signatures of these echoes in the gravitational wave data released by advanced Laser Interferometer Gravitational-Wave Observatory (LIGO), following the three black hole merger events GW150914, GW151226, and LVT151012. In particular, we look for repeating damped echoes with time-delays of 8MlogM (+spin corrections, in Planck units), corresponding to Planck-scale departures from GR near their respective horizons. Accounting for the “look elsewhere” effect due to uncertainty in the echo template, we find tentative evidence for Planck-scale structure near black hole horizons at 2.9σ significance level (corresponding to false detection probability of 1 in 270). Future data releases from LIGO collaboration, along with more physical echo templates, will definitively confirm (or rule out) this finding, providing possible empirical evidence for alternatives to classical black holes, such as in firewall or fuzzball paradigms.

I’ve highlighted some of the text in bold. I’ve highlighted this because as written its wrong.

I’ve blogged many times before about this type of thing. The “significance level” quoted corresponds to a “p-value” of 0.0037 (or about 1/270). If I had my way we’d ban p-values and significance levels altogether because they are so often presented in a misleading fashion, as it is here.

What is wrong is that the significance level is not the same as the false detection probability.  While it is usually the case that the false detection probability (which is often called the false discovery rate) will decrease the lower your p-value is, these two quantities are not the same thing at all. Usually the false detection probability is much higher than the p-value. The physicist John Bahcall summed this up when he said, based on his experience, “about half of all 3σ  detections are false”. You can find a nice (and relatively simple) explanation of why this is the case here (which includes various references that are worth reading), but basically it’s because the p-value relates to the probability of seeing a signal at least as large as that observed under a null hypothesis (e.g.  detector noise) but says nothing directly about the probability of it being produced by an actual signal. To answer this latter question properly one really needs to use a Bayesian approach, but if you’re not keen on that I refer you to this (from David Colquhoun’s blog):

One problem with all of the approaches mentioned above was the need to guess at the prevalence of real effects (that’s what a Bayesian would call the prior probability). James Berger and colleagues (Sellke et al., 2001) have proposed a way round this problem by looking at all possible prior distributions and so coming up with a minimum false discovery rate that holds universally. The conclusions are much the same as before. If you claim to have found an effects whenever you observe a P value just less than 0.05, you will come to the wrong conclusion in at least 29% of the tests that you do. If, on the other hand, you use P = 0.001, you’ll be wrong in only 1.8% of cases.

Of course the actual false detection probability can be much higher than these limits, but they provide a useful rule of thumb,

To be fair the Nature item puts it more accurately:

The echoes could be a statistical fluke, and if random noise is behind the patterns, says Afshordi, then the chance of seeing such echoes is about 1 in 270, or 2.9 sigma. To be sure that they are not noise, such echoes will have to be spotted in future black-hole mergers. “The good thing is that new LIGO data with improved sensitivity will be coming in, so we should be able to confirm this or rule it out within the next two years.

Unfortunately, however, the LIGO background noise is rather complicated so it’s not even clear to me that this calculation based on “random noise”  is meaningful anyway.

The idea that the authors are trying to test is of course interesting, but it needs a more rigorous approach before any evidence (even “tentative” can be claimed). This is rather reminiscent of the problems interpreting apparent “anomalies” in the Cosmic Microwave Background, which is something I’ve been interested in over the years.

In summary, I’m not convinced. Merry Christmas.



The Neyman-Scott ‘Paradox’

Posted in Bad Statistics, Cute Problems with tags , , , , on November 25, 2016 by telescoper

I just came across this interesting little problem recently and thought I’d share it here. It’s usually called the ‘Neyman-Scott’ paradox. Before going on it’s worth mentioning that Elizabeth Scott (the second half of Neyman-Scott) was an astronomer by background. Her co-author was Jerzy Neyman. As has been the case for many astronomers, she contributed greatly to the development of the field of statistics. Anyway, I think this example provides another good illustration of the superiority of Bayesian methods for estimating parameters, but I’ll let you make your own mind up about what’s going on.

The problem is fairly technical so I’ve done done a quick version in latex that you can download

here, but I’ve also copied into this post so you can read it below:




I look forward to receiving Frequentist Flak or Bayesian Benevolence through the comments box below!