Archive for p-value

LIGO Echoes, P-values and the False Discovery Rate

Posted in Astrohype, Bad Statistics, The Universe and Stuff with tags , , , , on December 12, 2016 by telescoper

Today is our staff Christmas lunch so I thought I’d get into the spirit by posting a grumbly article about a paper I found on the arXiv. In fact I came to this piece via a News item in Nature. Anyway, here is the abstract of the paper – which hasn’t been refereed yet:

In classical General Relativity (GR), an observer falling into an astrophysical black hole is not expected to experience anything dramatic as she crosses the event horizon. However, tentative resolutions to problems in quantum gravity, such as the cosmological constant problem, or the black hole information paradox, invoke significant departures from classicality in the vicinity of the horizon. It was recently pointed out that such near-horizon structures can lead to late-time echoes in the black hole merger gravitational wave signals that are otherwise indistinguishable from GR. We search for observational signatures of these echoes in the gravitational wave data released by advanced Laser Interferometer Gravitational-Wave Observatory (LIGO), following the three black hole merger events GW150914, GW151226, and LVT151012. In particular, we look for repeating damped echoes with time-delays of 8MlogM (+spin corrections, in Planck units), corresponding to Planck-scale departures from GR near their respective horizons. Accounting for the “look elsewhere” effect due to uncertainty in the echo template, we find tentative evidence for Planck-scale structure near black hole horizons at 2.9σ significance level (corresponding to false detection probability of 1 in 270). Future data releases from LIGO collaboration, along with more physical echo templates, will definitively confirm (or rule out) this finding, providing possible empirical evidence for alternatives to classical black holes, such as in firewall or fuzzball paradigms.

I’ve highlighted some of the text in bold. I’ve highlighted this because as written its wrong.

I’ve blogged many times before about this type of thing. The “significance level” quoted corresponds to a “p-value” of 0.0037 (or about 1/270). If I had my way we’d ban p-values and significance levels altogether because they are so often presented in a misleading fashion, as it is here.

What is wrong is that the significance level is not the same as the false detection probability.  While it is usually the case that the false detection probability (which is often called the false discovery rate) will decrease the lower your p-value is, these two quantities are not the same thing at all. Usually the false detection probability is much higher than the p-value. The physicist John Bahcall summed this up when he said, based on his experience, “about half of all 3σ  detections are false”. You can find a nice (and relatively simple) explanation of why this is the case here (which includes various references that are worth reading), but basically it’s because the p-value relates to the probability of seeing a signal at least as large as that observed under a null hypothesis (e.g.  detector noise) but says nothing directly about the probability of it being produced by an actual signal. To answer this latter question properly one really needs to use a Bayesian approach, but if you’re not keen on that I refer you to this (from David Colquhoun’s blog):

One problem with all of the approaches mentioned above was the need to guess at the prevalence of real effects (that’s what a Bayesian would call the prior probability). James Berger and colleagues (Sellke et al., 2001) have proposed a way round this problem by looking at all possible prior distributions and so coming up with a minimum false discovery rate that holds universally. The conclusions are much the same as before. If you claim to have found an effects whenever you observe a P value just less than 0.05, you will come to the wrong conclusion in at least 29% of the tests that you do. If, on the other hand, you use P = 0.001, you’ll be wrong in only 1.8% of cases.

Of course the actual false detection probability can be much higher than these limits, but they provide a useful rule of thumb,

To be fair the Nature item puts it more accurately:

The echoes could be a statistical fluke, and if random noise is behind the patterns, says Afshordi, then the chance of seeing such echoes is about 1 in 270, or 2.9 sigma. To be sure that they are not noise, such echoes will have to be spotted in future black-hole mergers. “The good thing is that new LIGO data with improved sensitivity will be coming in, so we should be able to confirm this or rule it out within the next two years.

Unfortunately, however, the LIGO background noise is rather complicated so it’s not even clear to me that this calculation based on “random noise”  is meaningful anyway.

The idea that the authors are trying to test is of course interesting, but it needs a more rigorous approach before any evidence (even “tentative” can be claimed). This is rather reminiscent of the problems interpreting apparent “anomalies” in the Cosmic Microwave Background, which is something I’ve been interested in over the years.

In summary, I’m not convinced. Merry Christmas.

 

 

Still Not Significant

Posted in Bad Statistics with tags , on May 27, 2015 by telescoper

I just couldn’t resist reblogging this post because of the wonderful list of meaningless convoluted phrases people use when they don’t get a “statistically significant” result. I particularly like:

“a robust trend toward significance”.

It’s scary to think that these were all taken from peer-reviewed scientific journals…

Probable Error

Image

What to do if your p-value is just over the arbitrary threshold for ‘significance’ of p=0.05?

You don’t need to play the significance testing game – there are better methods, like quoting the effect size with a confidence interval – but if you do, the rules are simple: the result is either significant or it isn’t.

So if your p-value remains stubbornly higher than 0.05, you should call it ‘non-significant’ and write it up as such. The problem for many authors is that this just isn’t the answer they were looking for: publishing so-called ‘negative results’ is harder than ‘positive results’.

The solution is to apply the time-honoured tactic of circumlocution to disguise the non-significant result as something more interesting. The following list is culled from peer-reviewed journal articles in which (a) the authors set themselves the threshold of 0.05 for significance, (b) failed to achieve that threshold value for…

View original post 2,779 more words