## Frequentism: the art of probably answering the wrong question

Popped into the office for a spot of lunch in between induction events and discovered that Jon Butterworth has posted an item on his *Grauniad* blog about how particle physicists use statistics, and the ‘5σ rule’ that is usually employed as a criterion for the detection of, e.g. a new particle. I couldn’t resist bashing out a quick reply, because I believe that actually the fundamental issue is not whether you choose 3σ or 5σ or 27σ but what these statistics mean or don’t mean.

As was the case with a Nature piece I blogged about some time ago, Jon’s article focuses on the p-value, a *frequentist* concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a particular *null hypothesis*. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient *r* obtained from a set of bivariate data. If the data were uncorrelated then *r* would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05. This is usually called a ‘2σ’ result because for Gaussian statistics a variable has a probability of 95% of lying within 2σ of the mean value.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that large under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is *actually* a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is incorrect. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean. Jon’s piece demonstrates that he does, so this is not meant as a personal criticism, but it is a pervasive problem that results quoted in such a way are intrinsically confusing.

The Nature story mentioned above argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true; a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.

While I agree with the Nature piece that there’s a problem, I don’t agree with the suggestion that it can be solved simply by choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05 or, in the case of particle physics, a 5σ standard (which translates to about 0.000001! While it is true that this would throw out a lot of flaky ‘two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would actually want to ask, which is what the data have to say about the probability of a specific hypothesis being true or sometimes whether the data imply one hypothesis more strongly than another. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis.

I feel so strongly about this that if I had my way I’d ban p-values altogether…

Not that it’s always easy to implement a Bayesian approach. It’s especially difficult when the data are affected by complicated noise statistics and selection effects, and/or when it is difficult to formulate a hypothesis test rigorously because one does not have a clear alternative hypothesis in mind. Experimentalists (including experimental particle physicists) seem to prefer to accept the limitations of the frequentist approach than tackle the admittedly very challenging problems of going Bayesian. In fact in my experience it seems that those scientists who approach data from a theoretical perspective are almost exclusively Baysian, while those of an experimental or observational bent stick to their frequentist guns.

Coincidentally a paper on the arXiv not long ago discussed an interesting apparent paradox in hypothesis testing that arises in the context of high energy physics, which I thought I’d share here. Here is the abstract:

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to inferences that are radically different from those of a Bayesian hypothesis test in the form advocated by Harold Jeffreys in the 1930’s and common today. The setting is the test of a point null (such as the Standard Model of elementary particle physics) versus a composite alternative (such as the Standard Model plus a new force of nature with unknown strength). The p-value, as well as the ratio of the likelihood under the null to the maximized likelihood under the alternative, can both strongly disfavor the null, while the Bayesian posterior probability for the null can be arbitrarily large. The professional statistics literature has many impassioned comments on the paradox, yet there is no consensus either on its relevance to scientific communication or on the correct resolution. I believe that the paradox is quite relevant to frontier research in high energy physics, where the model assumptions can evidently be quite different from those in other sciences. This paper is an attempt to explain the situation to both physicists and statisticians, in hopes that further progress can be made.

This paradox isn’t a paradox at all; the different approaches give different answers because they ask different questions. Both could be right, but I firmly believe that one of them answers the wrong question.

Follow @telescoper
September 15, 2014 at 3:14 pm

From the Guardian:

If anyone still believes that P(A|B) = P(B|A) [probability of A given B = probability of B given A], remind them that the probability of being pregnant, given that the person is female, is ∼3%, while the probability of being female, given that they are pregnant, is considerably larger.Now, where have I read that before? 🙂

September 15, 2014 at 5:38 pm

The problem, of course, is that while the Bayesian approach may get the question right it frequently (hah) fails to answer it.

September 15, 2014 at 5:44 pm

If a problem is ill-posed then it’s better to know about it than to accept the answer to a different one…

September 15, 2014 at 5:56 pm

Yes of course. But quite often with Bayesian statistics the problematic choice of the prior is glossed over or ignored. It’s not clear to me that – in cosmology at least – this is any less common than people thinking their frequentist approach has answered a question it hasn’t. Or perhaps it is less common, but it isn’t any less annoying.

September 15, 2014 at 8:13 pm

Sesh: the fact that we don’t know how to assign a prior in every question is a matter for research, not for dismissal of methods that are demonstrably wrong.

Those who refuse to assign a prior are doomed to have one assigned by default – by the method they use – which is not a good idea. As I’ve said before, suppose you actually know the value of a parameter (in Bayesian language, a delta-function prior) and are measuring it only because you have been ordered to. Your sampling-theoretical result will put a lot of probability where you KNOW it can’t be.

September 15, 2014 at 10:04 pm

Ooops… spot the deliberate error? I meant “not for dismissal of Bayesian methods in favour of others that are demonstrably wrong”.

September 16, 2014 at 9:59 am

Anton: frequentist methods give you a well-defined and sensible answer to a particular question. So long as you keep in mind exactly what that question is and what it isn’t, you’re fine. There are some situations in which a Bayesian approach allows you to answer a different (and probably more interesting) question. However in many situations, particularly in cosmology, it doesn’t. I’m certainly not advocating dismissing Bayesian methods (who would be so silly?), just pointing out that they can be misused just like any other method.

September 16, 2014 at 10:02 am

I disagree strongly with your assertion that “in cosmology, it doesn’t”. What is your justification for this statement?

September 16, 2014 at 1:01 pm

Well, to take an example, we may want to learn about sterile neutrinos. Within a well-defined, restrictive class of models (only those models which have the standard model plus a sterile neutrino and nothing else), given a further reasonable assumption of a prior on some parameter (say on the mass of the sterile neutrino), we can use say the CMB data and perform a Bayesian analysis to obtain a nice posterior distribution for the mass of that sterile neutrino. In some such cases the precise choice of prior on the parameter space happily becomes unimportant to the final result.

But suppose the question we want to answer is actually whether sterile neutrinos exist at all or not (e.g. whether apparent slight tension between CMB and LSS measurements is an indication of the existence of sterile neutrinos). In a case like that we need to somehow quantify a probability of the model itself, i.e. what is our prior belief that sterile neutrinos (of any mass) exist? Without the ability to meaningfully quantify that prior on theory space, I don’t see how a Bayesian approach adds anything to a simple statement of P(D|M).

Now, it may be because of a personal bias in the types of papers that I read on the arXiv, but it seems to me that in cosmology the second type of question (comparing different classes of models) comes up more frequently than the first type (parameter estimation within a class of models).

September 16, 2014 at 2:26 pm

For me, cosmology

is“parameter estimation within a class of models”, as Sandage said “a search for two numbers”. (OK, as Mr Lambda I am searching for three, but still.) All of the “what do the CMB papers tell us” are estimating parameters within a model.I suppose, though, that the line is fuzzy between a broader model with more parameters and an additional class of models.

September 16, 2014 at 4:41 pm

Sesh: You need to identify a parameter (say, m) which is zero in the case that sterile neutrinos don’t exist (hypothesis H1), and which is to be estimated from the data in the case H2 that they do exist. Then the analysis is a special case of Bayesian hypothesis testing in which a theory is tested against its own generalisation.

H1: has all its eggs in one basket: delta-fn prior for m at m=0

H2: prior probability for m is smeared over m-space

If the likelihood for the data is peaked wrt m near the delta-fn, we prefer H1 (because H2 wastes much of the prior prob for m by strewing it where the data suggest m isn’t).

If the likelihood for the data is peaked wrt m far from the delta-fn, prefer H2 (because H1 put all its eggs in the wrong basket for m).

So there is a trade-off between goodness of fit to the data, and simplicity of theory. This is a realisation of Ockham’s Razor principle. (It is not the only one!) We can always estimate m conditioned on H2 and the data; obviously this is worth doing if H2 is clearly preferred.

September 29, 2014 at 7:32 am

Ha ha! Right you are, Sesh, in my experience. I don’t know anything about cosmology, other than residual recall of quantum physics classes at Swarthmore College, so I should not even be commenting here. Rest assured, though, that your observation 😉 is applicable to many fields of study:

September 29, 2014 at 7:45 am

Quiter the opposite, actually. You can’t do a proper Bayesian study without specifying a prior, so you need to put all your assumptions on the table. This is not the case with frequentist studies which often claim to be “model free” but aren’t.

Of course it’s true that Bayesian inference can be done incorrectly, but that’s true of any type of analysis and at least Bayesians don’t set out to be wrong…

September 15, 2014 at 7:13 pm

It seems to me that the prior is not such a great problem as is often assumed. It is possible to argue that observation of P=0.045 gives a false discovery rate of at least 30% (and up to 70% for under-powered experiments) and these arguments involve no assumptions about the prior and no subjective probabilities. See, for example, http://arxiv.org/abs/1407.5296

September 27, 2014 at 4:19 am

David: It’s only possible to argue this manifestly incorrectly, using make-believe priors, dichotomous tests (that declare significance at just this value based on one test) while assuming all manner of QRPs and biases. Oh, and to top it off –it requires abusing power in a Bayesian computation (not at all what it’s intended for).

September 29, 2014 at 8:30 am

I don’t believe that’s true. All you have to do is simulate a lot of t tests and look only at the results that produce a P value close to 0.05. The false discovery rate is is disastrous. There is an R script that will this for you in http://arxiv.org/abs/1407.5296

And you can reach a very similar conclusion on the basis of the work of Sellke, T., Bayarri, M. J., and Berger, J. O. (2001),

American Statistician, 55, 62-71

September 29, 2014 at 2:10 pm

The all have spiked priors, based on non-exhaustive hypotheses, and misuse of power. I’ve been through this before and can’t review it just now. But check those references from my blog.

September 30, 2014 at 9:19 pm

I guess there is nothing wrong about describing the assumption. in Bayesian language, as a spiked prior. But it is equally correct to say that the simulated t tests describe repeated t tests on samples in which the means (a) are identical (H0), or (b) which differ be a specified amount. These are the standard assumptions for any comparison of two means, and I see no necessity to use Bayesian language at all. It’s true that you need to specify the prevalence of cases in which H0 is true/untrue. That’s an ordinary objective probability, though in practice it will rarely be possible to estimate its value. What changed my mind about the problem is the realisation that you can get similar values for the false discovery rate without having to make assumptions about about this prior.

(a) If you consider only P values that are close to 0.05 (rather than P < 0.05) the false discovery rate is at least 30% regardless of what you assume about the prevalence (prior), and

(b) the Sellke et al. approach also gives (similar) results that are independent of the prior.

In the light of these approaches, I don't think it's possible to deny that we have been badly misled by null-hypothesis testing as it is almost universally practised, at least in biomedical sciences.

September 15, 2014 at 10:12 pm

Reblogged this on jauntytraveller and commented:

In my future blog this delightful piece would go to section “everyone must know this”, hence reblog!

September 16, 2014 at 9:54 am

Observers know very well hat 2 sigma is not a strong enough result. So what is the problem? Anyway, the probability is always testing one hypothesis against another. The test is not whether H0 is right or wrong, but whether H0 is more likely than a particular alternative H1. Aren’t you here attacking an incomplete version of the p statistics? I have no opinion on the dispute but am always suspicious of one-sided debates!

September 16, 2014 at 10:16 am

Similarly, one should be suspicious of one-sided probability distributions. 🙂

September 18, 2014 at 6:18 am

Phil: No love for the “one-sided” exponential distribution? 😉

September 18, 2014 at 6:14 am

Albert: Constructing a p-value does not involve an alternative hypothesis, as Peter states in the OP. Now one could use a likelihood ratio as the test statistic, but that isn’t required.

September 17, 2014 at 3:34 am

Jake Vanderplas wrote a blog post a few months ago along similar lines, focusing on the difference between frequentist “confidence intervals” and Bayesian “credible regions”. He works out a couple examples where the difference in interpretation leads to substantive differences in the results.

His main point is essentially that “…scientists often turn the crank of frequentism hoping for useful answers, but in the process overlook the fact that in science, frequentism is generally answering the wrong question.” Specifically, “Many scientists operate as if the confidence interval is a Bayesian credible region, *but it demonstrably is not*.”

Worth a look: https://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/

September 27, 2014 at 4:14 am

There are several confusions in your post, but I”ll just note one: you say that to get a probability that the null hypothesis is actually a correct description of the data. “you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is correct. You mean “incorrect” but that’s not the problem. the power doesnt give you any kind of posterior, nor should it. Abusing it as such is another common misinterpretation of tests.

September 27, 2014 at 12:20 pm

Thanks for pointing out the typo, which I have corrected.

The rest of your comment is irrelevant as my post does not say what you claim it does.

My criticism – that frequentist techniques usually ask the wrong question – stands.

September 27, 2014 at 6:05 pm

telescope: i don’t see your post anymore, but my recollection was that you were seeing power as giving you a posterior probability. It is a common but entirely erroneous claim. No one could demonstrate against your insistence we’re asking the “wrong” question, but if you care about avoiding erroneous interpretations of data, or are in the mood for methods that can critically scrutinize the frauds and fallacies of others, then you might find yourself in need of these tools–correctly interpreted.

September 27, 2014 at 8:28 pm

I didn’t say that. And I am fully aware of the correct (ie Bayesian) way to deal consistently with the frauds and fallacies of others (ie frequentists).

September 27, 2014 at 11:41 pm

Well if I have the time, I’ll find where I believe you used power as the basis for a posterior.

That our fallacies are open to criticism is a mark of their scientific credentials. What some fraud busters I know have been lamenting is the lack of ways to pin down Bayesian mistakes.

September 28, 2014 at 12:46 am

Perhaps because they’re not mistakes?

September 28, 2014 at 1:21 am

You say that no one could ever commit QRPs or fraud if they used Bayesian methods? You should have told Diderik, Potti, Forster.

September 28, 2014 at 1:33 am

Do I? Where do I say that?

September 28, 2014 at 10:27 am

[…] gets some people excited¹, especially, it seems, when Bayes is mentioned. There was even a rapid response blog from (Bayesian) cosmologist at Sussex, Peter […]

September 29, 2014 at 11:45 pm

Albert says “Observers know very well that 2 sigma is not a strong enough result”

Sadly that isn’t true at all.

For example, very recently,

Sciencetrumpeted on Twitter Non-invasive stimulation of the brain can improve memories . . .”. The paper was behind a paywall, so most tweeters would not have read it. In fact most of the paper was about fMRI and the bit about memory was a subsection of one Figure and it had P = 0.043.The evidence was pathetically poor and this sort of innumerate behaviour by glamour journals does great harm to science.

September 30, 2014 at 12:26 pm

Here’s another high-profile but very dodgy statistical analysis I blogged about a few years ago:

https://telescoper.wordpress.com/2008/11/12/cerebral-asymmetry-is-it-all-in-the-mind/

October 25, 2014 at 7:44 am

[…] Popped into the office for a spot of lunch in between induction events and discovered that Jon Butterworth has posted an item on his Grauniad blog about how particle physicists use statistics, and … […]

June 27, 2015 at 6:33 pm

[…] gets some people excited¹, especially, it seems, when Bayes is mentioned. There was even a rapid response blog from (Bayesian) cosmologist at Sussex, Peter […]