## The Curse of P-values

Yesterday evening I noticed a news item in Nature that argues that inappropriate statistical methodology may be undermining the reporting of scientific results. The article focuses on lack of “reproducibility” of results.

The article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under the null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient *r* obtained from a set of bivariate data. If the data were uncorrelated then *r* would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is *actually* a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean.

The Nature story mentioned above argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true; a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.

While I agree with the Nature piece that there’s a problem, I don’t agree with the suggestion that it can be solved simply by choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05. While it is true that this would throw out a lot of flaky `two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would want to ask, which is what the data have to say about a given hypothesis. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis. If I had my way I’d ban p-values altogether.

Not that it’s always easy to implement a Bayesian approach. Coincidentally a recent paper on the arXiv discussed an interesting apparent paradox in hypothesis testing that arises in the context of high energy physics, which I thought I’d share here. Here is the abstract:

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to inferences that are radically different from those of a Bayesian hypothesis test in the form advocated by Harold Jeffreys in the 1930’s and common today. The setting is the test of a point null (such as the Standard Model of elementary particle physics) versus a composite alternative (such as the Standard Model plus a new force of nature with unknown strength). The p-value, as well as the ratio of the likelihood under the null to the maximized likelihood under the alternative, can both strongly disfavor the null, while the Bayesian posterior probability for the null can be arbitrarily large. The professional statistics literature has many impassioned comments on the paradox, yet there is no consensus either on its relevance to scientific communication or on the correct resolution. I believe that the paradox is quite relevant to frontier research in high energy physics, where the model assumptions can evidently be quite different from those in other sciences. This paper is an attempt to explain the situation to both physicists and statisticians, in hopes that further progress can be made.

Rather than tell you what I think about this paradox, I thought I’d invite discussion through the comments box…

Follow @telescoper
November 12, 2013 at 3:09 pm

I haven’t read this paper in detail, but at first glance I don’t understand why the fact that Bayesian and frequentist analyses give different answers should be called a “paradox.” The two analyses are designed to answer different questions, so we shouldn’t be surprised that they sometimes give different answers. It’s more surprising when they give the same answers!

November 12, 2013 at 3:33 pm

They’re not *designed* to answer different questions; that’s what they actually do, but they are intended to answer the same question. The Bayesian way answers it coherently, the sampling-theoretical way incoherently. As Peter wrote above, “If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is.”

And also:

“If I had my way I’d ban p-values altogether.”

You are about to have your way in the new e-journal that you are setting up. Ban p-values – make my day! They are actually an importing of some of the arbitrariness of decision theory into probability theory.

“I noticed a news item in Nature that argues that inappropriate statistical methodology may be undermining the reporting of scientific results. The article focuses on lack of “reproducibility” of results… Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.”

I suspect it matters most today in the testing of pharmaceutical drugs.

November 12, 2013 at 4:22 pm

yes, all papers that involve p-values will be rejected by the People’s Revolutionary Journal of Astrophysics and all authors thereof sent for compulsory re-education…

November 12, 2013 at 5:07 pm

That was interesting to read; thanks Peter. It’s a funny sort of paradox, as at one level it’s sort of obvious delta functions in your prior will do nasty things, but at the other level it looks such a tempting way to try to ‘match’ the frequentist analysis.

November 12, 2013 at 5:10 pm

Delta-function priors are the outstanding way to demonstrate to recalcitrant sampling-theorists/frequentists that the taking into account of prior information is a strength of Bayesian analysis, not a weakness.

November 12, 2013 at 10:51 pm

I made it about a third of the way through before wondering how Cousins managed to find the time to write this all up?! One take away point I think he’s got right: when testing scientific hypotheses we need to think about whether or not we have a genuine point null. e.g. when testing for whether or not the fine structure constant varies at late times then *any* variation is indeed violating the null, but when asking whether aspirin helps fight heart disease then extremely small (i.e. practically irrelevant) effects can probably be subsumed within the null.

(We need to think about the same when deciding whether or not the Bayesian model averaging of cosmological models with, e.g., fixed and varying spectral indices is of any scientific value?)

November 13, 2013 at 5:13 am

Many thanks for highlighting the issues with frequentist statistics. I’ve often felt ill at ease with the assumptions inherent in P-values when applying them to biomechanical data, particularly the arbitrary nature of the 95% significance level and the ‘fixes’ used to tailor the sensitivity of the analysis.

I must admit to presently perceiving Bayesian statistics as being beyond my understanding but given the arguments presented here it’s time for some study.

November 13, 2013 at 12:24 pm

You’ll find it’s easier done Bayesian – the hard part is unlearning the parts of what you’ve been taught that are wrong.

Forget ‘probability’ for a moment and consider a number p(A|B) that represent how strongly B implies A, where A and B are binary propositions (ie, True or False); more formally, a measure of how strongly A is implied to be true upon supposing that B is true, according to relations known between their referents.

Degree of implication is what you actually want in every problem involving uncertainty.

If propositions obey Boolean algebra then it can be shown that the degrees of implication for them obey corresponding algebraic relations. These relations turn out to be the sum and product rules (and hence Bayes’ theorem, an immediate corollary).

So let’s call degree of implication ‘probability’.

But if frequentists (or any other school) object to that naming then simply bypass them. Calculate the degree of implication in each problem, because it’s what you want in order to solve any problem, and call it whatever you like. In choosing a definition of probability there are no higher criteria than consistency and universality.

In this viewpoint there are no worries over ‘belief’ or imaginary ensembles, and all probabilities are automatically conditional. The confusing word ‘random’ is downplayed.

The two laws of probability – long known – were first derived from Boolean algebra in 1946 by RT Cox. (Cox did not use the rhetorical trick that I am advocating about the name of the quantity.) Today the best derivation from Boolean algebra is by Kevin Knuth. The crucial aspect of Boolean algebra is associativity, which leads to a functional equation for the probabilities.

NB There are many books with ‘Bayesian’ in the title that don’t understand that the arguments of probabilities are binary propositions, which is the core point; you can check quickly by seeing if RT Cox is mentioned in the index and looking at the pages mentioned. Bayesian Logical Data Analysis for the Physical Sciences by Phil Gregory and Data Analysis: A Bayesian Tutorial by Devinder Sivia are both based on this approach.

Bayes’ theorem tells you how to incorporate new information – phrased propositionally – into a probability. You do this when you get experimental data. But how to assign the probability that you update, in the first place? This is the ‘prior probability’ issue. Sampling theorists/frequentists have used it to beat Bayesians over the head with, without realising that the neglect of prior information is a weakness of their own methods, not a strength – as shown by delta-function priors mentioned above. We don’t know how to encode (prior) information in general cases, but that is a matter for research, not castigation, and only Bayesians stand a chance of finding out. Symmetry is a key principle: in the simple case of the probability for the location of a bead on a horizontal circle of wire, the probability is uniform.

November 13, 2013 at 1:23 pm

Aren’t the main issues with p-values related to unclear assumptions (which hypothesis is tested against which) and unspecified uncertainties on the p-values? The latter can be calculated with bootstrapping, and the former by using several different methods. Banning p-values might go a tad far.

Bins are very useful tools as they allow for a quantative uncertainties. Sure, badly chosen bins can hide or create things, and bins should be chosen before the data is known, and not adjusted to highlight clustered points.

November 13, 2013 at 1:28 pm

Albert,

I don’t think so. Even if you specify everything fully and calculate the p-value correctly then it still tells you the answer to the wrong question.

Peter

November 13, 2013 at 4:15 pm

I suppose it depends on whether you want a prior-independent answer to the wrong question, or a prior-dependent answer to the correct one.

November 14, 2013 at 6:49 pm

I’ll grant you that you can sometimes find the answer you want by asking the wrong question, but I think it’s generally a better to approach to ask the right question.

November 14, 2013 at 9:13 pm

A question can’t be ‘wrong’. And the answer isn’t wrong either, as long as the numbers are correct. But if the question is misunderstood, the answer can be misinterpreted. Is the prior always ‘right’?

November 14, 2013 at 9:29 pm

Of course a question can be wrong. If you want to know the capital of Peru then “What is the capital of Brazil?” Is the wrong question. Occasionally, however, someone will accidentally give you the answer you wanted to ask but didn’t.

The prior is always

stated..which is in itself valuable.November 15, 2013 at 10:30 am

In your example, there is nothing wrong with the question, but it is a classic case of ‘it is just what I asked for but not what I want’.. I have no opinion either way, by the way (I will use whatever tool I am told to use to convince myself and others) but you seem to be harder on one than on the other. You are happy with stating an assumed prior probability while for p-stats you are not happy with stating which hypothesis is tested against which. Are you?

August 8, 2014 at 2:55 am

[…] The Curse of P-values […]

September 15, 2014 at 1:32 pm

[…] was the case with a Nature piece I blogged about some time ago, Jon’s article focuses on the p-value, a frequentist concept that corresponds to the […]

September 22, 2014 at 10:19 am

[…] The bottom line is of course that the polarized emission from Galactic dust is much larger in the BICEP2 field than had been anticipated in the BICEP2 analysis of their data (now published in Physical Review Letters after being refereed). Indeed, as the abstract states, the actual dust contamination in the BICEP2 field is subject to considerable statistical and systematic uncertainties, but seems to be around the same level as BICEP2’s claimed detection. In other words the Planck analysis shows that the BICEP2 result is completely consistent with what is now known about polarized dust emission. I remind you that the original BICEP2 result was spun as a ‘7σ’ detection of a primordial polarization signal associated with gravitational waves. To put it bluntly, the Planck analysis shows that this claim was false. I’m going to resist (for the time being) another rant about p-values… […]

May 20, 2015 at 10:38 am

[…] it says nothing about the “probability of no correlation”. This is an error which is sadly commonplace throughout the scientific literature, not just astronomy. The point is that the p-value represents the probability that the given […]

October 11, 2022 at 4:42 pm

[…] was the case with a Nature piece I blogged about some time ago, this article focuses on the p-value, a frequentist concept that corresponds to the probability […]