Why we should abandon “statistical significance”

Posted in Bad Statistics with tags , on September 27, 2017 by telescoper

So a nice paper by McShane et al. has appeared on the arXiv with the title Abandon Statistical Significance and abstract:

In science publishing and many areas of research, the status quo is a lexicographic decision rule in which any result is first required to have a p-value that surpasses the 0.05 threshold and only then is consideration–often scant–given to such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. There have been recent proposals to change the p-value threshold, but instead we recommend abandoning the null hypothesis significance testing paradigm entirely, leaving p-values as just one of many pieces of information with no privileged role in scientific publication and decision making. We argue that this radical approach is both practical and sensible.

This piece is in part a reaction to a paper by Benjamin et al. in Nature Human Behaviour that argues for the adoption of a standard threshold of p=0.005 rather than the more usual p=0.05. This latter paper has generated a lot of interest, but I think it misses the point entirely. The fundamental problem is not what number is chosen for the threshold p-value, but what this statistic does (and does not) mean. It seems to me the p-value is usually an answer to a question which is quite different from that which a scientist would want to ask, which is what the data have to say about a given hypothesis. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis.

While I generally agree with the arguments given in McShane et al, I don’t think it goes far enough. I think p-values are so misleading, if I had my way I’d ban them altogether!

One More for the Bad Statistics in Astronomy File…

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , on May 20, 2015 by telescoper

It’s been a while since I last posted anything in the file marked Bad Statistics, but I can remedy that this morning with a comment or two on the following paper by Robertson et al. which I found on the arXiv via the Astrostatistics Facebook page. It’s called Stellar activity mimics a habitable-zone planet around Kapteyn’s star and it the abstract is as follows:

Kapteyn’s star is an old M subdwarf believed to be a member of the Galactic halo population of stars. A recent study has claimed the existence of two super-Earth planets around the star based on radial velocity (RV) observations. The innermost of these candidate planets–Kapteyn b (P = 48 days)–resides within the circumstellar habitable zone. Given recent progress in understanding the impact of stellar activity in detecting planetary signals, we have analyzed the observed HARPS data for signatures of stellar activity. We find that while Kapteyn’s star is photometrically very stable, a suite of spectral activity indices reveals a large-amplitude rotation signal, and we determine the stellar rotation period to be 143 days. The spectral activity tracers are strongly correlated with the purported RV signal of “planet b,” and the 48-day period is an integer fraction (1/3) of the stellar rotation period. We conclude that Kapteyn b is not a planet in the Habitable Zone, but an artifact of stellar activity.

It’s not really my area of specialism but it seemed an interesting conclusions so I had a skim through the rest of the paper. Here’s the pertinent figure, Figure 3,

It looks like difficult data to do a correlation analysis on and there are lots of questions to be asked  about  the form of the errors and how the bunching of the data is handled, to give just two examples.I’d like to have seen a much more comprehensive discussion of this in the paper. In particular the statistic chosen to measure the correlation between variates is the Pearson product-moment correlation coefficient, which is intended to measure linear association between variables. There may indeed be correlations in the plots shown above, but it doesn’t look to me that a straight line fit characterizes it very well. It looks to me in some of the  cases that there are simply two groups of data points…

However, that’s not the real reason for flagging this one up. The real reason is the following statement in the text:

Aargh!

No matter how the p-value is arrived at (see comments above), it says nothing about the “probability of no correlation”. This is an error which is sadly commonplace throughout the scientific literature, not just astronomy.  The point is that the p-value relates to the probability that the given value of the test statistic (in this case the Pearson product-moment correlation coefficient, r) would arise by chace in the sample if the null hypothesis H (in this case that the two variates are uncorrelated) were true. In other words it relates to P(r|H). It does not tells us anything directly about the probability of H. That would require the use of Bayes’ Theorem. If you want to say anything at all about the probability of a hypothesis being true or not you should use a Bayesian approach. And if you don’t want to say anything about the probability of a hypothesis being true or not then what are you trying to do anyway?

If I had my way I would ban p-values altogether, but it people are going to use them I do wish they would be more careful about the statements make about them.

Frequentism: the art of probably answering the wrong question

Posted in Bad Statistics with tags , , , , , , on September 15, 2014 by telescoper

Popped into the office for a spot of lunch in between induction events and discovered that Jon Butterworth has posted an item on his Grauniad blog about how particle physicists use statistics, and the ‘5σ rule’ that is usually employed as a criterion for the detection of, e.g. a new particle. I couldn’t resist bashing out a quick reply, because I believe that actually the fundamental issue is not whether you choose 3σ or 5σ or 27σ but what these statistics mean or don’t mean.

As was the case with a Nature piece I blogged about some time ago, Jon’s article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a particular null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05. This is usually called a ‘2σ’ result because for Gaussian statistics a variable has a probability of 95% of lying within 2σ of the mean value.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that large under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is incorrect. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean. Jon’s piece demonstrates that he does, so this is not meant as a personal criticism, but it is a pervasive problem that results quoted in such a way are intrinsically confusing.

The Nature story mentioned above argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true; a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.

While I agree with the Nature piece that there’s a problem, I don’t agree with the suggestion that it can be solved simply by choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05 or, in the case of particle physics, a 5σ standard (which translates to about 0.000001!  While it is true that this would throw out a lot of flaky ‘two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would actually want to ask, which is what the data have to say about the probability of a specific hypothesis being true or sometimes whether the data imply one hypothesis more strongly than another. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis.

Not that it’s always easy to implement a Bayesian approach. It’s especially difficult when the data are affected by complicated noise statistics and selection effects, and/or when it is difficult to formulate a hypothesis test rigorously because one does not have a clear alternative hypothesis in mind. Experimentalists (including experimental particle physicists) seem to prefer to accept the limitations of the frequentist approach than tackle the admittedly very challenging problems of going Bayesian. In fact in my experience it seems that those scientists who approach data from a theoretical perspective are almost exclusively Baysian, while those of an experimental or observational bent stick to their frequentist guns.

Coincidentally a paper on the arXiv not long ago discussed an interesting apparent paradox in hypothesis testing that arises in the context of high energy physics, which I thought I’d share here. Here is the abstract:

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to inferences that are radically different from those of a Bayesian hypothesis test in the form advocated by Harold Jeffreys in the 1930’s and common today. The setting is the test of a point null (such as the Standard Model of elementary particle physics) versus a composite alternative (such as the Standard Model plus a new force of nature with unknown strength). The p-value, as well as the ratio of the likelihood under the null to the maximized likelihood under the alternative, can both strongly disfavor the null, while the Bayesian posterior probability for the null can be arbitrarily large. The professional statistics literature has many impassioned comments on the paradox, yet there is no consensus either on its relevance to scientific communication or on the correct resolution. I believe that the paradox is quite relevant to frontier research in high energy physics, where the model assumptions can evidently be quite different from those in other sciences. This paper is an attempt to explain the situation to both physicists and statisticians, in hopes that further progress can be made.

This paradox isn’t a paradox at all; the different approaches give different answers because they ask different questions. Both could be right, but I firmly believe that one of them answers the wrong question.

The Curse of P-values

Posted in Bad Statistics with tags , , , on November 12, 2013 by telescoper

Yesterday evening I noticed a news item in Nature that argues that inappropriate statistical methodology may be undermining the reporting of scientific results. The article focuses on lack of “reproducibility” of results.

The article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under the null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean.

The Nature story mentioned above argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true; a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.

While I agree with the Nature piece that there’s a problem, I don’t agree with the suggestion that it can be solved simply by choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05. While it is true that this would throw out a lot of flaky `two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would want to ask, which is what the data have to say about a given hypothesis. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis. If I had my way I’d ban p-values altogether.

Not that it’s always easy to implement a Bayesian approach. Coincidentally a recent paper on the arXiv discussed an interesting apparent paradox in hypothesis testing that arises in the context of high energy physics, which I thought I’d share here. Here is the abstract:

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to inferences that are radically different from those of a Bayesian hypothesis test in the form advocated by Harold Jeffreys in the 1930’s and common today. The setting is the test of a point null (such as the Standard Model of elementary particle physics) versus a composite alternative (such as the Standard Model plus a new force of nature with unknown strength). The p-value, as well as the ratio of the likelihood under the null to the maximized likelihood under the alternative, can both strongly disfavor the null, while the Bayesian posterior probability for the null can be arbitrarily large. The professional statistics literature has many impassioned comments on the paradox, yet there is no consensus either on its relevance to scientific communication or on the correct resolution. I believe that the paradox is quite relevant to frontier research in high energy physics, where the model assumptions can evidently be quite different from those in other sciences. This paper is an attempt to explain the situation to both physicists and statisticians, in hopes that further progress can be made.

Science, Religion and Henry Gee

Posted in Bad Statistics, Books, Talks and Reviews, Science Politics, The Universe and Stuff with tags , , , , , , , , , on September 23, 2013 by telescoper

Last week a piece appeared on the Grauniad website by Henry Gee who is a Senior Editor at the magazine Nature.  I was prepared to get a bit snarky about the article when I saw the title, as it reminded me of an old  rant about science being just a kind of religion by Simon Jenkins that got me quite annoyed a few years ago. Henry Gee’s article, however, is actually rather more coherent than that and  not really deserving of some of the invective being flung at it.

For example, here’s an excerpt that I almost agree with:

One thing that never gets emphasised enough in science, or in schools, or anywhere else, is that no matter how fancy-schmancy your statistical technique, the output is always a probability level (a P-value), the “significance” of which is left for you to judge – based on nothing more concrete or substantive than a feeling, based on the imponderables of personal or shared experience. Statistics, and therefore science, can only advise on probability – they cannot determine The Truth. And Truth, with a capital T, is forever just beyond one’s grasp.

I’ve made the point on this blog many times that, although statistical reasoning lies at the heart of the scientific method, we don’t do anywhere near enough  to teach students how to use probability properly; nor do scientists do enough to explain the uncertainties in their results to decision makers and the general public.  I also agree with the concluding thought, that science isn’t about absolute truths. Unfortunately, Gee undermines his credibility by equating statistical reasoning with p-values which, in my opinion, are a frequentist aberration that contributes greatly to the public misunderstanding of science. Worse, he even gets the wrong statistics wrong…

But the main thing that bothers me about Gee’s article is that he blames scientists for promulgating the myth of “science-as-religion”. I don’t think that’s fair at all. Most scientists I know are perfectly well aware of the limitations of what they do. It’s really the media that want to portray everything in simple black and white terms. Some scientists play along, of course, as I comment upon below, but most of us are not priests but pragmatatists.

Anyway, this episode gives me the excuse to point out  that I ended a book I wrote in 1998 with a discussion of the image of science as a kind of priesthood which it seems apt to repeat here. The book was about the famous eclipse expedition of 1919 that provided some degree of experimental confirmation of Einstein’s general theory of relativity and which I blogged about at some length last year, on its 90th anniversary.

I decided to post the last few paragraphs here to show that I do think there is a valuable point to be made out of the scientist-as-priest idea. It’s to do with the responsibility scientists have to be honest about the limitations of their research and the uncertainties that surround any new discovery. Science has done great things for humanity, but it is fallible. Too many scientists are too certain about things that are far from proven. This can be damaging to science itself, as well as to the public perception of it. Bandwagons proliferate, stifling original ideas and leading to the construction of self-serving cartels. This is a fertile environment for conspiracy theories to flourish.

To my mind the thing  that really separates science from religion is that science is an investigative process, not a collection of truths. Each answer simply opens up more questions.  The public tends to see science as a collection of “facts” rather than a process of investigation. The scientific method has taught us a great deal about the way our Universe works, not through the exercise of blind faith but through the painstaking interplay of theory, experiment and observation.

This is what I wrote in 1998:

Science does not deal with ‘rights’ and ‘wrongs’. It deals instead with descriptions of reality that are either ‘useful’ or ‘not useful’. Newton’s theory of gravity was not shown to be ‘wrong’ by the eclipse expedition. It was merely shown that there were some phenomena it could not describe, and for which a more sophisticated theory was required. But Newton’s theory still yields perfectly reliable predictions in many situations, including, for example, the timing of total solar eclipses. When a theory is shown to be useful in a wide range of situations, it becomes part of our standard model of the world. But this doesn’t make it true, because we will never know whether future experiments may supersede it. It may well be the case that physical situations will be found where general relativity is supplanted by another theory of gravity. Indeed, physicists already know that Einstein’s theory breaks down when matter is so dense that quantum effects become important. Einstein himself realised that this would probably happen to his theory.

Putting together the material for this book, I was struck by the many parallels between the events of 1919 and coverage of similar topics in the newspapers of 1999. One of the hot topics for the media in January 1999, for example, has been the discovery by an international team of astronomers that distant exploding stars called supernovae are much fainter than had been predicted. To cut a long story short, this means that these objects are thought to be much further away than expected. The inference then is that not only is the Universe expanding, but it is doing so at a faster and faster rate as time passes. In other words, the Universe is accelerating. The only way that modern theories can account for this acceleration is to suggest that there is an additional source of energy pervading the very vacuum of space. These observations therefore hold profound implications for fundamental physics.

As always seems to be the case, the press present these observations as bald facts. As an astrophysicist, I know very well that they are far from unchallenged by the astronomical community. Lively debates about these results occur regularly at scientific meetings, and their status is far from established. In fact, only a year or two ago, precisely the same team was arguing for exactly the opposite conclusion based on their earlier data. But the media don’t seem to like representing science the way it actually is, as an arena in which ideas are vigorously debated and each result is presented with caveats and careful analysis of possible error. They prefer instead to portray scientists as priests, laying down the law without equivocation. The more esoteric the theory, the further it is beyond the grasp of the non-specialist, the more exalted is the priest. It is not that the public want to know – they want not to know but to believe.

Things seem to have been the same in 1919. Although the results from Sobral and Principe had then not received independent confirmation from other experiments, just as the new supernova experiments have not, they were still presented to the public at large as being definitive proof of something very profound. That the eclipse measurements later received confirmation is not the point. This kind of reporting can elevate scientists, at least temporarily, to the priesthood, but does nothing to bridge the ever-widening gap between what scientists do and what the public think they do.

As we enter a new Millennium, science continues to expand into areas still further beyond the comprehension of the general public. Particle physicists want to understand the structure of matter on tinier and tinier scales of length and time. Astronomers want to know how stars, galaxies  and life itself came into being. But not only is the theoretical ambition of science getting bigger. Experimental tests of modern particle theories require methods capable of probing objects a tiny fraction of the size of the nucleus of an atom. With devices such as the Hubble Space Telescope, astronomers can gather light that comes from sources so distant that it has taken most of the age of the Universe to reach us from them. But extending these experimental methods still further will require yet more money to be spent. At the same time that science reaches further and further beyond the general public, the more it relies on their taxes.

Many modern scientists themselves play a dangerous game with the truth, pushing their results one-sidedly into the media as part of the cut-throat battle for a share of scarce research funding. There may be short-term rewards, in grants and TV appearances, but in the long run the impact on the relationship between science and society can only be bad. The public responded to Einstein with unqualified admiration, but Big Science later gave the world nuclear weapons. The distorted image of scientist-as-priest is likely to lead only to alienation and further loss of public respect. Science is not a religion, and should not pretend to be one.

PS. You will note that I was voicing doubts about the interpretation of the early results from supernovae  in 1998 that suggested the universe might be accelerating and that dark energy might be the reason for its behaviour. Although more evidence supporting this interpretation has since emerged from WMAP and other sources, I remain sceptical that we cosmologists are on the right track about this. Don’t get me wrong – I think the standard cosmological model is the best working hypothesis we have _ I just think we’re probably missing some important pieces of the puzzle. I don’t apologise for that. I think sceptical is what a scientist should be.