## Frequentism: the art of probably answering the wrong question

Posted in Bad Statistics with tags , , , , , , on September 15, 2014 by telescoper

Popped into the office for a spot of lunch in between induction events and discovered that Jon Butterworth has posted an item on his Grauniad blog about how particle physicists use statistics, and the ‘5σ rule’ that is usually employed as a criterion for the detection of, e.g. a new particle. I couldn’t resist bashing out a quick reply, because I believe that actually the fundamental issue is not whether you choose 3σ or 5σ or 27σ but what these statistics mean or don’t mean.

As was the case with a Nature piece I blogged about some time ago, Jon’s article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a particular null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05. This is usually called a ‘2σ’ result because for Gaussian statistics a variable has a probability of 95% of lying within 2σ of the mean value.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that large under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is incorrect. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean. Jon’s piece demonstrates that he does, so this is not meant as a personal criticism, but it is a pervasive problem that results quoted in such a way are intrinsically confusing.

The Nature story mentioned above argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true; a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.

While I agree with the Nature piece that there’s a problem, I don’t agree with the suggestion that it can be solved simply by choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05 or, in the case of particle physics, a 5σ standard (which translates to about 0.000001!  While it is true that this would throw out a lot of flaky ‘two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would actually want to ask, which is what the data have to say about the probability of a specific hypothesis being true or sometimes whether the data imply one hypothesis more strongly than another. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis.

I feel so strongly about this that if I had my way I’d ban p-values altogether…

Not that it’s always easy to implement a Bayesian approach. It’s especially difficult when the data are affected by complicated noise statistics and selection effects, and/or when it is difficult to formulate a hypothesis test rigorously because one does not have a clear alternative hypothesis in mind. Experimentalists (including experimental particle physicists) seem to prefer to accept the limitations of the frequentist approach than tackle the admittedly very challenging problems of going Bayesian. In fact in my experience it seems that those scientists who approach data from a theoretical perspective are almost exclusively Baysian, while those of an experimental or observational bent stick to their frequentist guns.

Coincidentally a paper on the arXiv not long ago discussed an interesting apparent paradox in hypothesis testing that arises in the context of high energy physics, which I thought I’d share here. Here is the abstract:

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to inferences that are radically different from those of a Bayesian hypothesis test in the form advocated by Harold Jeffreys in the 1930’s and common today. The setting is the test of a point null (such as the Standard Model of elementary particle physics) versus a composite alternative (such as the Standard Model plus a new force of nature with unknown strength). The p-value, as well as the ratio of the likelihood under the null to the maximized likelihood under the alternative, can both strongly disfavor the null, while the Bayesian posterior probability for the null can be arbitrarily large. The professional statistics literature has many impassioned comments on the paradox, yet there is no consensus either on its relevance to scientific communication or on the correct resolution. I believe that the paradox is quite relevant to frontier research in high energy physics, where the model assumptions can evidently be quite different from those in other sciences. This paper is an attempt to explain the situation to both physicists and statisticians, in hopes that further progress can be made.

This paradox isn’t a paradox at all; the different approaches give different answers because they ask different questions. Both could be right, but I firmly believe that one of them answers the wrong question.

## Scotland Should Decide…

Posted in Bad Statistics, Politics, Science Politics with tags , , , , , , , , , on September 9, 2014 by telescoper

There being less than two weeks to go before the forthcoming referendum on Scottish independence, a subject on which I have so far refrained from commenting, I thought I would write something on it from the point of view of an English academic. I was finally persuaded to take the plunge because of incoming traffic to this blog from  pro-independence pieces here and here and a piece in Nature News on similar matters.

I’ll say at the outset that this is an issue for the Scots themselves to decide. I’m a believer in democracy and think that the wishes of the Scottish people as expressed through a referendum should be respected. I’m not qualified to express an opinion on the wider financial and political implications so I’ll just comment on the implications for science research, which is directly relevant to at least some of the readers of this blog. What would happen to UK research if Scotland were to vote yes?

Before going on I’ll just point out that the latest opinion poll by Yougov puts the “Yes” (i.e. pro-independence) vote ahead of “No” at 51%-49%. As the sample size for this survey was only just over a thousand, it has a margin of error of ±3%. On that basis I’d call the race neck-and-neck to within the resolution of the survey statistics. It does annoy me that pollsters never bother to state their margin of error in press released. Nevertheless, the current picture is a lot closer than it looked just a month ago, which is interesting in itself, as it is not clear to me as an outsider why it has changed so dramatically and so quickly.

Anyway, according to a Guardian piece not long ago.

Scientists and academics in Scotland would lose access to billions of pounds in grants and the UK’s world-leading research programmes if it became independent, the Westminster government has warned.

David Willetts, the UK science minister, said Scottish universities were “thriving” because of the UK’s generous and highly integrated system for funding scientific research, winning far more funding per head than the UK average.

Unveiling a new UK government paper on the impact of independence on scientific research, Willetts said that despite its size the UK was second only to the United States for the quality of its research.

“We do great things as a single, integrated system and a single integrated brings with it great strengths,” he said.

Overall spending on scientific research and development in Scottish universities from government, charitable and industry sources was more than £950m in 2011, giving a per capita spend of £180 compared to just £112 per head across the UK as a whole.

It is indeed notable that Scottish universities outperform those in the rest of the United Kingdom when it comes to research, but it always struck me that using this as an argument against independence is difficult to sustain. In fact it’s rather similar to the argument that the UK does well out of European funding schemes so that is a good argument for remaining in the European Union. The point is that, whether or not a given country benefits from the funding system, it still has to do so by following an agenda that isn’t necessarily its own. Scotland benefits from UK Research Council funding, but their priorities are set by the Westminster government, just as the European Research Council sets (sometimes rather bizarre) policies for its schemes. Who’s to say that Scotland wouldn’t do even better than it does currently by taking control of its own research funding rather than forcing its institutions to pander to Whitehall?

It’s also interesting to look at the flipside of this argument. If Scotland were to become independent, would the “billions” of research funding it would lose (according to the statement by Willetts, who is no longer the Minister in charge) benefit science in what’s left of the United Kingdom? There are many in England and Wales who think the existing research budget is already spread far too thinly and who would welcome an increase south of the border. If this did happen you could argue that, from a very narrow perspective, Scottish independence would be good for science in the rest of what is now the United Kingdom, but that depends on how much the Westminster government sets the science budget.

This all depends on how research funding would be redistributed if and when Scotland secedes from the Union, which could be done in various ways. The simplest would be for Scotland to withdraw from RCUK entirely. Because of the greater effectiveness of Scottish universities at winning funding compared to the rest of the UK, Scotland would have to spend more per capita to maintain its current level of resource, which is why many Scottish academics will be voting “no”. On the other hand, it has been suggested (by the “yes” campaign) that Scotland could buy back from its own revenue into RCUK at the current effective per capita rate  and thus maintain its present infrastructure and research expenditure at no extra cost. This, to me, sounds like wanting to have your cake and eat it,  and it’s by no means obvious that Westminster could or should agree to such a deal. All the soundings I have taken suggest that an independent Scotland should expect no such generosity, and will get actually zilch from the RCUK.

If full separation is the way head, science in Scotland would be heading into uncharted waters. Among the questions that would need to be answered are:

•  what will happen to RCUK funded facilities and staff currently situated in Scotland, such as those at the UKATC?
•  would Scottish researchers lose access to facilities located in England, Wales or Northern Ireland?
•  would Scotland have to pay its own subscriptions to CERN, ESA and ESO?

These are complicated issues to resolve and there’s no question that a lengthy process of negotiation would be needed to resolved them. In the meantime, why should RCUK risk investing further funds in programmes and facilities that may end up outside the UK (or what remains of it)? This is a recipe for planning blight on an enormous scale.

And then there’s the issue of EU membership. Would Scotland be allowed to join the EU immediately on independence? If not, what would happen to EU funded research?

I’m not saying these things will necessarily work out badly in the long run for Scotland, but they are certainly questions I’d want to have answered before I were convinced to vote “yes”. I don’t have a vote so my opinion shouldn’t count for very much, but I wonder if there are any readers of this blog from across the Border who feel like expressing an opinion?

## Politics, Polls and Insignificance

Posted in Bad Statistics, Politics with tags , , , , , on July 29, 2014 by telescoper

In between various tasks I had a look at the news and saw a story about opinion polls that encouraged me to make another quick contribution to my bad statistics folder.

The piece concerned (in the Independent) includes the following statement:

A ComRes survey for The Independent shows that the Conservatives have dropped to 27 per cent, their lowest in a poll for this newspaper since the 2010 election. The party is down three points on last month, while Labour, now on 33 per cent, is up one point. Ukip is down one point to 17 per cent, with the Liberal Democrats up one point to eight per cent and the Green Party up two points to seven per cent.

The link added to ComRes is mine; the full survey can be found here. Unfortunately, the report, as is sadly almost always the case in surveys of this kind, neglects any mention of the statistical uncertainty in the poll. In fact the last point is based on a telephone poll of a sample of just 1001 respondents. Suppose the fraction of the population having the intention to vote for a particular party is $p$. For a sample of size $n$ with $x$ respondents indicating that they hen one can straightforwardly estimate $p \simeq x/n$. So far so good, as long as there is no bias induced by the form of the question asked nor in the selection of the sample, which for a telephone poll is doubtful.

A  little bit of mathematics involving the binomial distribution yields an answer for the uncertainty in this estimate of p in terms of the sampling error:

$\sigma = \sqrt{\frac{p(1-p)}{n}}$

For the sample size given, and a value $p \simeq 0.33$ this amounts to a standard error of about 1.5%. About 95% of samples drawn from a population in which the true fraction is $p$ will yield an estimate within $p \pm 2\sigma$, i.e. within about 3% of the true figure. In other words the typical variation between two samples drawn from the same underlying population is about 3%.

If you don’t believe my calculation then you could use ComRes’ own “margin of error calculator“. The UK electorate as of 2012 numbered 46,353,900 and a sample size of 1001 returns a margin of error of 3.1%. This figure is not quoted in the report however.

Looking at the figures quoted in the report will tell you that all of the changes reported since last month’s poll are within the sampling uncertainty and are therefore consistent with no change at all in underlying voting intentions over this period.

A summary of the report posted elsewhere states:

A ComRes survey for the Independent shows that Labour have jumped one point to 33 per cent in opinion ratings, with the Conservatives dropping to 27 per cent – their lowest support since the 2010 election.

No! There’s no evidence of support for Labour having “jumped one point”, even if you could describe such a marginal change as a “jump” in the first place.

Statistical illiteracy is as widespread amongst politicians as it is amongst journalists, but the fact that silly reports like this are commonplace doesn’t make them any less annoying. After all, the idea of sampling uncertainty isn’t all that difficult to understand. Is it?

And with so many more important things going on in the world that deserve better press coverage than they are getting, why does a “quality” newspaper waste its valuable column inches on this sort of twaddle?

## Time for a Factorial Moment…

Posted in Bad Statistics with tags , , on July 22, 2014 by telescoper

Another very busy and very hot day so no time for a proper blog post. I suggest we all take a short break and enjoy a Factorial Moment:

I remember many moons ago spending ages calculating the factorial moments of the Poisson-Lognormal distribution, only to find that they were well known. If only I’d had Google then…

## The Power Spectrum and the Cosmic Web

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , on June 24, 2014 by telescoper

One of the things that makes this conference different from most cosmology meetings is that it is focussing on the large-scale structure of the Universe in itself as a topic rather a source of statistical information about, e.g. cosmological parameters. This means that we’ve been hearing about a set of statistical methods that is somewhat different from those usually used in the field (which are primarily based on second-order quantities).

One of the challenges cosmologists face is how to quantify the patterns we see in galaxy redshift surveys. In the relatively recent past the small size of the available data sets meant that only relatively crude descriptors could be used; anything sophisticated would be rendered useless by noise. For that reason, statistical analysis of galaxy clustering tended to be limited to the measurement of autocorrelation functions, usually constructed in Fourier space in the form of power spectra; you can find a nice review here.

Because it is so robust and contains a great deal of important information, the power spectrum has become ubiquitous in cosmology. But I think it’s important to realise its limitations.

Take a look at these two N-body computer simulations of large-scale structure:

The one on the left is a proper simulation of the “cosmic web” which is at least qualitatively realistic, in that in contains filaments, clusters and voids pretty much like what is observed in galaxy surveys.

To make the picture on the right I first  took the Fourier transform of the original  simulation. This approach follows the best advice I ever got from my thesis supervisor: “if you can’t think of anything else to do, try Fourier-transforming everything.”

Anyway each Fourier mode is complex and can therefore be characterized by an amplitude and a phase (the modulus and argument of the complex quantity). What I did next was to randomly reshuffle all the phases while leaving the amplitudes alone. I then performed the inverse Fourier transform to construct the image shown on the right.

What this procedure does is to produce a new image which has exactly the same power spectrum as the first. You might be surprised by how little the pattern on the right resembles that on the left, given that they share this property; the distribution on the right is much fuzzier. In fact, the sharply delineated features  are produced by mode-mode correlations and are therefore not well described by the power spectrum, which involves only the amplitude of each separate mode. In effect, the power spectrum is insensitive to the part of the Fourier description of the pattern that is responsible for delineating the cosmic web.

If you’re confused by this, consider the Fourier transforms of (a) white noise and (b) a Dirac delta-function. Both produce flat power-spectra, but they look very different in real space because in (b) all the Fourier modes are correlated in such away that they are in phase at the one location where the pattern is not zero; everywhere else they interfere destructively. In (a) the phases are distributed randomly.

The moral of this is that there is much more to the pattern of galaxy clustering than meets the power spectrum…

## Published BICEP2 paper admits “Unquantifiable Uncertainty”..

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , on June 20, 2014 by telescoper

Just a quick post to pass on the news that the BICEP2 results that excited so much press coverage earlier this year have now been published in Physical Review Letters. A free PDF version of the piece can be found here.  The published version incorporates a couple of important caveats that have arisen since the original release of the results prior to peer review. In particular, in the abstract (discussing models of the dust foreground emission:

However, these models are not sufficiently constrained by external public data to exclude the possibility of dust emission bright enough to explain the entire excess signal. Cross correlating BICEP2 against 100 GHz maps from the BICEP1 experiment, the excess signal is confirmed with 3σ significance and its spectral index is found to be consistent with that of the CMB, disfavoring dust at 1.7 σ.

Since the primary question-mark over the original result was whether the signal was due to dust or CMB, this corresponds to an admission that the detection is really at very low significance. I’ll set aside my objection to the frequentist language used in this statement!

There is an interesting comment in the footnotes too:

In the preprint version of this paper an additional DDM2 model was included based on information taken from Planck conference talks. We noted the large uncertainties on this and the other dust models presented. In the Planck dust polarization paper [96] which has since appeared the maps have been masked to include only regions “where the systematic uncertainties are small, and where the dust signal dominates total emission.” This mask excludes our field. We have concluded the information used for the DDM2 model has unquantifiable uncertainty. We look forward to performing a cross-correlation analysis against the Planck 353 GHz polarized maps in a future publication.

The emphasis is mine. The phrase made me think of this:

The paper concludes:

More data are clearly required to resolve the situation. We note that cross-correlation of our maps with the Planck 353 GHz maps will be more powerful than use of those maps alone in our field. Additional data are also expected from many other experiments, including Keck Array observations at 100 GHz in the 2014 season.

In other words, what I’ve been saying from the outset.

## Uncertain Attitudes

Posted in Bad Statistics, Politics with tags , , , , on May 28, 2014 by telescoper

It’s been a while since I posted anything in the bad statistics file, but an article in today’s Grauniad has now given me an opportunity to rectify that omission.
The piece concerned, entitled Racism on the rise in Britain is based on some new data from the British Social Attitudes survey; the full report can be found here (PDF). The main result is shown in this graph:

The version of this plot shown in the Guardian piece has the smoothed long-term trend (the blue curve, based on a five-year moving average of the data and clearly generally downward since 1986) removed.

In any case the report, as is sadly almost always the case in surveys of this kind, neglects any mention of the statistical uncertainty in the survey. In fact the last point is based on a sample of 2149 respondents. Suppose the fraction of the population describing themselves as having some prejudice is $p$. For a sample of size $n$ with $x$ respondents indicating that they describe themselves as “very prejudiced or a little prejudiced” then one can straightforwardly estimate $p \simeq x/n$. So far so good, as long as there is no bias induced by the form of the question asked nor in the selection of the sample…

However, a little bit of mathematics involving the binomial distribution yields an answer for the uncertainty in this estimate of p in terms of the sampling error:

$\sigma = \sqrt{\frac{p(1-p)}{n}}$

For the sample size given, and a value $p \simeq 0.35$ this amounts to a standard error of about 1%. About 95% of samples drawn from a population in which the true fraction is $p$ will yield an estimate within $p \pm 2\sigma$, i.e. within about 2% of the true figure. This is consistent with the “noise” on the unsmoothed curve and it shows that the year-on-year variation shown in the unsmoothed graph is largely attributable to sampling uncertainty; note that the sample sizes vary from year to year too. The results for 2012 and 2013 are 26% and 30% exactly, which differ by 4% and are therefore explicable solely in terms of sampling fluctuations.

I don’t know whether racial prejudice is on the rise in the UK or not, nor even how accurately such attitudes are measured by such surveys in the first place, but there’s no evidence in these data of any significant change over the past year. Given the behaviour of the smoothed data however, there is evidence that in the very long term the fraction of population identifying themselves as prejudiced is actually falling.

Newspapers however rarely let proper statistics get in the way of a good story, even to the extent of removing evidence that contradicts their own prejudice.