Archive for Bayesian statistics

Cosmological Dipole Controversy

Posted in Astrohype, Bad Statistics, The Universe and Stuff with tags , , on October 11, 2022 by telescoper

I’ve just finished reading an interesting paper by Secrest et al. which has attracted some attention recently. It’s published in the Astrophysical Journal Letters but is also available on the arXiv here. I blogged about earlier work by some of these authors here.

The abstract of the current paper is:

We present the first joint analysis of catalogs of radio galaxies and quasars to determine if their sky distribution is consistent with the standard ΛCDM model of cosmology. This model is based on the cosmological principle, which asserts that the universe is statistically isotropic and homogeneous on large scales, so the observed dipole anisotropy in the cosmic microwave background (CMB) must be attributed to our local peculiar motion. We test the null hypothesis that there is a dipole anisotropy in the sky distribution of radio galaxies and quasars consistent with the motion inferred from the CMB, as is expected for cosmologically distant sources. Our two samples, constructed respectively from the NRAO VLA Sky Survey and the Wide-field Infrared Survey Explorer, are systematically independent and have no shared objects. Using a completely general statistic that accounts for correlation between the found dipole amplitude and its directional offset from the CMB dipole, the null hypothesis is independently rejected by the radio galaxy and quasar samples with p-value of 8.9×10−3 and 1.2×10−5, respectively, corresponding to 2.6σ and 4.4σ significance. The joint significance, using sample size-weighted Z-scores, is 5.1σ. We show that the radio galaxy and quasar dipoles are consistent with each other and find no evidence for any frequency dependence of the amplitude. The consistency of the two dipoles improves if we boost to the CMB frame assuming its dipole to be fully kinematic, suggesting that cosmologically distant radio galaxies and quasars may have an intrinsic anisotropy in this frame.

I can summarize the paper in the form of this well-worn meme:

My main reaction to the paper – apart from finding it interesting – is that if I were doing this I wouldn’t take the frequentist approach used by the authors as this doesn’t address the real question of whether the data prefer some alternative model over the standard cosmological model.

As was the case with a Nature piece I blogged about some time ago, this article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a particular null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05. This is usually called a ‘2σ’ result because for Gaussian statistics a variable has a probability of 95% of lying within 2σ of the mean value.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that large under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is incorrect. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. In fact I feel so strongly about this that if I had my way I’d ban p-values altogether…

This is not an objection to the value of the p-value chosen, and whether this is 0.005 rather than 0.05 or, , a 5σ standard (which translates to about 0.000001!  While it is true that this would throw out a lot of flaky ‘two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would actually want to ask, which is what the data have to say about the probability of a specific hypothesis being true or sometimes whether the data imply one hypothesis more strongly than another. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis.

Not that it’s always easy to implement the (better) Bayesian approach. It’s especially difficult when the data are affected by complicated noise statistics and selection effects, and/or when it is difficult to formulate a hypothesis test rigorously because one does not have a clear alternative hypothesis in mind. That’s probably why many scientists prefer to accept the limitations of the frequentist approach than tackle the admittedly very challenging problems of going Bayesian.

But having indulged in that methodological rant, I certainly have an open mind about departures from isotropy on large scales. The correct scientific approach is now to reanalyze the data used in this paper to see if the result presented stands up, which it very well might.

The Neyman-Scott ‘Paradox’

Posted in Bad Statistics, Cute Problems with tags , , , , on November 25, 2016 by telescoper

I just came across this interesting little problem recently and thought I’d share it here. It’s usually called the ‘Neyman-Scott’ paradox. Before going on it’s worth mentioning that Elizabeth Scott (the second half of Neyman-Scott) was an astronomer by background. Her co-author was Jerzy Neyman. As has been the case for many astronomers, she contributed greatly to the development of the field of statistics. Anyway, I think this example provides another good illustration of the superiority of Bayesian methods for estimating parameters, but I’ll let you make your own mind up about what’s going on.

The problem is fairly technical so I’ve done done a quick version in latex that you can download

here, but I’ve also copied into this post so you can read it below:

 

neyman-scott1

neyman-scott2

I look forward to receiving Frequentist Flak or Bayesian Benevolence through the comments box below!

Frequentism: the art of probably answering the wrong question

Posted in Bad Statistics with tags , , , , , , on September 15, 2014 by telescoper

Popped into the office for a spot of lunch in between induction events and discovered that Jon Butterworth has posted an item on his Grauniad blog about how particle physicists use statistics, and the ‘5σ rule’ that is usually employed as a criterion for the detection of, e.g. a new particle. I couldn’t resist bashing out a quick reply, because I believe that actually the fundamental issue is not whether you choose 3σ or 5σ or 27σ but what these statistics mean or don’t mean.

As was the case with a Nature piece I blogged about some time ago, Jon’s article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a particular null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05. This is usually called a ‘2σ’ result because for Gaussian statistics a variable has a probability of 95% of lying within 2σ of the mean value.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that large under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is incorrect. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean. Jon’s piece demonstrates that he does, so this is not meant as a personal criticism, but it is a pervasive problem that results quoted in such a way are intrinsically confusing.

The Nature story mentioned above argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true; a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.

While I agree with the Nature piece that there’s a problem, I don’t agree with the suggestion that it can be solved simply by choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05 or, in the case of particle physics, a 5σ standard (which translates to about 0.000001!  While it is true that this would throw out a lot of flaky ‘two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would actually want to ask, which is what the data have to say about the probability of a specific hypothesis being true or sometimes whether the data imply one hypothesis more strongly than another. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis.

I feel so strongly about this that if I had my way I’d ban p-values altogether…

Not that it’s always easy to implement a Bayesian approach. It’s especially difficult when the data are affected by complicated noise statistics and selection effects, and/or when it is difficult to formulate a hypothesis test rigorously because one does not have a clear alternative hypothesis in mind. Experimentalists (including experimental particle physicists) seem to prefer to accept the limitations of the frequentist approach than tackle the admittedly very challenging problems of going Bayesian. In fact in my experience it seems that those scientists who approach data from a theoretical perspective are almost exclusively Baysian, while those of an experimental or observational bent stick to their frequentist guns.

Coincidentally a paper on the arXiv not long ago discussed an interesting apparent paradox in hypothesis testing that arises in the context of high energy physics, which I thought I’d share here. Here is the abstract:

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to inferences that are radically different from those of a Bayesian hypothesis test in the form advocated by Harold Jeffreys in the 1930’s and common today. The setting is the test of a point null (such as the Standard Model of elementary particle physics) versus a composite alternative (such as the Standard Model plus a new force of nature with unknown strength). The p-value, as well as the ratio of the likelihood under the null to the maximized likelihood under the alternative, can both strongly disfavor the null, while the Bayesian posterior probability for the null can be arbitrarily large. The professional statistics literature has many impassioned comments on the paradox, yet there is no consensus either on its relevance to scientific communication or on the correct resolution. I believe that the paradox is quite relevant to frontier research in high energy physics, where the model assumptions can evidently be quite different from those in other sciences. This paper is an attempt to explain the situation to both physicists and statisticians, in hopes that further progress can be made.

This paradox isn’t a paradox at all; the different approaches give different answers because they ask different questions. Both could be right, but I firmly believe that one of them answers the wrong question.

The Curse of P-values

Posted in Bad Statistics with tags , , , on November 12, 2013 by telescoper

Yesterday evening I noticed a news item in Nature that argues that inappropriate statistical methodology may be undermining the reporting of scientific results. The article focuses on lack of “reproducibility” of results.

The article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under the null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean.

The Nature story mentioned above argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true; a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.

While I agree with the Nature piece that there’s a problem, I don’t agree with the suggestion that it can be solved simply by choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05. While it is true that this would throw out a lot of flaky `two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would want to ask, which is what the data have to say about a given hypothesis. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis. If I had my way I’d ban p-values altogether.

Not that it’s always easy to implement a Bayesian approach. Coincidentally a recent paper on the arXiv discussed an interesting apparent paradox in hypothesis testing that arises in the context of high energy physics, which I thought I’d share here. Here is the abstract:

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to inferences that are radically different from those of a Bayesian hypothesis test in the form advocated by Harold Jeffreys in the 1930’s and common today. The setting is the test of a point null (such as the Standard Model of elementary particle physics) versus a composite alternative (such as the Standard Model plus a new force of nature with unknown strength). The p-value, as well as the ratio of the likelihood under the null to the maximized likelihood under the alternative, can both strongly disfavor the null, while the Bayesian posterior probability for the null can be arbitrarily large. The professional statistics literature has many impassioned comments on the paradox, yet there is no consensus either on its relevance to scientific communication or on the correct resolution. I believe that the paradox is quite relevant to frontier research in high energy physics, where the model assumptions can evidently be quite different from those in other sciences. This paper is an attempt to explain the situation to both physicists and statisticians, in hopes that further progress can be made.

Rather than tell you what I think about this paradox, I thought I’d invite discussion through the comments box…

Oh what a tangled web we weave…

Posted in Bad Statistics with tags , , , , , , on March 11, 2013 by telescoper

..when first we practice frequentist statistics!

I couldn’t resist a quick post directing you to a short paper on the arXiv with the following abstract:

I use archival data to measure the mass of the central black hole in NGC 4526, M = (4.70 +- 0.14) X 10^8 Msun. This 3% error bar is the most precise for an extra-galactic black hole and is close to the precision obtained for Sgr A* in the Milky Way. The factor 7 improvement over the previous measurement is entirely due to correction of a mathematical error, an error that I suggest may be common among astronomers.

The “mathematical error” quoted in the abstract involves using chi-squared-per-degree-of-freedom instead of chi-squared instead of the full likelihood function instead of the proper, Bayesian, posterior probability. The best way to avoid such confusion is to do things properly in the first place. That way you can also fold in errors on the distance to the black hole, etc etc…