## Misplaced Confidence

From time to time I’ve been posting items about the improper use of statistics. My colleague Ant Whitworth just showed me an astronomical example drawn from his own field of star formation and found in a recent paper by Matthew Bate from the University of Exeter.

The paper is a lengthy and complicated one involving the use of extensive numerical calculations to figure out the effect of radiative feedback on the process of star formation. The theoretical side of this subject is fiendishly difficult, to the extent that it is difficult to make any progress with pencil-and-paper techinques, and Matthew is one of the leading experts in the use of computational methods to tackle problems in this area.

One of the main issues Matthew was investigating was whether radiative feedback had any effect on the initial mass function of the stars in his calculations. The key results are shown in the picture below (Figure 8 from the paper) in terms of cumulative distributions of the star masses in various different situations.

The question that arises from such data is whether these empirical distributions differ significantly from each other or whether they are consistent with the variations that would naturally arise in different samples drawn from the same distribution. The most interesting ones are the two distributions to the right of the plot that appear to lie almost on top of each other.

Because the samples are very small (only 13 and 15 objects respectively) one can’t reasonably test for goodness-of-fit using the standard chi-squared test because of discreteness effects and because not much is known about the error distribution. To do the statistics, therefore, Matthew uses a popular non-parametric method called the Kolmogorov-Smirnov test which uses the maximum deviation *D* between the two distributions as a figure of merit to decide whether they match. If *D *is very large then it is not probable that it can have arisen from the same distribution. If it is smaller then it might have. As for what happens if it is very small then you’ll have to wait a bit.

This is an example of a standard (frequentist) hypothesis test in which the null hypothesis is that the empirical distributions are calculated from independent samples drawn from the same underlying form. The probability of a value of * * *D * arising as large as the measured one can be calculated assuming the null is true and is then the significance level of the test. If there’s only a 1% chance of it being as large as the measured value then the significance level is 1%.

So far, so good.

But then, in describing the results of the K-S test the paper states

A Kolmogorov-Smirnov (K-S) test on the …. distributions gives a 99.97% probability that the two IMFs were drawn from the same underlying distribution (i.e. they are statistically indistinguishable).

Agh! No it doesn’t! What it gives is a probability of 99.97% that the chance deviation between the two distributions is expected to be *larger* than that actually measured. In other words, the two distributions are surprisingly close to each other. But the significance level merely specifies the probability that you would reject the null-hypothesis if it were correct. It says *nothing at all* about the probability that the null hypothesis is correct. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution of *D* based on it, and hence determine the statistical *power* of the test. Without specifying an alternative hypothesis all you can say is that you have failed to reject the null hypothesis.

Or better still, if you have an alternative hypothesis you can forget about power and significance and instead work out the relative probability of the two hypotheses using a proper Bayesian approach.

You might also reasonably ask why might *D* be so very small? If you find an improbably low value of chi-squared then it usually means either that somebody has cheated or that the data are not independent (which is assumed for the basis of the test). Qualitatively the same thing happens with a KS test.

In fact these two distributions can’t be thought of as independent samples anyway as they are computed from the same initial conditions but with various knobs turned on or off to include different physics. They are not “samples” drawn from the same population but slightly different versions of the same sample. The probability emerging from the KS machinery is therefore meaningless anyway in this context.

So a correct statement of the result would be that the deviation between the two computed distributions is much smaller than one would expect to arise from two independent samples of the same size drawn from the same population.

That’s a much less dramatic statement than is contained in the paper, but has the advantage of not being bollocks.

December 10, 2008 at 5:05 pm

If I was paid a dollar (even the weak Australian variety) every time I had this argument with someone . . . then my bank balance would be differ from its actual value at considerably more than the 99.97% confidence level.

December 10, 2008 at 5:46 pm

Spot on Peter. Working scientists developed probability theory and got it right, and today it is once again a case of DIY; don’t trust professional “statisticians”. Or, above all, philosophers.

Anton

September 16, 2016 at 12:41 pm

Hi Peter,

I know i’m dragging this up from a while ago but i’m working with a similar problem and would be very interested to know more about Bayesian approaches to this type of problem. I’ve been looking at John Kruschke’s BEST (Bayesian Estimation Supersedes the t Test) methodology, but would be interested in hearing your preferred approach.

Many thanks,

Rich