## Evidence, Absence, and the Type II Monster

I was just having a quick lunchtime shufty at Dave Steele‘s blog. His latest post is inspired by the quotation “Absence of Evidence isn’t Evidence of Absence” which can apparently be traced back to Carl Sagan. I never knew that. Anyway I was muchly enjoying the piece when I suddenly stumbled into this paragraph, which quote without permission because I’m too shy to ask:

In a scientific experiment, the null hypothesis refers to a general or default position that there is no relationship between two measured phenomena. For example a well thought out point in an article by James Delingpole. Rejecting or disproving the null hypothesis is the primary task in any scientific research. If an experiment rejects the null hypothesis, it concludes that there are grounds greater than chance for believing that there is a relationship between the two (or more) phenomena being observed. Again the null hypothesis itself can never be proven. If participants treated with a medication are compared with untreated participants and there is found no statistically significant difference between the two groups, it does not prove that there really is no difference. Or if we say there is a monster in a Loch but cannot find it. The experiment could only be said to show that the results were not sufficient to reject the null hypothesis.

I’m going to pick up the trusty sword of Bayesian probability and have yet another go at the dragon of frequentism, but before doing so I’ll just correct the first sentence. The “null hypothesis” in a frequentist hypothesis test is not necessarily of the form described here: it could be of virtually any form, possibly quite different from the stated one of no correlation between two variables. All that matters is that (a) it has to be well-defined in terms of a model and (b) you have to be content to accept it as true unless and until you find evidence to the contrary. It’s true to say that there’s nowt as well-specified as nowt so nulls are often of the form “there is no correlation” or something like that, but the point is that they don’t have to be.

I note that the wikipedia page on “null hypothesis” uses the same wording as in the first sentence of the quoted paragraph, but this is not what you’ll find in most statistics textbooks. In their compendious three-volume work *The Advanced Theory of Statistics* Kendall & Stuart even go as far to say that the word “null” is misleading precisely because the hypothesis under test might be quite complicated, e.g. of composite nature.

Anyway, whatever the null hypothesis happens to be, the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the significance level merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call a Type I error. It says nothing at all about the probability that the null hypothesis is actually correct. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical *power* of the test, i.e. the probability that you would actually reject the null hypothesis when it is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. So is the notion, which stems from this frequentist formulation, that all a scientist can ever hope to do is refute their null hypothesis. You’ll find this view echoed in the philosophical approach of Karl Popper and it has heavily influenced the way many scientists see the scientific method, unfortunately.

The asymmetrical way that the null and alternative hypotheses are treated in the frequentist framework is not helpful, in my opinion. Far better to adopt a Bayesian framework in which probability represents the extent to which measurements or other data support a given theory. New statistical evidence can make two hypothesis either more or less probable relative to each other. The focus is not just on rejecting a specific model, but on comparing two or more models in a mutually consistent way. The key notion is not falsifiablity, but testability. Data that fail to reject a hypothesis *can * properly be interpreted as supporting it, i.e. by making it *more probable*, but such reasoning can only be done consistently within the Bayesian framework.

What remains true, however, is that the null hypothesis (or indeed any other hypothesis) can never be proven with certainty; that is true whenever probabilistic reasoning is true. Sometimes, though, the weight of supporting evidence is so strong that inductive logic compels us to regard our theory or model or hypothesis as virtually certain. That applies whether the evidence is actual measurement or non-detections; to a Bayesian, absence of evidence can (and indeed often is) evidence of absence. The sun rises every morning and sets every evening; it is silly to argue that this provides us with no grounds for arguing that it will do so tomorrow. Likewise, the sonar surveys and other investigations in Loch Ness provide us with evidence that supports the hypothesis that there isn’t a Monster over virtually every possible hypothetical Monster that has been suggested.

It is perfectly sensible to use this reasoning to infer that there is no Loch Ness Monster. Probably.

Follow @telescoper
June 25, 2013 at 9:59 am

Well, you can also be a symmetric frequentist a la Neyman-Pearson. It depends what your statistics are supposed to achieve for you. For example Bayesian statistics aren’t particularly suited to discovering unsuspected new phenomena for which you have no meaningful prior.

June 25, 2013 at 10:40 am

Hear, hear!

Of course a hypothesis, to be testable, has to make specific predictions about variables that you can perform experiments to measure. I can test y=kx but if the alternative is that y isn’t equal to kx then I am stuck (as is Bayesian analysis) until someone thinks of an alternative relation. We had only Newton’s nonrelativistic mechanics prior to Einstein, for instance.

If my datapoints, when plotted, don’t look at all linear then I will reject y=kx, and probably so will the null hypothesis formalism. So can it do something that Bayesianism can’t do, ie reject in isolation? Not really. What actually happened is that the plot of my data inspired me on the spot to subconsciously think of alternatives and I am subconsciously comparing y=kx with those alternatives (eg y=kx+bx^n, with several ‘floating’ parameters). In that case I should dredge those empirical alternatives out of my subconscious and do a formal Bayesian test of them vs y=kx, in which case Bayes will replicate intelligent intuition. In contrast there are situations in which the null hypothesis formalism will crash spectacularly – and you have no way of knowing if your situation is one of them. Bayesians have had fun setting up such counterexamples.

Popper’s notion of falsifiability is asymmetric between True and False (ie, it favours False) whereas the Bayesian notion of testability is symmetrical: the data drive the probability of your hypothesis/theory up or down. Yet it is true that the history of science is littered with superseded theories, such as nonrelativistic mechanics; you have to expect your pet theory to have a finite shelf life. The reason for the asymmetry is that theories to throw into the ring are continually invented, so we have ever more of them: even if nobody takes phlogiston seriously today it does not have to be reinvented, but can stay in the ring at an incredibly low probability based on the data. There is an “arrow of time” due to human ingenuity, and that is the source of the asymmetry that spurred Popper. Too bad he got it wrong.

June 25, 2013 at 1:41 pm

[…] I was just having a quick lunchtime shufty at Dave Steele's blog. His latest post is inspired by the quotation "Absence of Evidence isn't Evidence of Absence" which can apparently be traced back to… […]

June 26, 2013 at 1:43 am

Frequentist inference is not synonymous with hypothesis

testing.

June 26, 2013 at 9:09 am

Agreed. They get parameter estimation wrong too.

June 26, 2013 at 1:23 pm

Well, not what I meant.

June 26, 2013 at 9:48 am

[…] was thinking about this a little yesterday because I came across a post called evidence, absence and the Type II monster on a blog called In The Dark. I found this quite an interesting post, but should acknowledge that […]

December 7, 2013 at 9:47 pm

[…] was thinking about this a little yesterday because I came across a post called evidence, absence and the Type II monster on a blog called In The Dark. I found this quite an interesting post, but should acknowledge that […]