## Evidence, Absence, and the Type II Monster

Posted in Bad Statistics with tags , , , , , , on June 24, 2013 by telescoper

I was just having a quick lunchtime shufty at Dave Steele‘s blog. His latest post is inspired by the quotation “Absence of Evidence isn’t Evidence of Absence” which can apparently be traced back to Carl Sagan. I never knew that. Anyway I was muchly enjoying the piece when I suddenly stumbled into this paragraph, which quote without permission because I’m too shy to ask:

In a scientific experiment, the null hypothesis refers to a general or default position that there is no relationship between two measured phenomena. For example a well thought out point in an article by James Delingpole. Rejecting or disproving the null hypothesis is the primary task in any scientific research. If an experiment rejects the null hypothesis, it concludes that there are grounds greater than chance for believing that there is a relationship between the two (or more) phenomena being observed. Again the null hypothesis itself can never be proven. If participants treated with a medication are compared with untreated participants and there is found no statistically significant difference between the two groups, it does not prove that there really is no difference. Or if we say there is a monster in a Loch but cannot find it. The experiment could only be said to show that the results were not sufficient to reject the null hypothesis.

I’m going to pick up the trusty sword of Bayesian probability and have yet another go at the dragon of frequentism, but before doing so I’ll just correct the first sentence. The “null hypothesis” in a frequentist hypothesis test is not necessarily of the form described here: it could be of virtually any form, possibly quite different from the stated one of no correlation between two variables. All that matters is that (a) it has to be well-defined in terms of a model and (b) you have to be content to accept it as true unless and until you find evidence to the contrary. It’s true to say that there’s nowt as well-specified as nowt so nulls are often of the form “there is no correlation” or something like that, but the point is that they don’t have to be.

I note that the wikipedia page on “null hypothesis” uses the same wording as in the first sentence of the quoted paragraph, but this is not what you’ll find in most statistics textbooks. In their compendious three-volume work The Advanced Theory of Statistics Kendall & Stuart even go as far to say that the word “null” is misleading precisely because the hypothesis under test might be quite complicated, e.g. of composite nature.

Anyway, whatever the null hypothesis happens to be, the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the significance level merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call a Type I error. It says nothing at all about the probability that the null hypothesis is actually correct. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. So is the notion, which stems from this frequentist formulation, that all a scientist can ever hope to do is refute their null hypothesis. You’ll find this view echoed in the philosophical approach of Karl Popper and it has heavily influenced the way many scientists see the scientific method, unfortunately.

The asymmetrical way that the null and alternative hypotheses are treated in the frequentist framework is not helpful, in my opinion. Far better to adopt a Bayesian framework in which probability represents the extent to which measurements or other data support a given theory. New statistical evidence can make two hypothesis either more or less probable relative to each other. The focus is not just on rejecting a specific model, but on comparing two or more models in a mutually consistent way. The key notion is not falsifiablity, but testability. Data that fail to reject a hypothesis can properly be interpreted as supporting it, i.e. by making it more probable, but such reasoning can only be done consistently within the Bayesian framework.

What remains true, however, is that the null hypothesis (or indeed any other hypothesis) can never be proven with certainty; that is true whenever probabilistic reasoning is true. Sometimes, though, the weight of supporting evidence is so strong that inductive logic compels us to regard our theory or model or hypothesis as virtually certain. That applies whether the evidence is actual measurement or non-detections; to a Bayesian, absence of evidence can (and indeed often is) evidence of absence. The sun rises every morning and sets every evening; it is silly to argue that this provides us with no grounds for arguing that it will do so tomorrow. Likewise, the sonar surveys and other investigations in Loch Ness provide us with evidence that supports the hypothesis that there isn’t a Monster over virtually every possible hypothetical Monster that has been suggested.

It is perfectly sensible to use this reasoning to infer that there is no Loch Ness Monster. Probably.

## Bayes in the dock (again)

Posted in Bad Statistics with tags , , , , , on February 28, 2013 by telescoper

This morning on Twitter there appeared a link to a blog post reporting that the Court of Appeal had rejected the use of Bayesian probability in legal cases. I recommend anyone interested in probability to read it, as it gives a fascinating insight into how poorly the concept is understood.

Although this is a new report about a new case, it’s actually not an entirely new conclusion. I blogged about a similar case a couple of years ago, in fact. The earlier story n concerned an erroneous argument given during a trial about the significance of a match found between a footprint found at a crime scene and footwear belonging to a suspect.  The judge took exception to the fact that the figures being used were not known sufficiently accurately to make a reliable assessment, and thus decided that Bayes’ theorem shouldn’t be used in court unless the data involved in its application were “firm”.

If you read the Guardian article to which I’ve provided a link you will see that there’s a lot of reaction from the legal establishment and statisticians about this, focussing on the forensic use of probabilistic reasoning. This all reminds me of the tragedy of the Sally Clark case and what a disgrace it is that nothing has been done since then to improve the misrepresentation of statistical arguments in trials. Some of my Bayesian colleagues have expressed dismay at the judge’s opinion.

My reaction to this affair is more muted than you would probably expect. First thing to say is that this is really not an issue relating to the Bayesian versus frequentist debate at all. It’s about a straightforward application of Bayes’ theorem which, as its name suggests, is a theorem; actually it’s just a straightforward consequence of the sum and product laws of the calculus of probabilities. No-one, not even the most die-hard frequentist, would argue that Bayes’ theorem is false. What happened in this case is that an “expert” applied Bayes’ theorem to unreliable data and by so doing obtained misleading results. The  issue is not Bayes’ theorem per se, but the application of it to inaccurate data. Garbage in, garbage out. There’s no place for garbage in the courtroom, so in my opinion the judge was quite right to throw this particular argument out.

But while I’m on the subject of using Bayesian logic in the courts, let me add a few wider comments. First, I think that Bayesian reasoning provides a rigorous mathematical foundation for the process of assessing quantitatively the extent to which evidence supports a given theory or interpretation. As such it describes accurately how scientific investigations proceed by updating probabilities in the light of new data. It also describes how a criminal investigation works too.

What Bayesian inference is not good at is achieving closure in the form of a definite verdict. There are two sides to this. One is that the maxim “innocent until proven guilty” cannot be incorporated in Bayesian reasoning. If one assigns a zero prior probability of guilt then no amount of evidence will be able to change this into a non-zero posterior probability; the required burden is infinite. On the other hand, there is the problem that the jury must decide guilt in a criminal trial “beyond reasonable doubt”. But how much doubt is reasonable, exactly? And will a jury understand a probabilistic argument anyway?

In pure science we never really need to achieve this kind of closure, collapsing the broad range of probability into a simple “true” or “false”, because this is a process of continual investigation. It’s a reasonable inference, for example, based on Supernovae and other observations that the Universe is accelerating. But is it proven that this is so? I’d say “no”,  and don’t think my doubts are at all unreasonable…

So what I’d say is that while statistical arguments are extremely important for investigating crimes – narrowing down the field of suspects, assessing the reliability of evidence, establishing lines of inquiry, and so on – I don’t think they should ever play a central role once the case has been brought to court unless there’s much clearer guidance given to juries and stricter monitoring of so-called “expert” witnesses.

I’m sure various readers will wish to express diverse opinions on this case so, as usual, please feel free to contribute through the box below!

## Bayes in the Dock

Posted in Bad Statistics with tags , , , , on October 6, 2011 by telescoper

A few days ago John Peacock sent me a link to an interesting story about the use of Bayes’ theorem in legal proceedings and I’ve been meaning to post about it but haven’t had the time. I get the distinct feeling that John, who is of the frequentist persuasion,  feels a certain amount of delight that the beastly Bayesians have got their comeuppance at last.

The story in question concerns an erroneous argument given during a trial about the significance of a match found between a footprint found at a crime scene and footwear belonging to a suspect.  The judge took exception to the fact that the figures being used were not known sufficiently accurately to make a reliable assessment, and thus decided that Bayes’ theorem shouldn’t be used in court unless the data involved in its application were “firm”.

If you read the Guardian article you will see that there’s a lot of reaction from the legal establishment and statisticians about this, focussing on the forensic use of probabilistic reasoning. This all reminds me of the tragedy of the Sally Clark case and what a disgrace it is that nothing has been done since then to improve the misrepresentation of statistical arguments in trials. Some of my Bayesian colleagues have expressed dismay at the judge’s opinion, which no doubt pleases Professor Peacock no end.

My reaction to this affair is more muted than you would probably expect. First thing to say is that this is really not an issue relating to the Bayesian versus frequentist debate at all. It’s about a straightforward application of Bayes’ theorem which, as its name suggests, is a theorem; actually it’s just a straightforward consequence of the sum and product laws of the calculus of probabilities. No-one, not even the most die-hard frequentist, would argue that Bayes’ theorem is false. What happened in this case is that an “expert” applied Bayes’ theorem to unreliable data and by so doing obtained misleading results. The  issue is not Bayes’ theorem per se, but the application of it to inaccurate data. Garbage in, garbage out. There’s no place for garbage in the courtroom, so in my opinion the judge was quite right to throw this particular argument out.

But while I’m on the subject of using Bayesian logic in the courts, let me add a few wider comments. First, I think that Bayesian reasoning provides a rigorous mathematical foundation for the process of assessing quantitatively the extent to which evidence supports a given theory or interpretation. As such it describes accurately how scientific investigations proceed by updating probabilities in the light of new data. It also describes how a criminal investigation works too.

What Bayesian inference is not good at is achieving closure in the form of a definite verdict. There are two sides to this. One is that the maxim “innocent until proven guilty” cannot be incorporated in Bayesian reasoning. If one assigns a zero prior probability of guilt then no amount of evidence will be able to change this into a non-zero posterior probability; the required burden is infinite. On the other hand, there is the problem that the jury must decide guilt in a criminal trial “beyond reasonable doubt”. But how much doubt is reasonable, exactly? And will a jury understand a probabilistic argument anyway?

In pure science we never really need to achieve this kind of closure, collapsing the broad range of probability into a simple “true” or “false”, because this is a process of continual investigation. It’s a reasonable inference, for example, based on Supernovae and other observations that the Universe is accelerating. But is it proven that this is so? I’d say “no”,  and don’t think my doubts are at all unreasonable…

So what I’d say is that while statistical arguments are extremely important for investigating crimes – narrowing down the field of suspects, assessing the reliability of evidence, establishing lines of inquiry, and so on – I don’t think they should ever play a central role once the case has been brought to court unless there’s much clearer guidance given to juries on how to use it and stricter monitoring of so-called “expert” witnesses.

I’m sure various readers will wish to express diverse opinions on this case so, as usual, please feel free to contribute through the box below!

## A Little Bit of Bayes

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , on November 21, 2010 by telescoper

I thought I’d start a series of occasional posts about Bayesian probability. This is something I’ve touched on from time to time but its perhaps worth covering this relatively controversial topic in a slightly more systematic fashion especially with regard to how it works in cosmology.

I’ll start with Bayes’ theorem which for three logical propositions (such as statements about the values of parameters in theory) A, B and C can be written in the form

$P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)$

where

$K=P(A|C).$

This is (or should be!)  uncontroversial as it is simply a result of the sum and product rules for combining probabilities. Notice, however, that I’ve not restricted it to two propositions A and B as is often done, but carried throughout an extra one (C). This is to emphasize the fact that, to a Bayesian, all probabilities are conditional on something; usually, in the context of data analysis this is a background theory that furnishes the framework within which measurements are interpreted. If you say this makes everything model-dependent, then I’d agree. But every interpretation of data in terms of parameters of a model is dependent on the model. It has to be. If you think it can be otherwise then I think you’re misguided.

In the equation,  P(B|C) is the probability of B being true, given that C is true . The information C need not be definitely known, but perhaps assumed for the sake of argument. The left-hand side of Bayes’ theorem denotes the probability of B given both A and C, and so on. The presence of C has not changed anything, but is just there as a reminder that it all depends on what is being assumed in the background. The equation states  a theorem that can be proved to be mathematically correct so it is – or should be – uncontroversial.

Now comes the controversy. In the “frequentist” interpretation of probability, the entities A, B and C would be interpreted as “events” (e.g. the coin is heads) or “random variables” (e.g. the score on a dice, a number from 1 to 6) attached to which is their probability, indicating their propensity to occur in an imagined ensemble. These things are quite complicated mathematical objects: they don’t have specific numerical values, but are represented by a measure over the space of possibilities. They are sort of “blurred-out” in some way, the fuzziness representing the uncertainty in the precise value.

To a Bayesian, the entities A, B and C have a completely different character to what they represent for a frequentist. They are not “events” but  logical propositions which can only be either true or false. The entities themselves are not blurred out, but we may have insufficient information to decide which of the two possibilities is correct. In this interpretation, P(A|C) represents the degree of belief that it is consistent to hold in the truth of A given the information C. Probability is therefore a generalization of the “normal” deductive logic expressed by Boolean algebra: the value “0” is associated with a proposition which is false and “1” denotes one that is true. Probability theory extends  this logic to the intermediate case where there is insufficient information to be certain about the status of the proposition.

A common objection to Bayesian probability is that it is somehow arbitrary or ill-defined. “Subjective” is the word that is often bandied about. This is only fair to the extent that different individuals may have access to different information and therefore assign different probabilities. Given different information C and C′ the probabilities P(A|C) and P(A|C′) will be different. On the other hand, the same precise rules for assigning and manipulating probabilities apply as before. Identical results should therefore be obtained whether these are applied by any person, or even a robot, so that part isn’t subjective at all.

In fact I’d go further. I think one of the great strengths of the Bayesian interpretation is precisely that it does depend on what information is assumed. This means that such information has to be stated explicitly. The essential assumptions behind a result can be – and, regrettably, often are – hidden in frequentist analyses. Being a Bayesian forces you to put all your cards on the table.

To a Bayesian, probabilities are always conditional on other assumed truths. There is no such thing as an absolute probability, hence my alteration of the form of Bayes’s theorem to represent this. A probability such as P(A) has no meaning to a Bayesian: there is always conditioning information. For example, if  I blithely assign a probability of 1/6 to each face of a dice, that assignment is actually conditional on me having no information to discriminate between the appearance of the faces, and no knowledge of the rolling trajectory that would allow me to make a prediction of its eventual resting position.

In tbe Bayesian framework, probability theory  becomes not a branch of experimental science but a branch of logic. Like any branch of mathematics it cannot be tested by experiment but only by the requirement that it be internally self-consistent. This brings me to what I think is one of the most important results of twentieth century mathematics, but which is unfortunately almost unknown in the scientific community. In 1946, Richard Cox derived the unique generalization of Boolean algebra under the assumption that such a logic must involve associated a single number with any logical proposition. The result he got is beautiful and anyone with any interest in science should make a point of reading his elegant argument. It turns out that the only way to construct a consistent logic of uncertainty incorporating this principle is by using the standard laws of probability. There is no other way to reason consistently in the face of uncertainty than probability theory. Accordingly, probability theory always applies when there is insufficient knowledge for deductive certainty. Probability is inductive logic.

This is not just a nice mathematical property. This kind of probability lies at the foundations of a consistent methodological framework that not only encapsulates many common-sense notions about how science works, but also puts at least some aspects of scientific reasoning on a rigorous quantitative footing. This is an important weapon that should be used more often in the battle against the creeping irrationalism one finds in society at large.

I posted some time ago about an alternative way of deriving the laws of probability from consistency arguments.

To see how the Bayesian approach works, let us consider a simple example. Suppose we have a hypothesis H (some theoretical idea that we think might explain some experiment or observation). We also have access to some data D, and we also adopt some prior information I (which might be the results of other experiments or simply working assumptions). What we want to know is how strongly the data D supports the hypothesis H given my background assumptions I. To keep it easy, we assume that the choice is between whether H is true or H is false. In the latter case, “not-H” or H′ (for short) is true. If our experiment is at all useful we can construct P(D|HI), the probability that the experiment would produce the data set D if both our hypothesis and the conditional information are true.

The probability P(D|HI) is called the likelihood; to construct it we need to have   some knowledge of the statistical errors produced by our measurement. Using Bayes’ theorem we can “invert” this likelihood to give P(H|DI), the probability that our hypothesis is true given the data and our assumptions. The result looks just like we had in the first two equations:

$P(H|DI) = K^{-1}P(H|I)P(D|HI) .$

Now we can expand the “normalising constant” K because we know that either H or H′ must be true. Thus

$K=P(D|I)=P(H|I)P(D|HI)+P(H^{\prime}|I) P(D|H^{\prime}I)$

The P(H|DI) on the left-hand side of the first expression is called the posterior probability; the right-hand side involves P(H|I), which is called the prior probability and the likelihood P(D|HI). The principal controversy surrounding Bayesian inductive reasoning involves the prior and how to define it, which is something I’ll comment on in a future post.

The Bayesian recipe for testing a hypothesis assigns a large posterior probability to a hypothesis for which the product of the prior probability and the likelihood is large. It can be generalized to the case where we want to pick the best of a set of competing hypothesis, say H1 …. Hn. Note that this need not be the set of all possible hypotheses, just those that we have thought about. We can only choose from what is available. The hypothesis may be relatively simple, such as that some particular parameter takes the value x, or they may be composite involving many parameters and/or assumptions. For instance, the Big Bang model of our universe is a very complicated hypothesis, or in fact a combination of hypotheses joined together,  involving at least a dozen parameters which can’t be predicted a priori but which have to be estimated from observations.

The required result for multiple hypotheses is pretty straightforward: the sum of the two alternatives involved in K above simply becomes a sum over all possible hypotheses, so that

$P(H_i|DI) = K^{-1}P(H_i|I)P(D|H_iI),$

and

$K=P(D|I)=\sum P(H_j|I)P(D|H_jI)$

If the hypothesis concerns the value of a parameter – in cosmology this might be, e.g., the mean density of the Universe expressed by the density parameter Ω0 – then the allowed space of possibilities is continuous. The sum in the denominator should then be replaced by an integral, but conceptually nothing changes. Our “best” hypothesis is the one that has the greatest posterior probability.

From a frequentist stance the procedure is often instead to just maximize the likelihood. According to this approach the best theory is the one that makes the data most probable. This can be the same as the most probable theory, but only if the prior probability is constant, but the probability of a model given the data is generally not the same as the probability of the data given the model. I’m amazed how many practising scientists make this error on a regular basis.

The following figure might serve to illustrate the difference between the frequentist and Bayesian approaches. In the former case, everything is done in “data space” using likelihoods, and in the other we work throughout with probabilities of hypotheses, i.e. we think in hypothesis space. I find it interesting to note that most theorists that I know who work in cosmology are Bayesians and most observers are frequentists!

As I mentioned above, it is the presence of the prior probability in the general formula that is the most controversial aspect of the Bayesian approach. The attitude of frequentists is often that this prior information is completely arbitrary or at least “model-dependent”. Being empirically-minded people, by and large, they prefer to think that measurements can be made and interpreted without reference to theory at all.

Assuming we can assign the prior probabilities in an appropriate way what emerges from the Bayesian framework is a consistent methodology for scientific progress. The scheme starts with the hardest part – theory creation. This requires human intervention, since we have no automatic procedure for dreaming up hypothesis from thin air. Once we have a set of hypotheses, we need data against which theories can be compared using their relative probabilities. The experimental testing of a theory can happen in many stages: the posterior probability obtained after one experiment can be fed in, as prior, into the next. The order of experiments does not matter. This all happens in an endless loop, as models are tested and refined by confrontation with experimental discoveries, and are forced to compete with new theoretical ideas. Often one particular theory emerges as most probable for a while, such as in particle physics where a “standard model” has been in existence for many years. But this does not make it absolutely right; it is just the best bet amongst the alternatives. Likewise, the Big Bang model does not represent the absolute truth, but is just the best available model in the face of the manifold relevant observations we now have concerning the Universe’s origin and evolution. The crucial point about this methodology is that it is inherently inductive: all the reasoning is carried out in “hypothesis space” rather than “observation space”.  The primary form of logic involved is not deduction but induction. Science is all about inverse reasoning.

For comments on induction versus deduction in another context, see here.

So what are the main differences between the Bayesian and frequentist views?

First, I think it is fair to say that the Bayesian framework is enormously more general than is allowed by the frequentist notion that probabilities must be regarded as relative frequencies in some ensemble, whether that is real or imaginary. In the latter interpretation, a proposition is at once true in some elements of the ensemble and false in others. It seems to me to be a source of great confusion to substitute a logical AND for what is really a logical OR. The Bayesian stance is also free from problems associated with the failure to incorporate in the analysis any information that can’t be expressed as a frequency. Would you really trust a doctor who said that 75% of the people she saw with your symptoms required an operation, but who did not bother to look at your own medical files?

As I mentioned above, frequentists tend to talk about “random variables”. This takes us into another semantic minefield. What does “random” mean? To a Bayesian there are no random variables, only variables whose values we do not know. A random process is simply one about which we only have sufficient information to specify probability distributions rather than definite values.

More fundamentally, it is clear from the fact that the combination rules for probabilities were derived by Cox uniquely from the requirement of logical consistency, that any departure from these rules will generally speaking involve logical inconsistency. Many of the standard statistical data analysis techniques – including the simple “unbiased estimator” mentioned briefly above – used when the data consist of repeated samples of a variable having a definite but unknown value, are not equivalent to Bayesian reasoning. These methods can, of course, give good answers, but they can all be made to look completely silly by suitable choice of dataset.

By contrast, I am not aware of any example of a paradox or contradiction that has ever been found using the correct application of Bayesian methods, although method can be applied incorrectly. Furthermore, in order to deal with unique events like the weather, frequentists are forced to introduce the notion of an ensemble, a perhaps infinite collection of imaginary possibilities, to allow them to retain the notion that probability is a proportion. Provided the calculations are done correctly, the results of these calculations should agree with the Bayesian answers. On the other hand, frequentists often talk about the ensemble as if it were real, and I think that is very dangerous…

## Science’s Dirtiest Secret?

Posted in Bad Statistics, The Universe and Stuff with tags , , , on March 19, 2010 by telescoper

My attention was drawn yesterday to an article, in a journal I never read called American Scientist, about the role of statistics in science. Since this is a theme I’ve blogged about before I had a quick look at the piece and quickly came to the conclusion that the article was excruciating drivel. However, looking at it again today, my opinion of it has changed. I still don’t think it’s very good, but it didn’t make me as cross second time around. I don’t know whether this is because I was in a particularly bad mood yesterday, or whether the piece has been edited. But although it didn’t make me want to scream, I still think it’s a poor article.

For better or for worse, science has long been married to mathematics. Generally it has been for the better. Especially since the days of Galileo and Newton, math has nurtured science. Rigorous mathematical methods have secured science’s fidelity to fact and conferred a timeless reliability to its findings.

During the past century, though, a mutant form of math has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos. Supposedly, the proper use of statistics makes relying on scientific results a safe bet. But in practice, widespread misuse of statistical methods makes science more like a crapshoot.

In terms of historical accuracy, the author, Tom Siegfried, gets off to a very bad start. Science didn’t get “seduced” by statistics.  As I’ve already blogged about, scientists of the calibre of Gauss and Laplace – and even Galileo – were instrumental in inventing statistics.

And what were the “modes of calculation that had served it so faithfully” anyway? Scientists have long  recognized the need to understand the behaviour of experimental errors, and to incorporate the corresponding uncertainty in their analysis. Statistics isn’t a “mutant form of math”, it’s an integral part of the scientific method. It’s a perfectly sound discipline, provided you know what you’re doing…

And that’s where, despite the sloppiness of his argument,  I do have some sympathy with some of what  Siegfried says. What has happened, in my view, is that too many people use statistical methods “off the shelf” without thinking about what they’re doing. The result is that the bad use of statistics is widespread. This is particularly true in disciplines that don’t have a well developed mathematical culture, such as some elements of biosciences and medicine, although the physical sciences have their own share of horrors too.

I’ve had a run-in myself with the authors of a paper in neurobiology who based extravagant claims on an inappropriate statistical analysis.

What is wrong is therefore not the use of statistics per se, but the fact that too few people understand – or probably even think about – what they’re trying to do (other than publish papers).

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions. Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

Quite, but what does this mean for “science’s dirtiest secret”? Not that it involves statistical reasoning, but that large numbers of scientists haven’t a clue what they’re doing when they do a statistical test. And if this is the case with practising scientists, how can we possibly expect the general public to make sense of what is being said by the experts? No wonder people distrust scientists when so many results confidently announced on the basis of totally spurious arguments, turn out to be be wrong.

The problem is that the “standard” statistical methods shouldn’t be “standard”. It’s true that there are many methods that work in a wide range of situations, but simply assuming they will work in any particular one without thinking about it very carefully is a very dangerous strategy. Siegfried discusses examples where the use of “p-values” leads to incorrect results. It doesn’t surprise me that such examples can be found, as the misinterpretation of p-values is rife even in numerate disciplines, and matters get worse for those practitioners who combine p-values from different studies using meta-analysis, a method which has no mathematical motivation whatsoever and which should be banned. So indeed should a whole host of other frequentist methods which offer limitless opportunities for to make a complete botch of the data arising from a research project.

Siegfried goes on

Nobody contends that all of science is wrong, or that it hasn’t compiled an impressive array of truths about the natural world. Still, any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical.

Any single scientific study done along is quite likely to be incorrect. Really? Well, yes, if it is done incorrectly. But the point is not that they are incorrect because they use statistics, but that they are incorrect because they are done incorrectly. Many scientists don’t even understand the statistics well enough to realise that what they’re doing is wrong.

If I had my way, scientific publications – especially in disciplines that impact directly on everyday life, such as medicine – should adopt a much more rigorous policy on statistical analysis and on the way statistical significance is reported. I favour the setting up of independent panels whose responsibility is to do the statistical data analysis on behalf of those scientists who can’t be trusted to do it correctly themselves.

Having started badly, and lost its way in the middle, the article ends disappointingly too. Having led us through a wilderness of failed frequentists analyses, he finally arrives at a discussion of the superior Bayesian methodology, in irritatingly half-hearted fashion.

But Bayesian methods introduce a confusion into the actual meaning of the mathematical concept of “probability” in the real world. Standard or “frequentist” statistics treat probabilities as objective realities; Bayesians treat probabilities as “degrees of belief” based in part on a personal assessment or subjective decision about what to include in the calculation. That’s a tough placebo to swallow for scientists wedded to the “objective” ideal of standard statistics….

Conflict between frequentists and Bayesians has been ongoing for two centuries. So science’s marriage to mathematics seems to entail some irreconcilable differences. Whether the future holds a fruitful reconciliation or an ugly separation may depend on forging a shared understanding of probability.

The difficulty with this piece as a whole is that it reads as an anti-science polemic: “Some science results are based on bad statistics, therefore statistics is bad and science that uses statistics is bogus.” I don’t know whether that’s what the author intended, or whether it was just badly written.

I’d say the true state of affairs is different. A lot of bad science is published, and a lot of that science is bad because it uses statistical reasoning badly. You wouldn’t however argue that a screwdriver is no use because some idiot tries to hammer a nail in with one.

Only a bad craftsman blames his tools.

## A Mountain of Truth

Posted in Bad Statistics, The Universe and Stuff with tags , , , , on August 1, 2009 by telescoper

I spent the last week at a conference in a beautiful setting amidst the hills overlooking the small town of Ascona by Lake Maggiore in the canton of Ticino, the Italian-speaking part of Switzerland. To be more precise we were located in a conference centre called the Centro Stefano Franscini on  Monte Verità. The meeting was COSMOSTATS which aimed

… to bring together world-class leading figures in cosmology and particle physics, as well as renowned statisticians, in order to exchange knowledge and experience in dealing with large and complex data sets, and to meet the challenge of upcoming large cosmological surveys.

Although I didn’t know much about the location beforehand it turns out to have an extremely interesting history, going back about a hundred years. The first people to settle there, around the end of the 19th Century,  were anarchists who had sought refuge there during times of political upheaval. The Locarno region had long been a popular place for people with “alternative” lifestyles. Monte Verità (“The Mountain of Truth”) was eventually bought by Henri Oedenkoven, the son of a rich industrialist, and he  set up a sort of commune there at  which the residents practised vegetarianism, naturism, free love  and other forms of behaviour that were intended as a reaction against the scientific and technological progress of the time.  From about 1904 onward the centre became a sanatorium where the discipline of psychoanalysis flourished and it later attracted many artists. In 1927,   Baron Eduard Von dey Heydt took the place over. He was a great connoisseur of Oriental philosophy and art collector and he established  a large collection at Monte Verità, much of which is still there because when the Baron died in 1956 he left Monte Verità to the local Canton.

Given the bizarre collection of anarchists, naturists, theosophists (and even vegetarians) that used to live in Monte Verità, it is by no means out of keeping with the tradition that it should eventually play host to a conference of cosmologists and statisticians.

The  conference itself was interesting, and I was lucky enough to get to chair a session with three particularly interesting talks in it. In general, though, these dialogues between statisticians and physicists don’t seem to be as productive as one might have hoped. I’ve been to a few now, and although there’s a lot of enjoyable polemic they don’t work too well at changing anyone’s opinion or providing new insights.

We may now have mountains of new data in cosmology in particle physics but that hasn’t always translated into a corresponding mountain of truth. Intervening between our theories and observations lies the vexed question of how best to analyse the data and what the results actually mean. As always, lurking in the background, was the long-running conflict between adherents of the Bayesian and frequentist interpretations of probability. It appears that cosmologists -at least those represented at this meeting – tend to be Bayesian while particle physicists are almost exclusively frequentist. I’ll refrain from commenting on what this might mean. However, I was perplexed by various comments made during the conference about the issue of coverage. which is discussed rather nicely in some detail here. To me the question of of whether a Bayesian method has good frequentist coverage properties  is completely irrelevant. Bayesian methods ask different questions (actually, ones to which scientists want to know the answer) so it is not surprising that they give different answers. Measuring a Bayesian method according to  a frequentist criterion is completely pointless whichever camp you belong to.

The irrelevance of coverage was one thing that the previous residents knew better than some of the conference guests:

I’d like to thank  Uros Seljak, Roberto Trotta and Martin Kunz for organizing the meeting in such a  picturesque and intriguing place.