Archive for Bayesian

One More for the Bad Statistics in Astronomy File…

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , on May 20, 2015 by telescoper

It’s been a while since I last posted anything in the file marked Bad Statistics, but I can remedy that this morning with a comment or two on the following paper by Robertson et al. which I found on the arXiv via the Astrostatistics Facebook page. It’s called Stellar activity mimics a habitable-zone planet around Kapteyn’s star and it the abstract is as follows:

Kapteyn’s star is an old M subdwarf believed to be a member of the Galactic halo population of stars. A recent study has claimed the existence of two super-Earth planets around the star based on radial velocity (RV) observations. The innermost of these candidate planets–Kapteyn b (P = 48 days)–resides within the circumstellar habitable zone. Given recent progress in understanding the impact of stellar activity in detecting planetary signals, we have analyzed the observed HARPS data for signatures of stellar activity. We find that while Kapteyn’s star is photometrically very stable, a suite of spectral activity indices reveals a large-amplitude rotation signal, and we determine the stellar rotation period to be 143 days. The spectral activity tracers are strongly correlated with the purported RV signal of “planet b,” and the 48-day period is an integer fraction (1/3) of the stellar rotation period. We conclude that Kapteyn b is not a planet in the Habitable Zone, but an artifact of stellar activity.

It’s not really my area of specialism but it seemed an interesting conclusions so I had a skim through the rest of the paper. Here’s the pertinent figure, Figure 3,

bad_stat_figure

It looks like difficult data to do a correlation analysis on and there are lots of questions to be asked  about  the form of the errors and how the bunching of the data is handled, to give just two examples.I’d like to have seen a much more comprehensive discussion of this in the paper. In particular the statistic chosen to measure the correlation between variates is the Pearson product-moment correlation coefficient, which is intended to measure linear association between variables. There may indeed be correlations in the plots shown above, but it doesn’t look to me that a straight line fit characterizes it very well. It looks to me in some of the  cases that there are simply two groups of data points…

However, that’s not the real reason for flagging this one up. The real reason is the following statement in the text:

bad_stat_text

Aargh!

No matter how the p-value is arrived at (see comments above), it says nothing about the “probability of no correlation”. This is an error which is sadly commonplace throughout the scientific literature, not just astronomy.  The point is that the p-value relates to the probability that the given value of the test statistic (in this case the Pearson product-moment correlation coefficient, r) would arise by chace in the sample if the null hypothesis H (in this case that the two variates are uncorrelated) were true. In other words it relates to P(r|H). It does not tells us anything directly about the probability of H. That would require the use of Bayes’ Theorem. If you want to say anything at all about the probability of a hypothesis being true or not you should use a Bayesian approach. And if you don’t want to say anything about the probability of a hypothesis being true or not then what are you trying to do anyway?

If I had my way I would ban p-values altogether, but it people are going to use them I do wish they would be more careful about the statements make about them.

German Tanks, Traffic Wardens, and the End of the World

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , , on November 18, 2014 by telescoper

The other day I was looking through some documents relating to the portfolio of courses and modules offered by the Department of Mathematics here at the University of Sussex when I came across a reference to the German Tank Problem. Not knowing what this was I did a google search and  a quite comprehensive wikipedia page on the subject which explains the background rather well.

It seems that during the latter stages of World War 2 the Western Allies made sustained efforts to determine the extent of German tank production, and approached this in two major ways, namely  conventional intelligence gathering and statistical estimation with the latter approach often providing the more accurate and reliable, as was the case in estimation of the production of Panther tanks  just prior to D-Day. The allied command structure had thought the heavy Panzer V (Panther) tanks, with their high velocity, long barreled 75 mm/L70 guns, were uncommon, and would only be encountered in northern France in small numbers.  The US Army was confident that the Sherman tank would perform well against the Panzer III and IV tanks that they expected to meet but would struggle against the Panzer V. Shortly before D-Day, rumoursbegan to circulate that large numbers of Panzer V tanks had been deployed in Normandy.

To ascertain if this were true the Allies attempted to estimate the number of Panzer V  tanks being produced. To do this they used the serial numbers on captured or destroyed tanks. The principal numbers used were gearbox numbers, as these fell in two unbroken sequences; chassis, engine numbers and various other components were also used. The question to be asked is how accurately can one infer the total number of tanks based on a sample of a few serial numbers. So accurate did this analysis prove to be that, in the statistical theory of estimation, the general problem of estimating the maximum of a discrete uniform distribution from sampling without replacement is now known as the German tank problem. I’ll leave the details to the wikipedia discussion, which in my opinion is yet another demonstration of the advantages of a Bayesian approach to this kind of problem.

This problem is a more general version of a problem that I first came across about 30 years ago. I think it was devised in the following form by Steve Gull, but can’t be sure of that.

Imagine you are a visitor in an unfamiliar, but very populous, city. For the sake of argument let’s assume that it is in China. You know that this city is patrolled by traffic wardens, each of whom carries a number on their uniform.  These numbers run consecutively from 1 (smallest) to T (largest) but you don’t know what T is, i.e. how many wardens there are in total. You step out of your hotel and discover traffic warden number 347 sticking a ticket on your car. What is your best estimate of T, the total number of wardens in the city? I hope the similarity to the German Tank Problem is obvious, except in this case it is much simplified by involving just one number rather than a sample.

I gave a short lunchtime talk about this many years ago when I was working at Queen Mary College, in the University of London. Every Friday, over beer and sandwiches, a member of staff or research student would give an informal presentation about their research, or something related to it. I decided to give a talk about bizarre applications of probability in cosmology, and this problem was intended to be my warm-up. I was amazed at the answers I got to this simple question. The majority of the audience denied that one could make any inference at all about T based on a single observation like this, other than that it  must be at least 347.

Actually, a single observation like this can lead to a useful inference about T, using Bayes’ theorem. Suppose we have really no idea at all about T before making our observation; we can then adopt a uniform prior probability. Of course there must be an upper limit on T. There can’t be more traffic wardens than there are people, for example. Although China has a large population, the prior probability of there being, say, a billion traffic wardens in a single city must surely be zero. But let us take the prior to be effectively constant. Suppose the actual number of the warden we observe is t. Now we have to assume that we have an equal chance of coming across any one of the T traffic wardens outside our hotel. Each value of t (from 1 to T) is therefore equally likely. I think this is the reason that my astronomers’ lunch audience thought there was no information to be gleaned from an observation of any particular value, i.e. t=347.

Let us simplify this argument further by allowing two alternative “models” for the frequency of Chinese traffic wardens. One has T=1000, and the other (just to be silly) has T=1,000,000. If I find number 347, which of these two alternatives do you think is more likely? Think about the kind of numbers that occupy the range from 1 to T. In the first case, most of the numbers have 3 digits. In the second, most of them have 6. If there were a million traffic wardens in the city, it is quite unlikely you would find a random individual with a number as small as 347. If there were only 1000, then 347 is just a typical number. There are strong grounds for favouring the first model over the second, simply based on the number actually observed. To put it another way, we would be surprised to encounter number 347 if T were actually a million. We would not be surprised if T were 1000.

One can extend this argument to the entire range of possible values of T, and ask a more general question: if I observe traffic warden number t what is the probability I assign to each value of T? The answer is found using Bayes’ theorem. The prior, as I assumed above, is uniform. The likelihood is the probability of the observation given the model. If I assume a value of T, the probability P(t|T) of each value of t (up to and including T) is just 1/T (since each of the wardens is equally likely to be encountered). Bayes’ theorem can then be used to construct a posterior probability of P(T|t). Without going through all the nuts and bolts, I hope you can see that this probability will tail off for large T. Our observation of a (relatively) small value for t should lead us to suspect that T is itself (relatively) small. Indeed it’s a reasonable “best guess” that T=2t. This makes intuitive sense because the observed value of t then lies right in the middle of its range of possibilities.

Before going on, it is worth mentioning one other point about this kind of inference: that it is not at all powerful. Note that the likelihood just varies as 1/T. That of course means that small values are favoured over large ones. But note that this probability is uniform in logarithmic terms. So although T=1000 is more probable than T=1,000,000,  the range between 1000 and 10,000 is roughly as likely as the range between 1,000,000 and 10,000,0000, assuming there is no prior information. So although it tells us something, it doesn’t actually tell us very much. Just like any probabilistic inference, there’s a chance that it is wrong, perhaps very wrong.

Which brings me to an extrapolation of this argument to an argument about the end of the World. Now I don’t mind admitting that as I get older I get more and  more pessimistic about the prospects for humankind’s survival into the distant future. Unless there are major changes in the way this planet is governed, our Earth may indeed become barren and uninhabitable through war or environmental catastrophe. But I do think the future is in our hands, and disaster is, at least in principle, avoidable. In this respect I have to distance myself from a very strange argument that has been circulating among philosophers and physicists for a number of years. It is called Doomsday argument, and it even has a sizeable wikipedia entry, to which I refer you for more details and variations on the basic theme. As far as I am aware, it was first introduced by the mathematical physicist Brandon Carter and subsequently developed and expanded by the philosopher John Leslie (not to be confused with the TV presenter of the same name). It also re-appeared in slightly different guise through a paper in the serious scientific journal Nature by the eminent physicist Richard Gott. Evidently, for some reason, some serious people take it very seriously indeed.

So what can Doomsday possibly have to do with Panzer tanks or traffic wardens? Instead of traffic wardens, we want to estimate N, the number of humans that will ever be born, Following the same logic as in the example above, I assume that I am a “randomly” chosen individual drawn from the sequence of all humans to be born, in past present and future. For the sake of argument, assume I number n in this sequence. The logic I explained above should lead me to conclude that the total number N is not much larger than my number, n. For the sake of argument, assume that I am the one-billionth human to be born, i.e. n=1,000,000,0000.  There should not be many more than a few billion humans ever to be born. At the rate of current population growth, this means that not many more generations of humans remain to be born. Doomsday is nigh.

Richard Gott’s version of this argument is logically similar, but is based on timescales rather than numbers. If whatever thing we are considering begins at some time tbegin and ends at a time tend and if we observe it at a “random” time between these two limits, then our best estimate for its future duration is of order how long it has lasted up until now. Gott gives the example of Stonehenge, which was built about 4,000 years ago: we should expect it to last a few thousand years into the future. Actually, Stonehenge is a highly dubious . It hasn’t really survived 4,000 years. It is a ruin, and nobody knows its original form or function. However, the argument goes that if we come across a building put up about twenty years ago, presumably we should think it will come down again (whether by accident or design) in about twenty years time. If I happen to walk past a building just as it is being finished, presumably I should hang around and watch its imminent collapse….

But I’m being facetious.

Following this chain of thought, we would argue that, since humanity has been around a few hundred thousand years, it is expected to last a few hundred thousand years more. Doomsday is not quite as imminent as previously, but in any case humankind is not expected to survive sufficiently long to, say, colonize the Galaxy.

You may reject this type of argument on the grounds that you do not accept my logic in the case of the traffic wardens. If so, I think you are wrong. I would say that if you accept all the assumptions entering into the Doomsday argument then it is an equally valid example of inductive inference. The real issue is whether it is reasonable to apply this argument at all in this particular case. There are a number of related examples that should lead one to suspect that something fishy is going on. Usually the problem can be traced back to the glib assumption that something is “random” when or it is not clearly stated what that is supposed to mean.

There are around sixty million British people on this planet, of whom I am one. In contrast there are 3 billion Chinese. If I follow the same kind of logic as in the examples I gave above, I should be very perplexed by the fact that I am not Chinese. After all, the odds are 50: 1 against me being British, aren’t they?

Of course, I am not at all surprised by the observation of my non-Chineseness. My upbringing gives me access to a great deal of information about my own ancestry, as well as the geographical and political structure of the planet. This data convinces me that I am not a “random” member of the human race. My self-knowledge is conditioning information and it leads to such a strong prior knowledge about my status that the weak inference I described above is irrelevant. Even if there were a million million Chinese and only a hundred British, I have no grounds to be surprised at my own nationality given what else I know about how I got to be here.

This kind of conditioning information can be applied to history, as well as geography. Each individual is generated by its parents. Its parents were generated by their parents, and so on. The genetic trail of these reproductive events connects us to our primitive ancestors in a continuous chain. A well-informed alien geneticist could look at my DNA and categorize me as an “early human”. I simply could not be born later in the story of humankind, even if it does turn out to continue for millennia. Everything about me – my genes, my physiognomy, my outlook, and even the fact that I bothering to spend time discussing this so-called paradox – is contingent on my specific place in human history. Future generations will know so much more about the universe and the risks to their survival that they won’t even discuss this simple argument. Perhaps we just happen to be living at the only epoch in human history in which we know enough about the Universe for the Doomsday argument to make some kind of sense, but too little to resolve it.

To see this in a slightly different light, think again about Gott’s timescale argument. The other day I met an old friend from school days. It was a chance encounter, and I hadn’t seen the person for over 25 years. In that time he had married, and when I met him he was accompanied by a baby daughter called Mary. If we were to take Gott’s argument seriously, this was a random encounter with an entity (Mary) that had existed for less than a year. Should I infer that this entity should probably only endure another year or so? I think not. Again, bare numerological inference is rendered completely irrelevant by the conditioning information I have. I know something about babies. When I see one I realise that it is an individual at the start of its life, and I assume that it has a good chance of surviving into adulthood. Human civilization is a baby civilization. Like any youngster, it has dangers facing it. But is not doomed by the mere fact that it is young,

John Leslie has developed many different variants of the basic Doomsday argument, and I don’t have the time to discuss them all here. There is one particularly bizarre version, however, that I think merits a final word or two because is raises an interesting red herring. It’s called the “Shooting Room”.

Consider the following model for human existence. Souls are called into existence in groups representing each generation. The first generation has ten souls. The next has a hundred, the next after that a thousand, and so on. Each generation is led into a room, at the front of which is a pair of dice. The dice are rolled. If the score is double-six then everyone in the room is shot and it’s the end of humanity. If any other score is shown, everyone survives and is led out of the Shooting Room to be replaced by the next generation, which is ten times larger. The dice are rolled again, with the same rules. You find yourself called into existence and are led into the room along with the rest of your generation. What should you think is going to happen?

Leslie’s argument is the following. Each generation not only has more members than the previous one, but also contains more souls than have ever existed to that point. For example, the third generation has 1000 souls; the previous two had 10 and 100 respectively, i.e. 110 altogether. Roughly 90% of all humanity lives in the last generation. Whenever the last generation happens, there bound to be more people in that generation than in all generations up to that point. When you are called into existence you should therefore expect to be in the last generation. You should consequently expect that the dice will show double six and the celestial firing squad will take aim. On the other hand, if you think the dice are fair then each throw is independent of the previous one and a throw of double-six should have a probability of just one in thirty-six. On this basis, you should expect to survive. The odds are against the fatal score.

This apparent paradox seems to suggest that it matters a great deal whether the future is predetermined (your presence in the last generation requires the double-six to fall) or “random” (in which case there is the usual probability of a double-six). Leslie argues that if everything is pre-determined then we’re doomed. If there’s some indeterminism then we might survive. This isn’t really a paradox at all, simply an illustration of the fact that assuming different models gives rise to different probability assignments.

While I am on the subject of the Shooting Room, it is worth drawing a parallel with another classic puzzle of probability theory, the St Petersburg Paradox. This is an old chestnut to do with a purported winning strategy for Roulette. It was first proposed by Nicolas Bernoulli but famously discussed at greatest length by Daniel Bernoulli in the pages of Transactions of the St Petersburg Academy, hence the name.  It works just as well for the case of a simple toss of a coin as for Roulette as in the latter game it involves betting only on red or black rather than on individual numbers.

Imagine you decide to bet such that you win by throwing heads. Your original stake is £1. If you win, the bank pays you at even money (i.e. you get your stake back plus another £1). If you lose, i.e. get tails, your strategy is to play again but bet double. If you win this time you get £4 back but have bet £2+£1=£3 up to that point. If you lose again you bet £8. If you win this time, you get £16 back but have paid in £8+£4+£2+£1=£15 to that point. Clearly, if you carry on the strategy of doubling your previous stake each time you lose, when you do eventually win you will be ahead by £1. It’s a guaranteed winner. Isn’t it?

The answer is yes, as long as you can guarantee that the number of losses you will suffer is finite. But in tosses of a fair coin there is no limit to the number of tails you can throw before getting a head. To get the correct probability of winning you have to allow for all possibilities. So what is your expected stake to win this £1? The answer is the root of the paradox. The probability that you win straight off is ½ (you need to throw a head), and your stake is £1 in this case so the contribution to the expectation is £0.50. The probability that you win on the second go is ¼ (you must lose the first time and win the second so it is ½ times ½) and your stake this time is £2 so this contributes the same £0.50 to the expectation. A moment’s thought tells you that each throw contributes the same amount, £0.50, to the expected stake. We have to add this up over all possibilities, and there are an infinite number of them. The result of summing them all up is therefore infinite. If you don’t believe this just think about how quickly your stake grows after only a few losses: £1, £2, £4, £8, £16, £32, £64, £128, £256, £512, £1024, etc. After only ten losses you are staking over a thousand pounds just to get your pound back. Sure, you can win £1 this way, but you need to expect to stake an infinite amount to guarantee doing so. It is not a very good way to get rich.

The relationship of all this to the Shooting Room is that it is shows it is dangerous to pre-suppose a finite value for a number which in principle could be infinite. If the number of souls that could be called into existence is allowed to be infinite, then any individual as no chance at all of being called into existence in any generation!

Amusing as they are, the thing that makes me most uncomfortable about these Doomsday arguments is that they attempt to determine a probability of an event without any reference to underlying mechanism. For me, a valid argument about Doomsday would have to involve a particular physical cause for the extinction of humanity (e.g. asteroid impact, climate change, nuclear war, etc). Given this physical mechanism one should construct a model within which one can estimate probabilities for the model parameters (such as the rate of occurrence of catastrophic asteroid impacts). Only then can one make a valid inference based on relevant observations and their associated likelihoods. Such calculations may indeed lead to alarming or depressing results. I fear that the greatest risk to our future survival is not from asteroid impact or global warming, where the chances can be estimated with reasonable precision, but self-destructive violence carried out by humans themselves. Science has no way of being able to predict what atrocities people are capable of so we can’t make any reliable estimate of the probability we will self-destruct. But the absence of any specific mechanism in the versions of the Doomsday argument I have discussed robs them of any scientific credibility at all.

There are better grounds for worrying about the future than simple-minded numerology.

 

 

Evidence, Absence, and the Type II Monster

Posted in Bad Statistics with tags , , , , , , on June 24, 2013 by telescoper

I was just having a quick lunchtime shufty at Dave Steele‘s blog. His latest post is inspired by the quotation “Absence of Evidence isn’t Evidence of Absence” which can apparently be traced back to Carl Sagan. I never knew that. Anyway I was muchly enjoying the piece when I suddenly stumbled into this paragraph, which quote without permission because I’m too shy to ask:

In a scientific experiment, the null hypothesis refers to a general or default position that there is no relationship between two measured phenomena. For example a well thought out point in an article by James Delingpole. Rejecting or disproving the null hypothesis is the primary task in any scientific research. If an experiment rejects the null hypothesis, it concludes that there are grounds greater than chance for believing that there is a relationship between the two (or more) phenomena being observed. Again the null hypothesis itself can never be proven. If participants treated with a medication are compared with untreated participants and there is found no statistically significant difference between the two groups, it does not prove that there really is no difference. Or if we say there is a monster in a Loch but cannot find it. The experiment could only be said to show that the results were not sufficient to reject the null hypothesis.

I’m going to pick up the trusty sword of Bayesian probability and have yet another go at the dragon of frequentism, but before doing so I’ll just correct the first sentence. The “null hypothesis” in a frequentist hypothesis test is not necessarily of the form described here: it could be of virtually any form, possibly quite different from the stated one of no correlation between two variables. All that matters is that (a) it has to be well-defined in terms of a model and (b) you have to be content to accept it as true unless and until you find evidence to the contrary. It’s true to say that there’s nowt as well-specified as nowt so nulls are often of the form “there is no correlation” or something like that, but the point is that they don’t have to be.

I note that the wikipedia page on “null hypothesis” uses the same wording as in the first sentence of the quoted paragraph, but this is not what you’ll find in most statistics textbooks. In their compendious three-volume work The Advanced Theory of Statistics Kendall & Stuart even go as far to say that the word “null” is misleading precisely because the hypothesis under test might be quite complicated, e.g. of composite nature.

Anyway, whatever the null hypothesis happens to be, the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the significance level merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call a Type I error. It says nothing at all about the probability that the null hypothesis is actually correct. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. So is the notion, which stems from this frequentist formulation, that all a scientist can ever hope to do is refute their null hypothesis. You’ll find this view echoed in the philosophical approach of Karl Popper and it has heavily influenced the way many scientists see the scientific method, unfortunately.

The asymmetrical way that the null and alternative hypotheses are treated in the frequentist framework is not helpful, in my opinion. Far better to adopt a Bayesian framework in which probability represents the extent to which measurements or other data support a given theory. New statistical evidence can make two hypothesis either more or less probable relative to each other. The focus is not just on rejecting a specific model, but on comparing two or more models in a mutually consistent way. The key notion is not falsifiablity, but testability. Data that fail to reject a hypothesis can properly be interpreted as supporting it, i.e. by making it more probable, but such reasoning can only be done consistently within the Bayesian framework.

What remains true, however, is that the null hypothesis (or indeed any other hypothesis) can never be proven with certainty; that is true whenever probabilistic reasoning is true. Sometimes, though, the weight of supporting evidence is so strong that inductive logic compels us to regard our theory or model or hypothesis as virtually certain. That applies whether the evidence is actual measurement or non-detections; to a Bayesian, absence of evidence can (and indeed often is) evidence of absence. The sun rises every morning and sets every evening; it is silly to argue that this provides us with no grounds for arguing that it will do so tomorrow. Likewise, the sonar surveys and other investigations in Loch Ness provide us with evidence that supports the hypothesis that there isn’t a Monster over virtually every possible hypothetical Monster that has been suggested.

It is perfectly sensible to use this reasoning to infer that there is no Loch Ness Monster. Probably.

Science’s Dirtiest Secret?

Posted in Bad Statistics, The Universe and Stuff with tags , , , on March 19, 2010 by telescoper

My attention was drawn yesterday to an article, in a journal I never read called American Scientist, about the role of statistics in science. Since this is a theme I’ve blogged about before I had a quick look at the piece and quickly came to the conclusion that the article was excruciating drivel. However, looking at it again today, my opinion of it has changed. I still don’t think it’s very good, but it didn’t make me as cross second time around. I don’t know whether this is because I was in a particularly bad mood yesterday, or whether the piece has been edited. But although it didn’t make me want to scream, I still think it’s a poor article.

Let me start with the opening couple of paragraphs

For better or for worse, science has long been married to mathematics. Generally it has been for the better. Especially since the days of Galileo and Newton, math has nurtured science. Rigorous mathematical methods have secured science’s fidelity to fact and conferred a timeless reliability to its findings.

During the past century, though, a mutant form of math has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos. Supposedly, the proper use of statistics makes relying on scientific results a safe bet. But in practice, widespread misuse of statistical methods makes science more like a crapshoot.

In terms of historical accuracy, the author, Tom Siegfried, gets off to a very bad start. Science didn’t get “seduced” by statistics.  As I’ve already blogged about, scientists of the calibre of Gauss and Laplace – and even Galileo – were instrumental in inventing statistics.

And what were the “modes of calculation that had served it so faithfully” anyway? Scientists have long  recognized the need to understand the behaviour of experimental errors, and to incorporate the corresponding uncertainty in their analysis. Statistics isn’t a “mutant form of math”, it’s an integral part of the scientific method. It’s a perfectly sound discipline, provided you know what you’re doing…

And that’s where, despite the sloppiness of his argument,  I do have some sympathy with some of what  Siegfried says. What has happened, in my view, is that too many people use statistical methods “off the shelf” without thinking about what they’re doing. The result is that the bad use of statistics is widespread. This is particularly true in disciplines that don’t have a well developed mathematical culture, such as some elements of biosciences and medicine, although the physical sciences have their own share of horrors too.

I’ve had a run-in myself with the authors of a paper in neurobiology who based extravagant claims on an inappropriate statistical analysis.

What is wrong is therefore not the use of statistics per se, but the fact that too few people understand – or probably even think about – what they’re trying to do (other than publish papers).

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions. Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

Quite, but what does this mean for “science’s dirtiest secret”? Not that it involves statistical reasoning, but that large numbers of scientists haven’t a clue what they’re doing when they do a statistical test. And if this is the case with practising scientists, how can we possibly expect the general public to make sense of what is being said by the experts? No wonder people distrust scientists when so many results confidently announced on the basis of totally spurious arguments, turn out to be be wrong.

The problem is that the “standard” statistical methods shouldn’t be “standard”. It’s true that there are many methods that work in a wide range of situations, but simply assuming they will work in any particular one without thinking about it very carefully is a very dangerous strategy. Siegfried discusses examples where the use of “p-values” leads to incorrect results. It doesn’t surprise me that such examples can be found, as the misinterpretation of p-values is rife even in numerate disciplines, and matters get worse for those practitioners who combine p-values from different studies using meta-analysis, a method which has no mathematical motivation whatsoever and which should be banned. So indeed should a whole host of other frequentist methods which offer limitless opportunities for to make a complete botch of the data arising from a research project.

Siegfried goes on

Nobody contends that all of science is wrong, or that it hasn’t compiled an impressive array of truths about the natural world. Still, any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical.

Any single scientific study done along is quite likely to be incorrect. Really? Well, yes, if it is done incorrectly. But the point is not that they are incorrect because they use statistics, but that they are incorrect because they are done incorrectly. Many scientists don’t even understand the statistics well enough to realise that what they’re doing is wrong.

If I had my way, scientific publications – especially in disciplines that impact directly on everyday life, such as medicine – should adopt a much more rigorous policy on statistical analysis and on the way statistical significance is reported. I favour the setting up of independent panels whose responsibility is to do the statistical data analysis on behalf of those scientists who can’t be trusted to do it correctly themselves.

Having started badly, and lost its way in the middle, the article ends disappointingly too. Having led us through a wilderness of failed frequentists analyses, he finally arrives at a discussion of the superior Bayesian methodology, in irritatingly half-hearted fashion.

But Bayesian methods introduce a confusion into the actual meaning of the mathematical concept of “probability” in the real world. Standard or “frequentist” statistics treat probabilities as objective realities; Bayesians treat probabilities as “degrees of belief” based in part on a personal assessment or subjective decision about what to include in the calculation. That’s a tough placebo to swallow for scientists wedded to the “objective” ideal of standard statistics….

Conflict between frequentists and Bayesians has been ongoing for two centuries. So science’s marriage to mathematics seems to entail some irreconcilable differences. Whether the future holds a fruitful reconciliation or an ugly separation may depend on forging a shared understanding of probability.

The difficulty with this piece as a whole is that it reads as an anti-science polemic: “Some science results are based on bad statistics, therefore statistics is bad and science that uses statistics is bogus.” I don’t know whether that’s what the author intended, or whether it was just badly written.

I’d say the true state of affairs is different. A lot of bad science is published, and a lot of that science is bad because it uses statistical reasoning badly. You wouldn’t however argue that a screwdriver is no use because some idiot tries to hammer a nail in with one.

Only a bad craftsman blames his tools.

Test Odds

Posted in Cricket with tags , , on August 24, 2009 by telescoper

I’m very grateful to Daniel Mortlock for sending me this fascinating plot. It comes from the cricket pages of The Times Online and it shows how the probability of the various possible outcomes of the Final Ashes Test at the Oval evolved with time according to their “Hawk-Eye Analysis”.

pastedGraphic

 I think I should mention that Daniel is an Australian supporter, so this graph must make painful viewing for him! Anyway, it’s a fascinating plot, which I read as an application of Bayesian probability.

At the beginning of the match, a prior probability is assigned to each of the three possible outcomes: England win (blue); Australia win (yellow); and Draw (grey). It looks like these are roughly in the ratio 1:2:2. No details are given as to how these were arrived at, but it must have taken into account the fact that Australia thrashed England in the previous match at Headingley. Information from previous Tests at the Oval was presumably also included.I don’t know if the fact that England won the toss and decided to bat first altered the prior odds significantly, but it should have.

Anyway, what happens next depends on how sophisticated a model is used to determine the subsequent evolution of the  probabilities. In good Bayesian fashion, information is incorporate in a likelihood function determined by the model and this is used to update the  prior  to produce a posterior probability. This is passed on as a prior for  the next time step. And so it goes on until the end of the match where, regardless of what prior is chosen, the data force the model to the correct conclusion.

The red dots show the fall of wickets, but the odds fluctuate continually in accord with variables such as scoring rate, number of wickets,  and, presumably, the weather. Some form of difference equation is clearly being used, but we don’t know the details.

England got off to a pretty good start, so their probability to win started to creep up, but not by all that much, presumably because the model didn’t think their first-innings total of 332 was enough against a good batting side like Australia. However, the odds of a draw fell more significantly as a result of fairly quick scoring and the lack of any rain delays.

When the Australians batted they were going well at the start so England’s probability to win started to fall and theirs to rise. But when they started to lose quick wickets (largely to Stuart Broad), the blue and yellow trajectories swap over and England became favourites by a large margin. Despite a wobble when they lost 3 early wickets and some jitters when Australia’s batsmen put healthy partnerships together, England remained the more probable to win from that point to the end.

Although it all basically makes some sense, there are some curiosities.  Daniel Mortlock asked, for example, whether Australia were  really as likely to win at about 200 for 2 on the fourth day as  England were when Australia were 70 without loss in the first innings?  That’s what the graph seems to say. His reading of this is that too much stock is placed in the difficulty of   breaking a big (100+ runs) parnership, as the curves seem to   “accelerate” when the batsmen seem to be doing well.

I wonder how new information is included in general terms. Australia’s poor first innings batting (160 all out) in any case only reduced their win probability to about the level that England started at. How was their batting in the first innings balanced against their performance in the last match?

I’d love to know more about the algorithm used in this analysis, but I suspect it is copyright. There may be a good reason for not disclosing it. I have noticed in recent years that bookmakers have been setting extremely parsimonious odds for cricket outcomes. Gone are the days (Headingley 1981) when bookmakers offered 500-1 against England to beat Australia, which they then proceeded to do. In those days the bookmakers relied on expert advisors to fix their odds. I believe it was the late Godfrey Evans who persuaded them to offer 500-1. I’m not sure if they ever asked him again!

The system on which Hawkeye is based is much more conservative. Even on the last day of the test, odds against an Australian victory remained around 4-1 until they were down to their last few wickets. Notice also that the odds on a draw were never as long against as they should have been either, when that outcome was clearly virtually impossible. On the morning of the final day I could only find 10-1 against the draw which I think is remarkably ungenerous. However, even with an England victory a near certainty you could still find odds like 1-4. It seems like the system doesn’t like to produce extremely long or extremely short odds.

Perhaps the bookies are now using analyses like this to set their odds, which explains why betting on cricket isn’t as much fun as it used to be. On the other hand, if the system is predisposed against very short odds then maybe that’s the kind of bet to make in order to win. Things like this may be why the algorithm behind Hawkeye isn’t published…

A Mountain of Truth

Posted in Bad Statistics, The Universe and Stuff with tags , , , , on August 1, 2009 by telescoper

I spent the last week at a conference in a beautiful setting amidst the hills overlooking the small town of Ascona by Lake Maggiore in the canton of Ticino, the Italian-speaking part of Switzerland. To be more precise we were located in a conference centre called the Centro Stefano Franscini on  Monte Verità. The meeting was COSMOSTATS which aimed

… to bring together world-class leading figures in cosmology and particle physics, as well as renowned statisticians, in order to exchange knowledge and experience in dealing with large and complex data sets, and to meet the challenge of upcoming large cosmological surveys.

Although I didn’t know much about the location beforehand it turns out to have an extremely interesting history, going back about a hundred years. The first people to settle there, around the end of the 19th Century,  were anarchists who had sought refuge there during times of political upheaval. The Locarno region had long been a popular place for people with “alternative” lifestyles. Monte Verità (“The Mountain of Truth”) was eventually bought by Henri Oedenkoven, the son of a rich industrialist, and he  set up a sort of commune there at  which the residents practised vegetarianism, naturism, free love  and other forms of behaviour that were intended as a reaction against the scientific and technological progress of the time.  From about 1904 onward the centre became a sanatorium where the discipline of psychoanalysis flourished and it later attracted many artists. In 1927,   Baron Eduard Von dey Heydt took the place over. He was a great connoisseur of Oriental philosophy and art collector and he established  a large collection at Monte Verità, much of which is still there because when the Baron died in 1956 he left Monte Verità to the local Canton.

Given the bizarre collection of anarchists, naturists, theosophists (and even vegetarians) that used to live in Monte Verità, it is by no means out of keeping with the tradition that it should eventually play host to a conference of cosmologists and statisticians.

The  conference itself was interesting, and I was lucky enough to get to chair a session with three particularly interesting talks in it. In general, though, these dialogues between statisticians and physicists don’t seem to be as productive as one might have hoped. I’ve been to a few now, and although there’s a lot of enjoyable polemic they don’t work too well at changing anyone’s opinion or providing new insights.

We may now have mountains of new data in cosmology in particle physics but that hasn’t always translated into a corresponding mountain of truth. Intervening between our theories and observations lies the vexed question of how best to analyse the data and what the results actually mean. As always, lurking in the background, was the long-running conflict between adherents of the Bayesian and frequentist interpretations of probability. It appears that cosmologists -at least those represented at this meeting – tend to be Bayesian while particle physicists are almost exclusively frequentist. I’ll refrain from commenting on what this might mean. However, I was perplexed by various comments made during the conference about the issue of coverage. which is discussed rather nicely in some detail here. To me the question of of whether a Bayesian method has good frequentist coverage properties  is completely irrelevant. Bayesian methods ask different questions (actually, ones to which scientists want to know the answer) so it is not surprising that they give different answers. Measuring a Bayesian method according to  a frequentist criterion is completely pointless whichever camp you belong to.

The irrelevance of coverage was one thing that the previous residents knew better than some of the conference guests:

mvtanz3

I’d like to thank  Uros Seljak, Roberto Trotta and Martin Kunz for organizing the meeting in such a  picturesque and intriguing place.

The Doomsday Argument

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , on April 29, 2009 by telescoper

I don’t mind admitting that as I get older I get more and  more pessimistic about the prospects for humankind’s survival into the distant future.

Unless there are major changes in the way this planet is governed, our planet may become barren and uninhabitable through war or environmental catastrophe. But I do think the future is in our hands, and disaster is, at least in principle, avoidable. In this respect I have to distance myself from a very strange argument that has been circulating among philosophers and physicists for a number of years. It is called Doomsday argument, and it even has a sizeable wikipedia entry, to which I refer you for more details and variations on the basic theme. As far as I am aware, it was first introduced by the mathematical physicist Brandon Carter and subsequently developed and expanded by the philosopher John Leslie (not to be confused with the TV presenter of the same name). It also re-appeared in slightly different guise through a paper in the serious scientific journal Nature by the eminent physicist Richard Gott. Evidently, for some reason, some serious people take it very seriously indeed.

The Doomsday argument uses the language of probability theory, but it is such a strange argument that I think the best way to explain it is to begin with a more straightforward problem of the same type.

 Imagine you are a visitor in an unfamiliar, but very populous, city. For the sake of argument let’s assume that it is in China. You know that this city is patrolled by traffic wardens, each of whom carries a number on their uniform.  These numbers run consecutively from 1 (smallest) to T (largest) but you don’t know what T is, i.e. how many wardens there are in total. You step out of your hotel and discover traffic warden number 347 sticking a ticket on your car. What is your best estimate of T, the total number of wardens in the city?

 I gave a short lunchtime talk about this when I was working at Queen Mary College, in the University of London. Every Friday, over beer and sandwiches, a member of staff or research student would give an informal presentation about their research, or something related to it. I decided to give a talk about bizarre applications of probability in cosmology, and this problem was intended to be my warm-up. I was amazed at the answers I got to this simple question. The majority of the audience denied that one could make any inference at all about T based on a single observation like this, other than that it  must be at least 347.

 Actually, a single observation like this can lead to a useful inference about T, using Bayes’ theorem. Suppose we have really no idea at all about T before making our observation; we can then adopt a uniform prior probability. Of course there must be an upper limit on T. There can’t be more traffic wardens than there are people, for example. Although China has a large population, the prior probability of there being, say, a billion traffic wardens in a single city must surely be zero. But let us take the prior to be effectively constant. Suppose the actual number of the warden we observe is t. Now we have to assume that we have an equal chance of coming across any one of the T traffic wardens outside our hotel. Each value of t (from 1 to T) is therefore equally likely. I think this is the reason that my astronomers’ lunch audience thought there was no information to be gleaned from an observation of any particular value, i.e. t=347.

 Let us simplify this argument further by allowing two alternative “models” for the frequency of Chinese traffic wardens. One has T=1000, and the other (just to be silly) has T=1,000,000. If I find number 347, which of these two alternatives do you think is more likely? Think about the kind of numbers that occupy the range from 1 to T. In the first case, most of the numbers have 3 digits. In the second, most of them have 6. If there were a million traffic wardens in the city, it is quite unlikely you would find a random individual with a number as small as 347. If there were only 1000, then 347 is just a typical number. There are strong grounds for favouring the first model over the second, simply based on the number actually observed. To put it another way, we would be surprised to encounter number 347 if T were actually a million. We would not be surprised if T were 1000.

 One can extend this argument to the entire range of possible values of T, and ask a more general question: if I observe traffic warden number t what is the probability I assign to each value of T? The answer is found using Bayes’ theorem. The prior, as I assumed above, is uniform. The likelihood is the probability of the observation given the model. If I assume a value of T, the probability P(t|T) of each value of t (up to and including T) is just 1/T (since each of the wardens is equally likely to be encountered). Bayes’ theorem can then be used to construct a posterior probability of P(T|t). Without going through all the nuts and bolts, I hope you can see that this probability will tail off for large T. Our observation of a (relatively) small value for t should lead us to suspect that T is itself (relatively) small. Indeed it’s a reasonable “best guess” that T=2t. This makes intuitive sense because the observed value of t then lies right in the middle of its range of possibilities.

 Before going on, it is worth mentioning one other point about this kind of inference: that it is not at all powerful. Note that the likelihood just varies as 1/T. That of course means that small values are favoured over large ones. But note that this probability is uniform in logarithmic terms. So although T=1000 is more probable than T=1,000,000,  the range between 1000 and 10,000 is roughly as likely as the range between 1,000,000 and 10,000,0000, assuming there is no prior information. So although it tells us something, it doesn’t actually tell us very much. Just like any probabilistic inference, there’s a chance that it is wrong, perhaps very wrong.

 What does all this have to do with Doomsday? Instead of traffic wardens, we want to estimate N, the number of humans that will ever be born, Following the same logic as in the example above, I assume that I am a “randomly” chosen individual drawn from the sequence of all humans to be born, in past present and future. For the sake of argument, assume I number n in this sequence. The logic I explained above should lead me to conclude that the total number N is not much larger than my number, n. For the sake of argument, assume that I am the one-billionth human to be born, i.e. n=1,000,000,0000.  There should not be many more than a few billion humans ever to be born. At the rate of current population growth, this means that not many more generations of humans remain to be born. Doomsday is nigh.

 Richard Gott’s version of this argument is logically similar, but is based on timescales rather than numbers. If whatever thing we are considering begins at some time tbegin and ends at a time tend and if we observe it at a “random” time between these two limits, then our best estimate for its future duration is of order how long it has lasted up until now. Gott gives the example of Stonehenge[1], which was built about 4,000 years ago: we should expect it to last a few thousand years into the future. Actually, Stonehenge is a highly dubious . It hasn’t really survived 4,000 years. It is a ruin, and nobody knows its original form or function. However, the argument goes that if we come across a building put up about twenty years ago, presumably we should think it will come down again (whether by accident or design) in about twenty years time. If I happen to walk past a building just as it is being finished, presumably I should hang around and watch its imminent collapse….

But I’m being facetious.

Following this chain of thought, we would argue that, since humanity has been around a few hundred thousand years, it is expected to last a few hundred thousand years more. Doomsday is not quite as imminent as previously, but in any case humankind is not expected to survive sufficiently long to, say, colonize the Galaxy.

 You may reject this type of argument on the grounds that you do not accept my logic in the case of the traffic wardens. If so, I think you are wrong. I would say that if you accept all the assumptions entering into the Doomsday argument then it is an equally valid example of inductive inference. The real issue is whether it is reasonable to apply this argument at all in this particular case. There are a number of related examples that should lead one to suspect that something fishy is going on. Usually the problem can be traced back to the glib assumption that something is “random” when or it is not clearly stated what that is supposed to mean.

 There are around sixty million British people on this planet, of whom I am one. In contrast there are 3 billion Chinese. If I follow the same kind of logic as in the examples I gave above, I should be very perplexed by the fact that I am not Chinese. After all, the odds are 50: 1 against me being British, aren’t they?

 Of course, I am not at all surprised by the observation of my non-Chineseness. My upbringing gives me access to a great deal of information about my own ancestry, as well as the geographical and political structure of the planet. This data convinces me that I am not a “random” member of the human race. My self-knowledge is conditioning information and it leads to such a strong prior knowledge about my status that the weak inference I described above is irrelevant. Even if there were a million million Chinese and only a hundred British, I have no grounds to be surprised at my own nationality given what else I know about how I got to be here.

 This kind of conditioning information can be applied to history, as well as geography. Each individual is generated by its parents. Its parents were generated by their parents, and so on. The genetic trail of these reproductive events connects us to our primitive ancestors in a continuous chain. A well-informed alien geneticist could look at my DNA and categorize me as an “early human”. I simply could not be born later in the story of humankind, even if it does turn out to continue for millennia. Everything about me – my genes, my physiognomy, my outlook, and even the fact that I bothering to spend time discussing this so-called paradox – is contingent on my specific place in human history. Future generations will know so much more about the universe and the risks to their survival that they won’t even discuss this simple argument. Perhaps we just happen to be living at the only epoch in human history in which we know enough about the Universe for the Doomsday argument to make some kind of sense, but too little to resolve it.

 To see this in a slightly different light, think again about Gott’s timescale argument. The other day I met an old friend from school days. It was a chance encounter, and I hadn’t seen the person for over 25 years. In that time he had married, and when I met him he was accompanied by a baby daughter called Mary. If we were to take Gott’s argument seriously, this was a random encounter with an entity (Mary) that had existed for less than a year. Should I infer that this entity should probably only endure another year or so? I think not. Again, bare numerological inference is rendered completely irrelevant by the conditioning information I have. I know something about babies. When I see one I realise that it is an individual at the start of its life, and I assume that it has a good chance of surviving into adulthood. Human civilization is a baby civilization. Like any youngster, it has dangers facing it. But is not doomed by the mere fact that it is young,

 John Leslie has developed many different variants of the basic Doomsday argument, and I don’t have the time to discuss them all here. There is one particularly bizarre version, however, that I think merits a final word or two because is raises an interesting red herring. It’s called the “Shooting Room”.

 Consider the following model for human existence. Souls are called into existence in groups representing each generation. The first generation has ten souls. The next has a hundred, the next after that a thousand, and so on. Each generation is led into a room, at the front of which is a pair of dice. The dice are rolled. If the score is double-six then everyone in the room is shot and it’s the end of humanity. If any other score is shown, everyone survives and is led out of the Shooting Room to be replaced by the next generation, which is ten times larger. The dice are rolled again, with the same rules. You find yourself called into existence and are led into the room along with the rest of your generation. What should you think is going to happen?

 Leslie’s argument is the following. Each generation not only has more members than the previous one, but also contains more souls than have ever existed to that point. For example, the third generation has 1000 souls; the previous two had 10 and 100 respectively, i.e. 110 altogether. Roughly 90% of all humanity lives in the last generation. Whenever the last generation happens, there bound to be more people in that generation than in all generations up to that point. When you are called into existence you should therefore expect to be in the last generation. You should consequently expect that the dice will show double six and the celestial firing squad will take aim. On the other hand, if you think the dice are fair then each throw is independent of the previous one and a throw of double-six should have a probability of just one in thirty-six. On this basis, you should expect to survive. The odds are against the fatal score.

 This apparent paradox seems to suggest that it matters a great deal whether the future is predetermined (your presence in the last generation requires the double-six to fall) or “random” (in which case there is the usual probability of a double-six). Leslie argues that if everything is pre-determined then we’re doomed. If there’s some indeterminism then we might survive. This isn’t really a paradox at all, simply an illustration of the fact that assuming different models gives rise to different probability assignments.

 While I am on the subject of the Shooting Room, it is worth drawing a parallel with another classic puzzle of probability theory, the St Petersburg Paradox. This is an old chestnut to do with a purported winning strategy for Roulette. It was first proposed by Nicolas Bernoulli but famously discussed at greatest length by Daniel Bernoulli in the pages of Transactions of the St Petersburg Academy, hence the name.  It works just as well for the case of a simple toss of a coin as for Roulette as in the latter game it involves betting only on red or black rather than on individual numbers.

 Imagine you decide to bet such that you win by throwing heads. Your original stake is £1. If you win, the bank pays you at even money (i.e. you get your stake back plus another £1). If you lose, i.e. get tails, your strategy is to play again but bet double. If you win this time you get £4 back but have bet £2+£1=£3 up to that point. If you lose again you bet £8. If you win this time, you get £16 back but have paid in £8+£4+£2+£1=£15 to that point. Clearly, if you carry on the strategy of doubling your previous stake each time you lose, when you do eventually win you will be ahead by £1. It’s a guaranteed winner. Isn’t it?

 The answer is yes, as long as you can guarantee that the number of losses you will suffer is finite. But in tosses of a fair coin there is no limit to the number of tails you can throw before getting a head. To get the correct probability of winning you have to allow for all possibilities. So what is your expected stake to win this £1? The answer is the root of the paradox. The probability that you win straight off is ½ (you need to throw a head), and your stake is £1 in this case so the contribution to the expectation is £0.50. The probability that you win on the second go is ¼ (you must lose the first time and win the second so it is ½ times ½) and your stake this time is £2 so this contributes the same £0.50 to the expectation. A moment’s thought tells you that each throw contributes the same amount, £0.50, to the expected stake. We have to add this up over all possibilities, and there are an infinite number of them. The result of summing them all up is therefore infinite. If you don’t believe this just think about how quickly your stake grows after only a few losses: £1, £2, £4, £8, £16, £32, £64, £128, £256, £512, £1024, etc. After only ten losses you are staking over a thousand pounds just to get your pound back. Sure, you can win £1 this way, but you need to expect to stake an infinite amount to guarantee doing so. It is not a very good way to get rich.

 The relationship of all this to the Shooting Room is that it is shows it is dangerous to pre-suppose a finite value for a number which in principle could be infinite. If the number of souls that could be called into existence is allowed to be infinite, then any individual as no chance at all of being called into existence in any generation!

 Amusing as they are, the thing that makes me most uncomfortable about these Doomsday arguments is that they attempt to determine a probability of an event without any reference to underlying mechanism. For me, a valid argument about Doomsday would have to involve a particular physical cause for the extinction of humanity (e.g. asteroid impact, climate change, nuclear war, etc). Given this physical mechanism one should construct a model within which one can estimate probabilities for the model parameters (such as the rate of occurrence of catastrophic asteroid impacts). Only then can one make a valid inference based on relevant observations and their associated likelihoods. Such calculations may indeed lead to alarming or depressing results. I fear that the greatest risk to our future survival is not from asteroid impact or global warming, where the chances can be estimated with reasonable precision, but self-destructive violence carried out by humans themselves. Science has no way of being able to predict what atrocities people are capable of so we can’t make any reliable estimate of the probability we will self-destruct. But the absence of any specific mechanism in the versions of the Doomsday argument I have discussed robs them of any scientific credibility at all.

There are better grounds for worrying about the future than mere numerology.

Arrows and Demons

Posted in The Universe and Stuff with tags , , , , , on April 12, 2009 by telescoper

My recent post about randomness and non-randomness spawned a lot of comments over on cosmic variance about the nature of entropy. I thought I’d add a bit about that topic here, mainly because I don’t really agree with most of what is written in textbooks on this subject.

The connection between thermodynamics (which deals with macroscopic quantities) and statistical mechanics (which explains these in terms of microscopic behaviour) is a fascinating but troublesome area.  James Clerk Maxwell (right) did much to establish the microscopic meaning of the first law of thermodynamics he never tried develop the second law from the same standpoint. Those that did were faced with a conundrum.  

 

The behaviour of a system of interacting particles, such as the particles of a gas, can be expressed in terms of a Hamiltonian H which is constructed from the positions and momenta of its constituent particles. The resulting equations of motion are quite complicated because every particle, in principle, interacts with all the others. They do, however, possess an simple yet important property. Everything is reversible, in the sense that the equations of motion remain the same if one changes the direction of time and changes the direction of motion for all the particles. Consequently, one cannot tell whether a movie of atomic motions is being played forwards or backwards.

This means that the Gibbs entropy is actually a constant of the motion: it neither increases nor decreases during Hamiltonian evolution.

But what about the second law of thermodynamics? This tells us that the entropy of a system tends to increase. Our everyday experience tells us this too: we know that physical systems tend to evolve towards states of increased disorder. Heat never passes from a cold body to a hot one. Pour milk into coffee and everything rapidly mixes. How can this directionality in thermodynamics be reconciled with the completely reversible character of microscopic physics?

The answer to this puzzle is surprisingly simple, as long as you use a sensible interpretation of entropy that arises from the idea that its probabilistic nature represents not randomness (whatever that means) but incompleteness of information. I’m talking, of course, about the Bayesian view of probability.

 First you need to recognize that experimental measurements do not involve describing every individual atomic property (the “microstates” of the system), but large-scale average things like pressure and temperature (these are the “macrostates”). Appropriate macroscopic quantities are chosen by us as useful things to use because they allow us to describe the results of experiments and measurements in a  robust and repeatable way. By definition, however, they involve a substantial coarse-graining of our description of the system.

Suppose we perform an idealized experiment that starts from some initial macrostate. In general this will generally be consistent with a number – probably a very large number – of initial microstates. As the experiment continues the system evolves along a Hamiltonian path so that the initial microstate will evolve into a definite final microstate. This is perfectly symmetrical and reversible. But the point is that we can never have enough information to predict exactly where in the final phase space the system will end up because we haven’t specified all the details of which initial microstate we were in.  Determinism does not in itself allow predictability; you need information too.

If we choose macro-variables so that our experiments are reproducible it is inevitable that the set of microstates consistent with the final macrostate will usually be larger than the set of microstates consistent with the initial macrostate, at least  in any realistic system. Our lack of knowledge means that the probability distribution of the final state is smeared out over a larger phase space volume at the end than at the start. The entropy thus increases, not because of anything happening at the microscopic level but because our definition of macrovariables requires it.

ham

This is illustrated in the Figure. Each individual microstate in the initial collection evolves into one state in the final collection: the narrow arrows represent Hamiltonian evolution.

 

However, given only a finite amount of information about the initial state these trajectories can’t be as well defined as this. This requires the set of final microstates has to acquire a  sort of “buffer zone” around the strictly Hamiltonian core;  this is the only way to ensure that measurements on such systems will be reproducible.

The “theoretical” Gibbs entropy remains exactly constant during this kind of evolution, and it is precisely this property that requires the experimental entropy to increase. There is no microscopic explanation of the second law. It arises from our attempt to shoe-horn microscopic behaviour into framework furnished by macroscopic experiments.

Another, perhaps even more compelling demonstration of the so-called subjective nature of probability (and hence entropy) is furnished by Maxwell’s demon. This little imp first made its appearance in 1867 or thereabouts and subsequently led a very colourful and influential life. The idea is extremely simple: imagine we have a box divided into two partitions, A and B. The wall dividing the two sections contains a tiny door which can be opened and closed by a “demon” – a microscopic being “whose faculties are so sharpened that he can follow every molecule in its course”. The demon wishes to play havoc with the second law of thermodynamics so he looks out for particularly fast moving molecules in partition A and opens the door to allow them (and only them) to pass into partition B. He does the opposite thing with partition B, looking out for particularly sluggish molecules and opening the door to let them into partition A when they approach.

The net result of the demon’s work is that the fast-moving particles from A are preferentially moved into B and the slower particles from B are gradually moved into A. The net result is that the average kinetic energy of A molecules steadily decreases while that of B molecules increases. In effect, heat is transferred from a cold body to a hot body, something that is forbidden by the second law.

All this talk of demons probably makes this sound rather frivolous, but it is a serious paradox that puzzled many great minds. Until it was resolved in 1929 by Leo Szilard. He showed that the second law of thermodynamics would not actually be violated if entropy of the entire system (i.e. box + demon) increased by an amount every time the demon measured the speed of a molecule so he could decide whether to let it out from one side of the box into the other. This amount of entropy is precisely enough to balance the apparent decrease in entropy caused by the gradual migration of fast molecules from A into B. This illustrates very clearly that there is a real connection between the demon’s state of knowledge and the physical entropy of the system.

By now it should be clear why there is some sense of the word subjective that does apply to entropy. It is not subjective in the sense that anyone can choose entropy to mean whatever they like, but it is subjective in the sense that it is something to do with the way we manage our knowledge about nature rather than about nature itself. I know from experience, however, that many physicists feel very uncomfortable about the idea that entropy might be subjective even in this sense.

On the other hand, I feel completely comfortable about the notion:. I even think it’s obvious. To see why, consider the example I gave above about pouring milk into coffee. We are all used to the idea that the nice swirly pattern you get when you first pour the milk in is a state of relatively low entropy. The parts of the phase space of the coffee + milk system that contain such nice separations of black and white are few and far between. It’s much more likely that the system will end up as a “mixed” state. But then how well mixed the coffee is depends on your ability to resolve the size of the milk droplets. An observer with good eyesight would see less mixing than one with poor eyesight. And an observer who couldn’t perceive the difference between milk and coffee would see perfect mixing. In this case entropy, like beauty, is definitely in the eye of the beholder.

The refusal of many physicists to accept the subjective nature of entropy arises, as do so many misconceptions in physics, from the wrong view of probability.

False Positives

Posted in Bad Statistics with tags , , on February 27, 2009 by telescoper

There was an interesting article on the BBC website this week that, for once, contains an example of a reasonable discussion of statistics in the mass media. I’m indebted to my friend Anton for pointing it out to me. I’ve filed it along with examples of Bad Statistics because the issue is usually poorly explained. I don’t think the article itself is bad. In fact, it’s rather good.

The question is all about cancer screening, specifically for breast cancer, but the lesson could apply to a host of other situations. In the original context, the question goes as follows:

Say that routine screening is 90% accurate. Say you have a positive test. What’s the chance that your positive test is accurate and you really have cancer?

Presumably there will be many of you that think the answer is 90%. Hands up if you think this!

If you don’t think it’s 90% then what do you think it is?

The correct answer is that you have no idea. I haven’t given you enough information.

To see why, imagine that the prevalence of cancer in the population is such that 1% of a randomly selected sample will have it. Out of a thousand people one would expect that, on average, ten would have cancer. If the test is 90% accurate then 9 of these will show positive signs and only one won’t.

However, 990 people out of the original thousand don’t have cancer. If the test is only 90% accurate then 10%, i.e. 99 of these will show a false positive.

Thus the total number of positive tests is 108 and only 9 of the individuals concerned actually have cancer. The odds are therefore 9/108. That’s only about a 1-in-12 chance that you have cancer.

But that depends on my assumption about the overall rate in the population. If that number is different it changes the odds. Without this information, the problem is ill-posed.

The more general way of looking at this is in terms of conditional probabilities. What you are given is that P(positive test| cancer)=P(+|C)=0.9 and P(negative test|no cancer)=0.9, while P(negative test|cancer)= 0.1 and P(positive test|no cancer)=P(+|N)=0.1. What you want to know is P(cancer|positive test)=P(C|+). This can be obtained from Bayes’ Theorem but only if you know P(cancer)=P(C)=1-P(N), since people either have cancer or they don’t.

The answer is given by P(C|+)=P(C)P(+|C)/[P(C)P(+|C)+P(N)P(+|N)], which for the numbers I gave above= 0.01 x 0.9/[0.01 x 0.9 + 0.99 x 0.1]=0.009/[0.009+0.099], which gives the same answer as before.

So the moral is that you shouldn’t panic if you get a positive test from a screening test of this type. As long as the condition being tested is relatively rarer than the likelihood of an error in the test result then the chances are high that you’ve got nothing to worry about. But of course, you should take more detailed tests.

The Bayesian way is the easy way!

On the Cards

Posted in Uncategorized with tags , , , , , , , on January 27, 2009 by telescoper

After an interesting chat yesterday with a colleague about the difficulties involved in teaching probabilities, I thought it might be fun to write something about card games. Actually, much of science is intimately concerned with statistical reasoning and if any one activity was responsible for the development of the theory of probability, which underpins statistics, it was the rise of games of chance in the 16th and 17th centuries. Card, dice and lottery games still provide great examples of how to calculate probabilities, a skill which is very important for a physicist.

For those of you who did not misspend your youth playing with cards like I did, I should remind you that a standard pack of playing cards has 52 cards. There are 4 suits: clubs (♣), diamonds (♦), hearts (♥) and spades (♠). Clubs and spades are coloured black, while diamonds and hearts are red. Each suit contains thirteen cards, including an Ace (A), the plain numbered cards (2, 3, 4, 5, 6, 7, 8, 9 and 10), and the face cards: Jack (J), Queen (Q), and King (K). In most games the most valuable is the Ace, following by King, Queen and Jack and then from 10 down to 2.

I’ll start with Poker, because it seems to be one of the simplest ways of losing money these days. Imagine I start with a well-shuffled pack of 52 cards. In a game of five-card draw poker, the players essentially bet on who has the best hand made from five cards drawn from the pack. In more complicated versions of poker, such as Texas hold’em, one has, say, two “private” cards in one’s hand and, say, five on the table in plain view. These community cards are usually revealed in stages, allowing a round of betting at each stage. One has to make the best hand one can using five cards from ones private cards and those on the table. The existence of community cards makes this very interesting because it gives some additional information about other player’s holdings. For the present discussion, however, I will just stick to individual hands and their probabilities.

How many different five-card poker hands are possible?

To answer this question we need to know about permutations and combinations. Imagine constructing a poker hand from a standard deck. The deck is full when you start, which gives you 52 choices for the first card of your hand. Once that is taken you have 51 choices for the second, and so on down to 48 choices for the last card. One might think the answer is therefore 52×51×50×49 ×48=311,875,200, but that’s not right because it doesn’t actually matter which order your five cards are dealt to you.

Suppose you have 4 aces and the 2 of clubs in your hand; the sequences (A♣, A♥, A♦, A♠, 2♣) and (A♥ 2♣ A♠, A♦, A♣) are counted as distinct hands among the number I obtained above. There are many other possibilities like this where the cards are the same but the order is different. In fact there are 5×4×3×2× 1 = 120 such permutations . Mathematically this is denoted 5!, or five-factorial. Dividing the number above by this gives the actual number of possible five-card poker hands: 2,598,960. This number is important because it describes the size of the “possibility space”. Each of these hands is a possible poker deal, and each is assumed to be “equally likely”, unless the dealer is cheating.

This calculation is an example of a mathematical combination as opposed to a permutation. The number of combinations one can make of r things chosen from a set of n is usually denoted Cn,r. In the example above, r=5 and n=52. Note that 52×51×50×49 ×48 can be written 52!/47! The general result for the number of combinations can likewise be written Cn,r=n!/(n-r)!r!

Poker hands are characterized by the occurrence of particular events of varying degrees of probability. For example, a flush is five cards of the same suit but not in sequence (e.g. 2♠, 4♠, 7♠, 9♠, Q♠). A numerical sequence of cards regardless of suit (e.g. 7♣, 8♠, 9♥, 10♦, J♥) is called a straight. A sequence of cards of the same suit is called a straight flush. One can also have a pair of cards of the same value, or two pairs, or three of a kind, or four of a kind, or a full house which is three of one kind and two of another. One can also have nothing at all, i.e. not even a pair.

The relative value of the different hands is determined by how probable they are, and to work that out takes quite a bit of effort.

Consider the probability of getting, say, 5 spades (in other words, spade flush). To do this we have to calculate the number of distinct hands that have this composition.There are 13 spades in the deck to start with, so there are 13×12×11×10×9 permutations of 5 spades drawn from the pack, but, because of the possible internal rearrangements, we have to divide again by 5! The result is that there are 1287 possible hands containing 5 spades. Not all of these are mere flushes, however. Some of them will include sequences too, e.g. 8♠, 9♠, 10♠, J♠, Q♠, which makes them straight flushes. There are only 10 possible straight flushes in spades (starting with 2,3,4,5,6,7,8,9,10 or J), so only 1277 of the possible hands counted above are just flushes. This logic can apply to any of the suits, so in all there are 1277×4=5108 flush hands and 10×4=40 straight flush hands.

I won’t go through the details of calculating the probability of the other types of hand, but I’ve included a table showing their probabilities obtained by dividing the relevant number of possibilities by the total number of hands (given at the bottom of the middle column).

TYPE OF HAND

Number of Possible Hands

Probability

Straight Flush

40

0.000015

Four of a Kind

624

0.000240

Full House

3744

0.001441

Flush

5108

0.001965

Straight

10,200

0.003925

Three of a Kind

54,912

0.021129

Two Pair

123,552

0.047539

One Pair

1,098,240

0.422569

Nothing

1,302,540

0.501177

TOTALS

2,598,960

1.00000

 

 

 

Poker involves rounds of betting in which the players, amongst other things, try to assess how likely their hand is to win compared with the others involved in the game. If your hand is weak, you can fold and allow the accumulated bets to be given to your opponents. Alternatively, you can  bluff and bet strongly on a poor hand (even if you have “nothing”) to convince your opponents that your hand is strong. This tactic can be extremely successful in the right circumstances. In the words of the late great Paul Newman in the film Cool Hand Luke,  “sometimes nothing can be a real cool hand”.

If you bet heavily on your hand, the opponent may well think it is strong even if it contains nothing, and fold even if his hand has a higher value. To bluff successfully requires a good sense of timing – it depends crucially on who gets to bet first – and extremely cool nerves. To spot when an opponent is bluffing requires real psychological insight. These aspects of the game are in many ways more interesting than the basic hand probabilities, and they are difficult to reduce to mathematics.

Another card game that serves as a source for interesting problems in probability is Contract Bridge. This is one of the most difficult card games to play well because it is a game of logic that also involves chance to some degree. Bridge is a game for four people, arranged in two teams of two. The four sit at a table with members of each team facing opposite each other. Traditionally the different positions are called North, South, East and West although you don’t actually need a compass to play. North and South are partners, as are East and West.

For each hand of Bridge an ordinary pack of cards is shuffled and dealt out by one of the players, the dealer. Let us suppose that the dealer in this case is South. The pack is dealt out one card at a time to each player in turn, starting with West (to dealer’s immediate left) then North and so on in a clockwise direction. Each player ends up with thirteen cards when all the cards are dealt.

Now comes the first phase of the game, the auction. Each player looks at their cards and makes a bid, which is essentially a coded message that gives information to their partner about how good their hand is. A bid is basically an undertaking to win a certain number of tricks with a certain suit as trumps (or with no trumps). The meaning of tricks and trumps will become clear later. For example, dealer might bid “one spade” which is a suggestion that perhaps he and his partner could win one more trick than the opposition with spades as the trump suit. This means winning seven tricks, as there are always thirteen to be won in a given deal. The next to bid – in this case West – can either pass (saying “no bid”) or bid higher, like an auction. The value of the suits increases in the sequence clubs, diamonds, hearts and spades. So to outbid one spade (1S), West has to bid at least two hearts (2H), say, if hearts is the best suit for him but if South had opened 1C then 1H would have been sufficient to overcall . Next to bid is South’s partner, North. If he likes spades as trumps he can raise the original bid. If he likes them a lot he can jump to a much higher contract, such as four spades (4S).

This is the most straightforward level of Bridge bidding, but in reality there are many bids that don’t mean what they might appear to mean at first sight. Examples include conventional bids  (such as Stayman or Blackwood),  splinter and transfer bids and the rest of the complex lexicon of Bridge jargon. There are some bids to which partner must respond (forcing bids), and those to which a response is discretionary. And instead of overcalling a bid, one’s opponents could “double” either for penalties in the hope that the contract will fail or as a “take-out” to indicate strength in a suit other than the one just bid.

Bidding carries on in a clockwise direction until nobody dares take it higher. Three successive passes will end the auction, and the contract is then established. Whichever player opened the bidding in the suit that was finally chosen for trumps becomes “declarer”. If we suppose our example ended in 4S, then it was South that becomes declarer because he opened the bidding with 1S. If West had overcalled 2 Hearts (2H) and this had passed round the table, West would be declarer.

The scoring system for Bridge encourages teams to go for high contracts rather than low ones, so if one team has the best cards it doesn’t necessarily get an easy ride; it should undertake an ambitious contract rather than stroll through a simple one. In particular there are extra points for making “game” (a contract of four spades, four hearts, five clubs, five diamonds, or three no trumps). There is a huge bonus available for bidding and making a grand slam (an undertaking to win all thirteen tricks, i.e. seven of something) and a smaller but still impressive bonus for a small slam (six of something). This encourages teams to push for a valuable contract: tricks bid and made count a lot more than overtricks even without the slam bonus.

The second phase of the game now starts. The person to the left of declarer plays a card of their choice, possibly following yet another convention, such as “fourth highest of the longest suit”. The player opposite declarer puts all his cards on the table and becomes “dummy”, playing no further part in this particular hand. Dummy’s cards are then entirely under the control of the declarer. All three players can see the cards in dummy, but only declarer can see his own hand. Apart from the role of dummy, the card play is then similar to whist.

Each trick consists of four cards played in clockwise sequence from whoever leads. Each player, including dummy, must follow the suit led if he has a card of that suit in his hand. If a player doesn’t have a card of that suit he may “ruff”, i.e. play a trump card, or simply discard some card (probably of low value) from another suit. Good Bridge players keep a careful track of all discards to improve their knowledge of the cards held by their  opponents. Discards can also be used by the defence (i.e. East and West in this case) to signal to each other. Declarer can see dummy’s cards but the defenders don’t get to see each other’s.

One can win a trick in one of two ways. Either one plays a higher card of the same suit, e.g. K♥ beats 10♥, or anything lower than Q♥. Aces are high, by the way. Alternatively, if one has no cards of the suit that has been led, one can play a trump (or “ruff”). A trump always beats a card of the original suit, but more than one player may ruff and in that case the highest trump played carries the trick. For instance, East may ruff only to be over-ruffed by South if both have none of the suit led. Of course one may not have any trumps at all, making a ruff impossible. If one has neither the original suit nor a trump one has to discard something from another suit. The possibility of winning a trick by a ruff also does not exist if the contract is of the no-trumps variety.

Whoever wins a given trick leads to start the next one. This carries on until thirteen tricks have been played. Then comes the reckoning of whether the contract has been made. If so, points are awarded to declarer’s team. If not, penalty points are awarded to the defenders which are higher if the contract has been doubled. Then it’s time for another hand, probably another drink, and very possibly an argument about how badly declarer played the hand.

I’ve gone through the game in some detail in an attempt to make it clear why this is such an interesting game for probabilistic reasoning. During the auction, partial information is given about every player’s holding. It is vital to interpret this information correctly if the contract is to be made. The auction can reveal which of the defending team holds important high cards, or whether the trump suit is distributed strangely. Because the cards are played in strict clockwise sequence this matters a lot. On the other hand, even with very firm knowledge about where the important cards lie, one still often has a difficult logical puzzle to solve if all the potential winners in one’s hand are actually to be made into tricks. It can be a very subtle game.

I only have space-time for one illustration of this kind of thing, but it’s another one that is fun to work out. As is true to a lesser extent in poker, one is not really interested in the initial probabilities of the different hands but rather how to update these probabilities using conditional information as it may be revealed through the auction and card play. In poker this updating is done largely by interpreting the bets one’s opponents are making.

Let us suppose that I am South, and I have been daring enough to bid a grand slam in spades (7S). West leads, and North lays down dummy. I look at my hand and dummy, and realise that we have 11 trumps between us, missing only the King (K) and the 2. I have all other suits covered, and enough winners to make the contract provided I can make sure I win all the trump tricks. The King, however, poses a problem. The Ace of Spades will beat the King, but if I just lead the Ace, it may be that one of East or West has both the K and the 2. In this case he would simply play the two to my Ace. The King would be an automatic winner then: as the highest remaining trump it must win a trick eventually. The contract is then doomed.

On the other hand if the spades split 1-1 between East and West then the King drops when I lead the Ace, so that strategy makes the contract. It all depends how the cards split.

But there is a different way to play this situation. Suppose, for example, that A♠ and Q♠ are on the table (in dummy’s hand) and I, as declarer, have managed to win the first trick in my hand. If I think the K♠ lies in West’s hand, I lead a spade. West has to follow suit if he can. If he has the King, and plays it, I can cover it with the Ace so it doesn’t win. If, however, West plays low I can play Q♠. This will win if I am right about the location of the King. Next time I can lead the A♠ from dummy and the King will fall. This play is called a finesse.

But is this better than the previous strategy, playing for the drop? It’s all a question of probabilities, and this in turn boils down to the number of possible deals allow each strategy to work.

To start with, we need the total number of possible bridge hands. This is quite easy: it’s the number of combinations of 13 objects taken from 52, i.e. C52,13. This is a truly enormous number: over 600 billion. You have to play a lot of games to expect to be dealt the same hand twice!

What we now have to do is evaluate the probability of each possible arrangement of the missing King and two. Dummy and declarer’s hands are known to me. There are 26 remaining cards whose location I do not know. The relevant space of possibilities is now smaller than the original one. I have 26 cards to assign between East and West. There are C26,13 ways of assigning West’s 13 cards, but once I have done this the remaining 13 must be in East’s hand.

Suppose West has the 2 but not the K. Conditional on this assumption, I know one of his cards, but there are 12 others remaining to be assigned. There are therefore C24,12 hands with this possible arrangement of the trumps. Obviously the K has to be with East in this case. The finesse would not work as East would cover the Q with the K, but the K would drop if the A were played.

The opposite situation, with West having the K but not the 2 has the same number of possibilities associated with it. Here West must play the K when a spade is led so it will inevitably lose to the A. South abandons the idea of finessing when West rises and just covers it with the higher card.

Suppose instead West doesn’t have any trumps. There are C24,13 ways of constructing such a hand: 13 cards from the 24 remaining non-trumps. Here the finesse fails because the K is with East but the drop fails too. East plays the 2 on the A and the K becomes a winner.

The remaining possibility is that West has both trumps: this can happen in C24,11 ways. Here the finesse works but the drop fails. If West plays low on the South lead, declarer calls for the Q from dummy to hold the trick. Next lead he plays the A to drop the K.

To turn these counts into probabilities we just divide by the total number of different ways I can construct the hands of East and West, which is C26,13. The results are summarized in the table here.

Spades in West’s hand

Number of hands

Probability

Drop

Finesse

None

C24,13

0.24

0

0

K

C24,12

0.26

0.26

0.26

2

C24,12

0.26

0.26

0

K2

C24,11

0.24

0

0.24

Total

C26,13

1.00

0.52

0.50

The last two columns show the contributions of each arrangement to the probability of success of either playing for the drop or the finesse. You can see that the drop is slightly more likely to work than the finesse in this case.

Note, however, that this ignores any information gleaned from the auction, which could be crucial. For example, if West had made a bid then it is more likely that he had cards of some value so this might suggest the K might be in his hand. Note also that the probability of the drop and the probability of the finesse do not add up to one. This is because there are situations where both could work or both could fail.

This calculation does not mean that the finesse is never the right tactic. It sometimes has much higher probability than the drop, and is often strongly motivated by information the auction has revealed. Calculating the odds precisely, however, gets more complicated the more cards are missing from declarer’s holding. For those of you too lazy to compute the probabilities, the book On Gambling, by Oswald Jacoby contains tables of the odds for just about any bridge situation you can think of.

Finally on the subject of Bridge, I wanted to mention a fact that many people think is paradoxical but which isn’t really. Looking at the table shows that the odds of a 1-1 split in spades here are 0.52:0.48 or 13: 12. This comes from how many cards are in East and West’s hand when the play is attempted. There is a much quicker way of getting this answer than the brute force method I used above. Consider the hand with the spade two in it. There are 12 remaining opportunities in that hand that the spade K might fill, but there are 13 available slots for it in the other. The odds on a 1-1 split must therefore be 13:12. Now suppose instead of going straight for the trumps, I play off a few winners in the side suits (risking that they might be ruffed, of course). Suppose I lead out three Aces in the three suits other than spades and they all win. Now East and West have only 20 cards between them and by exactly the same reasoning as before, the odds of a 1-1 split have become 10:9 instead of 13:12. Playing out seemingly irrelevant suits has increased the probability of the drop working. Although I haven’t touched the spades, my assessment of the probability of the spade distribution has changed significantly.

This sort of thing is a major reason why I always think of probabilities in a Bayesian way. As information is gradually revealed one updates the assessment of the probability of the remaining unknowns.

But probability is only a part of Bridge; the best players don’t actually leave very much  to chance…