Archive for the Bad Statistics Category

German Tanks, Traffic Wardens, and the End of the World

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , , on November 18, 2014 by telescoper

The other day I was looking through some documents relating to the portfolio of courses and modules offered by the Department of Mathematics here at the University of Sussex when I came across a reference to the German Tank Problem. Not knowing what this was I did a google search and  a quite comprehensive wikipedia page on the subject which explains the background rather well.

It seems that during the latter stages of World War 2 the Western Allies made sustained efforts to determine the extent of German tank production, and approached this in two major ways, namely  conventional intelligence gathering and statistical estimation with the latter approach often providing the more accurate and reliable, as was the case in estimation of the production of Panther tanks  just prior to D-Day. The allied command structure had thought the heavy Panzer V (Panther) tanks, with their high velocity, long barreled 75 mm/L70 guns, were uncommon, and would only be encountered in northern France in small numbers.  The US Army was confident that the Sherman tank would perform well against the Panzer III and IV tanks that they expected to meet but would struggle against the Panzer V. Shortly before D-Day, rumoursbegan to circulate that large numbers of Panzer V tanks had been deployed in Normandy.

To ascertain if this were true the Allies attempted to estimate the number of Panzer V  tanks being produced. To do this they used the serial numbers on captured or destroyed tanks. The principal numbers used were gearbox numbers, as these fell in two unbroken sequences; chassis, engine numbers and various other components were also used. The question to be asked is how accurately can one infer the total number of tanks based on a sample of a few serial numbers. So accurate did this analysis prove to be that, in the statistical theory of estimation, the general problem of estimating the maximum of a discrete uniform distribution from sampling without replacement is now known as the German tank problem. I’ll leave the details to the wikipedia discussion, which in my opinion is yet another demonstration of the advantages of a Bayesian approach to this kind of problem.

This problem is a more general version of a problem that I first came across about 30 years ago. I think it was devised in the following form by Steve Gull, but can’t be sure of that.

Imagine you are a visitor in an unfamiliar, but very populous, city. For the sake of argument let’s assume that it is in China. You know that this city is patrolled by traffic wardens, each of whom carries a number on their uniform.  These numbers run consecutively from 1 (smallest) to T (largest) but you don’t know what T is, i.e. how many wardens there are in total. You step out of your hotel and discover traffic warden number 347 sticking a ticket on your car. What is your best estimate of T, the total number of wardens in the city? I hope the similarity to the German Tank Problem is obvious, except in this case it is much simplified by involving just one number rather than a sample.

I gave a short lunchtime talk about this many years ago when I was working at Queen Mary College, in the University of London. Every Friday, over beer and sandwiches, a member of staff or research student would give an informal presentation about their research, or something related to it. I decided to give a talk about bizarre applications of probability in cosmology, and this problem was intended to be my warm-up. I was amazed at the answers I got to this simple question. The majority of the audience denied that one could make any inference at all about T based on a single observation like this, other than that it  must be at least 347.

Actually, a single observation like this can lead to a useful inference about T, using Bayes’ theorem. Suppose we have really no idea at all about T before making our observation; we can then adopt a uniform prior probability. Of course there must be an upper limit on T. There can’t be more traffic wardens than there are people, for example. Although China has a large population, the prior probability of there being, say, a billion traffic wardens in a single city must surely be zero. But let us take the prior to be effectively constant. Suppose the actual number of the warden we observe is t. Now we have to assume that we have an equal chance of coming across any one of the T traffic wardens outside our hotel. Each value of t (from 1 to T) is therefore equally likely. I think this is the reason that my astronomers’ lunch audience thought there was no information to be gleaned from an observation of any particular value, i.e. t=347.

Let us simplify this argument further by allowing two alternative “models” for the frequency of Chinese traffic wardens. One has T=1000, and the other (just to be silly) has T=1,000,000. If I find number 347, which of these two alternatives do you think is more likely? Think about the kind of numbers that occupy the range from 1 to T. In the first case, most of the numbers have 3 digits. In the second, most of them have 6. If there were a million traffic wardens in the city, it is quite unlikely you would find a random individual with a number as small as 347. If there were only 1000, then 347 is just a typical number. There are strong grounds for favouring the first model over the second, simply based on the number actually observed. To put it another way, we would be surprised to encounter number 347 if T were actually a million. We would not be surprised if T were 1000.

One can extend this argument to the entire range of possible values of T, and ask a more general question: if I observe traffic warden number t what is the probability I assign to each value of T? The answer is found using Bayes’ theorem. The prior, as I assumed above, is uniform. The likelihood is the probability of the observation given the model. If I assume a value of T, the probability P(t|T) of each value of t (up to and including T) is just 1/T (since each of the wardens is equally likely to be encountered). Bayes’ theorem can then be used to construct a posterior probability of P(T|t). Without going through all the nuts and bolts, I hope you can see that this probability will tail off for large T. Our observation of a (relatively) small value for t should lead us to suspect that T is itself (relatively) small. Indeed it’s a reasonable “best guess” that T=2t. This makes intuitive sense because the observed value of t then lies right in the middle of its range of possibilities.

Before going on, it is worth mentioning one other point about this kind of inference: that it is not at all powerful. Note that the likelihood just varies as 1/T. That of course means that small values are favoured over large ones. But note that this probability is uniform in logarithmic terms. So although T=1000 is more probable than T=1,000,000,  the range between 1000 and 10,000 is roughly as likely as the range between 1,000,000 and 10,000,0000, assuming there is no prior information. So although it tells us something, it doesn’t actually tell us very much. Just like any probabilistic inference, there’s a chance that it is wrong, perhaps very wrong.

Which brings me to an extrapolation of this argument to an argument about the end of the World. Now I don’t mind admitting that as I get older I get more and  more pessimistic about the prospects for humankind’s survival into the distant future. Unless there are major changes in the way this planet is governed, our Earth may indeed become barren and uninhabitable through war or environmental catastrophe. But I do think the future is in our hands, and disaster is, at least in principle, avoidable. In this respect I have to distance myself from a very strange argument that has been circulating among philosophers and physicists for a number of years. It is called Doomsday argument, and it even has a sizeable wikipedia entry, to which I refer you for more details and variations on the basic theme. As far as I am aware, it was first introduced by the mathematical physicist Brandon Carter and subsequently developed and expanded by the philosopher John Leslie (not to be confused with the TV presenter of the same name). It also re-appeared in slightly different guise through a paper in the serious scientific journal Nature by the eminent physicist Richard Gott. Evidently, for some reason, some serious people take it very seriously indeed.

So what can Doomsday possibly have to do with Panzer tanks or traffic wardens? Instead of traffic wardens, we want to estimate N, the number of humans that will ever be born, Following the same logic as in the example above, I assume that I am a “randomly” chosen individual drawn from the sequence of all humans to be born, in past present and future. For the sake of argument, assume I number n in this sequence. The logic I explained above should lead me to conclude that the total number N is not much larger than my number, n. For the sake of argument, assume that I am the one-billionth human to be born, i.e. n=1,000,000,0000.  There should not be many more than a few billion humans ever to be born. At the rate of current population growth, this means that not many more generations of humans remain to be born. Doomsday is nigh.

Richard Gott’s version of this argument is logically similar, but is based on timescales rather than numbers. If whatever thing we are considering begins at some time tbegin and ends at a time tend and if we observe it at a “random” time between these two limits, then our best estimate for its future duration is of order how long it has lasted up until now. Gott gives the example of Stonehenge, which was built about 4,000 years ago: we should expect it to last a few thousand years into the future. Actually, Stonehenge is a highly dubious . It hasn’t really survived 4,000 years. It is a ruin, and nobody knows its original form or function. However, the argument goes that if we come across a building put up about twenty years ago, presumably we should think it will come down again (whether by accident or design) in about twenty years time. If I happen to walk past a building just as it is being finished, presumably I should hang around and watch its imminent collapse….

But I’m being facetious.

Following this chain of thought, we would argue that, since humanity has been around a few hundred thousand years, it is expected to last a few hundred thousand years more. Doomsday is not quite as imminent as previously, but in any case humankind is not expected to survive sufficiently long to, say, colonize the Galaxy.

You may reject this type of argument on the grounds that you do not accept my logic in the case of the traffic wardens. If so, I think you are wrong. I would say that if you accept all the assumptions entering into the Doomsday argument then it is an equally valid example of inductive inference. The real issue is whether it is reasonable to apply this argument at all in this particular case. There are a number of related examples that should lead one to suspect that something fishy is going on. Usually the problem can be traced back to the glib assumption that something is “random” when or it is not clearly stated what that is supposed to mean.

There are around sixty million British people on this planet, of whom I am one. In contrast there are 3 billion Chinese. If I follow the same kind of logic as in the examples I gave above, I should be very perplexed by the fact that I am not Chinese. After all, the odds are 50: 1 against me being British, aren’t they?

Of course, I am not at all surprised by the observation of my non-Chineseness. My upbringing gives me access to a great deal of information about my own ancestry, as well as the geographical and political structure of the planet. This data convinces me that I am not a “random” member of the human race. My self-knowledge is conditioning information and it leads to such a strong prior knowledge about my status that the weak inference I described above is irrelevant. Even if there were a million million Chinese and only a hundred British, I have no grounds to be surprised at my own nationality given what else I know about how I got to be here.

This kind of conditioning information can be applied to history, as well as geography. Each individual is generated by its parents. Its parents were generated by their parents, and so on. The genetic trail of these reproductive events connects us to our primitive ancestors in a continuous chain. A well-informed alien geneticist could look at my DNA and categorize me as an “early human”. I simply could not be born later in the story of humankind, even if it does turn out to continue for millennia. Everything about me – my genes, my physiognomy, my outlook, and even the fact that I bothering to spend time discussing this so-called paradox – is contingent on my specific place in human history. Future generations will know so much more about the universe and the risks to their survival that they won’t even discuss this simple argument. Perhaps we just happen to be living at the only epoch in human history in which we know enough about the Universe for the Doomsday argument to make some kind of sense, but too little to resolve it.

To see this in a slightly different light, think again about Gott’s timescale argument. The other day I met an old friend from school days. It was a chance encounter, and I hadn’t seen the person for over 25 years. In that time he had married, and when I met him he was accompanied by a baby daughter called Mary. If we were to take Gott’s argument seriously, this was a random encounter with an entity (Mary) that had existed for less than a year. Should I infer that this entity should probably only endure another year or so? I think not. Again, bare numerological inference is rendered completely irrelevant by the conditioning information I have. I know something about babies. When I see one I realise that it is an individual at the start of its life, and I assume that it has a good chance of surviving into adulthood. Human civilization is a baby civilization. Like any youngster, it has dangers facing it. But is not doomed by the mere fact that it is young,

John Leslie has developed many different variants of the basic Doomsday argument, and I don’t have the time to discuss them all here. There is one particularly bizarre version, however, that I think merits a final word or two because is raises an interesting red herring. It’s called the “Shooting Room”.

Consider the following model for human existence. Souls are called into existence in groups representing each generation. The first generation has ten souls. The next has a hundred, the next after that a thousand, and so on. Each generation is led into a room, at the front of which is a pair of dice. The dice are rolled. If the score is double-six then everyone in the room is shot and it’s the end of humanity. If any other score is shown, everyone survives and is led out of the Shooting Room to be replaced by the next generation, which is ten times larger. The dice are rolled again, with the same rules. You find yourself called into existence and are led into the room along with the rest of your generation. What should you think is going to happen?

Leslie’s argument is the following. Each generation not only has more members than the previous one, but also contains more souls than have ever existed to that point. For example, the third generation has 1000 souls; the previous two had 10 and 100 respectively, i.e. 110 altogether. Roughly 90% of all humanity lives in the last generation. Whenever the last generation happens, there bound to be more people in that generation than in all generations up to that point. When you are called into existence you should therefore expect to be in the last generation. You should consequently expect that the dice will show double six and the celestial firing squad will take aim. On the other hand, if you think the dice are fair then each throw is independent of the previous one and a throw of double-six should have a probability of just one in thirty-six. On this basis, you should expect to survive. The odds are against the fatal score.

This apparent paradox seems to suggest that it matters a great deal whether the future is predetermined (your presence in the last generation requires the double-six to fall) or “random” (in which case there is the usual probability of a double-six). Leslie argues that if everything is pre-determined then we’re doomed. If there’s some indeterminism then we might survive. This isn’t really a paradox at all, simply an illustration of the fact that assuming different models gives rise to different probability assignments.

While I am on the subject of the Shooting Room, it is worth drawing a parallel with another classic puzzle of probability theory, the St Petersburg Paradox. This is an old chestnut to do with a purported winning strategy for Roulette. It was first proposed by Nicolas Bernoulli but famously discussed at greatest length by Daniel Bernoulli in the pages of Transactions of the St Petersburg Academy, hence the name.  It works just as well for the case of a simple toss of a coin as for Roulette as in the latter game it involves betting only on red or black rather than on individual numbers.

Imagine you decide to bet such that you win by throwing heads. Your original stake is £1. If you win, the bank pays you at even money (i.e. you get your stake back plus another £1). If you lose, i.e. get tails, your strategy is to play again but bet double. If you win this time you get £4 back but have bet £2+£1=£3 up to that point. If you lose again you bet £8. If you win this time, you get £16 back but have paid in £8+£4+£2+£1=£15 to that point. Clearly, if you carry on the strategy of doubling your previous stake each time you lose, when you do eventually win you will be ahead by £1. It’s a guaranteed winner. Isn’t it?

The answer is yes, as long as you can guarantee that the number of losses you will suffer is finite. But in tosses of a fair coin there is no limit to the number of tails you can throw before getting a head. To get the correct probability of winning you have to allow for all possibilities. So what is your expected stake to win this £1? The answer is the root of the paradox. The probability that you win straight off is ½ (you need to throw a head), and your stake is £1 in this case so the contribution to the expectation is £0.50. The probability that you win on the second go is ¼ (you must lose the first time and win the second so it is ½ times ½) and your stake this time is £2 so this contributes the same £0.50 to the expectation. A moment’s thought tells you that each throw contributes the same amount, £0.50, to the expected stake. We have to add this up over all possibilities, and there are an infinite number of them. The result of summing them all up is therefore infinite. If you don’t believe this just think about how quickly your stake grows after only a few losses: £1, £2, £4, £8, £16, £32, £64, £128, £256, £512, £1024, etc. After only ten losses you are staking over a thousand pounds just to get your pound back. Sure, you can win £1 this way, but you need to expect to stake an infinite amount to guarantee doing so. It is not a very good way to get rich.

The relationship of all this to the Shooting Room is that it is shows it is dangerous to pre-suppose a finite value for a number which in principle could be infinite. If the number of souls that could be called into existence is allowed to be infinite, then any individual as no chance at all of being called into existence in any generation!

Amusing as they are, the thing that makes me most uncomfortable about these Doomsday arguments is that they attempt to determine a probability of an event without any reference to underlying mechanism. For me, a valid argument about Doomsday would have to involve a particular physical cause for the extinction of humanity (e.g. asteroid impact, climate change, nuclear war, etc). Given this physical mechanism one should construct a model within which one can estimate probabilities for the model parameters (such as the rate of occurrence of catastrophic asteroid impacts). Only then can one make a valid inference based on relevant observations and their associated likelihoods. Such calculations may indeed lead to alarming or depressing results. I fear that the greatest risk to our future survival is not from asteroid impact or global warming, where the chances can be estimated with reasonable precision, but self-destructive violence carried out by humans themselves. Science has no way of being able to predict what atrocities people are capable of so we can’t make any reliable estimate of the probability we will self-destruct. But the absence of any specific mechanism in the versions of the Doomsday argument I have discussed robs them of any scientific credibility at all.

There are better grounds for worrying about the future than simple-minded numerology.

 

 

Bayes, Laplace and Bayes’ Theorem

Posted in Bad Statistics with tags , , , , , , , , on October 1, 2014 by telescoper

A  couple of interesting pieces have appeared which discuss Bayesian reasoning in the popular media. One is by Jon Butterworth in his Grauniad science blog and the other is a feature article in the New York Times. I’m in early today because I have an all-day Teaching and Learning Strategy Meeting so before I disappear for that I thought I’d post a quick bit of background.

One way to get to Bayes’ Theorem is by starting with

P(A|C)P(B|AC)=P(B|C)P(A|BC)=P(AB|C)

where I refer to three logical propositions A, B and C and the vertical bar “|” denotes conditioning, i.e. P(A|B) means the probability of A being true given the assumed truth of B; “AB” means “A and B”, etc. This basically follows from the fact that “A and B” must always be equivalent to “B and A”.  Bayes’ theorem  then follows straightforwardly as

P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)

where

K=P(A|C).

Many versions of this, including the one in Jon Butterworth’s blog, exclude the third proposition and refer to A and B only. I prefer to keep an extra one in there to remind us that every statement about probability depends on information either known or assumed to be known; any proper statement of probability requires this information to be stated clearly and used appropriately but sadly this requirement is frequently ignored.

Although this is called Bayes’ theorem, the general form of it as stated here was actually first written down not by Bayes, but by Laplace. What Bayes did was derive the special case of this formula for “inverting” the binomial distribution. This distribution gives the probability of x successes in n independent “trials” each having the same probability of success, p; each “trial” has only two possible outcomes (“success” or “failure”). Trials like this are usually called Bernoulli trials, after Daniel Bernoulli. If we ask the question “what is the probability of exactly x successes from the possible n?”, the answer is given by the binomial distribution:

P_n(x|n,p)= C(n,x) p^x (1-p)^{n-x}

where

C(n,x)= \frac{n!}{x!(n-x)!}

is the number of distinct combinations of x objects that can be drawn from a pool of n.

You can probably see immediately how this arises. The probability of x consecutive successes is p multiplied by itself x times, or px. The probability of (n-x) successive failures is similarly (1-p)n-x. The last two terms basically therefore tell us the probability that we have exactly x successes (since there must be n-x failures). The combinatorial factor in front takes account of the fact that the ordering of successes and failures doesn’t matter.

The binomial distribution applies, for example, to repeated tosses of a coin, in which case p is taken to be 0.5 for a fair coin. A biased coin might have a different value of p, but as long as the tosses are independent the formula still applies. The binomial distribution also applies to problems involving drawing balls from urns: it works exactly if the balls are replaced in the urn after each draw, but it also applies approximately without replacement, as long as the number of draws is much smaller than the number of balls in the urn. I leave it as an exercise to calculate the expectation value of the binomial distribution, but the result is not surprising: E(X)=np. If you toss a fair coin ten times the expectation value for the number of heads is 10 times 0.5, which is five. No surprise there. After another bit of maths, the variance of the distribution can also be found. It is np(1-p).

So this gives us the probability of x given a fixed value of p. Bayes was interested in the inverse of this result, the probability of p given x. In other words, Bayes was interested in the answer to the question “If I perform n independent trials and get x successes, what is the probability distribution of p?”. This is a classic example of inverse reasoning, in that it involved turning something like P(A|BC) into something like P(B|AC), which is what is achieved by the theorem stated at the start of this post.

Bayes got the correct answer for his problem, eventually, but by very convoluted reasoning. In my opinion it is quite difficult to justify the name Bayes’ theorem based on what he actually did, although Laplace did specifically acknowledge this contribution when he derived the general result later, which is no doubt why the theorem is always named in Bayes’ honour.

 

This is not the only example in science where the wrong person’s name is attached to a result or discovery. Stigler’s Law of Eponymy strikes again!

So who was the mysterious mathematician behind this result? Thomas Bayes was born in 1702, son of Joshua Bayes, who was a Fellow of the Royal Society (FRS) and one of the very first nonconformist ministers to be ordained in England. Thomas was himself ordained and for a while worked with his father in the Presbyterian Meeting House in Leather Lane, near Holborn in London. In 1720 he was a minister in Tunbridge Wells, in Kent. He retired from the church in 1752 and died in 1761. Thomas Bayes didn’t publish a single paper on mathematics in his own name during his lifetime but was elected a Fellow of the Royal Society (FRS) in 1742.

The paper containing the theorem that now bears his name was published posthumously in the Philosophical Transactions of the Royal Society of London in 1763. In his great Philosophical Essay on Probabilities Laplace wrote:

Bayes, in the Transactions Philosophiques of the Year 1763, sought directly the probability that the possibilities indicated by past experiences are comprised within given limits; and he has arrived at this in a refined and very ingenious manner, although a little perplexing.

The reasoning in the 1763 paper is indeed perplexing, and I remain convinced that the general form we now we refer to as Bayes’ Theorem should really be called Laplace’s Theorem. Nevertheless, Bayes did establish an extremely important principle that is reflected in the title of the New York Times piece I referred to at the start of this piece. In a nutshell this is that probabilities of future events can be updated on the basis of past measurements or, as I prefer to put it, “one person’s posterior is another’s prior”.

 

 

 

BICEP2 bites the dust.. or does it?

Posted in The Universe and Stuff, Bad Statistics, Science Politics, Open Access with tags , , , , , , , , on September 22, 2014 by telescoper

Well, it’s come about three weeks later than I suggested – you should know that you can never trust anything you read in a blog – but the long-awaited Planck analysis of polarized dust emission from our Galaxy has now hit the arXiv. Here is the abstract, which you can click on to make it larger:

PlanckvBICEP2

My twitter feed was already alive with reactions to the paper when I woke up at 6am, so I’m already a bit late on the story, but I couldn’t resist a quick comment or two.

The bottom line is of course that the polarized emission from Galactic dust is much larger in the BICEP2 field than had been anticipated in the BICEP2 analysis of their data (now published  in Physical Review Letters after being refereed). Indeed, as the abstract states, the actual dust contamination in the BICEP2 field is subject to considerable statistical and systematic uncertainties, but seems to be around the same level as BICEP2’s claimed detection. In other words the Planck analysis shows that the BICEP2 result is completely consistent with what is now known about polarized dust emission.  To put it bluntly, the Planck analysis shows that the claim that primordial gravitational waves had been detected was premature, to say the least. I remind you that the original  BICEP2 result was spun as a ‘7σ’ detection of a primordial polarization signal associated with gravitational waves. This level of confidence is now known to have been false.  I’m going to resist (for the time being) another rant about p-values

Although it is consistent with being entirely dust, the Planck analysis does not entirely kill off the idea that there might be a primordial contribution to the BICEP2 measurement, which could be of similar amplitude to the dust signal. However, identifying and extracting that signal will require the much more sophisticated joint analysis alluded to in the final sentence of the abstract above. Planck and BICEP2 have differing strengths and weaknesses and a joint analysis will benefit from considerable complementarity. Planck has wider spectral coverage, and has mapped the entire sky; BICEP2 is more sensitive, but works at only one frequency and covers only a relatively small field of view. Between them they may be able to identify an excess source of polarization over and above the foreground, so it is not impossible that there may a gravitational wave component may be isolated. That will be a tough job, however, and there’s by no means any guarantee that it will work. We will just have to wait and see.

In the mean time let’s see how big an effect this paper has on my poll:

 

 

Note also that the abstract states:

We show that even in the faintest dust-emitting regions there are no “clean” windows where primordial CMB B-mode polarization could be measured without subtraction of dust emission.

It is as I always thought. Our Galaxy is a rather grubby place to live. Even the windows are filthy. It’s far too dusty for fussy cosmologists, who need to have everything just so, but probably fine for astrophysicists who generally like mucking about and getting their hands dirty…

This discussion suggests that a confident detection of B-modes from primordial gravitational waves (if there is one to detect) may have to wait for a sensitive all-sky experiment, which would have to be done in space. On the other hand, Planck has identified some regions which appear to be significantly less contaminated than the BICEP2 field (which is outlined in black):

Quieter dust

Could it be possible to direct some of the ongoing ground- or balloon-based CMB polarization experiments towards the cleaner (dark blue area in the right-hand panel) just south of the BICEP2 field?

From a theorist’s perspective, I think this result means that all the models of the early Universe that we thought were dead because they couldn’t produce the high level of primordial gravitational waves detected by BICEP2 have no come back to life, and those that came to life to explain the BICEP2 result may soon be read the last rites if the signal turns out to be predominantly dust.

Another important thing that remains to be seen is the extent to which the extraordinary media hype surrounding the announcement back in March will affect the credibility of the BICEP2 team itself and indeed the cosmological community as a whole. On the one hand, there’s nothing wrong with what has happened from a scientific point of view: results get scrutinized, tested, and sometimes refuted.  To that extent all this episode demonstrates is that science works.  On the other hand most of this stuff usually goes on behind the scenes as far as the public are concerned. The BICEP2 team decided to announce their results by press conference before they had been subjected to proper peer review. I’m sure they made that decision because they were confident in their results, but it now looks like it may have backfired rather badly. I think the public needs to understand more about how science functions as a process, often very messily, but how much of this mess should be out in the open?

 

UPDATE: Here’s a piece by Jonathan Amos on the BBC Website about the story.

ANOTHER UPDATE: Here’s the Physics World take on the story.

ANOTHER OTHER UPDATE: A National Geographic story

Frequentism: the art of probably answering the wrong question

Posted in Bad Statistics with tags , , , , , , on September 15, 2014 by telescoper

Popped into the office for a spot of lunch in between induction events and discovered that Jon Butterworth has posted an item on his Grauniad blog about how particle physicists use statistics, and the ‘5σ rule’ that is usually employed as a criterion for the detection of, e.g. a new particle. I couldn’t resist bashing out a quick reply, because I believe that actually the fundamental issue is not whether you choose 3σ or 5σ or 27σ but what these statistics mean or don’t mean.

As was the case with a Nature piece I blogged about some time ago, Jon’s article focuses on the p-value, a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a particular null hypothesis. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient r obtained from a set of bivariate data. If the data were uncorrelated then r would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05. This is usually called a ‘2σ’ result because for Gaussian statistics a variable has a probability of 95% of lying within 2σ of the mean value.

Anyway, whatever the null hypothesis happens to be, you can see that the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that large under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is actually a correct description of the data. To make that sort of statement you would need to specify an alternative distribution, calculate the distribution based on it, and hence determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when it is incorrect. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean. Jon’s piece demonstrates that he does, so this is not meant as a personal criticism, but it is a pervasive problem that results quoted in such a way are intrinsically confusing.

The Nature story mentioned above argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true; a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are typically rather small.

While I agree with the Nature piece that there’s a problem, I don’t agree with the suggestion that it can be solved simply by choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05 or, in the case of particle physics, a 5σ standard (which translates to about 0.000001!  While it is true that this would throw out a lot of flaky ‘two-sigma’ results, it doesn’t alter the basic problem which is that the frequentist approach to hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach. In particular, most of the time the p-value is an answer to a question which is quite different from that which a scientist would actually want to ask, which is what the data have to say about the probability of a specific hypothesis being true or sometimes whether the data imply one hypothesis more strongly than another. I’ve banged on about Bayesian methods quite enough on this blog so I won’t repeat the arguments here, except that such approaches focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis.

I feel so strongly about this that if I had my way I’d ban p-values altogether…

Not that it’s always easy to implement a Bayesian approach. It’s especially difficult when the data are affected by complicated noise statistics and selection effects, and/or when it is difficult to formulate a hypothesis test rigorously because one does not have a clear alternative hypothesis in mind. Experimentalists (including experimental particle physicists) seem to prefer to accept the limitations of the frequentist approach than tackle the admittedly very challenging problems of going Bayesian. In fact in my experience it seems that those scientists who approach data from a theoretical perspective are almost exclusively Baysian, while those of an experimental or observational bent stick to their frequentist guns.

Coincidentally a paper on the arXiv not long ago discussed an interesting apparent paradox in hypothesis testing that arises in the context of high energy physics, which I thought I’d share here. Here is the abstract:

The Jeffreys-Lindley paradox displays how the use of a p-value (or number of standard deviations z) in a frequentist hypothesis test can lead to inferences that are radically different from those of a Bayesian hypothesis test in the form advocated by Harold Jeffreys in the 1930’s and common today. The setting is the test of a point null (such as the Standard Model of elementary particle physics) versus a composite alternative (such as the Standard Model plus a new force of nature with unknown strength). The p-value, as well as the ratio of the likelihood under the null to the maximized likelihood under the alternative, can both strongly disfavor the null, while the Bayesian posterior probability for the null can be arbitrarily large. The professional statistics literature has many impassioned comments on the paradox, yet there is no consensus either on its relevance to scientific communication or on the correct resolution. I believe that the paradox is quite relevant to frontier research in high energy physics, where the model assumptions can evidently be quite different from those in other sciences. This paper is an attempt to explain the situation to both physicists and statisticians, in hopes that further progress can be made.

This paradox isn’t a paradox at all; the different approaches give different answers because they ask different questions. Both could be right, but I firmly believe that one of them answers the wrong question.

Scotland Should Decide…

Posted in Bad Statistics, Science Politics, Politics with tags , , , , , , , , , on September 9, 2014 by telescoper

There being less than two weeks to go before the forthcoming referendum on Scottish independence, a subject on which I have so far refrained from commenting, I thought I would write something on it from the point of view of an English academic. I was finally persuaded to take the plunge because of incoming traffic to this blog from  pro-independence pieces here and here and a piece in Nature News on similar matters.

I’ll say at the outset that this is an issue for the Scots themselves to decide. I’m a believer in democracy and think that the wishes of the Scottish people as expressed through a referendum should be respected. I’m not qualified to express an opinion on the wider financial and political implications so I’ll just comment on the implications for science research, which is directly relevant to at least some of the readers of this blog. What would happen to UK research if Scotland were to vote yes?

Before going on I’ll just point out that the latest opinion poll by Yougov puts the “Yes” (i.e. pro-independence) vote ahead of “No” at 51%-49%. As the sample size for this survey was only just over a thousand, it has a margin of error of ±3%. On that basis I’d call the race neck-and-neck to within the resolution of the survey statistics. It does annoy me that pollsters never bother to state their margin of error in press released. Nevertheless, the current picture is a lot closer than it looked just a month ago, which is interesting in itself, as it is not clear to me as an outsider why it has changed so dramatically and so quickly.

Anyway, according to a Guardian piece not long ago.

Scientists and academics in Scotland would lose access to billions of pounds in grants and the UK’s world-leading research programmes if it became independent, the Westminster government has warned.

David Willetts, the UK science minister, said Scottish universities were “thriving” because of the UK’s generous and highly integrated system for funding scientific research, winning far more funding per head than the UK average.

Unveiling a new UK government paper on the impact of independence on scientific research, Willetts said that despite its size the UK was second only to the United States for the quality of its research.

“We do great things as a single, integrated system and a single integrated brings with it great strengths,” he said.

Overall spending on scientific research and development in Scottish universities from government, charitable and industry sources was more than £950m in 2011, giving a per capita spend of £180 compared to just £112 per head across the UK as a whole.

It is indeed notable that Scottish universities outperform those in the rest of the United Kingdom when it comes to research, but it always struck me that using this as an argument against independence is difficult to sustain. In fact it’s rather similar to the argument that the UK does well out of European funding schemes so that is a good argument for remaining in the European Union. The point is that, whether or not a given country benefits from the funding system, it still has to do so by following an agenda that isn’t necessarily its own. Scotland benefits from UK Research Council funding, but their priorities are set by the Westminster government, just as the European Research Council sets (sometimes rather bizarre) policies for its schemes. Who’s to say that Scotland wouldn’t do even better than it does currently by taking control of its own research funding rather than forcing its institutions to pander to Whitehall?

It’s also interesting to look at the flipside of this argument. If Scotland were to become independent, would the “billions” of research funding it would lose (according to the statement by Willetts, who is no longer the Minister in charge) benefit science in what’s left of the United Kingdom? There are many in England and Wales who think the existing research budget is already spread far too thinly and who would welcome an increase south of the border. If this did happen you could argue that, from a very narrow perspective, Scottish independence would be good for science in the rest of what is now the United Kingdom, but that depends on how much the Westminster government sets the science budget.

This all depends on how research funding would be redistributed if and when Scotland secedes from the Union, which could be done in various ways. The simplest would be for Scotland to withdraw from RCUK entirely. Because of the greater effectiveness of Scottish universities at winning funding compared to the rest of the UK, Scotland would have to spend more per capita to maintain its current level of resource, which is why many Scottish academics will be voting “no”. On the other hand, it has been suggested (by the “yes” campaign) that Scotland could buy back from its own revenue into RCUK at the current effective per capita rate  and thus maintain its present infrastructure and research expenditure at no extra cost. This, to me, sounds like wanting to have your cake and eat it,  and it’s by no means obvious that Westminster could or should agree to such a deal. All the soundings I have taken suggest that an independent Scotland should expect no such generosity, and will get actually zilch from the RCUK.

If full separation is the way head, science in Scotland would be heading into uncharted waters. Among the questions that would need to be answered are:

  •  what will happen to RCUK funded facilities and staff currently situated in Scotland, such as those at the UKATC?
  •  would Scottish researchers lose access to facilities located in England, Wales or Northern Ireland?
  •  would Scotland have to pay its own subscriptions to CERN, ESA and ESO?

These are complicated issues to resolve and there’s no question that a lengthy process of negotiation would be needed to resolved them. In the meantime, why should RCUK risk investing further funds in programmes and facilities that may end up outside the UK (or what remains of it)? This is a recipe for planning blight on an enormous scale.

And then there’s the issue of EU membership. Would Scotland be allowed to join the EU immediately on independence? If not, what would happen to EU funded research?

I’m not saying these things will necessarily work out badly in the long run for Scotland, but they are certainly questions I’d want to have answered before I were convinced to vote “yes”. I don’t have a vote so my opinion shouldn’t count for very much, but I wonder if there are any readers of this blog from across the Border who feel like expressing an opinion?

 

Politics, Polls and Insignificance

Posted in Bad Statistics, Politics with tags , , , , , on July 29, 2014 by telescoper

In between various tasks I had a look at the news and saw a story about opinion polls that encouraged me to make another quick contribution to my bad statistics folder.

The piece concerned (in the Independent) includes the following statement:

A ComRes survey for The Independent shows that the Conservatives have dropped to 27 per cent, their lowest in a poll for this newspaper since the 2010 election. The party is down three points on last month, while Labour, now on 33 per cent, is up one point. Ukip is down one point to 17 per cent, with the Liberal Democrats up one point to eight per cent and the Green Party up two points to seven per cent.

The link added to ComRes is mine; the full survey can be found here. Unfortunately, the report, as is sadly almost always the case in surveys of this kind, neglects any mention of the statistical uncertainty in the poll. In fact the last point is based on a telephone poll of a sample of just 1001 respondents. Suppose the fraction of the population having the intention to vote for a particular party is p. For a sample of size n with x respondents indicating that they hen one can straightforwardly estimate p \simeq x/n. So far so good, as long as there is no bias induced by the form of the question asked nor in the selection of the sample, which for a telephone poll is doubtful.

A  little bit of mathematics involving the binomial distribution yields an answer for the uncertainty in this estimate of p in terms of the sampling error:

\sigma = \sqrt{\frac{p(1-p)}{n}}

For the sample size given, and a value p \simeq 0.33 this amounts to a standard error of about 1.5%. About 95% of samples drawn from a population in which the true fraction is p will yield an estimate within p \pm 2\sigma, i.e. within about 3% of the true figure. In other words the typical variation between two samples drawn from the same underlying population is about 3%.

If you don’t believe my calculation then you could use ComRes’ own “margin of error calculator“. The UK electorate as of 2012 numbered 46,353,900 and a sample size of 1001 returns a margin of error of 3.1%. This figure is not quoted in the report however.

Looking at the figures quoted in the report will tell you that all of the changes reported since last month’s poll are within the sampling uncertainty and are therefore consistent with no change at all in underlying voting intentions over this period.

A summary of the report posted elsewhere states:

A ComRes survey for the Independent shows that Labour have jumped one point to 33 per cent in opinion ratings, with the Conservatives dropping to 27 per cent – their lowest support since the 2010 election.

No! There’s no evidence of support for Labour having “jumped one point”, even if you could describe such a marginal change as a “jump” in the first place.

Statistical illiteracy is as widespread amongst politicians as it is amongst journalists, but the fact that silly reports like this are commonplace doesn’t make them any less annoying. After all, the idea of sampling uncertainty isn’t all that difficult to understand. Is it?

And with so many more important things going on in the world that deserve better press coverage than they are getting, why does a “quality” newspaper waste its valuable column inches on this sort of twaddle?

Time for a Factorial Moment…

Posted in Bad Statistics with tags , , on July 22, 2014 by telescoper

Another very busy and very hot day so no time for a proper blog post. I suggest we all take a short break and enjoy a Factorial Moment:

Factorial Moment

I remember many moons ago spending ages calculating the factorial moments of the Poisson-Lognormal distribution, only to find that they were well known. If only I’d had Google then…

Follow

Get every new post delivered to your Inbox.

Join 4,229 other followers