## Bayes, Laplace and Bayes’ Theorem

Posted in Bad Statistics with tags , , , , , , , , on October 1, 2014 by telescoper

A  couple of interesting pieces have appeared which discuss Bayesian reasoning in the popular media. One is by Jon Butterworth in his Grauniad science blog and the other is a feature article in the New York Times. I’m in early today because I have an all-day Teaching and Learning Strategy Meeting so before I disappear for that I thought I’d post a quick bit of background.

One way to get to Bayes’ Theorem is by starting with

$P(A|C)P(B|AC)=P(B|C)P(A|BC)=P(AB|C)$

where I refer to three logical propositions A, B and C and the vertical bar “|” denotes conditioning, i.e. $P(A|B)$ means the probability of A being true given the assumed truth of B; “AB” means “A and B”, etc. This basically follows from the fact that “A and B” must always be equivalent to “B and A”.  Bayes’ theorem  then follows straightforwardly as

$P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)$

where

$K=P(A|C).$

Many versions of this, including the one in Jon Butterworth’s blog, exclude the third proposition and refer to A and B only. I prefer to keep an extra one in there to remind us that every statement about probability depends on information either known or assumed to be known; any proper statement of probability requires this information to be stated clearly and used appropriately but sadly this requirement is frequently ignored.

Although this is called Bayes’ theorem, the general form of it as stated here was actually first written down not by Bayes, but by Laplace. What Bayes did was derive the special case of this formula for “inverting” the binomial distribution. This distribution gives the probability of x successes in n independent “trials” each having the same probability of success, p; each “trial” has only two possible outcomes (“success” or “failure”). Trials like this are usually called Bernoulli trials, after Daniel Bernoulli. If we ask the question “what is the probability of exactly x successes from the possible n?”, the answer is given by the binomial distribution:

$P_n(x|n,p)= C(n,x) p^x (1-p)^{n-x}$

where

$C(n,x)= \frac{n!}{x!(n-x)!}$

is the number of distinct combinations of x objects that can be drawn from a pool of n.

You can probably see immediately how this arises. The probability of x consecutive successes is p multiplied by itself x times, or px. The probability of (n-x) successive failures is similarly (1-p)n-x. The last two terms basically therefore tell us the probability that we have exactly x successes (since there must be n-x failures). The combinatorial factor in front takes account of the fact that the ordering of successes and failures doesn’t matter.

The binomial distribution applies, for example, to repeated tosses of a coin, in which case p is taken to be 0.5 for a fair coin. A biased coin might have a different value of p, but as long as the tosses are independent the formula still applies. The binomial distribution also applies to problems involving drawing balls from urns: it works exactly if the balls are replaced in the urn after each draw, but it also applies approximately without replacement, as long as the number of draws is much smaller than the number of balls in the urn. I leave it as an exercise to calculate the expectation value of the binomial distribution, but the result is not surprising: E(X)=np. If you toss a fair coin ten times the expectation value for the number of heads is 10 times 0.5, which is five. No surprise there. After another bit of maths, the variance of the distribution can also be found. It is np(1-p).

So this gives us the probability of x given a fixed value of p. Bayes was interested in the inverse of this result, the probability of p given x. In other words, Bayes was interested in the answer to the question “If I perform n independent trials and get x successes, what is the probability distribution of p?”. This is a classic example of inverse reasoning, in that it involved turning something like P(A|BC) into something like P(B|AC), which is what is achieved by the theorem stated at the start of this post.

Bayes got the correct answer for his problem, eventually, but by very convoluted reasoning. In my opinion it is quite difficult to justify the name Bayes’ theorem based on what he actually did, although Laplace did specifically acknowledge this contribution when he derived the general result later, which is no doubt why the theorem is always named in Bayes’ honour.

This is not the only example in science where the wrong person’s name is attached to a result or discovery. Stigler’s Law of Eponymy strikes again!

So who was the mysterious mathematician behind this result? Thomas Bayes was born in 1702, son of Joshua Bayes, who was a Fellow of the Royal Society (FRS) and one of the very first nonconformist ministers to be ordained in England. Thomas was himself ordained and for a while worked with his father in the Presbyterian Meeting House in Leather Lane, near Holborn in London. In 1720 he was a minister in Tunbridge Wells, in Kent. He retired from the church in 1752 and died in 1761. Thomas Bayes didn’t publish a single paper on mathematics in his own name during his lifetime but was elected a Fellow of the Royal Society (FRS) in 1742.

The paper containing the theorem that now bears his name was published posthumously in the Philosophical Transactions of the Royal Society of London in 1763. In his great Philosophical Essay on Probabilities Laplace wrote:

Bayes, in the Transactions Philosophiques of the Year 1763, sought directly the probability that the possibilities indicated by past experiences are comprised within given limits; and he has arrived at this in a refined and very ingenious manner, although a little perplexing.

The reasoning in the 1763 paper is indeed perplexing, and I remain convinced that the general form we now we refer to as Bayes’ Theorem should really be called Laplace’s Theorem. Nevertheless, Bayes did establish an extremely important principle that is reflected in the title of the New York Times piece I referred to at the start of this piece. In a nutshell this is that probabilities of future events can be updated on the basis of past measurements or, as I prefer to put it, “one person’s posterior is another’s prior”.

## Politics, Polls and Insignificance

Posted in Bad Statistics, Politics with tags , , , , , on July 29, 2014 by telescoper

In between various tasks I had a look at the news and saw a story about opinion polls that encouraged me to make another quick contribution to my bad statistics folder.

The piece concerned (in the Independent) includes the following statement:

A ComRes survey for The Independent shows that the Conservatives have dropped to 27 per cent, their lowest in a poll for this newspaper since the 2010 election. The party is down three points on last month, while Labour, now on 33 per cent, is up one point. Ukip is down one point to 17 per cent, with the Liberal Democrats up one point to eight per cent and the Green Party up two points to seven per cent.

The link added to ComRes is mine; the full survey can be found here. Unfortunately, the report, as is sadly almost always the case in surveys of this kind, neglects any mention of the statistical uncertainty in the poll. In fact the last point is based on a telephone poll of a sample of just 1001 respondents. Suppose the fraction of the population having the intention to vote for a particular party is $p$. For a sample of size $n$ with $x$ respondents indicating that they hen one can straightforwardly estimate $p \simeq x/n$. So far so good, as long as there is no bias induced by the form of the question asked nor in the selection of the sample, which for a telephone poll is doubtful.

A  little bit of mathematics involving the binomial distribution yields an answer for the uncertainty in this estimate of p in terms of the sampling error:

$\sigma = \sqrt{\frac{p(1-p)}{n}}$

For the sample size given, and a value $p \simeq 0.33$ this amounts to a standard error of about 1.5%. About 95% of samples drawn from a population in which the true fraction is $p$ will yield an estimate within $p \pm 2\sigma$, i.e. within about 3% of the true figure. In other words the typical variation between two samples drawn from the same underlying population is about 3%.

If you don’t believe my calculation then you could use ComRes’ own “margin of error calculator“. The UK electorate as of 2012 numbered 46,353,900 and a sample size of 1001 returns a margin of error of 3.1%. This figure is not quoted in the report however.

Looking at the figures quoted in the report will tell you that all of the changes reported since last month’s poll are within the sampling uncertainty and are therefore consistent with no change at all in underlying voting intentions over this period.

A summary of the report posted elsewhere states:

A ComRes survey for the Independent shows that Labour have jumped one point to 33 per cent in opinion ratings, with the Conservatives dropping to 27 per cent – their lowest support since the 2010 election.

No! There’s no evidence of support for Labour having “jumped one point”, even if you could describe such a marginal change as a “jump” in the first place.

Statistical illiteracy is as widespread amongst politicians as it is amongst journalists, but the fact that silly reports like this are commonplace doesn’t make them any less annoying. After all, the idea of sampling uncertainty isn’t all that difficult to understand. Is it?

And with so many more important things going on in the world that deserve better press coverage than they are getting, why does a “quality” newspaper waste its valuable column inches on this sort of twaddle?

## A Keno Game Problem

Posted in Cute Problems with tags , , , , on July 25, 2014 by telescoper

It’s been a while since I posted anything in the Cute Problems category so, given that I’ve got an unexpected gap of half an hour today, I thought I’d return to one of my side interests, the mathematics and games and gambling.

There is a variety of gambling games called Keno games in which a player selects (or is given) a set of numbers, some or all of which the player hopes to match with numbers drawn without replacement from a larger set of numbers. A common example of this type of game is Bingo. These games mostly originate in the 19th Century when travelling carnivals and funfairs often involved booths in which customers could gamble in various ways; similar things happen today, though perhaps with more sophisticated games.

In modern Casino Keno (sometimes called Race Horse Keno) a player receives a card with the numbers from 1 to 80 marked on it. He or she then marks a selection between 1 and 15 numbers and indicates the amount of a proposed bet; if n numbers are marked then the game is called `n-spot Keno’. Obviously, in 1-spot Keno, only one number is marked. Twenty numbers are then drawn without replacement from a set comprising the integers 1 to 80, using some form of randomizing device. If an appropriate proportion of the marked numbers are in fact drawn the player gets a payoff calculated by the House. Below you can see the usual payoffs for 10-spot Keno:

If fewer than five of your numbers are drawn, you lose your £1 stake. The expected gain on a £1 bet can be calculated by working out the probability of each of the outcomes listed above multiplied by the corresponding payoff, adding these together and then subtracting the probability of losing your stake (which corresponds to a gain of -£1). If this overall expected gain is negative (which it will be for any competently run casino) then the expected loss is called the house edge. In other words, if you can expect to lose £X on a £1 bet then X is the house edge.

What is the house edge for 10-spot Keno?

## Time for a Factorial Moment…

Posted in Bad Statistics with tags , , on July 22, 2014 by telescoper

Another very busy and very hot day so no time for a proper blog post. I suggest we all take a short break and enjoy a Factorial Moment:

I remember many moons ago spending ages calculating the factorial moments of the Poisson-Lognormal distribution, only to find that they were well known. If only I’d had Google then…

## Uncertain Attitudes

Posted in Bad Statistics, Politics with tags , , , , on May 28, 2014 by telescoper

It’s been a while since I posted anything in the bad statistics file, but an article in today’s Grauniad has now given me an opportunity to rectify that omission.
The piece concerned, entitled Racism on the rise in Britain is based on some new data from the British Social Attitudes survey; the full report can be found here (PDF). The main result is shown in this graph:

The version of this plot shown in the Guardian piece has the smoothed long-term trend (the blue curve, based on a five-year moving average of the data and clearly generally downward since 1986) removed.

In any case the report, as is sadly almost always the case in surveys of this kind, neglects any mention of the statistical uncertainty in the survey. In fact the last point is based on a sample of 2149 respondents. Suppose the fraction of the population describing themselves as having some prejudice is $p$. For a sample of size $n$ with $x$ respondents indicating that they describe themselves as “very prejudiced or a little prejudiced” then one can straightforwardly estimate $p \simeq x/n$. So far so good, as long as there is no bias induced by the form of the question asked nor in the selection of the sample…

However, a little bit of mathematics involving the binomial distribution yields an answer for the uncertainty in this estimate of p in terms of the sampling error:

$\sigma = \sqrt{\frac{p(1-p)}{n}}$

For the sample size given, and a value $p \simeq 0.35$ this amounts to a standard error of about 1%. About 95% of samples drawn from a population in which the true fraction is $p$ will yield an estimate within $p \pm 2\sigma$, i.e. within about 2% of the true figure. This is consistent with the “noise” on the unsmoothed curve and it shows that the year-on-year variation shown in the unsmoothed graph is largely attributable to sampling uncertainty; note that the sample sizes vary from year to year too. The results for 2012 and 2013 are 26% and 30% exactly, which differ by 4% and are therefore explicable solely in terms of sampling fluctuations.

I don’t know whether racial prejudice is on the rise in the UK or not, nor even how accurately such attitudes are measured by such surveys in the first place, but there’s no evidence in these data of any significant change over the past year. Given the behaviour of the smoothed data however, there is evidence that in the very long term the fraction of population identifying themselves as prejudiced is actually falling.

Newspapers however rarely let proper statistics get in the way of a good story, even to the extent of removing evidence that contradicts their own prejudice.

## Galaxies, Glow-worms and Chicken Eyes

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , , , on February 26, 2014 by telescoper

I just came across a news item based on a research article in Physical Review E by Jiao et al. with the abstract:

Optimal spatial sampling of light rigorously requires that identical photoreceptors be arranged in perfectly regular arrays in two dimensions. Examples of such perfect arrays in nature include the compound eyes of insects and the nearly crystalline photoreceptor patterns of some fish and reptiles. Birds are highly visual animals with five different cone photoreceptor subtypes, yet their photoreceptor patterns are not perfectly regular. By analyzing the chicken cone photoreceptor system consisting of five different cell types using a variety of sensitive microstructural descriptors, we find that the disordered photoreceptor patterns are “hyperuniform” (exhibiting vanishing infinite-wavelength density fluctuations), a property that had heretofore been identified in a unique subset of physical systems, but had never been observed in any living organism. Remarkably, the patterns of both the total population and the individual cell types are simultaneously hyperuniform. We term such patterns “multihyperuniform” because multiple distinct subsets of the overall point pattern are themselves hyperuniform. We have devised a unique multiscale cell packing model in two dimensions that suggests that photoreceptor types interact with both short- and long-ranged repulsive forces and that the resultant competition between the types gives rise to the aforementioned singular spatial features characterizing the system, including multihyperuniformity. These findings suggest that a disordered hyperuniform pattern may represent the most uniform sampling arrangement attainable in the avian system, given intrinsic packing constraints within the photoreceptor epithelium. In addition, they show how fundamental physical constraints can change the course of a biological optimization process. Our results suggest that multihyperuniform disordered structures have implications for the design of materials with novel physical properties and therefore may represent a fruitful area for future research.

The point made in the paper is that the photoreceptors found in the eyes of chickens possess a property called disordered hyperuniformity which means that the appear disordered on small scales but exhibit order over large distances. Here’s an illustration:

It’s an interesting paper, but I’d like to quibble about something it says in the accompanying news story. The caption with the above diagram states

Left: visual cell distribution in chickens; right: a computer-simulation model showing pretty much the exact same thing. The colored dots represent the centers of the chicken’s eye cells.

Well, as someone who has spent much of his research career trying to discern and quantify patterns in collections of points – in my case they tend to be galaxies rather than photoreceptors – I find it difficult to defend the use of the phrase “pretty much the exact same thing”. It’s notoriously difficult to look at realizations of stochastic point processes and decided whether they are statistically similar or not. For that you generally need quite sophisticated mathematical analysis.  In fact, to my eye, the two images above don’t look at all like “pretty much the exact same thing”. I’m not at all sure that the model works as well as it is claimed, as the statistical analysis presented in the paper is relatively simple: I’d need to see some more quantitative measures of pattern morphology and clustering, especially higher-order correlation functions, before I’m convinced.

Anyway, all this reminded me of a very old post of mine about the difficulty of discerning patterns in distributions of points. Take the two (not very well scanned)  images here as examples:

You will have to take my word for it that one of these is a realization of a two-dimensional Poisson point process (which is, in a well-defined sense completely “random”) and the other contains spatial correlations between the points. One therefore has a real pattern to it, and one is a realization of a completely unstructured random process.

I sometimes show this example in popular talks and get the audience to vote on which one is the random one. The vast majority usually think that the one on the right is the one that is random and the left one is the one with structure to it. It is not hard to see why. The right-hand pattern is very smooth (what one would naively expect for a constant probability of finding a point at any position in the two-dimensional space) , whereas the  left one seems to offer a profusion of linear, filamentary features and densely concentrated clusters.

In fact, it’s the left picture that was generated by a Poisson process using a Monte Carlo random number generator. All the structure that is visually apparent is imposed by our own sensory apparatus, which has evolved to be so good at discerning patterns that it finds them when they’re not even there!

The right process is also generated by a Monte Carlo technique, but the algorithm is more complicated. In this case the presence of a point at some location suppresses the probability of having other points in the vicinity. Each event has a zone of avoidance around it; the points are therefore anticorrelated. The result of this is that the pattern is much smoother than a truly random process should be. In fact, this simulation has nothing to do with galaxy clustering really. The algorithm used to generate it was meant to mimic the behaviour of glow-worms (a kind of beetle) which tend to eat each other if they get too close. That’s why they spread themselves out in space more uniformly than in the random pattern. In fact, the tendency displayed in this image of the points to spread themselves out more smoothly than a random distribution is in in some ways reminiscent of the chicken eye problem.

The moral of all this is that people are actually pretty hopeless at understanding what “really” random processes look like, probably because the word random is used so often in very imprecise ways and they don’t know what it means in a specific context like this. The point about random processes, even simpler ones like repeated tossing of a coin, is that coincidences happen much more frequently than one might suppose. By the same token, people are also pretty hopeless at figuring out whether two distributions of points resemble each other in some kind of statistical sense, because that can only be made precise if one defines some specific quantitative measure of clustering pattern, which is not easy to do.

## Double Indemnity – Statistics Noir

Posted in Film with tags , , , , on February 20, 2014 by telescoper

The other day I decided to treat myself by watching a DVD of the  film  Double Indemnity. It’s a great movie for many reasons, not least because when it was released in 1944 it immediately established much of the language and iconography of the genre that has come to be known as film noir, which I’ve written about on a number of occasions on this blog; see here for example. Like many noir movies the plot revolves around the destructive relationship between a femme fatale and male anti-hero and, as usual for the genre, the narrative strategy involves use of flashbacks and a first-person voice-over. The photography is done in such a way as to surround the protagonists with dark, threatening shadows. In fact almost every interior in the film (including the one shown in the clip below) has Venetian blinds for this purpose. These chiaroscuro lighting effects charge even the most mundane encounters with psychological tension or erotic suspense.

To the left is an example still from Double Indemnity which shows a number of trademark features. The shadows cast by venetian blinds on the wall, the cigarette being smoked by Barbara Stanwyck and the curious construction of the mise en scene are all very characteristic of the style. What is even more wonderful about this particular shot however is the way the shadow of Fred McMurray’s character enters the scene before he does. The Barbara Stanwyck character is just about to shoot him with a pearl-handled revolver; this image suggests that he is already on his way to the underworld as he enters the room.

I won’t repeat any more of the things I’ve already said about this great movie, but I will say a couple of things that struck me watching it again at the weekend. The first is that even after having seen it dozens of times of the year I still found it intense and gripping. The other is that I think one of the contributing factors to its greatness which is not often discussed is a wonderful cameo by Edward G Robinson , who steals every scene he appears in as the insurance investigator Barton Keyes. Here’s an example, which I’ve chosen because it provides an interesting illustration of the the scientific use of statistical information, another theme I’ve visited frequently on this blog: