## A Vaccination Fallacy

Posted in Bad Statistics, Covid-19 with tags , , , , on June 27, 2021 by telescoper I have been struck by the number of people upset by the latest analysis of SARS-Cov-2 “variants of concern” byPublic Health England. In particular it is in the report that over 40% of those dying from the so-called Delta Variant have had both vaccine jabs. I even saw some comments on social media from people saying that this proves that the vaccines are useless against this variant and as a consequence they weren’t going to bother getting their second jab.

This is dangerous nonsense and I think it stems – as much dangerous nonsense does – from a misunderstanding of basic probability which comes up in a number of situations, including the Prosecutor’s Fallacy. I’ll try to clarify it here with a bit of probability theory. The same logic as the following applies if you specify serious illness or mortality, but I’ll keep it simple by just talking about contracting Covid-19. When I write about probabilities you can think of these as proportions within the population so I’ll use the terms probability and proportion interchangeably in the following.

Denote by P[C|V] the conditional probability that a fully vaccinated person becomes ill from Covid-19. That is considerably smaller than P[C| not V] (by a factor of ten or so given the efficacy of the vaccines). Vaccines do not however deliver perfect immunity so P[C|V]≠0.

Let P[V|C] be the conditional probability of a person with Covid-19 having been fully vaccinated. Or, if you prefer, the proportion of people with Covid-19 who are fully vaccinated..

Now the first thing to point out is that these conditional probability are emphatically not equal. The probability of a female person being pregnant is not the same as the probability of a pregnant person being female.

We can find the relationship between P[C|V] and P[V|C] using the joint probability P[V,C]=P[V,C] of a person having been fully vaccinated and contracting Covid-19. This can be decomposed in two ways: P[V,C]=P[V]P[C|V]=P[C]P[V|C]=P[V,C], where P[V] is the proportion of people fully vaccinated and P[C] is the proportion of people who have contracted Covid-19. This gives P[V|C]=P[V]P[C|V]/P[C].

This result is nothing more than the famous Bayes Theorem.

Now P[C] is difficult to know exactly because of variable testing rates and other selection effects but is presumably quite small. The total number of positive tests since the pandemic began in the UK is about 5M which is less than 10% of the population. The proportion of the population fully vaccinated on the other hand is known to be about 50% in the UK. We can be pretty sure therefore that P[V]»P[C]. This in turn means that P[V|C]»P[C|V].

In words this means that there is nothing to be surprised about in the fact that the proportion of people being infected with Covid-19 is significantly larger than the probability of a vaccinated person catching Covid-19. It is expected that the majority of people catching Covid-19 in the current phase of the pandemic will have been fully vaccinated.

(As a commenter below points out, in the limit when everyone has been vaccinated 100% of the people who catch Covid-19 will have been vaccinated. The point is that the number of people getting ill and dying will be lower than in an unvaccinated population.)

The proportion of those dying of Covid-19 who have been fully vaccinated will also be high, a point also made here.

It’s difficult to be quantitatively accurate here because there are other factors involved in the risk of becoming ill with Covid-19, chiefly age. The reason this poses a problem is that in my countries vaccinations have been given preferentially to those deemed to be at high risk. Younger people are at relatively low risk of serious illness or death from Covid-19 whether or not they are vaccinated compared to older people, but the latter are also more likely to have been vaccinated. To factor this into the calculation above requires an additional piece of conditioning information. We could express this crudely in terms of a binary condition High Risk (H) or Low Risk (L) and construct P(V|L,H) etc but I don’t have the time or information to do this.

So please don’t be taken in by this fallacy. Vaccines do work. Get your second jab (or your first if you haven’t done it yet). It might save your life.

## A Virus Testing Probability Puzzle

Posted in Cute Problems, mathematics with tags , on April 13, 2020 by telescoper Here is a topical puzzle for you.

A test is designed to show whether or not a person is carrying a particular virus.

The test has only two possible outcomes, positive or negative.

If the person is carrying the virus the test has a 95% probability of giving a positive result.

If the person is not carrying the virus the test has a 95% probability of giving a negative result.

A given individual, selected at random, is tested and obtains a positive result. What is the probability that they are carrying the virus?

Update 1: the comments so far have correctly established that the answer is not what you might naively think (ie 95%) and that it depends on the fraction of people in the population actually carrying the virus. Suppose this is f. Now what is the answer?

Update 2: OK so we now have the probability for a fixed value of f. Suppose we know nothing about f in advance. Can we still answer the question?

## The First Bookie

Posted in Football, mathematics, Sport with tags , , , , , , on April 24, 2019 by telescoper

I read an interesting piece in Sunday’s Observer which is mainly about the challenges facing the modern sports betting industry but which also included some interesting historical snippets about the history of gambling. One thing that I didn’t know before reading this article was that it is generally accepted that the first ever bookmaker was a chap called Harry Ogden who started business in the late 18th century on Newmarket Heath. Organized horse-racing had been going on for over a century by then, and gambling had co-existed with it, not always legally. Before Harry Ogden, however, the types of wager were very different from what we have nowadays. For one thing bets would generally be offered on one particular horse (the Favourite), against the field. There being only two outcomes these were generally even-money bets, and the wagers were made between individuals rather than being administered by a turf accountant’.

Then up stepped Harry Ogden, who introduced the innovation of laying odds on every horse in a race. He set the odds based on his knowledge of the form of the different horses (i.e. on their results in previous races), using this data to estimate probabilities of success for each one. This kind of book’, listing odds for all the runners in a race, rapidly became very popular and is still with us today. The way of specifying odds as fractions (e.g. 6/1 against, 7/1 on) derives from this period.

Ogden wasn’t interested in merely facilitating other people’s wagers: he wanted to make a profit out of this process and the system he put in place to achieve this survives to this day. In particular he introduced a version of the overround, which works as follows. I’ll use a simple example from football rather than horse-racing because I was thinking about it the other day while I was looking at the bookies odds on relegation from the Premiership.

Suppose there is a football match, which can result either in a HOME win, an AWAY win or a DRAW. Suppose the bookmaker’s expert analysts – modern bookmakers employ huge teams of these – judge the odds of these three outcomes to be: 1-1 (evens) on a HOME win, 2-1 against the DRAW and 5-1 against the AWAY win. The corresponding probabilities are: 1/2 for the HOME win, 1/3 for the DRAW and 1/6 for the AWAY win. Note that these add up to 100%, as they are meant to be probabilities and these are the only three possible outcomes. These are true odds’.

Offering these probabilities as odds to punters would not guarantee a return for the bookie, who would instead change the odds so they add up to more than 100%. In the case above the bookie’s odds might be: 4-6 for the HOME win; 6-4 for the DRAW and 4-1 against the AWAY win. The implied probabilities here are 3/5, 2/5 and 1/5 respectively, which adds up to 120%, not 100%. The excess is the overround or bookmaker’s margin’ – in this case 20%.

This is quite the opposite to the Dutch Book case I discussed here.

Harry Ogden applied his method to horse races with many more possible outcomes, but the principle is the same: work out your best estimate of the true odds then apply your margin to calculate the odds offered to the punter.

One thing this means is that you have to be careful f you want to estimate the probability of an event from a bookie’s odds. If they offer you even money then that does not mean they you have a 50-50 chance!

## A Problem of Sons

Posted in Cute Problems with tags , , on January 31, 2019 by telescoper

I’m posting this in the Cute Problems folder, but I’m mainly putting it up here as a sort of experiment. This little puzzle was posted on Twitter by someone I follow and it got a huge number of responses (>25,000). I was fascinated by the replies, and I’m really interested to see whether the distribution of responses from readers of this blog is different.

Anyway, here it is, exactly as posted on Twitter:

Assume there is a 50:50 chance of any child being male or female.

Now assume four generations, all other things being equal.

What are the odds of a son being a son of a son of a son?

## The Problem with Odd Moments

Posted in Bad Statistics, Cute Problems, mathematics with tags , , on July 9, 2018 by telescoper

Last week, realizing that it had been a while since I posted anything in the cute problems folder, I did a quick post before going to a meeting. Unfortunately, as a couple of people pointed out almost immediately, there was a problem with the question (a typo in the form of a misplaced bracket). I took the post offline until I could correct it and then promptly forgot about it. I remembered it yesterday so have now corrected it. I also added a useful integral as a hint at the end, because I’m a nice person. I suggest you start by evaluating the expectation value (i.e. the first-order moment). Answers to parts (2) and (3) through the comments box please! SOLUTION: I’ll leave you to draw your own sketch but, as Anton correctly points out, this is a distribution that is asymmetric about its mean but has all odd-order moments equal (including the skewness) equal to zero. it therefore provides a counter-example to common assertions, e.g. that asymmetric distributions must have non-zero skewness. The function shown in the problem was originally given by Stieltjes, but a general discussion can be be found in E. Churchill (1946) Information given by odd moments, Ann. Math. Statist. 17, 244-6. The paper is available online here.

## Joseph Bertrand and the Monty Hall Problem

Posted in Bad Statistics, History, mathematics with tags , , , , on October 4, 2017 by telescoper

The death a few days ago of Monty Hall reminded me of something I was going to write about the Monty Hall Problem, as it did with another blogger I follow, namely that (unsrurprisingly) Stigler’s Law of Eponymy applies to this problem.

The earliest version of the problem now called the Monty Hall Problem dates from a book, first published in 1889, called Calcul des probabilités written by Joseph Bertrand. It’s a very interesting book, containing much of specific interest to astronomers as well as general things for other scientists. Ypu can read it all online here, if you can read French.

As it happens, I have a copy of the book and here is the relevant problem. If you click on the image it should be legible. It’s actually Problem 2 of Chapter 1, suggesting that it’s one of the easier, introductory questions. Interesting that it has endured so long, even if it has evolved slightly!

I won’t attempt a full translation into English, but the problem is worth describing as it is actually more interesting than the Monty Hall Problem (with the three doors). In the Bertrand version there are three apparently identical boxes (coffrets) each of which has two drawers (tiroirs). In each drawer of each box there is a medal. In the first box there are two gold medals. The second box contains two silver medals. The third box contains one gold and one silver.

The boxes are shuffled, and you pick a box at random’ and open one drawer randomly chosen’ from the two. What is the probability that the other drawer of the same box contains a medal that differs from the first?

Now the probability that you select a box with two different medals in the first place is just 1/3, as it has to be the third box: the other two contain identical medals.

However, once you open one drawer and find (say) a silver medal then the probability of the other one being different (i.e. gold) changes because the knowledge gained by opening the drawer eliminates (in this case) the possibility that you selected the first box (which has only gold medals in it). The probability of the two medals being different is therefore 1/2.

That’s a very rough translation of the part of Bertrand’s discussion on the first page. I leave it as an exercise for the reader to translate the second part!

I just remembered that this is actually the same as the three-card problem I posted about here.

## Fear, Risk, Uncertainty and the European Union

Posted in Politics, Science Politics, The Universe and Stuff with tags , , , , , , , , , on April 11, 2016 by telescoper

I’ve been far too busy with work and other things to contribute as much as I’d like to the ongoing debate about the forthcoming referendum on Britain’s membership of the European Union. Hopefully I’ll get time for a few posts before June 23rd, which is when the United Kingdom goes to the polls.

For the time being, however, I’ll just make a quick comment about one phrase that is being bandied about in this context, namely Project Fear.As far as I am aware this expression first came up in the context of last year’s referendum on Scottish independence, but it’s now being used by the “leave” campaign to describe some of the arguments used by the “remain” campaign. I’ve met this phrase myself rather often on social media such as Twitter, usually in use by a BrExit campaigner accusing me of scaremongering because I think there’s a significant probability that leaving the EU will cause the UK serious economic problems.

Can I prove that this is the case? No, of course not. Nobody will know unless and until we try leaving the EU. But my point is that there’s definitely a risk. It seems to me grossly irresponsible to argue – as some clearly are doing – that there is no risk at all.

This is all very interesting for those of us who work in university science departments because “Risk Assessments” are one of the things we teach our students to do as a matter of routine, especially in advance of experimental projects. In case you weren’t aware, a risk assessment is

…. a systematic examination of a task, job or process that you carry out at work for the purpose of; Identifying the significant hazards that are present (a hazard is something that has the potential to cause someone harm or ill health).

Perhaps we should change the name of our “Project Risk Assessments” to “Project Fear”?

I think this all demonstrates how very bad most people are at thinking rationally about uncertainty, to such an extent that even thinking about potential hazards is verboten. I’ve actually written a book about uncertainty in the physical sciences , partly in an attempt to counter the myth that science deals with absolute certainties. And if physics doesn’t, economics definitely can’t.

In this context it is perhaps worth mentioning the  definitions of “uncertainty” and “risk” suggested by Frank Hyneman Knight in a book on economics called Risk, Uncertainty and Profit which seem to be in standard use in the social sciences.  The distinction made there is that “risk” is “randomness” with “knowable probabilities”, whereas “uncertainty” involves “randomness” with “unknowable probabilities”.

I don’t like these definitions at all. For one thing they both involve a reference to “randomness”, a word which I don’t know how to define anyway; I’d be much happier to use “unpredictability”.In the context of BrExit there is unpredictability because we don’t have any hard information on which to base a prediction. Even more importantly, perhaps, I find the distinction between “knowable” and “unknowable” probabilities very problematic. One always knows something about a probability distribution, even if that something means that the distribution has to be very broad. And in any case these definitions imply that the probabilities concerned are “out there”, rather being statements about a state of knowledge (or lack thereof). Sometimes we know what we know and sometimes we don’t, but there are more than two possibilities. As the great American philosopher and social scientist Donald Rumsfeld (Shurely Shome Mishtake? Ed) put it:

“…as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”

There may be a proper Bayesian formulation of the distinction between “risk” and “uncertainty” that involves a transition between prior-dominated (uncertain) and posterior-dominated (risky), but basically I don’t see any qualititative difference between the two from such a perspective.

When it comes to the EU referendum is that probabilities of different outcomes are difficult to calculate because of the complexity of economics generally and the dynamics of trade within and beyond the European Union in particular. Moreover, probabilities need to be updated using quantitative evidence and we don’t actually have any of that. But it seems absurd to try to argue that there is neither any risk nor any uncertainty. Frankly, anyone who argues this is just being irrational.

Whether a risk is worth taking depends on the likely profit. Nobody has convinced me that the country as a whole will gain anything concrete if we leave the European Union, so the risk seems pointless. Cui Bono? I think you’ll find the answer to that among the hedge fund managers who are bankrolling the BrExit campaign…

## Life as a Condition of Cosmology

Posted in The Universe and Stuff with tags , , , , , , , on November 7, 2015 by telescoper

Trigger Warnings: Bayesian Probability and the Anthropic Principle!

Once upon a time I was involved in setting up a cosmology conference in Valencia (Spain). The principal advantage of being among the organizers of such a meeting is that you get to invite yourself to give a talk and to choose the topic. On this particular occasion, I deliberately abused my privilege and put myself on the programme to talk about the “Anthropic Principle”. I doubt if there is any subject more likely to polarize a scientific audience than this. About half the participants present in the meeting stayed for my talk. The other half ran screaming from the room. Hence the trigger warnings on this post. Anyway, I noticed a tweet this morning from Jon Butterworth advertising a new blog post of his on the very same subject so I thought I’d while away a rainy November afternoon with a contribution of my own.

In case you weren’t already aware, the Anthropic Principle is the name given to a class of ideas arising from the suggestion that there is some connection between the material properties of the Universe as a whole and the presence of human life within it. The name was coined by Brandon Carter in 1974 as a corrective to the “Copernican Principle” that man does not occupy a special place in the Universe. A naïve application of this latter principle to cosmology might lead us to think that we could have evolved in any of the myriad possible Universes described by the system of Friedmann equations. The Anthropic Principle denies this, because life could not have evolved in all possible versions of the Big Bang model. There are however many different versions of this basic idea that have different logical structures and indeed different degrees of credibility. It is not really surprising to me that there is such a controversy about this particular issue, given that so few physicists and astronomers take time to study the logical structure of the subject, and this is the only way to assess the meaning and explanatory value of propositions like the Anthropic Principle. My former PhD supervisor, John Barrow (who is quoted in John Butterworth’s post) wrote the definite text on this topic together with Frank Tipler to which I refer you for more background. What I want to do here is to unpick this idea from a very specific perspective and show how it can be understood quite straightfowardly in terms of Bayesian reasoning. I’ll begin by outlining this form of inferential logic.

I’ll start with Bayes’ theorem which for three logical propositions (such as statements about the values of parameters in theory) A, B and C can be written in the form $P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)$

where $K=P(A|C).$

This is (or should be!)  uncontroversial as it is simply a result of the sum and product rules for combining probabilities. Notice, however, that I’ve not restricted it to two propositions A and B as is often done, but carried throughout an extra one (C). This is to emphasize the fact that, to a Bayesian, all probabilities are conditional on something; usually, in the context of data analysis this is a background theory that furnishes the framework within which measurements are interpreted. If you say this makes everything model-dependent, then I’d agree. But every interpretation of data in terms of parameters of a model is dependent on the model. It has to be. If you think it can be otherwise then I think you’re misguided.

In the equation,  P(B|C) is the probability of B being true, given that C is true . The information C need not be definitely known, but perhaps assumed for the sake of argument. The left-hand side of Bayes’ theorem denotes the probability of B given both A and C, and so on. The presence of C has not changed anything, but is just there as a reminder that it all depends on what is being assumed in the background. The equation states  a theorem that can be proved to be mathematically correct so it is – or should be – uncontroversial.

To a Bayesian, the entities A, B and C are logical propositions which can only be either true or false. The entities themselves are not blurred out, but we may have insufficient information to decide which of the two possibilities is correct. In this interpretation, P(A|C) represents the degree of belief that it is consistent to hold in the truth of A given the information C. Probability is therefore a generalization of the “normal” deductive logic expressed by Boolean algebra: the value “0” is associated with a proposition which is false and “1” denotes one that is true. Probability theory extends  this logic to the intermediate case where there is insufficient information to be certain about the status of the proposition.

A common objection to Bayesian probability is that it is somehow arbitrary or ill-defined. “Subjective” is the word that is often bandied about. This is only fair to the extent that different individuals may have access to different information and therefore assign different probabilities. Given different information C and C′ the probabilities P(A|C) and P(A|C′) will be different. On the other hand, the same precise rules for assigning and manipulating probabilities apply as before. Identical results should therefore be obtained whether these are applied by any person, or even a robot, so that part isn’t subjective at all.

In fact I’d go further. I think one of the great strengths of the Bayesian interpretation is precisely that it does depend on what information is assumed. This means that such information has to be stated explicitly. The essential assumptions behind a result can be – and, regrettably, often are – hidden in frequentist analyses. Being a Bayesian forces you to put all your cards on the table.

To a Bayesian, probabilities are always conditional on other assumed truths. There is no such thing as an absolute probability, hence my alteration of the form of Bayes’s theorem to represent this. A probability such as P(A) has no meaning to a Bayesian: there is always conditioning information. For example, if  I blithely assign a probability of 1/6 to each face of a dice, that assignment is actually conditional on me having no information to discriminate between the appearance of the faces, and no knowledge of the rolling trajectory that would allow me to make a prediction of its eventual resting position.

In tbe Bayesian framework, probability theory  becomes not a branch of experimental science but a branch of logic. Like any branch of mathematics it cannot be tested by experiment but only by the requirement that it be internally self-consistent. This brings me to what I think is one of the most important results of twentieth century mathematics, but which is unfortunately almost unknown in the scientific community. In 1946, Richard Cox derived the unique generalization of Boolean algebra under the assumption that such a logic must involve associated a single number with any logical proposition. The result he got is beautiful and anyone with any interest in science should make a point of reading his elegant argument. It turns out that the only way to construct a consistent logic of uncertainty incorporating this principle is by using the standard laws of probability. There is no other way to reason consistently in the face of uncertainty than probability theory. Accordingly, probability theory always applies when there is insufficient knowledge for deductive certainty. Probability is inductive logic.

This is not just a nice mathematical property. This kind of probability lies at the foundations of a consistent methodological framework that not only encapsulates many common-sense notions about how science works, but also puts at least some aspects of scientific reasoning on a rigorous quantitative footing. This is an important weapon that should be used more often in the battle against the creeping irrationalism one finds in society at large.

To see how the Bayesian approach provides a methodology for science, let us consider a simple example. Suppose we have a hypothesis H (some theoretical idea that we think might explain some experiment or observation). We also have access to some data D, and we also adopt some prior information I (which might be the results of other experiments and observations, or other working assumptions). What we want to know is how strongly the data D supports the hypothesis H given my background assumptions I. To keep it easy, we assume that the choice is between whether H is true or H is false. In the latter case, “not-H” or H′ (for short) is true. If our experiment is at all useful we can construct P(D|HI), the probability that the experiment would produce the data set D if both our hypothesis and the conditional information are true.

The probability P(D|HI) is called the likelihood; to construct it we need to have   some knowledge of the statistical errors produced by our measurement. Using Bayes’ theorem we can “invert” this likelihood to give P(H|DI), the probability that our hypothesis is true given the data and our assumptions. The result looks just like we had in the first two equations: $P(H|DI) = K^{-1}P(H|I)P(D|HI) .$

Now we can expand the “normalising constant” K because we know that either H or H′ must be true. Thus $K=P(D|I)=P(H|I)P(D|HI)+P(H^{\prime}|I) P(D|H^{\prime}I)$

The P(H|DI) on the left-hand side of the first expression is called the posterior probability; the right-hand side involves P(H|I), which is called the prior probability and the likelihood P(D|HI). The principal controversy surrounding Bayesian inductive reasoning involves the prior and how to define it, which is something I’ll comment on in a future post.

The Bayesian recipe for testing a hypothesis assigns a large posterior probability to a hypothesis for which the product of the prior probability and the likelihood is large. It can be generalized to the case where we want to pick the best of a set of competing hypothesis, say H1 …. Hn. Note that this need not be the set of all possible hypotheses, just those that we have thought about. We can only choose from what is available. The hypothesis may be relatively simple, such as that some particular parameter takes the value x, or they may be composite involving many parameters and/or assumptions. For instance, the Big Bang model of our universe is a very complicated hypothesis, or in fact a combination of hypotheses joined together,  involving at least a dozen parameters which can’t be predicted a priori but which have to be estimated from observations.

The required result for multiple hypotheses is pretty straightforward: the sum of the two alternatives involved in K above simply becomes a sum over all possible hypotheses, so that $P(H_i|DI) = K^{-1}P(H_i|I)P(D|H_iI),$

and $K=P(D|I)=\sum P(H_j|I)P(D|H_jI)$

If the hypothesis concerns the value of a parameter – in cosmology this might be, e.g., the mean density of the Universe expressed by the density parameter Ω0 – then the allowed space of possibilities is continuous. The sum in the denominator should then be replaced by an integral, but conceptually nothing changes. Our “best” hypothesis is the one that has the greatest posterior probability.

From a frequentist stance the procedure is often instead to just maximize the likelihood. According to this approach the best theory is the one that makes the data most probable. This can be the same as the most probable theory, but only if the prior probability is constant, but the probability of a model given the data is generally not the same as the probability of the data given the model. I’m amazed how many practising scientists make this error on a regular basis.

The following figure might serve to illustrate the difference between the frequentist and Bayesian approaches. In the former case, everything is done in “data space” using likelihoods, and in the other we work throughout with probabilities of hypotheses, i.e. we think in hypothesis space. I find it interesting to note that most theorists that I know who work in cosmology are Bayesians and most observers are frequentists! As I mentioned above, it is the presence of the prior probability in the general formula that is the most controversial aspect of the Bayesian approach. The attitude of frequentists is often that this prior information is completely arbitrary or at least “model-dependent”. Being empirically-minded people, by and large, they prefer to think that measurements can be made and interpreted without reference to theory at all.

Assuming we can assign the prior probabilities in an appropriate way what emerges from the Bayesian framework is a consistent methodology for scientific progress. The scheme starts with the hardest part – theory creation. This requires human intervention, since we have no automatic procedure for dreaming up hypothesis from thin air. Once we have a set of hypotheses, we need data against which theories can be compared using their relative probabilities. The experimental testing of a theory can happen in many stages: the posterior probability obtained after one experiment can be fed in, as prior, into the next. The order of experiments does not matter. This all happens in an endless loop, as models are tested and refined by confrontation with experimental discoveries, and are forced to compete with new theoretical ideas. Often one particular theory emerges as most probable for a while, such as in particle physics where a “standard model” has been in existence for many years. But this does not make it absolutely right; it is just the best bet amongst the alternatives. Likewise, the Big Bang model does not represent the absolute truth, but is just the best available model in the face of the manifold relevant observations we now have concerning the Universe’s origin and evolution. The crucial point about this methodology is that it is inherently inductive: all the reasoning is carried out in “hypothesis space” rather than “observation space”.  The primary form of logic involved is not deduction but induction. Science is all about inverse reasoning.

Now, back to the anthropic principle. The point is that we can observe that life exists in our Universe and this observation must be incorporated as conditioning information whenever we try to make inferences about cosmological models if we are to reason consistently. In other words, the existence of life is a datum that must be incorporated in the conditioning information I mentioned above.

Suppose we have a model of the Universe M that contains various parameters which can be fixed by some form of observation. Let U be the proposition that these parameters take specific values U1, U2, and so on. Anthropic arguments revolve around the existence of life, so let L be the proposition that intelligent life evolves in the Universe. Note that the word “anthropic” implies specifically human life, but many versions of the argument do not necessarily accommodate anything more complicated than a virus.

Using Bayes’ theorem we can write $P(U|L,M)=K^{-1} P(U|M)P(L|U,M)$

The dependence of the posterior probability P(U|L,M) on the likelihood P(L|U,M) demonstrates that the values of U for which P(L|U,M) is larger correspond to larger values of P(U|L,M); K is just a normalizing constant for the purpose of this argument. Since life is observed in our Universe the model-parameters which make life more probable must be preferred to those that make it less so. To go any further we need to say something about the likelihood and the prior. Here the complexity and scope of the model makes it virtually impossible to apply in detail the symmetry principles usually exploited to define priors for physical models. On the other hand, it seems reasonable to assume that the prior is broad rather than sharply peaked; if our prior knowledge of which universes are possible were so definite then we wouldn’t really be interested in knowing what observations could tell us. If now the likelihood is sharply peaked in U then this will be projected directly into the posterior distribution.

We have to assign the likelihood using our knowledge of how galaxies, stars and planets form, how planets are distributed in orbits around stars, what conditions are needed for life to evolve, and so on. There are certainly many gaps in this knowledge. Nevertheless if any one of the steps in this chain of knowledge requires very finely-tuned parameter choices then we can marginalize over the remaining steps and still end up with a sharp peak in the remaining likelihood and so also in the posterior probability. For example, there are plausible reasons for thinking that intelligent life has to be carbon-based, and therefore evolve on a planet. It is reasonable to infer, therefore, that P(U|L,M) should prefer some values of U. This means that there is a correlation between the propositions U and L in the sense that knowledge of one should, through Bayesian reasoning, enable us to make inferences about the other.

It is very difficult to make this kind of argument rigorously quantitative, but I can illustrate how the argument works with a simplified example. Let us suppose that the relevant parameters contained in the set U include such quantities as Newton’s gravitational constant G, the charge on the electron e, and the mass of the proton m. These are usually termed fundamental constants. The argument above indicates that there might be a connection between the existence of life and the value that these constants jointly take. Moreover, there is no reason why this kind of argument should not be used to find the values of fundamental constants in advance of their measurement. The ordering of experiment and theory is merely an historical accident; the process is cyclical. An illustration of this type of logic is furnished by the case of a plant whose seeds germinate only after prolonged rain. A newly-germinated (and intelligent) specimen could either observe dampness in the soil directly, or infer it using its own knowledge coupled with the observation of its own germination. This type, used properly, can be predictive and explanatory.

This argument is just one example of a number of its type, and it has clear (but limited) explanatory power. Indeed it represents a fruitful application of Bayesian reasoning. The question is how surprised we should be that the constants of nature are observed to have their particular values? That clearly requires a probability based answer. The smaller the probability of a specific joint set of values (given our prior knowledge) then the more surprised we should be to find them. But this surprise should be bounded in some way: the values have to lie somewhere in the space of possibilities. Our argument has not explained why life exists or even why the parameters take their values but it has elucidated the connection between two propositions. In doing so it has reduced the number of unexplained phenomena from two to one. But it still takes our existence as a starting point rather than trying to explain it from first principles.

Arguments of this type have been called Weak Anthropic Principle by Brandon Carter and I do not believe there is any reason for them to be at all controversial. They are simply Bayesian arguments that treat the existence of life as an observation about the Universe that is treated in Bayes’ theorem in the same way as all other relevant data and whatever other conditioning information we have. If more scientists knew about the inductive nature of their subject, then this type of logic would not have acquired the suspicious status that it currently has.

## Albert, Bernard and Bell’s Theorem

Posted in The Universe and Stuff with tags , , , , , , , , , , on April 15, 2015 by telescoper

You’ve probably all heard of the little logic problem involving the mysterious Cheryl and her friends Albert and Bernard that went viral on the internet recently. I decided not to post about it directly because it’s already been done to death. It did however make me think that if people struggle so much with “ordinary” logic problems of this type its no wonder they are so puzzled by the kind of logical issues raised by quantum mechanics. Hence the motivation of updating a post I did quite a while ago. The question we’ll explore does not concern the date of Cheryl’s birthday but the spin of an electron.

To begin with, let me give a bit of physics background. Spin is a concept of fundamental importance in quantum mechanics, not least because it underlies our most basic theoretical understanding of matter. The standard model of particle physics divides elementary particles into two types, fermions and bosons, according to their spin.  One is tempted to think of  these elementary particles as little cricket balls that can be rotating clockwise or anti-clockwise as they approach an elementary batsman. But, as I hope to explain, quantum spin is not really like classical spin.

Take the electron,  for example. The amount of spin an electron carries is  quantized, so that it always has a magnitude which is ±1/2 (in units of Planck’s constant; all fermions have half-integer spin). In addition, according to quantum mechanics, the orientation of the spin is indeterminate until it is measured. Any particular measurement can only determine the component of spin in one direction. Let’s take as an example the case where the measuring device is sensitive to the z-component, i.e. spin in the vertical direction. The outcome of an experiment on a single electron will lead a definite outcome which might either be “up” or “down” relative to this axis.

However, until one makes a measurement the state of the system is not specified and the outcome is consequently not predictable with certainty; there will be a probability of 50% probability for each possible outcome. We could write the state of the system (expressed by the spin part of its wavefunction  ψ prior to measurement in the form

|ψ> = (|↑> + |↓>)/√2

This gives me an excuse to use  the rather beautiful “bra-ket” notation for the state of a quantum system, originally due to Paul Dirac. The two possibilities are “up” (↑­) and “down” (↓) and they are contained within a “ket” (written |>)which is really just a shorthand for a wavefunction describing that particular aspect of the system. A “bra” would be of the form <|; for the mathematicians this represents the Hermitian conjugate of a ket. The √2 is there to insure that the total probability of the spin being either up or down is 1, remembering that the probability is the square of the wavefunction. When we make a measurement we will get one of these two outcomes, with a 50% probability of each.

At the point of measurement the state changes: if we get “up” it becomes purely |↑>  and if the result is  “down” it becomes |↓>. Either way, the quantum state of the system has changed from a “superposition” state described by the equation above to an “eigenstate” which must be either up or down. This means that all subsequent measurements of the spin in this direction will give the same result: the wave-function has “collapsed” into one particular state. Incidentally, the general term for a two-state quantum system like this is a qubit, and it is the basis of the tentative steps that have been taken towards the construction of a quantum computer.

Notice that what is essential about this is the role of measurement. The collapse of  ψ seems to be an irreversible process, but the wavefunction itself evolves according to the Schrödinger equation, which describes reversible, Hamiltonian changes.  To understand what happens when the state of the wavefunction changes we need an extra level of interpretation beyond what the mathematics of quantum theory itself provides,  because we are generally unable to write down a wave-function that sensibly describes the system plus the measuring apparatus in a single form.

So far this all seems rather similar to the state of a fair coin: it has a 50-50 chance of being heads or tails, but the doubt is resolved when its state is actually observed. Thereafter we know for sure what it is. But this resemblance is only superficial. A coin only has heads or tails, but the spin of an electron doesn’t have to be just up or down. We could rotate our measuring apparatus by 90° and measure the spin to the left (←) or the right (→). In this case we still have to get a result which is a half-integer times Planck’s constant. It will have a 50-50 chance of being left or right that “becomes” one or the other when a measurement is made.

Now comes the real fun. Suppose we do a series of measurements on the same electron. First we start with an electron whose spin we know nothing about. In other words it is in a superposition state like that shown above. We then make a measurement in the vertical direction. Suppose we get the answer “up”. The electron is now in the eigenstate with spin “up”.

We then pass it through another measurement, but this time it measures the spin to the left or the right. The process of selecting the electron to be one with  spin in the “up” direction tells us nothing about whether the horizontal component of its spin is to the left or to the right. Theory thus predicts a 50-50 outcome of this measurement, as is observed experimentally.

Suppose we do such an experiment and establish that the electron’s spin vector is pointing to the left. Now our long-suffering electron passes into a third measurement which this time is again in the vertical direction. You might imagine that since we have already measured this component to be in the up direction, it would be in that direction again this time. In fact, this is not the case. The intervening measurement seems to “reset” the up-down component of the spin; the results of the third measurement are back at square one, with a 50-50 chance of getting up or down.

This is just one example of the kind of irreducible “randomness” that seems to be inherent in quantum theory. However, if you think this is what people mean when they say quantum mechanics is weird, you’re quite mistaken. It gets much weirder than this! So far I have focussed on what happens to the description of single particles when quantum measurements are made. Although there seem to be subtle things going on, it is not really obvious that anything happening is very different from systems in which we simply lack the microscopic information needed to make a prediction with absolute certainty.

At the simplest level, the difference is that quantum mechanics gives us a theory for the wave-function which somehow lies at a more fundamental level of description than the usual way we think of probabilities. Probabilities can be derived mathematically from the wave-function,  but there is more information in ψ than there is in |2; the wave-function is a complex entity whereas the square of its amplitude is entirely real. If one can construct a system of two particles, for example, the resulting wave-function is obtained by superimposing the wave-functions of the individual particles, and probabilities are then obtained by squaring this joint wave-function. This will not, in general, give the same probability distribution as one would get by adding the one-particle probabilities because, for complex entities A and B,

A2+B2 ≠(A+B)2

in general. To put this another way, one can write any complex number in the form a+ib (real part plus imaginary part) or, generally more usefully in physics , as Re, where R is the amplitude and θ  is called the phase. The square of the amplitude gives the probability associated with the wavefunction of a single particle, but in this case the phase information disappears; the truly unique character of quantum physics and how it impacts on probabilies of measurements only reveals itself when the phase information is retained. This generally requires two or more particles to be involved, as the absolute phase of a single-particle state is essentially impossible to measure.

Finding situations where the quantum phase of a wave-function is important is not easy. It seems to be quite easy to disturb quantum systems in such a way that the phase information becomes scrambled, so testing the fundamental aspects of quantum theory requires considerable experimental ingenuity. But it has been done, and the results are astonishing.

Let us think about a very simple example of a two-component system: a pair of electrons. All we care about for the purpose of this experiment is the spin of the electrons so let us write the state of this system in terms of states such as  which I take to mean that the first particle has spin up and the second one has spin down. Suppose we can create this pair of electrons in a state where we know the total spin is zero. The electrons are indistinguishable from each other so until we make a measurement we don’t know which one is spinning up and which one is spinning down. The state of the two-particle system might be this:

|ψ> = (|↑↓> – |↓↑>)/√2

squaring this up would give a 50% probability of “particle one” being up and “particle two” being down and 50% for the contrary arrangement. This doesn’t look too different from the example I discussed above, but this duplex state exhibits a bizarre phenomenon known as quantum entanglement.

Suppose we start the system out in this state and then separate the two electrons without disturbing their spin states. Before making a measurement we really can’t say what the spins of the individual particles are: they are in a mixed state that is neither up nor down but a combination of the two possibilities. When they’re up, they’re up. When they’re down, they’re down. But when they’re only half-way up they’re in an entangled state.

If one of them passes through a vertical spin-measuring device we will then know that particle is definitely spin-up or definitely spin-down. Since we know the total spin of the pair is zero, then we can immediately deduce that the other one must be spinning in the opposite direction because we’re not allowed to violate the law of conservation of angular momentum: if Particle 1 turns out to be spin-up, Particle 2  must be spin-down, and vice versa. It is known experimentally that passing two electrons through identical spin-measuring gadgets gives  results consistent with this reasoning. So far there’s nothing so very strange in this.

The problem with entanglement lies in understanding what happens in reality when a measurement is done. Suppose we have two observers, Albert and Bernard, who are bored with Cheryl’s little games and have decided to do something interesting with their lives by becoming physicists. Each is equipped with a device that can measure the spin of an electron in any direction they choose. Particle 1 emerges from the source and travels towards Albert whereas particle 2 travels in Bernard’s direction. Before any measurement, the system is in an entangled superposition state. Suppose Albert decides to measure the spin of electron 1 in the z-direction and finds it spinning up. Immediately, the wave-function for electron 2 collapses into the down direction. If Albert had instead decided to measure spin in the left-right direction and found it “left” similar collapse would have occurred for particle 2, but this time putting it in the “right” direction.

Whatever Albert does, the result of any corresponding measurement made by Bernard has a definite outcome – the opposite to Alberts result. So Albert’s decision whether to make a measurement up-down or left-right instantaneously transmits itself to Bernard who will find a consistent answer, if he makes the same measurement as Albert.

If, on the other hand, Albert makes an up-down measurement but Bernard measures left-right then Albert’s answer has no effect on Bernard, who has a 50% chance of getting “left” and 50% chance of getting right. The point is that whatever Albert decides to do, it has an immediate effect on the wave-function at ’s position; the collapse of the wave-function induced by Albert immediately collapses the state measured by Bernard. How can particle 1 and particle 2 communicate in this way?

This riddle is the core of a thought experiment by Einstein, Podolsky and Rosen in 1935 which has deep implications for the nature of the information that is supplied by quantum mechanics. The essence of the EPR paradox is that each of the two particles – even if they are separated by huge distances – seems to know exactly what the other one is doing. Einstein called this “spooky action at a distance” and went on to point out that this type of thing simply could not happen in the usual calculus of random variables. His argument was later tightened considerably by John Bell in a form now known as Bell’s theorem.

To see how Bell’s theorem works, consider the following roughly analagous situation. Suppose we have two suspects in prison, say Albert and Bernard (presumably Cheryl grassed them up and has been granted immunity from prosecution). The  two are taken apart to separate cells for individual questioning. We can allow them to use notes, electronic organizers, tablets of stone or anything to help them remember any agreed strategy they have concocted, but they are not allowed to communicate with each other once the interrogation has started. Each question they are asked has only two possible answers – “yes” or “no” – and there are only three possible questions. We can assume the questions are asked independently and in a random order to the two suspects.

When the questioning is over, the interrogators find that whenever they asked the same question, Albert and Bernard always gave the same answer, but when the question was different they only gave the same answer 25% of the time. What can the interrogators conclude?

This a simple illustration of what in quantum mechanics is known as a Bell inequality. Albert and Bernard can only keep the number of such false agreements down to the measured level of 25% by cheating.

This example is directly analogous to the behaviour of the entangled quantum state described above under repeated interrogations about its spin in three different directions. The result of each measurement can only be either “yes” or “no”. Each individual answer (for each particle) is equally probable in this case; the same question always produces the same answer for both particles, but the probability of agreement for two different questions is indeed ¼ and not larger as would be expected if the answers were random. For example one could ask particle 1 “are you spinning up” and particle 2 “are you spinning to the right”? The probability of both producing an answer “yes” is 25% according to quantum theory but would be higher if the particles weren’t cheating in some way.

Probably the most famous experiment of this type was done in the 1980s, by Alain Aspect and collaborators, involving entangled pairs of polarized photons (which are bosons), rather than electrons, primarily because these are easier to prepare.

The implications of quantum entanglement greatly troubled Einstein long before the EPR paradox. Indeed the interpretation of single-particle quantum measurement (which has no entanglement) was already troublesome. Just exactly how does the wave-function relate to the particle? What can one really say about the state of the particle before a measurement is made? What really happens when a wave-function collapses? These questions take us into philosophical territory that I have set foot in already; the difficult relationship between epistemological and ontological uses of probability theory.

Thanks largely to the influence of Niels Bohr, in the relatively early stages of quantum theory a standard approach to this question was adopted. In what became known as the  Copenhagen interpretation of quantum mechanics, the collapse of the wave-function as a result of measurement represents a real change in the physical state of the system. Before the measurement, an electron really is neither spinning up nor spinning down but in a kind of quantum purgatory. After a measurement it is released from limbo and becomes definitely something. What collapses the wave-function is something unspecified to do with the interaction of the particle with the measuring apparatus or, in some extreme versions of this doctrine, the intervention of human consciousness.

I find it amazing that such a view could have been held so seriously by so many highly intelligent people. Schrödinger hated this concept so much that he invented a thought-experiment of his own to poke fun at it. This is the famous “Schrödinger’s cat” paradox.

In a closed box there is a cat. Attached to the box is a device which releases poison into the box when triggered by a quantum-mechanical event, such as radiation produced by the decay of a radioactive substance. One can’t tell from the outside whether the poison has been released or not, so one doesn’t know whether the cat is alive or dead. When one opens the box, one learns the truth. Whether the cat has collapsed or not, the wave-function certainly does. At this point one is effectively making a quantum measurement so the wave-function of the cat is either “dead” or “alive” but before opening the box it must be in a superposition state. But do we really think the cat is neither dead nor alive? Isn’t it certainly one or the other, but that our lack of information prevents us from knowing which? And if this is true for a macroscopic object such as a cat, why can’t it be true for a microscopic system, such as that involving just a pair of electrons?

As I learned at a talk a while ago by the Nobel prize-winning physicist Tony Leggett – who has been collecting data on this  – most physicists think Schrödinger’s cat is definitely alive or dead before the box is opened. However, most physicists don’t believe that an electron definitely spins either up or down before a measurement is made. But where does one draw the line between the microscopic and macroscopic descriptions of reality? If quantum mechanics works for 1 particle, does it work also for 10, 1000? Or, for that matter, 1023?

Most modern physicists eschew the Copenhagen interpretation in favour of one or other of two modern interpretations. One involves the concept of quantum decoherence, which is basically the idea that the phase information that is crucial to the underlying logic of quantum theory can be destroyed by the interaction of a microscopic system with one of larger size. In effect, this hides the quantum nature of macroscopic systems and allows us to use a more classical description for complicated objects. This certainly happens in practice, but this idea seems to me merely to defer the problem of interpretation rather than solve it. The fact that a large and complex system makes tends to hide its quantum nature from us does not in itself give us the right to have a different interpretations of the wave-function for big things and for small things.

Another trendy way to think about quantum theory is the so-called Many-Worlds interpretation. This asserts that our Universe comprises an ensemble – sometimes called a multiverse – and  probabilities are defined over this ensemble. In effect when an electron leaves its source it travels through infinitely many paths in this ensemble of possible worlds, interfering with itself on the way. We live in just one slice of the multiverse so at the end we perceive the electron winding up at just one point on our screen. Part of this is to some extent excusable, because many scientists still believe that one has to have an ensemble in order to have a well-defined probability theory. If one adopts a more sensible interpretation of probability then this is not actually necessary; probability does not have to be interpreted in terms of frequencies. But the many-worlds brigade goes even further than this. They assert that these parallel universes are real. What this means is not completely clear, as one can never visit parallel universes other than our own …

It seems to me that none of these interpretations is at all satisfactory and, in the gap left by the failure to find a sensible way to understand “quantum reality”, there has grown a pathological industry of pseudo-scientific gobbledegook. Claims that entanglement is consistent with telepathy, that parallel universes are scientific truths, that consciousness is a quantum phenomena abound in the New Age sections of bookshops but have no rational foundation. Physicists may complain about this, but they have only themselves to blame.

But there is one remaining possibility for an interpretation of that has been unfairly neglected by quantum theorists despite – or perhaps because of – the fact that is the closest of all to commonsense. This view that quantum mechanics is just an incomplete theory, and the reason it produces only a probabilistic description is that does not provide sufficient information to make definite predictions. This line of reasoning has a distinguished pedigree, but fell out of favour after the arrival of Bell’s theorem and related issues. Early ideas on this theme revolved around the idea that particles could carry “hidden variables” whose behaviour we could not predict because our fundamental description is inadequate. In other words two apparently identical electrons are not really identical; something we cannot directly measure marks them apart. If this works then we can simply use only probability theory to deal with inferences made on the basis of information that’s not sufficient for absolute certainty.

After Bell’s work, however, it became clear that these hidden variables must possess a very peculiar property if they are to describe out quantum world. The property of entanglement requires the hidden variables to be non-local. In other words, two electrons must be able to communicate their values faster than the speed of light. Putting this conclusion together with relativity leads one to deduce that the chain of cause and effect must break down: hidden variables are therefore acausal. This is such an unpalatable idea that it seems to many physicists to be even worse than the alternatives, but to me it seems entirely plausible that the causal structure of space-time must break down at some level. On the other hand, not all “incomplete” interpretations of quantum theory involve hidden variables.

One can think of this category of interpretation as involving an epistemological view of quantum mechanics. The probabilistic nature of the theory has, in some sense, a subjective origin. It represents deficiencies in our state of knowledge. The alternative Copenhagen and Many-Worlds views I discussed above differ greatly from each other, but each is characterized by the mistaken desire to put quantum mechanics – and, therefore, probability –  in the realm of ontology.

The idea that quantum mechanics might be incomplete  (or even just fundamentally “wrong”) does not seem to me to be all that radical. Although it has been very successful, there are sufficiently many problems of interpretation associated with it that perhaps it will eventually be replaced by something more fundamental, or at least different. Surprisingly, this is a somewhat heretical view among physicists: most, including several Nobel laureates, seem to think that quantum theory is unquestionably the most complete description of nature we will ever obtain. That may be true, of course. But if we never look any deeper we will certainly never know…

With the gradual re-emergence of Bayesian approaches in other branches of physics a number of important steps have been taken towards the construction of a truly inductive interpretation of quantum mechanics. This programme sets out to understand  probability in terms of the “degree of belief” that characterizes Bayesian probabilities. Recently, Christopher Fuchs, amongst others, has shown that, contrary to popular myth, the role of probability in quantum mechanics can indeed be understood in this way and, moreover, that a theory in which quantum states are states of knowledge rather than states of reality is complete and well-defined. I am not claiming that this argument is settled, but this approach seems to me by far the most compelling and it is a pity more people aren’t following it up…

## Bayes, Laplace and Bayes’ Theorem

Posted in Bad Statistics with tags , , , , , , , , on October 1, 2014 by telescoper

A  couple of interesting pieces have appeared which discuss Bayesian reasoning in the popular media. One is by Jon Butterworth in his Grauniad science blog and the other is a feature article in the New York Times. I’m in early today because I have an all-day Teaching and Learning Strategy Meeting so before I disappear for that I thought I’d post a quick bit of background.

One way to get to Bayes’ Theorem is by starting with $P(A|C)P(B|AC)=P(B|C)P(A|BC)=P(AB|C)$

where I refer to three logical propositions A, B and C and the vertical bar “|” denotes conditioning, i.e. $P(A|B)$ means the probability of A being true given the assumed truth of B; “AB” means “A and B”, etc. This basically follows from the fact that “A and B” must always be equivalent to “B and A”.  Bayes’ theorem  then follows straightforwardly as $P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)$

where $K=P(A|C).$

Many versions of this, including the one in Jon Butterworth’s blog, exclude the third proposition and refer to A and B only. I prefer to keep an extra one in there to remind us that every statement about probability depends on information either known or assumed to be known; any proper statement of probability requires this information to be stated clearly and used appropriately but sadly this requirement is frequently ignored.

Although this is called Bayes’ theorem, the general form of it as stated here was actually first written down not by Bayes, but by Laplace. What Bayes did was derive the special case of this formula for “inverting” the binomial distribution. This distribution gives the probability of x successes in n independent “trials” each having the same probability of success, p; each “trial” has only two possible outcomes (“success” or “failure”). Trials like this are usually called Bernoulli trials, after Daniel Bernoulli. If we ask the question “what is the probability of exactly x successes from the possible n?”, the answer is given by the binomial distribution: $P_n(x|n,p)= C(n,x) p^x (1-p)^{n-x}$

where $C(n,x)= \frac{n!}{x!(n-x)!}$

is the number of distinct combinations of x objects that can be drawn from a pool of n.

You can probably see immediately how this arises. The probability of x consecutive successes is p multiplied by itself x times, or px. The probability of (n-x) successive failures is similarly (1-p)n-x. The last two terms basically therefore tell us the probability that we have exactly x successes (since there must be n-x failures). The combinatorial factor in front takes account of the fact that the ordering of successes and failures doesn’t matter.

The binomial distribution applies, for example, to repeated tosses of a coin, in which case p is taken to be 0.5 for a fair coin. A biased coin might have a different value of p, but as long as the tosses are independent the formula still applies. The binomial distribution also applies to problems involving drawing balls from urns: it works exactly if the balls are replaced in the urn after each draw, but it also applies approximately without replacement, as long as the number of draws is much smaller than the number of balls in the urn. I leave it as an exercise to calculate the expectation value of the binomial distribution, but the result is not surprising: E(X)=np. If you toss a fair coin ten times the expectation value for the number of heads is 10 times 0.5, which is five. No surprise there. After another bit of maths, the variance of the distribution can also be found. It is np(1-p).

So this gives us the probability of x given a fixed value of p. Bayes was interested in the inverse of this result, the probability of p given x. In other words, Bayes was interested in the answer to the question “If I perform n independent trials and get x successes, what is the probability distribution of p?”. This is a classic example of inverse reasoning, in that it involved turning something like P(A|BC) into something like P(B|AC), which is what is achieved by the theorem stated at the start of this post.

Bayes got the correct answer for his problem, eventually, but by very convoluted reasoning. In my opinion it is quite difficult to justify the name Bayes’ theorem based on what he actually did, although Laplace did specifically acknowledge this contribution when he derived the general result later, which is no doubt why the theorem is always named in Bayes’ honour.

This is not the only example in science where the wrong person’s name is attached to a result or discovery. Stigler’s Law of Eponymy strikes again! So who was the mysterious mathematician behind this result? Thomas Bayes was born in 1702, son of Joshua Bayes, who was a Fellow of the Royal Society (FRS) and one of the very first nonconformist ministers to be ordained in England. Thomas was himself ordained and for a while worked with his father in the Presbyterian Meeting House in Leather Lane, near Holborn in London. In 1720 he was a minister in Tunbridge Wells, in Kent. He retired from the church in 1752 and died in 1761. Thomas Bayes didn’t publish a single paper on mathematics in his own name during his lifetime but was elected a Fellow of the Royal Society (FRS) in 1742.

The paper containing the theorem that now bears his name was published posthumously in the Philosophical Transactions of the Royal Society of London in 1763. In his great Philosophical Essay on Probabilities Laplace wrote:

Bayes, in the Transactions Philosophiques of the Year 1763, sought directly the probability that the possibilities indicated by past experiences are comprised within given limits; and he has arrived at this in a refined and very ingenious manner, although a little perplexing.

The reasoning in the 1763 paper is indeed perplexing, and I remain convinced that the general form we now we refer to as Bayes’ Theorem should really be called Laplace’s Theorem. Nevertheless, Bayes did establish an extremely important principle that is reflected in the title of the New York Times piece I referred to at the start of this piece. In a nutshell this is that probabilities of future events can be updated on the basis of past measurements or, as I prefer to put it, “one person’s posterior is another’s prior”.