## Bayes, Laplace and Bayes’ Theorem

Posted in Bad Statistics with tags , , , , , , , , on October 1, 2014 by telescoper

A  couple of interesting pieces have appeared which discuss Bayesian reasoning in the popular media. One is by Jon Butterworth in his Grauniad science blog and the other is a feature article in the New York Times. I’m in early today because I have an all-day Teaching and Learning Strategy Meeting so before I disappear for that I thought I’d post a quick bit of background.

One way to get to Bayes’ Theorem is by starting with

$P(A|C)P(B|AC)=P(B|C)P(A|BC)=P(AB|C)$

where I refer to three logical propositions A, B and C and the vertical bar “|” denotes conditioning, i.e. $P(A|B)$ means the probability of A being true given the assumed truth of B; “AB” means “A and B”, etc. This basically follows from the fact that “A and B” must always be equivalent to “B and A”.  Bayes’ theorem  then follows straightforwardly as

$P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)$

where

$K=P(A|C).$

Many versions of this, including the one in Jon Butterworth’s blog, exclude the third proposition and refer to A and B only. I prefer to keep an extra one in there to remind us that every statement about probability depends on information either known or assumed to be known; any proper statement of probability requires this information to be stated clearly and used appropriately but sadly this requirement is frequently ignored.

Although this is called Bayes’ theorem, the general form of it as stated here was actually first written down not by Bayes, but by Laplace. What Bayes did was derive the special case of this formula for “inverting” the binomial distribution. This distribution gives the probability of x successes in n independent “trials” each having the same probability of success, p; each “trial” has only two possible outcomes (“success” or “failure”). Trials like this are usually called Bernoulli trials, after Daniel Bernoulli. If we ask the question “what is the probability of exactly x successes from the possible n?”, the answer is given by the binomial distribution:

$P_n(x|n,p)= C(n,x) p^x (1-p)^{n-x}$

where

$C(n,x)= \frac{n!}{x!(n-x)!}$

is the number of distinct combinations of x objects that can be drawn from a pool of n.

You can probably see immediately how this arises. The probability of x consecutive successes is p multiplied by itself x times, or px. The probability of (n-x) successive failures is similarly (1-p)n-x. The last two terms basically therefore tell us the probability that we have exactly x successes (since there must be n-x failures). The combinatorial factor in front takes account of the fact that the ordering of successes and failures doesn’t matter.

The binomial distribution applies, for example, to repeated tosses of a coin, in which case p is taken to be 0.5 for a fair coin. A biased coin might have a different value of p, but as long as the tosses are independent the formula still applies. The binomial distribution also applies to problems involving drawing balls from urns: it works exactly if the balls are replaced in the urn after each draw, but it also applies approximately without replacement, as long as the number of draws is much smaller than the number of balls in the urn. I leave it as an exercise to calculate the expectation value of the binomial distribution, but the result is not surprising: E(X)=np. If you toss a fair coin ten times the expectation value for the number of heads is 10 times 0.5, which is five. No surprise there. After another bit of maths, the variance of the distribution can also be found. It is np(1-p).

So this gives us the probability of x given a fixed value of p. Bayes was interested in the inverse of this result, the probability of p given x. In other words, Bayes was interested in the answer to the question “If I perform n independent trials and get x successes, what is the probability distribution of p?”. This is a classic example of inverse reasoning, in that it involved turning something like P(A|BC) into something like P(B|AC), which is what is achieved by the theorem stated at the start of this post.

Bayes got the correct answer for his problem, eventually, but by very convoluted reasoning. In my opinion it is quite difficult to justify the name Bayes’ theorem based on what he actually did, although Laplace did specifically acknowledge this contribution when he derived the general result later, which is no doubt why the theorem is always named in Bayes’ honour.

This is not the only example in science where the wrong person’s name is attached to a result or discovery. Stigler’s Law of Eponymy strikes again!

So who was the mysterious mathematician behind this result? Thomas Bayes was born in 1702, son of Joshua Bayes, who was a Fellow of the Royal Society (FRS) and one of the very first nonconformist ministers to be ordained in England. Thomas was himself ordained and for a while worked with his father in the Presbyterian Meeting House in Leather Lane, near Holborn in London. In 1720 he was a minister in Tunbridge Wells, in Kent. He retired from the church in 1752 and died in 1761. Thomas Bayes didn’t publish a single paper on mathematics in his own name during his lifetime but was elected a Fellow of the Royal Society (FRS) in 1742.

The paper containing the theorem that now bears his name was published posthumously in the Philosophical Transactions of the Royal Society of London in 1763. In his great Philosophical Essay on Probabilities Laplace wrote:

Bayes, in the Transactions Philosophiques of the Year 1763, sought directly the probability that the possibilities indicated by past experiences are comprised within given limits; and he has arrived at this in a refined and very ingenious manner, although a little perplexing.

The reasoning in the 1763 paper is indeed perplexing, and I remain convinced that the general form we now we refer to as Bayes’ Theorem should really be called Laplace’s Theorem. Nevertheless, Bayes did establish an extremely important principle that is reflected in the title of the New York Times piece I referred to at the start of this piece. In a nutshell this is that probabilities of future events can be updated on the basis of past measurements or, as I prefer to put it, “one person’s posterior is another’s prior”.

## Bunn on Bayes

Posted in Bad Statistics with tags , , , , on June 17, 2013 by telescoper

Just a quickie to advertise a nice blog post by Ted Bunn in which he takes down an article in Science by Bradley Efron, which is about frequentist statistics. I’ll leave it to you to read his piece, and the offending article, but couldn’t resist nicking his little graphic that sums up the matter for me:

The point is that as scientists we are interested in the probability of a model (or hypothesis)  given the evidence (or data) arising from an experiment (or observation). This requires inverse, or inductive, reasoning and it is therefore explicitly Bayesian. Frequentists focus on a different question, about the probability of the data given the model, which is not the same thing at all, and is not what scientists actually need. There are examples in which a frequentist method accidentally gives the correct (i.e. Bayesian) answer, but they are nevertheless still answering the wrong question.

I will make one further comment arising from the following excerpt from the Efron piece.

Bayes’ 1763 paper was an impeccable exercise in probability theory. The trouble and the subsequent busts came from overenthusiastic application of the theorem in the absence of genuine prior information, with Pierre-Simon Laplace as a prime violator.

I think this is completely wrong. There is always prior information, even if it is minimal, but the point is that frequentist methods always ignore it even if it is “genuine” (whatever that means). It’s not always easy to encode this information in a properly defined prior probability of course, but at least a Bayesian will not deliberately answer the wrong question in order to avoid thinking about it.

It is ironic that the pioneers of probability theory, such as Laplace, adopted a Bayesian rather than frequentist interpretation for his probabilities. Frequentism arose during the nineteenth century and held sway until recently. I recall giving a conference talk about Bayesian reasoning only to be heckled by the audience with comments about “new-fangled, trendy Bayesian methods”. Nothing could have been less apt. Probability theory pre-dates the rise of sampling theory and all the frequentist-inspired techniques that modern-day statisticians like to employ and which, in my opinion, have added nothing but confusion to the scientific analysis of statistical data.

## Bayes and his Theorem

Posted in Bad Statistics with tags , , , , , , on November 23, 2010 by telescoper

My earlier post on Bayesian probability seems to have generated quite a lot of readers, so this lunchtime I thought I’d add a little bit of background. The previous discussion started from the result

$P(B|AC) = K^{-1}P(B|C)P(A|BC) = K^{-1} P(AB|C)$

where

$K=P(A|C).$

Although this is called Bayes’ theorem, the general form of it as stated here was actually first written down, not by Bayes but by Laplace. What Bayes’ did was derive the special case of this formula for “inverting” the binomial distribution. This distribution gives the probability of x successes in n independent “trials” each having the same probability of success, p; each “trial” has only two possible outcomes (“success” or “failure”). Trials like this are usually called Bernoulli trials, after Daniel Bernoulli. If we ask the question “what is the probability of exactly x successes from the possible n?”, the answer is given by the binomial distribution:

$P_n(x|n,p)= C(n,x) p^x (1-p)^{n-x}$

where

$C(n,x)= n!/x!(n-x)!$

is the number of distinct combinations of x objects that can be drawn from a pool of n.

You can probably see immediately how this arises. The probability of x consecutive successes is p multiplied by itself x times, or px. The probability of (n-x) successive failures is similarly (1-p)n-x. The last two terms basically therefore tell us the probability that we have exactly x successes (since there must be n-x failures). The combinatorial factor in front takes account of the fact that the ordering of successes and failures doesn’t matter.

The binomial distribution applies, for example, to repeated tosses of a coin, in which case p is taken to be 0.5 for a fair coin. A biased coin might have a different value of p, but as long as the tosses are independent the formula still applies. The binomial distribution also applies to problems involving drawing balls from urns: it works exactly if the balls are replaced in the urn after each draw, but it also applies approximately without replacement, as long as the number of draws is much smaller than the number of balls in the urn. I leave it as an exercise to calculate the expectation value of the binomial distribution, but the result is not surprising: E(X)=np. If you toss a fair coin ten times the expectation value for the number of heads is 10 times 0.5, which is five. No surprise there. After another bit of maths, the variance of the distribution can also be found. It is np(1-p).

So this gives us the probability of x given a fixed value of p. Bayes was interested in the inverse of this result, the probability of p given x. In other words, Bayes was interested in the answer to the question “If I perform n independent trials and get x successes, what is the probability distribution of p?”. This is a classic example of inverse reasoning. He got the correct answer, eventually, but by very convoluted reasoning. In my opinion it is quite difficult to justify the name Bayes’ theorem based on what he actually did, although Laplace did specifically acknowledge this contribution when he derived the general result later, which is no doubt why the theorem is always named in Bayes’ honour.

This is not the only example in science where the wrong person’s name is attached to a result or discovery. In fact, it is almost a law of Nature that any theorem that has a name has the wrong name. I propose that this observation should henceforth be known as Coles’ Law.

So who was the mysterious mathematician behind this result? Thomas Bayes was born in 1702, son of Joshua Bayes, who was a Fellow of the Royal Society (FRS) and one of the very first nonconformist ministers to be ordained in England. Thomas was himself ordained and for a while worked with his father in the Presbyterian Meeting House in Leather Lane, near Holborn in London. In 1720 he was a minister in Tunbridge Wells, in Kent. He retired from the church in 1752 and died in 1761. Thomas Bayes didn’t publish a single paper on mathematics in his own name during his lifetime but despite this was elected a Fellow of the Royal Society (FRS) in 1742. Presumably he had Friends of the Right Sort. He did however write a paper on fluxions in 1736, which was published anonymously. This was probably the grounds on which he was elected an FRS.

The paper containing the theorem that now bears his name was published posthumously in the Philosophical Transactions of the Royal Society of London in 1764.

P.S. I understand that the authenticity of the picture is open to question. Whoever it actually is, he looks  to me a bit like Laurence Olivier…

## A Little Bit of Chaos

Posted in The Universe and Stuff with tags , , , , , , , , on November 21, 2009 by telescoper

The era of modern physics could be said to have begun in 1687 with the publication by Sir Isaac Newton of his great Philosophiae Naturalis Principia Mathematica, (Principia for short). In this magnificent volume, Newton presented a mathematical theory of all known forms of motion and, for the first time, gave clear definitions of the concepts of force and momentum. Within this general framework he derived a new theory of Universal Gravitation and used it to explain the properties of planetary orbits previously discovered but unexplained by Johannes Kepler. The classical laws of motion and his famous “inverse square law” of gravity have been superseded by more complete theories when dealing with very high speeds or very strong gravity, but they nevertheless continue supply a very accurate description of our everyday physical world.

Newton’s laws have a rigidly deterministic structure. What I mean by this is that, given precise information about the state of a system at some time then one can use Newtonian mechanics to calculate the precise state of the system at any later time. The orbits of the planets, the positions of stars in the sky, and the occurrence of eclipses can all be predicted to very high accuracy using this theory.

At this point it is useful to mention that most physicists do not use Newton’s laws in the form presented in the Principia, but in a more elegant language named after Sir William Rowan Hamilton. The point about Newton’s laws of motion is that they are expressed mathematically as differential equations: they are expressed in terms of rates of changes of things. For instance, the force on a body gives the rate of change of the momentum of the body. Generally speaking, differential equations are very nasty things to solve which is a shame because most a great deal of theoretical physics involves them. Hamilton realised that it was possible to express Newton’s laws in a way that did not involve clumsy mathematics of this type. His formalism was equivalent, in the sense that one could obtain the basic differential equations from it, but easier to use in general situations. The key concept he introduced – now called the Hamiltonian – is a single mathematical function that depends on both the positions q and momenta p of the particles in a system, say H(q,p). This function is constructed from the different forms of energy (kinetic and potential) in the system, and how they depend on the p’s and q’s, but the details of how this works out don’t matter. Suffice to say that knowing the Hamiltonian for a system is tantamount to a full classical description of its behaviour.

Hamilton was a very interesting character. He was born in Dublin in 1805 and showed an astonishing early flair for languages, speaking 13 of them by the time he was 13. He graduated from Trinity College aged 22, at which point he was clearly a whiz-kid at mathematics as well as languages. He was immediately made professor of astronomy at Dublin and Astronomer Royal for Ireland. However, he turned out to be hopeless at the practicalities of observational work. Despite employing three of his sisters to help him in the observatory he never produced much of astronomical interest. Mathematics and alcohol seem to have been the two real loves of his life.

It is a fascinating historical fact that the development of probability theory during the late 17th and early 18th century coincided almost exactly with the rise of Newtonian Mechanics. It may seem strange in retrospect that there was no great philosophical conflict between these two great intellectual achievements since they have mutually incompatible views of prediction. Probability applies in unpredictable situations; Newtonian Mechanics says that everything is predictable. The resolution of this conundrum may owe a great deal to Laplace, who contributed greatly to both fields. Laplace, more than any other individual, was responsible to elevated the deterministic world-view of Newton to a scientific principle in its own right. To quote:

We ought then to regard the present state of the Universe as the effect of its preceding state and as the cause of its succeeding state.

According to Laplace’s view, knowledge of the initial conditions pertaining at the instant of creation would be sufficient in order to predict everything that subsequently happened. For him, a probabilistic treatment of phenomena did not conflict with classical theory, but was simply a convenient approach to be taken when the equations of motion were too difficult to be solved exactly. The required probabilities could be derived from the underlying theory, perhaps using some kind of symmetry argument.

The s-called “randomizing” devices used in all traditional gambling games – roulette wheels, dice, coins, bingo machines, and so on – are in fact well described by Newtonian mechanics. We call them “random” because the motions involved are just too complicated to make accurate prediction possible. Nevertheless it is clear that they are just straightforward mechanical devices which are essentially deterministic. On the other hand, we like to think the weather is predictable, at least in principle, but with much less evidence that it is so!

But it is not only systems with large numbers of interacting particles (like the Earth’s atmosphere) that pose problems for predictability. Some deceptively simple systems display extremely erratic behaviour. The theory of these systems is less than fifty years old or so, and it goes under the general title of nonlinear dynamics. One of the most important landmarks in this field was a study by two astronomers, Michel Hénon and Carl Heiles in 1964. They were interested in what would happens if you take a system with a known analytical solutions and modify it.

In the language of Hamiltonians, let us assume that H0 describes a system whose evolution we know exactly and H1 is some perturbation to it. The Hamiltonian of the modified system is thus

$H(q_i,p_i)=H_0(q_i, p_i) + H_1 (q_i, p_i)$

What Hénon and Heiles did was to study a system whose unmodified form is very familiar to physicists: the simple harmonic oscillator. This is a system which, when displaced from its equilibrium, experiences a restoring force proportional to the displacement. The Hamiltonian description for a single simple harmonic oscillator system involves a function that is quadratic in both p and q:

$H=\frac{1}{2} \left( q_1^2+p_1^2\right)$

The solution of this system is well known: the general form is a sinusoidal motion and it is used in the description of all kinds of wave phenomena, swinging pendulums and so on.

The case Henon and Heiles looked at had two degrees of freedom, so that the Hamiltonian depends on q1, q2, p1 and p2:

$H=\frac{1}{2} \left( q_1^2+p_1^2 + q_2^2+p_2^2\right)$

However, in this example, the two degrees of freedom are independent, meaning that there is uncoupled motion in the two directions. The amplitude of the oscillations is governed by the total energy of the system, which is a constant of the motion. Other than this, the type of behaviour displayed by this system is very rich, as exemplified by the various Lissajous figures shown in the diagram below. Note that all these figures are produced by the same type of dynamical system of equations: the different shapes are consequences of different initial conditions and different coefficients (which I set to unity in the form above).

If the oscillations in each direction have the same frequency then one can get an orbit which is a line or an ellipse. If the frequencies differ then the orbits can be much more complicated, but still pretty. Note that in all these cases the orbit is just a line, i.e. a one-dimensional part of the two-dimensional space drawn on the paper.

More generally, one can think of this system as a point moving in a four-dimensional phase space defined by the coordinates q1, q2, p1 and p2; taking slices through this space reveals qualitatively similar types of orbit for, say, p2 and q2 as for p1 and p2. The motion of the system is confined to a lower-dimensional part of the phase space rather than filling up all the available phase space. In this particular case, because each degree of freedom moves in only one of its two available dimensions, the system as a whole moves in a two-dimensional part of the four-dimensional space.

This all applies to the original, unperturbed system. Hénon and Heiles took this simple model and modified by adding a term to the Hamiltonian that was cubic rather than quadratic and which coupled the two degrees of freedom together. For those of you interested in the details their Hamiltonian was of the form

$H=\frac{1}{2} \left( q_1^2+p_1^2 + q_2^2+p_2^2\right) +q_1^2q_2+ \frac{1}{3}q_2^3$

The first set of terms in the brackets is the unmodified form, describing a simple harmonic oscillator; the other two terms are new. The result of this simple alteration is really quite surprising. They found that, for low energies, the system continued to behave like two uncoupled oscillators; the orbits were smooth and well-behaved. This is not surprising because the cubic modifications are smaller than the original quadratic terms if the amplitude is small.  For higher energies the motion becomes a bit more complicated, but the phase space behaviour is still characterized by continuous lines, as shown in the left hand part of the following figure.

However, at higher values of the energy (right), the cubic terms become more important, and something very striking happens. A two-dimensional slice through the phase space no longer shows the continuous curves that typify the original system, but a seemingly disorganized scattering of dots. It is not possible to discern any pattern in the phase space structure of this system: it appear to be random.

Nowadays we describe the transition from these two types of behaviour as being accompanied by the onset of chaos. It is important to note that this system is entirely deterministic, but it generates a phase space pattern that is quite different from what one would naively expect from the behaviour usually associated with classical Hamiltonian systems. To understand how this comes about it is perhaps helpful to think about predictability in classical systems. It is true that precise knowledge of the state of a system allows one to predict its state at some future time.  For a single particle this means that precise knowledge of its position and momentum, and knowledge of the relevant H, will allow one to calculate the position and momentum at all future times.

But think a moment about what this means. What do we mean by precise knowledge of the particle’s position? How precise? How many decimal places? If one has to give the position exactly then that could require an infinite amount of information. Clearly we never have that much information. Everything we know about the physical world has to be coarse-grained to some extent, even if it is only limited by measurement error. Strict determinism in the form advocated by Laplace is clearly a fantasy. Determinism is not the same as predictability.

In “simple” Hamiltonian systems what happens is that two neighbouring phase-space paths separate from each other in a very controlled way as the system evolves. In fact the separation between paths usually grows proportionally to time. The coarse-graining with which the input conditions are specified thus leads to a similar level of coarse-graining in the output state. Effectively the system is predictable, since the uncertainty in the output is not much larger than in the input.

In the chaotic system things are very different. What happens here is that the non-linear interactions represented in the Hamiltonian play havoc with the initial coarse-graining. Phase-space orbits that start out close to each other separate extremely violently (typically exponentially) and in a way that varies from one part of the phase space to another.  What happens then is that particle paths become hopelessly scrambled and the mapping between initial and final states becomes too complex to handle. What comes out  the end is practically impossible to predict.