## A Little Bit of Bayes

I thought I’d start a series of occasional posts about Bayesian probability. This is something I’ve touched on from time to time but its perhaps worth covering this relatively controversial topic in a slightly more systematic fashion especially with regard to how it works in cosmology.

I’ll start with Bayes’ theorem which for three logical propositions (such as statements about the values of parameters in theory) *A*, *B* and *C* can be written in the form

where

This is (or should be!) uncontroversial as it is simply a result of the sum and product rules for combining probabilities. Notice, however, that I’ve not restricted it to two propositions *A* and *B* as is often done, but carried throughout an extra one (*C*). This is to emphasize the fact that, to a Bayesian, all probabilities are conditional on something; usually, in the context of data analysis this is a background theory that furnishes the framework within which measurements are interpreted. If you say this makes everything model-dependent, then I’d agree. But every interpretation of data in terms of parameters of a model is dependent on the model. It has to be. If you think it can be otherwise then I think you’re misguided.

In the equation, *P(B|C)* is the probability of *B* being true, given that *C* is true . The information *C* need not be definitely known, but perhaps assumed for the sake of argument. The left-hand side of Bayes’ theorem denotes the probability of *B* given both *A* and *C*, and so on. The presence of *C* has not changed anything, but is just there as a reminder that it all depends on what is being assumed in the background. The equation states a *theorem* that can be proved to be mathematically correct so it is – or should be – uncontroversial.

Now comes the controversy. In the “frequentist” interpretation of probability, the entities *A*, *B* and *C *would be interpreted as “events” (e.g. the coin is heads) or “random variables” (e.g. the score on a dice, a number from 1 to 6) attached to which is their probability, indicating their propensity to occur in an imagined ensemble. These things are quite complicated mathematical objects: they don’t have specific numerical values, but are represented by a measure over the space of possibilities. They are sort of “blurred-out” in some way, the fuzziness representing the uncertainty in the precise value.

To a Bayesian, the entities *A*, *B* and *C* have a completely different character to what they represent for a frequentist. They are not “events” but logical propositions which can only be either true or false. The entities themselves are not blurred out, but we may have insufficient information to decide which of the two possibilities is correct. In this interpretation, *P(A|C)* represents the *degree of belief* that it is consistent to hold in the truth of *A* given the information *C*. Probability is therefore a generalization of the “normal” deductive logic expressed by Boolean algebra: the value “0” is associated with a proposition which is false and “1” denotes one that is true. Probability theory extends this logic to the intermediate case where there is insufficient information to be certain about the status of the proposition.

A common objection to Bayesian probability is that it is somehow arbitrary or ill-defined. “Subjective” is the word that is often bandied about. This is only fair to the extent that different individuals may have access to different information and therefore assign different probabilities. Given different information *C* and *C*′ the probabilities *P(A|C)* and *P(A|C′)* will be different. On the other hand, the same precise rules for assigning and manipulating probabilities apply as before. Identical results should therefore be obtained whether these are applied by any person, or even a robot, so that part isn’t subjective at all.

In fact I’d go further. I think one of the great strengths of the Bayesian interpretation is precisely that it *does* depend on what information is assumed. This means that such information has to be stated explicitly. The essential assumptions behind a result can be – and, regrettably, often are – hidden in frequentist analyses. Being a Bayesian forces you to put all your cards on the table.

To a Bayesian, probabilities are always conditional on other assumed truths. There is no such thing as an absolute probability, hence my alteration of the form of Bayes’s theorem to represent this. A probability such as *P(A)* has no meaning to a Bayesian: there is always conditioning information. For example, if I blithely assign a probability of 1/6 to each face of a dice, that assignment is actually conditional on me having no information to discriminate between the appearance of the faces, and no knowledge of the rolling trajectory that would allow me to make a prediction of its eventual resting position.

In tbe Bayesian framework, probability theory becomes not a branch of experimental science but a branch of logic. Like any branch of mathematics it cannot be tested by experiment but only by the requirement that it be internally self-consistent. This brings me to what I think is one of the most important results of twentieth century mathematics, but which is unfortunately almost unknown in the scientific community. In 1946, Richard Cox derived the unique generalization of Boolean algebra under the assumption that such a logic must involve associated a single number with any logical proposition. The result he got is beautiful and anyone with any interest in science should make a point of reading his elegant argument. It turns out that the only way to construct a consistent logic of uncertainty incorporating this principle is by using the standard laws of probability. There is no other way to reason consistently in the face of uncertainty than probability theory. Accordingly, probability theory always applies when there is insufficient knowledge for deductive certainty. Probability is inductive logic.

This is not just a nice mathematical property. This kind of probability lies at the foundations of a consistent methodological framework that not only encapsulates many common-sense notions about how science works, but also puts at least some aspects of scientific reasoning on a rigorous quantitative footing. This is an important weapon that should be used more often in the battle against the creeping irrationalism one finds in society at large.

I posted some time ago about an alternative way of deriving the laws of probability from consistency arguments.

To see how the Bayesian approach works, let us consider a simple example. Suppose we have a hypothesis *H* (some theoretical idea that we think might explain some experiment or observation). We also have access to some data *D*, and we also adopt some prior information *I* (which might be the results of other experiments or simply working assumptions). What we want to know is how strongly the data *D *supports the hypothesis *H* given my background assumptions I. To keep it easy, we assume that the choice is between whether *H* is true or *H* is false. In the latter case, “not-*H*” or *H′* (for short) is true. If our experiment is at all useful we can construct *P(D|HI)*, the probability that the experiment would produce the data set *D* if both our hypothesis and the conditional information are true.

The probability *P(D|HI)* is called the *likelihood*; to construct it we need to have some knowledge of the statistical errors produced by our measurement. Using Bayes’ theorem we can “invert” this likelihood to give *P(H|DI)*, the probability that our hypothesis is true given the data and our assumptions. The result looks just like we had in the first two equations:

Now we can expand the “normalising constant” *K* because we know that either *H* or *H′* must be true. Thus

The *P(H|DI)* on the left-hand side of the first expression is called the *posterior probability*; the right-hand side involves *P(H|I)*, which is called the *prior probability* and the likelihood *P(D|HI)*. The principal controversy surrounding Bayesian inductive reasoning involves the prior and how to define it, which is something I’ll comment on in a future post.

The Bayesian recipe for testing a hypothesis assigns a large posterior probability to a hypothesis for which the product of the prior probability and the likelihood is large. It can be generalized to the case where we want to pick the best of a set of competing hypothesis, say *H _{1} …. H_{n}*. Note that this need not be the set of all possible hypotheses, just those that we have thought about. We can only choose from what is available. The hypothesis may be relatively simple, such as that some particular parameter takes the value

*x*, or they may be composite involving many parameters and/or assumptions. For instance, the Big Bang model of our universe is a very complicated hypothesis, or in fact a combination of hypotheses joined together, involving at least a dozen parameters which can’t be predicted

*a priori*but which have to be estimated from observations.

The required result for multiple hypotheses is pretty straightforward: the sum of the two alternatives involved in *K* above simply becomes a sum over all possible hypotheses, so that

and

If the hypothesis concerns the value of a parameter – in cosmology this might be, e.g., the mean density of the Universe expressed by the density parameter Ω_{0} – then the allowed space of possibilities is continuous. The sum in the denominator should then be replaced by an integral, but conceptually nothing changes. Our “best” hypothesis is the one that has the greatest posterior probability.

From a frequentist stance the procedure is often instead to just maximize the likelihood. According to this approach the best theory is the one that makes the data most probable. This can be the same as the most probable theory, but only if the prior probability is constant, but the probability of a model given the data is generally not the same as the probability of the data given the model. I’m amazed how many practising scientists make this error on a regular basis.

The following figure might serve to illustrate the difference between the frequentist and Bayesian approaches. In the former case, everything is done in “data space” using likelihoods, and in the other we work throughout with probabilities of hypotheses, i.e. we think in hypothesis space. I find it interesting to note that most theorists that I know who work in cosmology are Bayesians and most observers are frequentists!

As I mentioned above, it is the presence of the prior probability in the general formula that is the most controversial aspect of the Bayesian approach. The attitude of frequentists is often that this prior information is completely arbitrary or at least “model-dependent”. Being empirically-minded people, by and large, they prefer to think that measurements can be made and interpreted without reference to theory at all.

Assuming we can assign the prior probabilities in an appropriate way what emerges from the Bayesian framework is a consistent methodology for scientific progress. The scheme starts with the hardest part – theory creation. This requires human intervention, since we have no automatic procedure for dreaming up hypothesis from thin air. Once we have a set of hypotheses, we need data against which theories can be compared using their relative probabilities. The experimental testing of a theory can happen in many stages: the posterior probability obtained after one experiment can be fed in, as prior, into the next. The order of experiments does not matter. This all happens in an endless loop, as models are tested and refined by confrontation with experimental discoveries, and are forced to compete with new theoretical ideas. Often one particular theory emerges as most probable for a while, such as in particle physics where a “standard model” has been in existence for many years. But this does not make it absolutely right; it is just the best bet amongst the alternatives. Likewise, the Big Bang model does not represent the absolute truth, but is just the best available model in the face of the manifold relevant observations we now have concerning the Universe’s origin and evolution. The crucial point about this methodology is that it is inherently inductive: all the reasoning is carried out in “hypothesis space” rather than “observation space”. The primary form of logic involved is not deduction but *induction*. Science is all about *inverse* reasoning.

For comments on induction versus deduction in another context, see here.

So what are the main differences between the Bayesian and frequentist views?

First, I think it is fair to say that the Bayesian framework is enormously more general than is allowed by the frequentist notion that probabilities must be regarded as relative frequencies in some ensemble, whether that is real or imaginary. In the latter interpretation, a proposition is at once true in some elements of the ensemble and false in others. It seems to me to be a source of great confusion to substitute a logical AND for what is really a logical OR. The Bayesian stance is also free from problems associated with the failure to incorporate in the analysis any information that can’t be expressed as a frequency. Would you really trust a doctor who said that 75% of the people she saw with your symptoms required an operation, but who did not bother to look at your own medical files?

As I mentioned above, frequentists tend to talk about “random variables”. This takes us into another semantic minefield. What does “random” mean? To a Bayesian there are no random variables, only variables whose values we do not know. A random process is simply one about which we only have sufficient information to specify probability distributions rather than definite values.

More fundamentally, it is clear from the fact that the combination rules for probabilities were derived by Cox uniquely from the requirement of logical consistency, that any departure from these rules will generally speaking involve logical inconsistency. Many of the standard statistical data analysis techniques – including the simple “unbiased estimator” mentioned briefly above – used when the data consist of repeated samples of a variable having a definite but unknown value, are not equivalent to Bayesian reasoning. These methods can, of course, give good answers, but they can all be made to look completely silly by suitable choice of dataset.

By contrast, I am not aware of any example of a paradox or contradiction that has ever been found using the correct application of Bayesian methods, although method can be applied incorrectly. Furthermore, in order to deal with unique events like the weather, frequentists are forced to introduce the notion of an ensemble, a perhaps infinite collection of imaginary possibilities, to allow them to retain the notion that probability is a proportion. Provided the calculations are done correctly, the results of these calculations should agree with the Bayesian answers. On the other hand, frequentists often talk about the ensemble as if it were real, and I think that is very dangerous…

November 21, 2010 at 8:11 pm

Totally agree Peter, great post.

The way I deal with frequentists is not to go into a shouting match about the meaning of the word ‘probability,’ but to introduce the idea of how strongly one thing implies another. In more words, this is how strongly one binary proposition ‘A’ is implied to be true if we take another (‘B’) to be true. Call it the

implicabilityif you like (the name doesn’t matter – it’s the idea that counts). Cox’s analysis, plus one or two uncontroversial assumptions, now show that implicability obeys the sum and product rules, ie the ‘laws of probability’, and a little reflection shows that implicability is what you actuallywantin any problem where there is uncertainty. On those grounds I am happy to call it ‘the probability’ – but if any frequentist, or anybody else, takes offence, my reply is: fine, I’ll go ahead and find this quantity because it actually solves the problem, you go off and do whatever you like.A nice thing about calling it implicability is that the name itself reminds you that it has two arguments A and B, since one (partially) implies the other, so that you never fall into the mistake of ‘unconditional probabilities’. Also, you need never get hung up over the psychological connotations of ‘degree of belief’.

What none of this says is where to start: if you are doing an (inevitably noisy) experiment in which you let the data educate you about the value of a physical quantiy, what prior probability distribution should you adopt for the value of that quantity? This silence has been used by frequentists to attack Bayesianism, but all it really means is that further principles are needed. A powerful one is symmetry: if you don’t know where a bead is located on a circular horizontal piece of wire, you have to assign equal probability density to every position on the wire.

When discussing all this with someone else, I avoid certain words until I know what they mean to that person:

subjective

objective

random

probability

and even, sadly…

Bayesian

(of which there are various shades).

Anton

PS to philosophers: So far as I am concerned, this IS inductive logic.

November 22, 2010 at 12:34 am

Peter,

“frequentists often talk about the ensemble as if it were real, and I think that is very dangerous”

I’m sure that the barmy many-worlds interpretation of quantum mechanics would never have been taken seriously but for the plague that is frequentist statistics.

Where does the multiverse idea fit into this discussion, please?

Anton

November 22, 2010 at 8:26 am

I had a go at the multiverse a while ago..

https://telescoper.wordpress.com/2009/06/17/multiversalism/

November 22, 2010 at 9:21 am

I remember it. There, though, you were mainly talking physics, and mentioned the probability wars only briefly. You also said that

“some plausible models based on quantum field theory do admit the possibility that our observable Universe is part of collection of mini-universes, each of which “really” exists. It’s hard to explain precisely what I mean by that”

Please attempt that explanation. I am at present no wiser as to whether the multiverse

a) is a pompous way of saying that the ‘laws’ of physics differ in differing patches (in which case they are not the basic laws but approximations to them in the case that parameters vary slowly with epoch or region)

b) is a frequentist fallacy

or

c) is something else that I am presently unaware of.

Is the idea perhaps something like genetic algorithms, but in physics rather than biology?

Yours confusedly,

Anton

November 22, 2010 at 3:10 pm

Anton,

Will try to get around to revisiting the multiverse some time. Until then I’ll just say that I don’t think your three options are mutually exclusive…

Peter

November 22, 2010 at 7:33 pm

Boy, this post sure is generating hits! I never thought probability would grab so many people’s interest.

November 23, 2010 at 9:04 am

You’re probably right! But in that case the comments/hits ratio is less good than for other topics…

November 23, 2010 at 9:49 am

I don’t know why that is. It’s not a subject people are usually reticent about in my experience!

November 22, 2010 at 7:54 pm

[…] This post was mentioned on Twitter by Matthew Sottile, Peter Coles. Peter Coles said: A Little Bit of Bayes: http://wp.me/pko9D-26a […]

November 23, 2010 at 1:09 pm

[…] earlier post on Bayesian probability seems to have generated quite a lot of readers, so this lunchtime I thought […]

December 11, 2010 at 12:12 pm

[…] at my stats I find that my recent introductory post about Bayesian probability has proved surprisingly popular with readers, so I thought I’d […]

February 19, 2011 at 3:15 pm

[…] It’s been quite while since I posted little piece about Bayesian probability. That one and the others that followed it (here and here) proved to be […]

May 30, 2011 at 10:33 am

[…] and the controversy around it, you might try reading one of my earlier posts about it, e.g. this one. I hope I can persuade some of the email commenters to upload their contributions through the box […]

August 19, 2012 at 12:47 pm

[…] There’s an article in today’s Observer marking the 50th anniversary of the publication of Thomas Kuhn’s book The Structure of Scientific Revolutions. John Naughton, who wrote the piece, claims that this book “changed the way we look at science”. I don’t agree with this view at all, actually. There’s little in Kuhn’s book that isn’t implicit in the writings of Karl Popper and little in Popper’s work that isn’t implicit in the work of a far more important figure in the development of the philosophy of science, David Hume. The key point about all these authors is that they failed to understand the central role played by probability and inductive logic in scientific research. In the following I’ll try to explain how I think it all went wrong. It might help the uninitiated to read an earlier post of mine about the Bayesian interpretation of probability. […]

March 25, 2013 at 12:22 pm

I have a question about bayesian rule.

P(a|b,c) = (P(b,c|a)*P(a))/P(c)

P(b,c|a) is probability b and c given a

If so, Can we write P(b,c|a)

P(b,c|a)=P(b|a) * P(c|a) is it correct?

unlike P(a|b,c) =P(a|b intersect c) =P(a intersect b intersect c) /P(a intersect b)

which one is correct. Can you help me for above equation…

March 25, 2013 at 12:49 pm

P(b,c|a)=P(b|a) * P(c|a) is not in general correct: it assumes independence of b and c….

March 25, 2013 at 12:54 pm

thanks for answer,

how can we calculate P(a|b,c) ?

what is the emaning of a,b and how can we simply this equation like p(a|b)