Bayes’ Razor
It’s been quite while since I posted a little piece about Bayesian probability. That one and the others that followed it (here and here) proved to be surprisingly popular so I’ve been planning to add a few more posts whenever I could find the time. Today I find myself in the office after spending the morning helping out with a very busy UCAS visit day, and it’s raining, so I thought I’d take the opportunity to write something before going home. I think I’ll do a short introduction to a topic I want to do a more technical treatment of in due course.
A particularly important feature of Bayesian reasoning is that it gives precise motivation to things that we are generally taught as rules of thumb. The most important of these is Ockham’s Razor. This famous principle of intellectual economy is variously presented in Latin as Pluralites non est ponenda sine necessitate or Entia non sunt multiplicanda praetor necessitatem. Either way, it means basically the same thing: the simplest theory which fits the data should be preferred.
William of Ockham, to whom this dictum is attributed, was an English Scholastic philosopher (probably) born at Ockham in Surrey in 1280. He joined the Franciscan order around 1300 and ended up studying theology in Oxford. He seems to have been an outspoken character, and was in fact summoned to Avignon in 1323 to account for his alleged heresies in front of the Pope, and was subsequently confined to a monastery from 1324 to 1328. He died in 1349.
In the framework of Bayesian inductive inference, it is possible to give precise reasons for adopting Ockham’s razor. To take a simple example, suppose we want to fit a curve to some data. In the presence of noise (or experimental error) which is inevitable, there is bound to be some sort of trade-off between goodness-of-fit and simplicity. If there is a lot of noise then a simple model is better: there is no point in trying to reproduce every bump and wiggle in the data with a new parameter or physical law because such features are likely to be features of the noise rather than the signal. On the other hand if there is very little noise, every feature in the data is real and your theory fails if it can’t explain it.
To go a bit further it is helpful to consider what happens when we generalize one theory by adding to it some extra parameters. Suppose we begin with a very simple theory, just involving one parameter , but we fear it may not fit the data. We therefore add a couple more parameters, say
and
. These might be the coefficients of a polynomial fit, for example: the first model might be straight line (with fixed intercept), the second a cubic. We don’t know the appropriate numerical values for the parameters at the outset, so we must infer them by comparison with the available data.
Quantities such as ,
and
are usually called “floating” parameters; there are as many as a dozen of these in the standard Big Bang model, for example.
Obviously, having three degrees of freedom with which to describe the data should enable one to get a closer fit than is possible with just one. The greater flexibility within the general theory can be exploited to match the measurements more closely than the original. In other words, such a model can improve the likelihood, i.e. the probability of the obtained data arising (given the noise statistics – presumed known) if the signal is described by whatever model we have in mind.
But Bayes’ theorem tells us that there is a price to be paid for this flexibility, in that each new parameter has to have a prior probability assigned to it. This probability will generally be smeared out over a range of values where the experimental results (contained in the likelihood) subsequently show that the parameters don’t lie. Even if the extra parameters allow a better fit to the data, this dilution of the prior probability may result in the posterior probability being lower for the generalized theory than the simple one. The more parameters are involved, the bigger the space of prior possibilities for their values, and the harder it is for the improved likelihood to win out. Arbitrarily complicated theories are simply improbable. The best theory is the most probable one, i.e. the one for which the product of likelihood and prior is largest.
To give a more quantitative illustration of this consider a given model which has a set of
floating parameters represented as a vector
; in a sense each choice of parameters represents a different model or, more precisely, a member of the family of models labelled
.
Now assume we have some data and can consequently form a likelihood function
. In Bayesian reasoning we have to assign a prior probability
to the parameters of the model which, if we’re being honest, we should do in advance of making any measurements!
The interesting thing to look at now is not the best-fitting choice of model parameters but the extent to which the data support the model in general. This is encoded in a sort of average of likelihood over the prior probability space:
This is just the normalizing constant usually found in statements of Bayes’ theorem which, in this context, takes the form
In statistical mechanics things like are usually called partition functions, but in this setting
is called the evidence, and it is used to form the so-called Bayes Factor, used in a technique known as Bayesian model selection of which more anon….
The usefulness of the Bayesian evidence emerges when we ask the question whether our parameters are sufficient to get a reasonable fit to the data. Should we add another one to improve things a bit further? And why not another one after that? When should we stop?
The answer is that although adding an extra degree of freedom can increase the first term in the integral defining (the likelihood), it also imposes a penalty in the second factor, the prior, because the more parameters the more smeared out the prior probability must be. If the improvement in fit is marginal and/or the data are noisy, then the second factor wins and the evidence for a model with
parameters lower than that for the
-parameter version. Ockham’s razor has done its job.
This is a satisfying result that is in nice accord with common sense. But I think it goes much further than that. Many modern-day physicists are obsessed with the idea of a “Theory of Everything” (or TOE). Such a theory would entail the unification of all physical theories – all laws of Nature, if you like – into a single principle. An equally accurate description would then be available, in a single formula, of phenomena that are currently described by distinct theories with separate sets of parameters. Instead of textbooks on mechanics, quantum theory, gravity, electromagnetism, and so on, physics students would need just one book.
The physicist Stephen Hawking has described the quest for a TOE as like trying to read the Mind of God. I think that is silly. If a TOE is every constructed it will be the most economical available description of the Universe. Not the Mind of God. Just the best way we have of saving paper.
February 19, 2011 at 4:43 pm
TOE, or as near as we can get to it:
Supernova star Kochab, verified as such by
senior researchers at Princeton University.
The message: 4.6.32.15.31.27….
February 19, 2011 at 5:02 pm
Huh?
February 19, 2011 at 6:30 pm
Some fill-out to the history: William did a midnight bunk in 1328 when it became clear that the fairly decadent papal court (then at Avignon) was on the point of declaring against the austerity favoured by the Franciscans, whose intellectual star he was. He fled with their leader Michael of Cesena but the ship they reached and hid in could not leave port for many days because of unfavourable winds; they nevertheless eventually reached the safe haven of Bavaria (where the ruler was in political conflict with the papacy). This is not the place to go into William’s theology (‘nominalism’) and why it was a change of spirit from the scholastics, but he was certainly smart because he wrote down de Morgan’s laws of binary logic in full generality 500 years before de Morgan – longhand in the absence of mathematical notation, and of course in Latin. (The logical product of two propositions is known as their ‘copulation’.) He is believed to have died months before the frightful Black Death swept through Bavaria and killed 1/3 – 1/2 of the population.
The first Latin aphorism which Peter quotes above (Pluralites…) is typical of William’s own writings; the second (Entia OR Essentia non sunt multiplicanda…) is a summary by the Irish scholar John Ponce several centuries later. The idea was not new with Ockham (Latin: Occamus), for in the 2nd century Ptolemy wrote this about changes in the earth’s solstices and equinoxes: “It is a good principle to explain the phenomena by the simplest hypotheses possible, in so far as there is nothing in the observations to provide a significant objection to such a procedure”. The name “Razor” appears to have been coined by the 19th century scholar William S. Hamilton (NOT the same as William Rowan Hamilton), possibly because it cuts away unnecessary complication. Perhaps he was inspired by the phrase “Rasoir des Nominaux” used obscurely by the French scholar Condillac a century earlier.
I’ve no idea why Hawking threw that mind-of-God line in. It alienates those theists who acknowledge a personal God, and it alienates atheists too. But even a theory of everything is not the end of the road: as well as experimental testing in ever wider regions of parameter space, you want the theory to have no floating parameters. Any theory with floating parameters would be superseded by one with none and which correctly predicts their observed values. A century ago, Max Planck had invented an early version of the quantum theory that explained a baffling phenomenon, the speed of electrons thrown off when light is shone at a metal. His equations called for a new physical constant (now nammed after him), whose value had to be found from the observations. But the same idea was also applied to explain the amount of radiation given off by a hot body, and furthermore the wavelengths of light absorbed by hydrogen atoms. These phenomena had been experimentally studied and each had required its own physical constant which had to be set from the observations. The new idea related these constants to Planck’s and gave their values accurately. A formal Ockham anaylsis massively prefers the early quantum theory to the combination of three separate empirical theories. The quantum concept was instantly accepted.
Anton
February 19, 2011 at 7:05 pm
Sales!
February 20, 2011 at 6:37 am
I do wish the media- errrm – seeking physicists would desist from the mind of god, the god particle, the god theory, the god cheesecake etc etc. It makes them look horribly desperate to up their sales.
February 20, 2011 at 1:28 pm
mmmm god cheesecake who want’s manna from heaven when you can have cheesecake?
It may look desperate but it’s effective. While it annoys me no end as a tactic I don’t think it alienates too many people; the sorts of people liable to read a book on physics are going to understand the hyperbole and some may agree that a TOE is reading the mind of god.
BTW the ‘god particle’ really should be referred to by its proper name: ‘that god-damn particle’.
February 20, 2011 at 8:43 am
There have been a few papers on the arxiv in the last few years that have argued against the idea or effectiveness of the Evidence (for example there was a recent one by John P and co-author that was actually published in MNRAS). At the risk of inciting more arguments on this subject, do you care to make any comment?
February 20, 2011 at 9:16 am
David,
That topic is what as was alluding to when I said I was going to do a subsequent post. I felt it was best to put up the basics first before going into the use of model selection in cosmology, which would be a much more specialised article. I imagine that might draw some comments from working cosmologists!
Peter
February 20, 2011 at 9:44 am
Some more examples of the regular Ockham analysis (ie, not ‘model selection’ in the technical sense used by Peter):
Is there an (N+1)th planet in the noisy data regarding the orbits of the first N?
Is there a 5th fundamental force?
Is the cosmological constant zero (all prior probability eggs in one basket) or nonzero and to be estimated from the data?
And one of that sort to which we now have the answer, hopefully derived via a Bayesian Ockham analysis rather than other (ie, wrong) statistical methods:
Do the data prefer a zero-mass neutrino or a nonzero, data-estimated mass?
Anton
February 20, 2011 at 10:39 am
These are all good examples. One of them – the cosmological constant – came up in my seminar at MSSL on Thursday. The data do require a non-zero cosmological constant, but that is a statement dependent on the model M which in this case entails GR being right and also that the Cosmological Principle holds, i.e. that the Universe is homogeneous and isotropic on large scales. Within this family of possible universes lives the concordance model.
Moreover, people often say we “observe” that the expansion of the universe is accelerating. Not so. What we observe are such things as that the luminosity distances of high-redshift supernovae are larger than one would expect at a given redshift in cosmologies without a cosmological constant. The acceleration is not observed directly, but is an inference made using this model.
I’m not rubbishing the concordance cosmology, but I think it’s good to get its status right. It is not a “fact”, but a plausible inference that could easily be superseded by more data and/or a change of theoretical framework.
February 20, 2011 at 1:42 pm
Anton,
In the standard model (the current, but ailing, model of particle physics) neutrinos were originally assumed to have zero mass (giving them mass introduced further terms into the equations: Ockham again).
Measurements at the homesteak experiment showed a dearth of solar neutrinos; subsequent measurements at Super Kamiokande and the Sudbury Neutrino Observatory (SNO) showed that neutrinos can oscillate between flavours. This flavour oscillation can only occur if neutrinos have mass, again this is directly due to the maths involved with the standard model.
Unfortunately giving neutrinos mass introduces further free parameters which we can’t explain (in this case the parameters are the masses of the neutrinos). This is why there are so many ‘beyond the standard model’ theories (eg super-symmetry, extra dimensions etc etc; NOT string theory and it’s kin) these theories generally unify the observables in some way.
February 21, 2011 at 1:23 pm
Peter, getting my retaliation in before your technical post, I really disagree strongly with the common claim that Bayesian evidence incorporates Ockham’s razor. Take model M1, and add 1000 extra parameters to make model M2; let these 1000 extra parameters have negligible effect on observables. In that case, their priors integrate straight out of the evidence ratio, so we conclude that M1 and M2 are equally acceptable. The reason for this manifestly dumb conclusion is that we haven’t yet added a true Ockham penalty: if we can, we should fit the data with fewer parameters. Thus the prior on M should add a penalty for increasing N, the number of parameters. One might consider P(N)=1/2^N, but that doesn’t feel right: 1000 parameters is about as bad as 1010. Maybe P(N) = 1/N would be better, but that feels perhaps a bit too gentle. Room for debate, as usual with Bayes, but surely no-one could argue that P(N)=1 is correct?
February 22, 2011 at 9:10 am
John, I don’t understand this comment at all. You have a 1000 parameters but the observables don’t depend on them? Doesn’t that just mean that the model involves a constant term, obtained by adding 1000 constants together?
I don’t disagree that Bayesian logic is unable to decide whether 6 is better than 1+2+3, but it doesn’t seem to me to be a major objection! Or am I missing the point?
February 22, 2011 at 2:46 am
There have been a number of criteria proposed for deciding whether a more complicated model is preferred, but I have seen that most of these attempts fail to address the “simple” map-making problem for WMAP: N=10^11 data points fit by k=10^6.5 parameters. The Akaike Information Criterion: -2ln(L)+2k with L being the max likelihood fails in this case, and the Bayesian IC: -2ln(L)+2k*ln(N) is even worse. It prefers a flat sky to the actual map. But the old-fashioned rule that when adding m parameters to a model the distribution of \Delta[-2ln(L)] follows the \chi^2 distribution with m degrees of freedom is correct for this linear problem with Gaussian noise. The corresponding criterion is -2ln(L)+k which is quite tolerant of extra parameters, but then adding more pixels to a map is not as big a qualitative step as a non-zero Lambda.
February 22, 2011 at 9:25 am
Not everything that calls itself Bayesian is kosher; the word has been hijacked. (For instance, I never buy a textbook just because it has ‘Bayesian’ in the title – I check whether RT Cox is given due honour in the index or references.) If you use the analysis Peter has set out, which is uniquely consistent with the sum and product rules, then I guarantee that the result will either coincide with your intuition or educate it. The hard part will be putting a prior on your parameters; to stand a chance of doing that you must know their ontological meaning. That the information criteria you quote can proceed without considering this issue shows that they are ultimately ad hoc.
February 22, 2011 at 6:46 pm
I’ve never thought that the “simple” map-making problem was simple at all!
February 23, 2011 at 8:08 am
Peter, let me try again. All astronomical observations are consistent with the CDM model. They are also consistent with a model that is CDM plus a pink elephant on the far side of each planet in the Milky Way. But all these elephants are an unnecessary complication to a simple model, so an Ockam-esque philosophy says that you should disfavour the elephantoverse model. But Bayesian Evidence would not reach such a conclusion: it penalises extra parameters only if a small part of their parameter space is consistent with observation, but exo-elephants don’t affect cosmological data, so they are not penalised.
February 23, 2011 at 8:36 am
I admit that we have to restrict this version of Ockham’s razor to scientific applications, i.e. cases involving quantitatively testable hypotheses.
Believe in pink elephants if you wish. Just don’t call it science!
February 23, 2011 at 1:51 pm
I think that John raises a point, but I think there’s a fairly simple answer to it concerning the prior for models themselves.
February 23, 2011 at 1:58 pm
I wonder what happened to bnton, cnton, dnton…inton?
February 23, 2011 at 4:48 pm
We’re all here, and often manifest when Anton goees travelling and has to retype his name into alien computers…
March 6, 2011 at 1:15 pm
[…] I’m aware that I still haven’t posted a follow-up to my introductory article about Bayesian Evidence, so I apologize to those of you out there that thought this was going to be it! In fact I’m […]
July 27, 2016 at 11:41 am
[…] the data. Bayesian model selection analysis however tends to reject such models on the grounds of Ockham’s Razor. In other words the price you pay for introducing an extra free parameter exceeds the benefit in […]
January 11, 2018 at 3:55 pm
[…] the data. Bayesian model selection analysis however tends to reject such models on the grounds of Ockham’s Razor. In other words the price you pay for introducing an extra free parameter exceeds the benefit in […]
July 25, 2019 at 12:40 pm
[…] The point is that if you allow the equation of state parameter w to vary from the value of w=-1 that it has in the standard cosmology then you get a better fit. However, it is one of the features of Bayesian inference that if you introduce a new free parameter then you have to assign a prior probability over the space of values that parameter could hold. That prior penalty is carried through to the posterior probability. Unless the new model fits observational data significantly better than the old one, this prior penalty will lead to the new model being disfavoured. This is the Bayesian statement of Ockham’s Razor. […]