## Bayes’ Razor

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , , , , on February 19, 2011 by telescoper

It’s been quite while since I posted a little piece about Bayesian probability. That one and the others that followed it (here and here) proved to be surprisingly popular so I’ve been planning to add a few more posts whenever I could find the time. Today I find myself in the office after spending the morning helping out with a very busy UCAS visit day, and it’s raining, so I thought I’d take the opportunity to write something before going home. I think I’ll do a short introduction to a topic I want to do a more technical treatment of in due course.

A particularly important feature of Bayesian reasoning is that it gives precise motivation to things that we are generally taught as rules of thumb. The most important of these is Ockham’s Razor. This famous principle of intellectual economy is variously presented in Latin as Pluralites non est ponenda sine necessitate or Entia non sunt multiplicanda praetor necessitatem. Either way, it means basically the same thing: the simplest theory which fits the data should be preferred.

William of Ockham, to whom this dictum is attributed, was an English Scholastic philosopher (probably) born at Ockham in Surrey in 1280. He joined the Franciscan order around 1300 and ended up studying theology in Oxford. He seems to have been an outspoken character, and was in fact summoned to Avignon in 1323 to account for his alleged heresies in front of the Pope, and was subsequently confined to a monastery from 1324 to 1328. He died in 1349.

In the framework of Bayesian inductive inference, it is possible to give precise reasons for adopting Ockham’s razor. To take a simple example, suppose we want to fit a curve to some data. In the presence of noise (or experimental error) which is inevitable, there is bound to be some sort of trade-off between goodness-of-fit and simplicity. If there is a lot of noise then a simple model is better: there is no point in trying to reproduce every bump and wiggle in the data with a new parameter or physical law because such features are likely to be features of the noise rather than the signal. On the other hand if there is very little noise, every feature in the data is real and your theory fails if it can’t explain it.

To go a bit further it is helpful to consider what happens when we generalize one theory by adding to it some extra parameters. Suppose we begin with a very simple theory, just involving one parameter $p$, but we fear it may not fit the data. We therefore add a couple more parameters, say $q$ and $r$. These might be the coefficients of a polynomial fit, for example: the first model might be straight line (with fixed intercept), the second a cubic. We don’t know the appropriate numerical values for the parameters at the outset, so we must infer them by comparison with the available data.

Quantities such as $p$, $q$ and $r$ are usually called “floating” parameters; there are as many as a dozen of these in the standard Big Bang model, for example.

Obviously, having three degrees of freedom with which to describe the data should enable one to get a closer fit than is possible with just one. The greater flexibility within the general theory can be exploited to match the measurements more closely than the original. In other words, such a model can improve the likelihood, i.e. the probability  of the obtained data  arising (given the noise statistics – presumed known) if the signal is described by whatever model we have in mind.

But Bayes’ theorem tells us that there is a price to be paid for this flexibility, in that each new parameter has to have a prior probability assigned to it. This probability will generally be smeared out over a range of values where the experimental results (contained in the likelihood) subsequently show that the parameters don’t lie. Even if the extra parameters allow a better fit to the data, this dilution of the prior probability may result in the posterior probability being lower for the generalized theory than the simple one. The more parameters are involved, the bigger the space of prior possibilities for their values, and the harder it is for the improved likelihood to win out. Arbitrarily complicated theories are simply improbable. The best theory is the most probable one, i.e. the one for which the product of likelihood and prior is largest.

To give a more quantitative illustration of this consider a given model $M$ which has a set of $N$ floating parameters represented as a vector $\underline\lambda = (\lambda_1,\ldots \lambda_N)=\lambda_i$; in a sense each choice of parameters represents a different model or, more precisely, a member of the family of models labelled $M$.

Now assume we have some data $D$ and can consequently form a likelihood function $P(D|\underline{\lambda},M)$. In Bayesian reasoning we have to assign a prior probability $P(\underline{\lambda}|M)$ to the parameters of the model which, if we’re being honest, we should do in advance of making any measurements!

The interesting thing to look at now is not the best-fitting choice of model parameters $\underline{\lambda}$ but the extent to which the data support the model in general.  This is encoded in a sort of average of likelihood over the prior probability space:

$P(D|M) = \int P(D|\underline{\lambda},M) P(\underline{\lambda}|M) d^{N}\underline{\lambda}.$

This is just the normalizing constant $K$ usually found in statements of Bayes’ theorem which, in this context, takes the form

$P(\underline{\lambda}|DM) = K^{-1}P(\underline{\lambda}|M)P(D|\underline{\lambda},M).$

In statistical mechanics things like $K$ are usually called partition functions, but in this setting $K$ is called the evidence, and it is used to form the so-called Bayes Factor, used in a technique known as Bayesian model selection of which more anon….

The  usefulness of the Bayesian evidence emerges when we ask the question whether our $N$  parameters are sufficient to get a reasonable fit to the data. Should we add another one to improve things a bit further? And why not another one after that? When should we stop?

The answer is that although adding an extra degree of freedom can increase the first term in the integral defining $K$ (the likelihood), it also imposes a penalty in the second factor, the prior, because the more parameters the more smeared out the prior probability must be. If the improvement in fit is marginal and/or the data are noisy, then the second factor wins and the evidence for a model with $N+1$ parameters lower than that for the $N$-parameter version. Ockham’s razor has done its job.

This is a satisfying result that is in nice accord with common sense. But I think it goes much further than that. Many modern-day physicists are obsessed with the idea of a “Theory of Everything” (or TOE). Such a theory would entail the unification of all physical theories – all laws of Nature, if you like – into a single principle. An equally accurate description would then be available, in a single formula, of phenomena that are currently described by distinct theories with separate sets of parameters. Instead of textbooks on mechanics, quantum theory, gravity, electromagnetism, and so on, physics students would need just one book.

The physicist Stephen Hawking has described the quest for a TOE as like trying to read the Mind of God. I think that is silly. If a TOE is every constructed it will be the most economical available description of the Universe. Not the Mind of God.  Just the best way we have of saving paper.