Archive for Bayesian Model Selection

New Publication at the Open Journal of Astrophysics!

Posted in OJAp Papers, Open Access, The Universe and Stuff with tags , , , , on June 23, 2020 by telescoper

Well, Maynooth University may well be still (partially) closed as a result of the Covid-19 pandemic but the The Open Journal of Astrophysics is definitely fully open.

In fact we have just published another paper! This one is in the Astrophysics of Galaxies section and is entitled A Bayesian Approach to the Vertical Structure of the Disk of the Milky Way. The authors are Phillip S Dobbie and Stephen J Warren of Imperial College, London.

Here is a screen grab of the overlay:


You can find the arXiv version of the paper here.

I’d like to take this opportunity to thank the Editorial team and various referees for their efforts in keeping the Open Journal of Astrophysics going in these difficult times.

Why the Universe is (probably) not rotating

Posted in Cosmic Anomalies, The Universe and Stuff with tags , , , , , on October 1, 2013 by telescoper

Just a quick post to point you towards a nice blog post by Jason McEwen entitled Is the Universe rotating? It’s a general rule that if  an article has a question for a title then the answer to that question is probably “no”, and “probably no” is indeed the answer in this case.

The item relates to a paper by McEwen et al whose abstract is given here:

We perform a definitive analysis of Bianchi VII_h cosmologies with WMAP observations of the cosmic microwave background (CMB) temperature anisotropies. Bayesian analysis techniques are developed to study anisotropic cosmologies using full-sky and partial-sky, masked CMB temperature data. We apply these techniques to analyse the full-sky internal linear combination (ILC) map and a partial-sky, masked W-band map of WMAP 9-year observations. In addition to the physically motivated Bianchi VII_h model, we examine phenomenological models considered in previous studies, in which the Bianchi VII_h parameters are decoupled from the standard cosmological parameters. In the two phenomenological models considered, Bayes factors of 1.7 and 1.1 units of log-evidence favouring a Bianchi component are found in full-sky ILC data. The corresponding best-fit Bianchi maps recovered are similar for both phenomenological models and are very close to those found in previous studies using earlier WMAP data releases. However, no evidence for a phenomenological Bianchi component is found in the partial-sky W-band data. In the physical Bianchi VII_h model we find no evidence for a Bianchi component: WMAP data thus do not favour Bianchi VII_h cosmologies over the standard Lambda Cold Dark Matter (LCDM) cosmology. It is not possible to discount Bianchi VII_h cosmologies in favour of LCDM completely, but we are able to constrain the vorticity of physical Bianchi VII_h cosmologies at $(\omega/H)_0 < 8.6 \times 10^{-10}$ with 95% confidence.

For non-experts the Bianchi cosmologies are based on exact solutions of Einstein’s equations for general relativity which obey the condition that they are spatially homogeneous but not necessarily isotropic. If you find that concept hard to understand, imagine a universe which looks the same everywhere but which is pervaded by a uniform magnetic field: that would be homogeneous (because every place is identical) but anisotropic (because there is a preferred direction – along the magnetic field lines). Another example of would be s a universe which is, for reasons known only to itself, rotating; the preferred direction here is the axis of rotation. The complete classification of all Bianchi space-times is discussed here. I also mentioned them and showed some pictures on this blog here.

As Jason’s post explains, observations of the cosmic microwave background by the Wilkinson Microwave Anisotropy Probe (WMAP) suggest  that there is something a little bit fishy about it: it seems to be have an anomalous large-scale asymmetry not expected in the standard cosmology. These suggestions seem to be confirmed by Planck, though the type of analysis done for WMAP has not yet been performed for Planck. The paper mentioned above investigates whether the WMAP asymmetry could be accounted for by one particular Bianchi cosmology, i.e. Bianchi VII_h. This is quite a complicated model which has negative spatial curvature, rotation (vorticity) and shear; formally speaking, it is the most general Bianchi model of any type that includes the standard Friedmann cosmology as a special case.

The question whether such a complicated model actually provides a better fit to the data than the much simpler standard model is one naturally answered by Bayesian techniques that trade off the increased complexity of a more sophisticated model  against the improvement in goodness-of-fit achieved by having more free parameters.  Using this approach McEwen et al. showed that, in simple  terms, while a slight improvement in fit is indeed gained by adding a Bianchi VII_h component to the model,  the penalty paid in terms of increased complexity means that the alternative model is not significantly more probable than the simple one. Ockham’s Razor strikes again! Although this argument does not definitively exclude the possibility that the Universe is rotating, it does put limits on how much rotation there can be. It also excludes one possible explanation of the  peculiar pattern  of the temperature fluctuations seen by WMAP.

So what does cause the anomalous behaviour of the cosmic microwave background?

I have no idea.

Bayes’ Razor

Posted in Bad Statistics, The Universe and Stuff with tags , , , , , , , , , on February 19, 2011 by telescoper

It’s been quite while since I posted a little piece about Bayesian probability. That one and the others that followed it (here and here) proved to be surprisingly popular so I’ve been planning to add a few more posts whenever I could find the time. Today I find myself in the office after spending the morning helping out with a very busy UCAS visit day, and it’s raining, so I thought I’d take the opportunity to write something before going home. I think I’ll do a short introduction to a topic I want to do a more technical treatment of in due course.

A particularly important feature of Bayesian reasoning is that it gives precise motivation to things that we are generally taught as rules of thumb. The most important of these is Ockham’s Razor. This famous principle of intellectual economy is variously presented in Latin as Pluralites non est ponenda sine necessitate or Entia non sunt multiplicanda praetor necessitatem. Either way, it means basically the same thing: the simplest theory which fits the data should be preferred.

William of Ockham, to whom this dictum is attributed, was an English Scholastic philosopher (probably) born at Ockham in Surrey in 1280. He joined the Franciscan order around 1300 and ended up studying theology in Oxford. He seems to have been an outspoken character, and was in fact summoned to Avignon in 1323 to account for his alleged heresies in front of the Pope, and was subsequently confined to a monastery from 1324 to 1328. He died in 1349.

In the framework of Bayesian inductive inference, it is possible to give precise reasons for adopting Ockham’s razor. To take a simple example, suppose we want to fit a curve to some data. In the presence of noise (or experimental error) which is inevitable, there is bound to be some sort of trade-off between goodness-of-fit and simplicity. If there is a lot of noise then a simple model is better: there is no point in trying to reproduce every bump and wiggle in the data with a new parameter or physical law because such features are likely to be features of the noise rather than the signal. On the other hand if there is very little noise, every feature in the data is real and your theory fails if it can’t explain it.

To go a bit further it is helpful to consider what happens when we generalize one theory by adding to it some extra parameters. Suppose we begin with a very simple theory, just involving one parameter p, but we fear it may not fit the data. We therefore add a couple more parameters, say q and r. These might be the coefficients of a polynomial fit, for example: the first model might be straight line (with fixed intercept), the second a cubic. We don’t know the appropriate numerical values for the parameters at the outset, so we must infer them by comparison with the available data.

Quantities such as p, q and r are usually called “floating” parameters; there are as many as a dozen of these in the standard Big Bang model, for example.

Obviously, having three degrees of freedom with which to describe the data should enable one to get a closer fit than is possible with just one. The greater flexibility within the general theory can be exploited to match the measurements more closely than the original. In other words, such a model can improve the likelihood, i.e. the probability  of the obtained data  arising (given the noise statistics – presumed known) if the signal is described by whatever model we have in mind.

But Bayes’ theorem tells us that there is a price to be paid for this flexibility, in that each new parameter has to have a prior probability assigned to it. This probability will generally be smeared out over a range of values where the experimental results (contained in the likelihood) subsequently show that the parameters don’t lie. Even if the extra parameters allow a better fit to the data, this dilution of the prior probability may result in the posterior probability being lower for the generalized theory than the simple one. The more parameters are involved, the bigger the space of prior possibilities for their values, and the harder it is for the improved likelihood to win out. Arbitrarily complicated theories are simply improbable. The best theory is the most probable one, i.e. the one for which the product of likelihood and prior is largest.

To give a more quantitative illustration of this consider a given model M which has a set of N floating parameters represented as a vector \underline\lambda = (\lambda_1,\ldots \lambda_N)=\lambda_i; in a sense each choice of parameters represents a different model or, more precisely, a member of the family of models labelled M.

Now assume we have some data D and can consequently form a likelihood function P(D|\underline{\lambda},M). In Bayesian reasoning we have to assign a prior probability P(\underline{\lambda}|M) to the parameters of the model which, if we’re being honest, we should do in advance of making any measurements!

The interesting thing to look at now is not the best-fitting choice of model parameters \underline{\lambda} but the extent to which the data support the model in general.  This is encoded in a sort of average of likelihood over the prior probability space:

P(D|M) = \int P(D|\underline{\lambda},M) P(\underline{\lambda}|M) d^{N}\underline{\lambda}.

This is just the normalizing constant K usually found in statements of Bayes’ theorem which, in this context, takes the form

P(\underline{\lambda}|DM) = K^{-1}P(\underline{\lambda}|M)P(D|\underline{\lambda},M).

In statistical mechanics things like K are usually called partition functions, but in this setting K is called the evidence, and it is used to form the so-called Bayes Factor, used in a technique known as Bayesian model selection of which more anon….

The  usefulness of the Bayesian evidence emerges when we ask the question whether our N  parameters are sufficient to get a reasonable fit to the data. Should we add another one to improve things a bit further? And why not another one after that? When should we stop?

The answer is that although adding an extra degree of freedom can increase the first term in the integral defining K (the likelihood), it also imposes a penalty in the second factor, the prior, because the more parameters the more smeared out the prior probability must be. If the improvement in fit is marginal and/or the data are noisy, then the second factor wins and the evidence for a model with N+1 parameters lower than that for the N-parameter version. Ockham’s razor has done its job.

This is a satisfying result that is in nice accord with common sense. But I think it goes much further than that. Many modern-day physicists are obsessed with the idea of a “Theory of Everything” (or TOE). Such a theory would entail the unification of all physical theories – all laws of Nature, if you like – into a single principle. An equally accurate description would then be available, in a single formula, of phenomena that are currently described by distinct theories with separate sets of parameters. Instead of textbooks on mechanics, quantum theory, gravity, electromagnetism, and so on, physics students would need just one book.

The physicist Stephen Hawking has described the quest for a TOE as like trying to read the Mind of God. I think that is silly. If a TOE is every constructed it will be the most economical available description of the Universe. Not the Mind of God.  Just the best way we have of saving paper.