Archive for statistics

Isotropic Random Fields in Astrophysics – Workshop Announcement!

Posted in The Universe and Stuff with tags , , on May 8, 2017 by telescoper

We have a little workshop coming up in Cardiff at the end of June, sponsored via a “seedcorn” grant by the Data Innovation Research Institute.

This meeting is part of a series of activities aimed at bringing together world-leading experts in the analysis of big astrophysical data sets, specifically those arising from the (previous) Planck and (future) Euclid space missions, with mathematical experts in the spectral theory of scalar vector or tensor valued isotropic random fields. Our aim is to promote collaboration between mathematicians interested in probability theory and statistical analysis and theoretical and observational astrophysicists both within Cardiff university and further afield.

The workshop page can be found here. We have a great list of invited speakers from as far afield as Japan and California (as well as some from much closer to home) and we’re also open for contributed talks. We’ll be publishing the full programme of titles and abstracts soon. Registration is free of charge, but you do need to register so we can be sure we have enough space, enough coffee and enough lunch! That goes whether you want to give a contributed talk, or just come along and listen!

It’s only a short (two-day) meeting and are aiming for an informal atmosphere with plenty of time for discussions, with roughly a 50-50 blend of astrophysicists and mathematicians and to achieve that aim we’d particularly welcome a few more contributed talks from the mathematical side of the house, but we still have spaces for more astrophysics talks too! We’d also welcome more contributions from early career researchers, especially PhD students.

Please feel free to pass this around your colleagues.


The Neyman-Scott ‘Paradox’

Posted in Bad Statistics, Cute Problems with tags , , , , on November 25, 2016 by telescoper

I just came across this interesting little problem recently and thought I’d share it here. It’s usually called the ‘Neyman-Scott’ paradox. Before going on it’s worth mentioning that Elizabeth Scott (the second half of Neyman-Scott) was an astronomer by background. Her co-author was Jerzy Neyman. As has been the case for many astronomers, she contributed greatly to the development of the field of statistics. Anyway, I think this example provides another good illustration of the superiority of Bayesian methods for estimating parameters, but I’ll let you make your own mind up about what’s going on.

The problem is fairly technical so I’ve done done a quick version in latex that you can download

here, but I’ve also copied into this post so you can read it below:




I look forward to receiving Frequentist Flak or Bayesian Benevolence through the comments box below!

What does “Big Data” mean to you?

Posted in The Universe and Stuff with tags , , , , on April 7, 2016 by telescoper

On several occasions recently I’ve had to talk about Big Data for one reason or another. I’m always at a disadvantage when I do that because I really dislike the term.Clearly I’m not the only one who feels this way:


For one thing the term “Big Data” seems to me like describing the Ocean as “Big Water”. For another it’s not really just the how big the data set is that matters. Size isn’t everything, after all. There is much truth in Stalin’s comment that “Quantity has a quality all its own” in that very large data sets allow you to do things you wouldn’t even try with smaller ones, but it can be complexity rather than sheer size that also requires new methods of analysis.


The biggest event in my own field of cosmology in the last few years has been the Planck mission. The data set is indeed huge: the above map of the temperature pattern in the cosmic microwave background has no fewer than 167 million pixels. That certainly caused some headaches in the analysis pipeline, but I think I would argue that this wasn’t really a Big Data project. I don’t mean that to be insulting to anyone, just that the main analysis of the Planck data was aimed at doing something very similar to what had been done (by WMAP), i.e. extracting the power spectrum of temperature fluctuations:

Planck_power_spectrum_origIt’s a wonderful result of course that extends the measurements that WMAP made up to much higher frequencies, but Planck’s goals were phrased in similar terms to those of WMAP – to pin down the parameters of the standard model to as high accuracy as possible. For me, a real “Big Data” approach to cosmic microwave background studies would involve doing something that couldn’t have been done at all with a smaller data set. An example that springs to mind is looking for indications of effects beyond the standard model.

Moreover what passes for Big Data in some fields would be just called “data” in others. For example, the Atlas Detector on the  Large Hadron Collider  represents about 150 million sensors delivering data 40 million times per second. There are about 600 million collisions per second, out of which perhaps one hundred per second are useful. The issue here is then one of dealing with an enormous rate of data in such a way as to be able to discard most of it very quickly. The same will be true of the Square Kilometre Array which will acquire exabytes of data every day out of which perhaps one petabyte will need to be stored. Both these projects involve data sets much bigger and more difficult to handle that what might pass for Big Data in other arenas.

Books you can buy at airports about Big Data generally list the following four or five characteristics:

  1. Volume
  2. Velocity
  3. Variety
  4. Veracity
  5. Variability

The first two are about the size and acquisition rate of the data mentioned above but the others are more about qualitatively different matters. For example, in cosmology nowadays we have to deal with data sets which are indeed quite large, but also very different in form.  We need to be able to do efficient joint analyses of heterogeneous data structures with very different sampling properties and systematic errors in such a way that we get the best science results we can. Now that’s a Big Data challenge!


The Insignificance of ORB

Posted in Bad Statistics with tags , , , on April 5, 2016 by telescoper

A piece about opinion polls ahead of the EU Referendum which appeared in today’s Daily Torygraph has spurred me on to make a quick contribution to my bad statistics folder.

The piece concerned includes the following statement:

David Cameron’s campaign to warn voters about the dangers of leaving the European Union is beginning to win the argument ahead of the referendum, a new Telegraph poll has found.

The exclusive poll found that the “Remain” campaign now has a narrow lead after trailing last month, in a sign that Downing Street’s tactic – which has been described as “Project Fear” by its critics – is working.

The piece goes on to explain

The poll finds that 51 per cent of voters now support Remain – an increase of 4 per cent from last month. Leave’s support has decreased five points to 44 per cent.

This conclusion is based on the results of a survey by ORB in which the number of participants was 800. Yes, eight hundred.

How much can we trust this result on statistical grounds?

Suppose the fraction of the population having the intention to vote in a particular way in the EU referendum is p. For a sample of size n with x respondents indicating that they hen one can straightforwardly estimate p \simeq x/n. So far so good, as long as there is no bias induced by the form of the question asked nor in the selection of the sample which, given the fact that such polls have been all over the place seems rather unlikely.

A little bit of mathematics involving the binomial distribution yields an answer for the uncertainty in this estimate of p in terms of the sampling error:

\sigma = \sqrt{\frac{p(1-p)}{n}}

For the sample size of 800 given, and an actual value p \simeq 0.5 this amounts to a standard error of about 2%. About 95% of samples drawn from a population in which the true fraction is p will yield an estimate within p \pm 2\sigma, i.e. within about 4% of the true figure. In other words the typical variation between two samples drawn from the same underlying population is about 4%. In other other words, the change reported between the two ORB polls mentioned above can be entirely explained by sampling variation and does not at all imply any systematic change of public opinion between the two surveys.

I need hardly point out that in a two-horse race (between “Remain” and “Leave”) an increase of 4% in the Remain vote corresponds to a decrease in the Leave vote by the same 4% so a 50-50 population vote can easily generate a margin as large as  54-46 in such a small sample.

Why do pollsters bother with such tiny samples? With such a large margin error they are basically meaningless.

I object to the characterization of the Remain campaign as “Project Fear” in any case. I think it’s entirely sensible to point out the serious risks that an exit from the European Union would generate for the UK in loss of trade, science funding, financial instability, and indeed the near-inevitable secession of Scotland. But in any case this poll doesn’t indicate that anything is succeeding in changing anything other than statistical noise.

Statistical illiteracy is as widespread amongst politicians as it is amongst journalists, but the fact that silly reports like this are commonplace doesn’t make them any less annoying. After all, the idea of sampling uncertainty isn’t all that difficult to understand. Is it?

And with so many more important things going on in the world that deserve better press coverage than they are getting, why does a “quality” newspaper waste its valuable column inches on this sort of twaddle?

The Essence of Cosmology is Statistics

Posted in The Universe and Stuff with tags , , on September 8, 2015 by telescoper

I’m grateful to Licia Verde for sending this picture of me in action at last week’s conference in Castiglioncello.


The quote is one I use quite regularly, as the source is quite surprising. It is by George McVittie and appears in the Preface to the Proceedings of the Third Bekeley Symposium on Mathematical Statistics and Probability, which took place in 1956. It is surprising for two reasons. One is that McVittie is more strongly associated with theoretical cosmology than with statistics. In fact I have one of his books, the first edition of which was published in 1937:


There’s a bit in the book about observational cosmology, but basically it’s wall-to-wall Christoffel symbols!

The other surprising thing is that way back in 1956 there was precious little statistical information relevant to cosmology anyway, a far cry from the situation today with our plethora of maps and galaxy surveys. What he was saying though was that statistics is all about making inferences based on partial or incomplete data. Given that the subject of cosmology is the entire Universe, it is obvious we will never have complete data (i.e. we will never know everything). Hence cosmology is essentially statistical. This is true of other fields too, but in cosmology it is taken to an extreme. George McVittie passed away in 1988, so didn’t really live long enough to see this statement fulfilled, but it certainly has been over the last couple of decades!

P.S. Although he spent much of his working life in the East End of London (at Queen Mary College), George McVittie should not be confused with the even more famous, or rather infamous, Jack McVitie.

Adventures with the One-Point Distribution Function

Posted in Bad Statistics, Books, Talks and Reviews, Talks and Reviews, The Universe and Stuff with tags , , on September 1, 2015 by telescoper

As I promised a few people, here are the slides I used for my talk earlier today at the meeting I am attending. Actually I was given only 30 minutes and used up a lot of that time on two things that haven’t got much to do with the title. One was a quiz to identify the six famous astronomers (or physicists) who had made important contributions to statistics (Slide 2) and the other was on some issues that arose during the discussion session yesterday evening. I didn’t in the end talk much about the topic given in the title, which was about how, despite learning a huge amount about certain aspects of galaxy clustering, we are still far from a good understanding of the one-point distribution of density fluctuations. I guess I’ll get the chance to talk more about that in the near future!

P.S. I think the six famous faces should be easy to identify, so there are no prizes but please feel free to guess through the comments box!