What does “Big Data” mean to you?
On several occasions recently I’ve had to talk about Big Data for one reason or another. I’m always at a disadvantage when I do that because I really dislike the term.Clearly I’m not the only one who feels this way:
For one thing the term “Big Data” seems to me like describing the Ocean as “Big Water”. For another it’s not really just the how big the data set is that matters. Size isn’t everything, after all. There is much truth in Stalin’s comment that “Quantity has a quality all its own” in that very large data sets allow you to do things you wouldn’t even try with smaller ones, but it can be complexity rather than sheer size that also requires new methods of analysis.
The biggest event in my own field of cosmology in the last few years has been the Planck mission. The data set is indeed huge: the above map of the temperature pattern in the cosmic microwave background has no fewer than 167 million pixels. That certainly caused some headaches in the analysis pipeline, but I think I would argue that this wasn’t really a Big Data project. I don’t mean that to be insulting to anyone, just that the main analysis of the Planck data was aimed at doing something very similar to what had been done (by WMAP), i.e. extracting the power spectrum of temperature fluctuations:
It’s a wonderful result of course that extends the measurements that WMAP made up to much higher frequencies, but Planck’s goals were phrased in similar terms to those of WMAP – to pin down the parameters of the standard model to as high accuracy as possible. For me, a real “Big Data” approach to cosmic microwave background studies would involve doing something that couldn’t have been done at all with a smaller data set. An example that springs to mind is looking for indications of effects beyond the standard model.
Moreover what passes for Big Data in some fields would be just called “data” in others. For example, the Atlas Detector on the Large Hadron Collider represents about 150 million sensors delivering data 40 million times per second. There are about 600 million collisions per second, out of which perhaps one hundred per second are useful. The issue here is then one of dealing with an enormous rate of data in such a way as to be able to discard most of it very quickly. The same will be true of the Square Kilometre Array which will acquire exabytes of data every day out of which perhaps one petabyte will need to be stored. Both these projects involve data sets much bigger and more difficult to handle that what might pass for Big Data in other arenas.
Books you can buy at airports about Big Data generally list the following four or five characteristics:
The first two are about the size and acquisition rate of the data mentioned above but the others are more about qualitatively different matters. For example, in cosmology nowadays we have to deal with data sets which are indeed quite large, but also very different in form. We need to be able to do efficient joint analyses of hetergoeneous data structures with very different sampling properties and systematic errors in such a way that we get the best science results we can. Now that’s a Big Data challenge!Follow @telescoper