Archive for Big Data

Data-Intensive Physics and Astrophysics

Posted in Biographical, Education with tags , , , , on March 27, 2017 by telescoper

One of the jobs I’ve got in my current position (which is divided between the School of Physics & Astronomy and the Data Innovation Research Institute) is to develop new teaching activities, focussing on interdisciplinary courses involving a Data Science component. Despite the fact that I only started work developing them in September last year the first two such courses have been formally approved and are now open for admission of new students to begin their courses in September 2017. That represents a very fast-track for such things as there are many hurdles to get over in preparing new courses. Meeting the deadlines hasn’t been easy, which is largely why I’ve been whingeing on here about workload issues, but we’re finally there!

The two new courses are both at Masters (MSc) level and are called Data-Intensive Physics and Data-Intensive Astrophysics and they are both taught jointly by staff in the School of Physics and Astronomy and the School of Computer Science and Informatics in a kind of major/minor combination.

The aim of these courses is twofold.

One is to provide specialist postgraduate training for students wishing to go into academic research in a ‘data-intensive’ area of physics or astrophysics, by which I mean a field which involves the analysis and manipulation of very large or complex data sets and/or the use of high-performance computing for, e.g., simulation work. There is a shortage of postgraduates with the necessary combination of skills to being PhD programmes in such areas, and we plan to try to fill the gap with these courses.

The other aim is to cater for students who may not have made up their mind whether to go into academic research, but wish to keep their options open while pursuing a postgraduate course. The unique combination of physics/astrophysics and computer science will give those with these qualifications the option of either continuing or going into another sphere of data-intensive research in the wider world of Big Data.

We’ll be putting out some official promotional materials for these courses very soon, but I thought I’d mention them here partly because it might help with recruitment and partly because I’m so relieved that they’ve actually made it into the prospectus.


Science for the Citizen

Posted in Education, Open Access, The Universe and Stuff with tags , , , , , , on March 20, 2017 by telescoper

I spent all day on Friday on business connected with my role in the Data Innovation Research Institute, attending an event to launch the new Data Justice Lab at Cardiff University. It was a fascinating day of discussions about all kinds of ethical, legal and political issues surrounding the “datafication” of society:

Our financial transactions, communications, movements, relationships, and interactions with government and corporations all increasingly generate data that are used to profile and sort groups and individuals. These processes can affect both individuals as well as entire communities that may be denied services and access to opportunities, or wrongfully targeted and exploited. In short, they impact on our ability to participate in society. The emergence of this data paradigm therefore introduces a particular set of power dynamics requiring investigation and critique.

As a scientist whose research is in an area (cosmology) which is extremely data-intensive, I have a fairly clear interpretation of the phrase “Big Data” and recognize the need for innovative methods to handle the scale and complexity of the data we use. This clarity comes largely from the fact that we are asking very well-defined questions which can be framed in quantitative terms within the framework of well-specified theoretical models. In this case, sophisticated algorithms can be constructed that extract meaningful information even when individual measurements are dominated by noise.

The use of “Big Data” in civic society is much more problematic because the questions being asked are often ill-posed and there is rarely any compelling underlying theory. A naive belief exists in some quarters that harvesting more and more data necessarily leads to an increase in relevant information. Instead there is a danger that algorithms simply encode false assumptions and produce unintended consequences, often with disastrous results for individuals. We heard plenty of examples of this on Friday.

Although it is clearly the case that personal data can be – and indeed is – deliberately used for nefarious purposes, I think there’s a parallel danger that we increasingly tend to believe that just because something is based on numerical calculations it somehow must be “scientific”. In reality, any attempt to extract information from quantitative data relies on assumptions. if those assumptions are wrong, then you get garbage out no matter what you put in. Some applications of “data science” – those that don’t recognize these limitations – are in fact extremely unscientific.

I mentioned in discussions on Friday that there is a considerable push in astrophysics and cosmology for open science, by which I mean that not only are the published results openly accessible, but all the data and analysis algorithms are published too. Not all branches of science work this way, and we’re very far indeed from a society that applies such standards to the use of personal data.

Anyway, after the day’s discussion we adjourned to the School of Journalism, Media and Cultural Studies for a set of more formal presentations. The Head of School, Professor Stuart Allan introduced this session with some quotes from a book called Science for the Citizen, written by Lancelot Hogben in 1938. I haven’t read the book, but it looks fascinating and prescient. I have just ordered it and look forward to reading it. You can get the full-text free online here.

Here is the first paragraph of Chapter 1:

A MUCH abused writer of the nineteenth century said: up to the present philosophers have only interpreted the world, it is also necessary to change it. No statement more fittingly distinguishes the standpoint of humanistic philosophy from the scientific outlook. Science is organized workmanship. Its history is co-extensive with that of civilized living. It emerges so soon as the secret lore of the craftsman overflows the dam of oral tradition, demanding a permanent record of its own. It expands as the record becomes accessible to a widening personnel, gathering into itself and coordinating the fruits of new crafts. It languishes when the social incentive to new productive accomplishment is lacking, and when its custodians lose the will to share it with others. Its history, which is the history of the constructive achievements of mankind, is also the history of the democratization of positive knowledge. This book is written to tell the story of its growth as a record of human achievement, a story of the satisfaction of the common needs of mankind, disclosing as it unfolds new horizons of human wellbeing which lie before us, if we plan our new resources intelligently.

The phrase that struck me with particular force is “the democratization of positive knowledge”. That is what I believe science should do, but the closed culture of many fields of modern science makes it difficult to argue that’s what it actually does. Instead, there is an increasing tendency for scientific knowledge in many domains to be concentrated in a small number of people with access to the literature and the expertise needed to make sense of it.

In an increasingly technologically-driven society, the gap between the few in and the many out of the know poses a grave threat to our existence as an open and inclusive democracy. The public needs to be better informed about science (as well as a great many other things). Two areas need attention.

In fields such as my own there’s a widespread culture of working very hard at outreach. This overarching term includes trying to get people interested in science and encouraging more kids to take it seriously at school and college, but also engaging directly with members of the public and institutions that represent them. Not all scientists take the same attitude, though, and we must try harder. Moves are being made to give more recognition to public engagement, but a drastic improvement is necessary if our aim is to make our society genuinely democratic.

But the biggest issue we have to confront is education. The quality of science education must improve, especially in state schools where pupils sometimes don’t have appropriately qualified teachers and so are unable to learn, e.g. physics, properly. The less wealthy are becoming systematically disenfranchised through their lack of access to the education they need to understand the complex issues relating to life in an advanced technological society.

If we improve school education, we may well get more graduates in STEM areas too although this government’s cuts to Higher Education make that unlikely. More science graduates would be good for many reasons, but I don’t think the greatest problem facing the UK is the lack of qualified scientists – it’s that too few ordinary citizens have even a vague understanding of what science is and how it works. They are therefore unable to participate in an informed way in discussions of some of the most important issues facing us in the 21st century.

We can’t expect everyone to be a science expert, but we do need higher levels of basic scientific literacy throughout our society. Unless this happens we will be increasingly vulnerable to manipulation by the dark forces of global capitalism via the media they control. You can see it happening already.

What does “Big Data” mean to you?

Posted in The Universe and Stuff with tags , , , , on April 7, 2016 by telescoper

On several occasions recently I’ve had to talk about Big Data for one reason or another. I’m always at a disadvantage when I do that because I really dislike the term.Clearly I’m not the only one who feels this way:


For one thing the term “Big Data” seems to me like describing the Ocean as “Big Water”. For another it’s not really just the how big the data set is that matters. Size isn’t everything, after all. There is much truth in Stalin’s comment that “Quantity has a quality all its own” in that very large data sets allow you to do things you wouldn’t even try with smaller ones, but it can be complexity rather than sheer size that also requires new methods of analysis.


The biggest event in my own field of cosmology in the last few years has been the Planck mission. The data set is indeed huge: the above map of the temperature pattern in the cosmic microwave background has no fewer than 167 million pixels. That certainly caused some headaches in the analysis pipeline, but I think I would argue that this wasn’t really a Big Data project. I don’t mean that to be insulting to anyone, just that the main analysis of the Planck data was aimed at doing something very similar to what had been done (by WMAP), i.e. extracting the power spectrum of temperature fluctuations:

Planck_power_spectrum_origIt’s a wonderful result of course that extends the measurements that WMAP made up to much higher frequencies, but Planck’s goals were phrased in similar terms to those of WMAP – to pin down the parameters of the standard model to as high accuracy as possible. For me, a real “Big Data” approach to cosmic microwave background studies would involve doing something that couldn’t have been done at all with a smaller data set. An example that springs to mind is looking for indications of effects beyond the standard model.

Moreover what passes for Big Data in some fields would be just called “data” in others. For example, the Atlas Detector on the  Large Hadron Collider  represents about 150 million sensors delivering data 40 million times per second. There are about 600 million collisions per second, out of which perhaps one hundred per second are useful. The issue here is then one of dealing with an enormous rate of data in such a way as to be able to discard most of it very quickly. The same will be true of the Square Kilometre Array which will acquire exabytes of data every day out of which perhaps one petabyte will need to be stored. Both these projects involve data sets much bigger and more difficult to handle that what might pass for Big Data in other arenas.

Books you can buy at airports about Big Data generally list the following four or five characteristics:

  1. Volume
  2. Velocity
  3. Variety
  4. Veracity
  5. Variability

The first two are about the size and acquisition rate of the data mentioned above but the others are more about qualitatively different matters. For example, in cosmology nowadays we have to deal with data sets which are indeed quite large, but also very different in form.  We need to be able to do efficient joint analyses of heterogeneous data structures with very different sampling properties and systematic errors in such a way that we get the best science results we can. Now that’s a Big Data challenge!