Archive for Data Science

Is nothing > data?

Posted in Uncategorized with tags , on July 19, 2017 by telescoper

I got this yesterday from one of my office mates who suggested that I stick it somewhere. It’s an advert for a data science company called Pivigo. Logically, the statement on the sticker implies that data is less than nothing, which I don’t think is the point that they’re trying to make. On the other hand, I suppose that by posting this I’ve given Pivigo some free advertising so in some sense it is a successful promotional ploy!

Anyway, when I posted this on Twitter it sparked a little discussion about the vexed issue of whether the word `data’ is singular or plural, so I decided to bore my readers with thoughts on that – not that I’m pedantic or anything.

The word `data’ is formed from the latin plural of the word `datum’ (itself formed from the past participle of the latin verb `dare’, meaning `to give’) hence meaning `things given’ or words to that effect. The usage of `data’ that we use now (to refer to measurements or quantitative information) seems not to have been present in roman or mediaeval times so some argue that it is a deliberate archaism to treat it as a Latin plural now. Moreover, some insist that `data’ in modern usage is a `mass noun’ so should on that grounds also be treated as singular.

For those of you who aren’t up with such things, English nouns can be of two forms: `count nouns’ and `non-count counts’ (also known as `mass nouns’). Count nouns are those that can be enumerated and therefore have both plural and singular forms:  one eye, two eyes, etc. Non-count nouns are those which describe something which is not enumerable, such as `furniture’ or `cutlery’. Such things can’t be counted so they don’t have a different singular and plural forms: you can have two chairs (count noun) but can’t have two furnitures (non-count noun).

Count and non-count nouns require different grammatical treatment. You can ask `how much furniture do you have?’ but not how many. The answer to a `how much’ question usually requires a unit or measure word (e.g. `a vanload of furniture’) but the answer to a `how many’ question would be just a number. Next time you are in a supermarket queue where it says `ten items or less’ you will appreciate that it the sign is grammatically incorrect. `Item’ is most definitely a count noun, so the correct form should be `ten items or fewer’.

In the specific case of `data’, it seems clear to me that there are (at least) two distinct uses of this word. One is the use of `data’ to describe an undifferentiated unspecified or unlimited quantity of information such as that stored on a computer disk. Of such stuff you might well ask `how much data do you have?’ and the answer would be in some units (e.g. Gbytes). This clearly identifies it as a mass noun.

But there is another meaning, which is that ascribed to specified pieces of information either given (as per the original Latin) or obtained from a measurement. Such things are precisely defined, enumerable and clearly therefore of count-noun form. Indeed one such entity could reasonably be called a datum and the plural would be data. This usage applies when the context defines the relevant quantum of information so no unit is required. This is the usage that arises in most scientific papers, as opposed to software manuals. In Figure 1, the data are plotted…’ is correct. Although it sounds clumsy you could well ask in such a situation `how many data do you have?’ (meaning how many measurements do you have) and the answer would just be a number. I don’t find this archaic at all. It seems quite sensible.

To labour the point still further,  here are another two sentences that show the different uses:

“If I had less data my disk would have more free space on it.” (Non-count)

“If I had fewer data I would not be able to obtain an astrometric solution.” (Count).

It is not unusual for the same words (if they’re nouns) to have both count and non-count forms in different contexts. I give the example of `whisky’, as in `my glass is full of whisky’ (non-count) versus `two whiskies, please, barman’.
There are countless other examples (pun intended) of words that can be count nouns or non-count nouns. `Fire’ can be a mass noun `fire is dangerous’) but also a count noun (`the firemen were fighting three fires simultaneously’). Another nice one  is `hair’ which is non-count when it is on someone’s head (`my hair is going grey’) but count when  they, in the plural, are being split.

In the context of data science it seems to me that `data’ is almost always used as a non-count noun and can therefore reasonably be treated as singular. In the context of the statement that `nothing is > data’ it would also appear that `nothing’ is also of non-count form, but whether this is the case or not, the statement seems to imply that `0>data’, which seems to imply that data is negative.

And there’s another question: what does `>’ mean? Wikipedia says `greater than‘, but I think it means `is greater than’, much as `=’ means `equals’ or `is equal to’. So there’s a syntax error in the sticker too…

..or perhaps I might be reading a little too much into this?

Science for the Citizen

Posted in Education, Open Access, The Universe and Stuff with tags , , , , , , on March 20, 2017 by telescoper

I spent all day on Friday on business connected with my role in the Data Innovation Research Institute, attending an event to launch the new Data Justice Lab at Cardiff University. It was a fascinating day of discussions about all kinds of ethical, legal and political issues surrounding the “datafication” of society:

Our financial transactions, communications, movements, relationships, and interactions with government and corporations all increasingly generate data that are used to profile and sort groups and individuals. These processes can affect both individuals as well as entire communities that may be denied services and access to opportunities, or wrongfully targeted and exploited. In short, they impact on our ability to participate in society. The emergence of this data paradigm therefore introduces a particular set of power dynamics requiring investigation and critique.

As a scientist whose research is in an area (cosmology) which is extremely data-intensive, I have a fairly clear interpretation of the phrase “Big Data” and recognize the need for innovative methods to handle the scale and complexity of the data we use. This clarity comes largely from the fact that we are asking very well-defined questions which can be framed in quantitative terms within the framework of well-specified theoretical models. In this case, sophisticated algorithms can be constructed that extract meaningful information even when individual measurements are dominated by noise.

The use of “Big Data” in civic society is much more problematic because the questions being asked are often ill-posed and there is rarely any compelling underlying theory. A naive belief exists in some quarters that harvesting more and more data necessarily leads to an increase in relevant information. Instead there is a danger that algorithms simply encode false assumptions and produce unintended consequences, often with disastrous results for individuals. We heard plenty of examples of this on Friday.

Although it is clearly the case that personal data can be – and indeed is – deliberately used for nefarious purposes, I think there’s a parallel danger that we increasingly tend to believe that just because something is based on numerical calculations it somehow must be “scientific”. In reality, any attempt to extract information from quantitative data relies on assumptions. if those assumptions are wrong, then you get garbage out no matter what you put in. Some applications of “data science” – those that don’t recognize these limitations – are in fact extremely unscientific.

I mentioned in discussions on Friday that there is a considerable push in astrophysics and cosmology for open science, by which I mean that not only are the published results openly accessible, but all the data and analysis algorithms are published too. Not all branches of science work this way, and we’re very far indeed from a society that applies such standards to the use of personal data.

Anyway, after the day’s discussion we adjourned to the School of Journalism, Media and Cultural Studies for a set of more formal presentations. The Head of School, Professor Stuart Allan introduced this session with some quotes from a book called Science for the Citizen, written by Lancelot Hogben in 1938. I haven’t read the book, but it looks fascinating and prescient. I have just ordered it and look forward to reading it. You can get the full-text free online here.

Here is the first paragraph of Chapter 1:

A MUCH abused writer of the nineteenth century said: up to the present philosophers have only interpreted the world, it is also necessary to change it. No statement more fittingly distinguishes the standpoint of humanistic philosophy from the scientific outlook. Science is organized workmanship. Its history is co-extensive with that of civilized living. It emerges so soon as the secret lore of the craftsman overflows the dam of oral tradition, demanding a permanent record of its own. It expands as the record becomes accessible to a widening personnel, gathering into itself and coordinating the fruits of new crafts. It languishes when the social incentive to new productive accomplishment is lacking, and when its custodians lose the will to share it with others. Its history, which is the history of the constructive achievements of mankind, is also the history of the democratization of positive knowledge. This book is written to tell the story of its growth as a record of human achievement, a story of the satisfaction of the common needs of mankind, disclosing as it unfolds new horizons of human wellbeing which lie before us, if we plan our new resources intelligently.

The phrase that struck me with particular force is “the democratization of positive knowledge”. That is what I believe science should do, but the closed culture of many fields of modern science makes it difficult to argue that’s what it actually does. Instead, there is an increasing tendency for scientific knowledge in many domains to be concentrated in a small number of people with access to the literature and the expertise needed to make sense of it.

In an increasingly technologically-driven society, the gap between the few in and the many out of the know poses a grave threat to our existence as an open and inclusive democracy. The public needs to be better informed about science (as well as a great many other things). Two areas need attention.

In fields such as my own there’s a widespread culture of working very hard at outreach. This overarching term includes trying to get people interested in science and encouraging more kids to take it seriously at school and college, but also engaging directly with members of the public and institutions that represent them. Not all scientists take the same attitude, though, and we must try harder. Moves are being made to give more recognition to public engagement, but a drastic improvement is necessary if our aim is to make our society genuinely democratic.

But the biggest issue we have to confront is education. The quality of science education must improve, especially in state schools where pupils sometimes don’t have appropriately qualified teachers and so are unable to learn, e.g. physics, properly. The less wealthy are becoming systematically disenfranchised through their lack of access to the education they need to understand the complex issues relating to life in an advanced technological society.

If we improve school education, we may well get more graduates in STEM areas too although this government’s cuts to Higher Education make that unlikely. More science graduates would be good for many reasons, but I don’t think the greatest problem facing the UK is the lack of qualified scientists – it’s that too few ordinary citizens have even a vague understanding of what science is and how it works. They are therefore unable to participate in an informed way in discussions of some of the most important issues facing us in the 21st century.

We can’t expect everyone to be a science expert, but we do need higher levels of basic scientific literacy throughout our society. Unless this happens we will be increasingly vulnerable to manipulation by the dark forces of global capitalism via the media they control. You can see it happening already.

Magnets, Data Science and the Intelligent Pig

Posted in Biographical, The Universe and Stuff with tags , , , , , on November 18, 2016 by telescoper

The other day I was talking to some colleagues in the pub (as one does). At one point the subject of conversation turned to the pressure we academics are under these days to collaborate more with the world of industry and commerce. That’s one of the things that the Cardiff University Data Innovation Research Institute – which currently pays half my wages  – is supposed to do, but there was general consternation when I mentioned that I have in the past spent quite a long time working in industry. I am, after all, Professor of Theoretical Astrophysics. Of what possible interest could that be to industry?

My time in industry was spent at one of the research stations of British Gas, called the On-Line Inspection Centre (“OLIC”) which was situated in Cramlington, Northumberland. I started work there in 1981, just after I’d finished my A-levels and the Cambridge Entrance Examination and I worked there for about 9 months, before leaving to start my undergraduate course in 1982. At that time British Gas was still state-owned, and one of the consequences of that was that I had to sign the Official Secrets Act when I joined the staff. Among other things that forbade me from making “unauthorized disclosures” of what I was working on for thirty years. I feel comfortable discussing that work now, partly because the thirty years passed some time ago and partly because OLIC no longer exists. I’m not sure exactly what happened to it, but I presume it got flogged off on the cheap when British Gas was privatized during the Thatcher regime.

The main activity of the On-Line Inspection Centre was developing and exploiting techniques for inspecting gas pipelines for various forms of faults. The UK’s gas transmission network comprises thousands of kilometres of pipelines, made from steel in sections joined together by seam welds. I always thought of it as like a road network: the motorways which were made of 36″ diameter pipes; the A-roads were of smaller, 24″, diameter; and the minor roads were generally made of 12″ pipes. It’s interesting that despite the many failings of my memory now that I’ve reached middle age, I can still remember the names of some of the routes: “Huddersfield to Hopton Top” and “Seabank to Frampton Cotterell” spring immediately to mind.

Anyway, as part of the Mathematics Group at OLIC my job was to work on algorithms to analyse data from various magnetic inspection vehicles. These vehicles – known as “pigs” – were of different sizes to fit snugly  in the various pipes. The term “pig” had originally been applied to simple devices used to clean the gunk from inside of a pipe. They were just put in one end of the line and  gas pressure would push them all the way to the other end, often tens of kilometres away. The pipeline could thus be cleaned without taking it out of service.

This basic idea was modified to produce the much more sophisticated “intelligent pig” which produced the data I worked on. You can read much more about this here. This looked very similar to the cleaning pig, but had a complicated assembly of magnets and sensors, shown schematically here:


The two sets  of magnets are connected to the pipe wall by steel brushes to maintain good contact. The magnetic field applied by the front set of magnets is contained within the pipe wall forms a kind of circuit with the rear set as shown, unless there is a variation in the thickness of the material. In that case magnetic flux leaks out and is detected by the sensors. The magnets and sensors are deployed in rings to cover the whole circumference of the pipe. A 24″ diameter pig would have 240 sensors, each recorded as a separate channel on the vehicle.

The actual system is fairly complicated so some of the work was experimental. Sections of pipe were made with defects of various sizes machined into them. The pig would then be pulled through these sections and the signals studied to build up an understanding of how the magnetic field would respond in different situations.

The actual pig (which could be several metres long and weighing a couple of tonnes) looks like this:


I always thought they looked a bit like spacecraft.

The pig usually travels at something like walking pace along the pipeline, and the sampling rate of the sensors was such that a reading would be taken every few millimetres. That sampling rate was necessary because corrosion pits as small as 1cm across could be dangerous.  The larger vehicles had “on-board thresholding” so that recordings of quiescent sections were discarded. Even so pipe surfaces (especially those of smaller bore) could be uneven for various reasons to do with their production rather than the effects of corrosion. Moreover, every few metres there would be a circumferential seam weld where two sections of pipe were joined together; these features would produce a large signal on all channels which the thresholding algorithm did not suppress.  The net result was that a lot of data had to be stored on the vehicle. When I say “a lot”, I mean for that time. A full run might produce about 5 × 107 readings. That seems like nothing now, but it was “Big Data” in those days!

So how was all this data processed back at the station? You probably won’t believe this, but it was printed out on Versatec printers in the form of a chart recording for each channel. Operators then identified funny-looking signals by eye and we then pulled down the data from tape and had a further look, usually comparing the patterns visually with those obtained from “pull-through” experiments.

Among the things I worked on was an algorithm to recognize seam weld signals automatically. That was quite easy actually – because it just requires looking for simultaneous activity on all channels – although it had to be made robust enough to deal with the odd dead channel and other instrumental glitches. This algorithm proved to be useful because sometimes the on-board telemetry would go wrong and we had to locate the pig by counting the number of welds it had passed since the start of the run.

A far more difficult challenge was dealing with data from 12″ diameter pipe. These are manufactured in a way that’s completely different from that used to make pipes of larger diameter, which are made of rolled steel. The 12″ pipes were made from a solid plug of molten steel, the centre of which is bored out by a device that rotates as it goes along. The effect of this is that it imposes a peculiar form of variation on the pipe wall, in the form of a spirally modulated “noise”. Annoyingly, the pitch and amplitude of the spiral varied from one section of pipe to another. After many failed attempts, the group finally came up with an algorithm that used the weld detector as a starting point to establish the vehicle had entered a new section of pipe. It then used data from the start of each section to estimate the parameters of the spiral pattern for that section, and then applied a filter to remove it from the rest of the section. It wasn’t particularly elegant, but it certainly cleaned up the data massively and made it much easier to spot significant features.

You might ask why I’ve written at such length about this when it’s got nothing to do with my current research (or indeed, anything else I’ve done since I graduated from Cambridge in 1985). One reason is that, although I didn’t know it at the time, my time at OLIC was going to prepare me very well for when I started my PhD. That was the case because all the programming I did used VAX computers, which turned out to be the computers used by STARLINK.  When I started my life as a research student I was already fluent in the command language (DCL) as well as the database software DATATRIEVE, which was a great advantage. Another reason is that working in this environment I had to learn to make my code (which, incidentally, was all in Fortran-77) conform to various very strict standards. I didn’t like some of the things we were forced to do, but I was shouted at sufficiently often that I gave up and did what I was told. I have never been particularly good at doing that in general, but in the context of software it is a lesson I’m glad I learned. Above all, though, I think working outside academia gave me a different perspective on research.  As academics were are very lucky to be able- at least some of the time – to choose our own research problems, but I believe that in the long run it can be very for your intellectual development to do something completely different every now and then.

We’re currently discussing a scheme whereby Physics and Astrophysics research students can interrupt their PhD for up to 6 months to undertake a (paid) work placement outside academia. I suspect many graduate students will not be keen on this, as they’ll see it as a distraction from their PhD topic, but I think it has many potential advantages as I hope I’ve explained.



That Was The Data Innovation Day That Was

Posted in Uncategorized with tags , , on November 7, 2016 by telescoper

Time, methinks, for a quick work-related post. You may know that my current appointment is in association with Cardiff University’s Data Innovation Research Insitute, and it’s that part of my job that is taking up most of my time at the moment. Last Friday (4th November) we had our first Data Innovation Day, the aim of which was to encourage collaboration between Schools and Research Institutes in the area of Data Science.

To this end, on Friday morning we had a dozen short(ish) talks on data science aspects of all kinds of subjects, from neuroimaging to gravitational wave research to healthcare to biosocial computing to statistical modelling and so on and so forth. It was a fascinating mixture of presentations and about 75 people attended, which was a pretty good audience. After lunch we broke into groups to develop specific research projects and establish what the Data Innovation Institute can do to help foster collaborations across disciplinary and administrative boundaries. That’s much harder than it might sound, and is certainly harder than it should be in modern universities. We had no shortage of ideas, and let’s hope we can turn them into concrete projects.

Anyway, one of my contributions to the day was to set up a Twitter account for the Data Innovation Research Institute together with a logo:


We currently have a princely 37 followers. Feel free to follow if you’re on Twitter and interested in Data Science!