LIGO and Open Science

I’ve just come from another meeting here at the Niels Bohr Institute between some members of the LIGO Scientific Collaboration and the authors of the `Danish Paper‘. As with the other one I attended last week it was both interesting and informative. I’m not going to divulge any of the details of the discussion, but I anticipate further developments that will put some of them into the public domain fairly soon and will comment on them as and when that happens.

I think an important aspect of the way science works is that when a given individual or group publishes a result, it should be possible for others to reproduce it (or not as the case may be). In normal-sized laboratory physics it suffices to explain the experimental set-up in the published paper in sufficient detail for another individual or group to build an equivalent replica experiment if they want to check the results. In `Big Science’, e.g. with LIGO or the Large Hadron Collider, it is not practically possible for other groups to build their own copy, so the best that can be done is to release the data coming from the experiment. A basic problem with reproducibility obviously arises when this does not happen.

In astrophysics and cosmology, results in scientific papers are often based on very complicated analyses of large data sets. This is also the case for gravitational wave experiments. Fortunately in astrophysics these days researchers are generally pretty good at sharing their data, but there are a few exceptions in that field. Particle physicists, by contrast, generally treat all their data as proprietary.

Even allowing open access to data doesn’t always solve the reproducibility problem. Often extensive numerical codes are needed to process the measurements and extract meaningful output. Without access to these pipeline codes it is impossible for a third party to check the path from input to output without writing their own version, assuming that there is sufficient information to do that in the first place. That researchers should publish their software as well as their results is quite a controversial suggestion, but I think it’s the best practice for science. In any case there are often intermediate stages between `raw’ data and scientific results, as well as ancillary data products of various kinds. I think these should all be made public. Doing that could well entail a great deal of effort, but I think in the long run that it is worth it.

I’m not saying that scientific collaborations should not have a proprietary period, just that this period should end when a result is announced, and that any such announcement should be accompanied by a release of the data products and software needed to subject the analysis to independent verification.

Now, if you are interested in trying to reproduce the analysis of data from the first detection of gravitational waves by LIGO, you can go here, where you can not only download the data but also find a helpful tutorial on how to analyse it.

This seems at first sight to be fully in the spirit of open science, but if you visit that page you will find this disclaimer:

 

In other words, one can’t check the LIGO data analysis because not all the data and tools necessary to do that are not publicly available.  I know for a fact that this is the case because of the meetings going on here at NBI!

Given that the detection of gravitational waves is one of the most important breakthroughs ever made in physics, I think this is a matter of considerable regret. I also find it difficult to understand the reasoning that led the LIGO consortium to think it was a good plan only to go part of the way towards open science, by releasing only part of the information needed to reproduce the processing of the LIGO signals and their subsequent statistical analysis. There may be good reasons that I know nothing about, but at the moment it seems to me to me to represent a wasted opportunity.

I know I’m an extremist when it comes to open science, and there are probably many who disagree with me, so I thought I’d do a mini-poll on this issue:

Any other comments welcome through the box below!

23 Responses to “LIGO and Open Science”

  1. My only exception to releasing both code and data would be to say that it must be released, but only after a short embargo period, to enable the LIGO researchers to have first crack at the data. Once published, however, I agree that you must publish both code and data. (And the only reason I support an embargo period at all is because of how otherwise the scientists that make these large collaborations go – designing hardware and software and maintaining them – don’t get to publish, and consequently end up rather career-limited under the current system.)
    In other ways, I’d go further. You should not only be releasing your code, but also your unit tests and regression tests, particularly for a long-lived project such as LIGO. You want to be sure that the software version you’re running didn’t have bugs in.

    • telescoper Says:

      I agree, but as soon as a scientific paper is published the data and tools needed to check the reproducibility of the results therein should be released.

  2. Code needs to be released. The geosciences have struggled with this for a long time and over time there has been a gradual shift to releasing code – many codes for models and data analysis are available online at different locations. Codes from my lab are released on Github (when the paper is published), though I have to admit that we don’t always publish unit and regression tests, though we should.

    The biggest problem I’ve found is in teaching students to a) code properly (i.e. not the way I was taught *cough* *cough* *splutter* *splutter* years ago) and b) the importance of keeping up with tests. This makes the regular code reviews that I perform a real pain.

  3. Not so simple! One of the classes of signals LIGO is searching intensively for is GW pulsars, continuous signals so weak that they may need coherent integration times of up to a year. (These searches take up most of the LSC’s use of compute cycles.) Proprietary access then suggests LIGO should withhold all its data for 2 years to allow a full search. LIGO has compromised by releasing data around short events (like GW150914). That is good enough for studying the signal and its parameters (the subject of the Copenhagen discussions) but not for also estimating the true significance of the detection, which depends on being able to study the non-Gaussian “glitch” noise of each detector over many months in order to estimate the likely chance occurrence of a non-astronomical ‘event’. Data stretches that long are still proprietary for the pulsar searches, so it is not realistic to expect that full reproducibility can be tested immediately.

    LIGO’s software is open-source, but very complex and regularly upgraded. Enabling outside scientists to use it would require significant investment in extra staff to teach people how. LIGO initially was funded just to build and operate the detectors. Several years ago LIGO requested additional funds from NSF for data release staff because it anticipated these problems, but the request was declined. Given that LIGO is still a work in progress, with difficult technical upgrades still to come (along with concomitant changes in analysis software), and given current budget pressures at NSF, it seems very unlikely to me that funds to pay such people will be supplied by NSF any time soon.

    • telescoper Says:

      I understand what you say about the other sources, but I’m simply advocating releasing data when a discovery is announced.

      Unfortunately, it’s not true that the LIGO data released so far is `good enough’ to study the GW150914 signal and its parameters. It’s not a question of the length of data being released the 4096s record currently available is in principle adequate, but what is available is insufficient to reproduce the analysis presented in the discovery paper. In particular the full template library is not available. Even the `raw’ data record available on the LIGO website has been pre-cleaned and is not absolutely raw.

      • You *could* do the parameter estimation with the data that was released, but to reproduce the detection confidence results requires several weeks of data around that time. Releasing those data would not have impacted the pulsar searches (it’s a few weeks of data, rather than years’ worth), but Bernard’s point about staff and funding still holds. A grumpier summary would be: it’s easy to advocate open data when it’s someone else’s data!

      • telescoper Says:

        We haven’t looked at parameter estimation at all, as we don’t have access to the library of templates. I’d be quite interested in seeing what the full posterior probability regions look like in (M1,M2,S1,S2) space look like, as there must be a degeneracy between mass and spin….

      • The codes to do the parameter estimation, and produce the theoretical waveforms, are all publicly available. The posteriors for the masses and spins are in the LIGO papers, of course — there is a well-known degeneracy between mass-ratio and spin (most visible in GW151226) and also between the two spins. The summary of all of the O1 results is a good place to look, arXiv:1606.04856.

      • telescoper Says:

        It seems to me that the bigger problem with parameter estimation is that you need to understand the noise and systematics very well indeed to construct a realistic likelihood function.

  4. Tommy Burch Says:

    on the subject of open science, there are some groups in my former field (lattice field theory) which do not share their numerical data or codes. there is even one which is so secretive that members of the collaboration are required to sign non-disclosure agreements before having access to the data/code (and they have a few Nature pubs from this stuff). this is not only tragic, but also rather ironic, seeing as this is the field from which the originator of the arXiv (ginsparg) comes.

    • telescoper Says:

      Interesting contrast with cosmology, in which the top numerical simulation codes are publicly available.

    • The result of ‘lattice field theory’ can be checked in many other ways, most importantly from experiments. So, keeping secret about the deriving procedures of a result does not hurt the reviewing process.

      For LIGO’s case, it claims that there is no other eye or ear can detect its signal/event. Its conclusion is not supported by any known astrophysical theories of how to produce the 30 x 30 (solar mass) twin black holes. The twin black hole population density from its almost monthly discovery is totally in conflict with all known observation data (dark mass, etc.).

      LIGO is much worse than the BICEP 2 fiasco.

  5. Interesting discussion… Widening it a little; what do people think about the trend for universities generally to put a lot of effort into protecting IP and creating NDAs – and then bidding contracts (e.g from the European Space Agency) often in competition with industry?

  6. Shantanu Says:

    There is an interesting comment at https://labcit.ligo.caltech.edu/~weekly/pastreports/weekly2017/weekly20170717.html which says under Alan Weinstein “Spend an inordinate amount of time defending the LOSC tutorial against charges that it contributed to the mistakes in Cresswell’s paper. It did not; there is nothing wrong with the tutorial, and if Cresswell had used it, they would have avoided many mistakes.” I don’t know what were the internal deliberations going on, but I do hope that the time-series filtering described in the documentation is exactly same as in the papers

    • telescoper Says:

      There are actually SIX mistakes in the codes given in the LOSC tutorial page. The only errors I have seen have been in the LIGO analysis, not that of the NBI group.

      • I think you need to be more careful with your words, Peter. There are no demonstrated analysis errors by the LSC. The diagram that was meant to illustrate the detection was slightly misleading but was not intended as a data product; the numbers provided in the discovery papers and follow-ons for parameter measurements certainly do not contain demonstrated mistakes. They come with error bars (posterior pdfs) as published. The Cresswell et al analysis seems to employ values slightly different but still within the quoted uncertainties. If there are errors in the tutorials (I first learned about this suggestion from your blog) those have nothing to do with the analysis since the tutorial is a simplified version of what was done for the papers. The real version of the analysis is in the refereed papers, as one should expect — including all the dozens of previous ‘methods’ papers over several years that have passed the usual peer review.

      • telescoper Says:

        Bernard

        I think it is very clear that my comment referred to the tutorial page not in the analysis published in the PRL detection paper (and the other papers). Even if there are errors in the other papers – and I have no reason to believe that there are – then it would not be possible to demonstrate them as the data needed to do so are not publicly available.

        Peter

  7. Shantanu Says:

    Ok, I do hope there is an errata provided for each of the 6 mistakes in the tutorials.

  8. […] found. If it is a claimed detection then I hope that LIGO and VIRGO will release sufficient data to enable the analysis to be checked and verified. That’s what most of the respondents to my poll seem to hope […]

  9. […] on the phone some time ago to clarify some points I made in previous blog posts on this issue (e.g. this one). I even ended up being quoted in the […]

Leave a comment