Should Open Access Include Open Software?

Very busy today, so just time for a quick post (and associated poll) about Open Science.

As you all know I’ve been using this blog for a while to bang on about Open Access to scientific publications. I’m not going to repeat my position in detail here except to say that I’m in favour of Open Access but not at the immense cost envisaged by the Finch Report.

I thought however that it might be useful to float some opinions about wider issues related to open science. In particular, the question that often troubles me is that is open access to scientific results actually enough, or do we have to go a lot further?

I think an important aspect of the way science works is that when a given individual or group publishes a result, it should be possible for others to reproduce it (or not as the case may be). Traditional journal publications don’t always allow this. In my own field of astrophysics/cosmology, for example, results in scientific papers are often based on very complicated analyses of large data sets. This is increasingly the case in other fields too. A basic problem obviously arises when data are not made public. Fortunately in astrophysics these days researchers are pretty good at sharing their data, although this hasn’t always been the case.

However, even allowing open access to data doesn’t always solve the reproducibility problem. Often extensive numerical codes are needed to process the measurements and extract meaningful output. Without access to these pipeline codes it is impossible for a third party to check the path from input to output without writing their own version assuming that there is sufficient information to do that in the first place. That researchers should publish their software as well as their results is quite a controversial suggestion, but I think it’s the best practice for science. There isn’t a uniform policy in astrophysics and cosmology, but I sense that quite a few people out there agree with me. Cosmological numerical simulations, for example, can be performed by anyone with a sufficiently big computer using GADGET the source codes of which are freely available. Likewise, for CMB analysis, there is the excellent CAMB code, which can be downloaded at will; this is in a long tradition of openly available numerical codes, including CMBFAST and HealPix.

I suspect some researchers might be reluctant to share the codes they have written because they feel they won’t get sufficient credit for work done using them. I don’t think this is true, as researchers are generally very appreciative of such openness and publications describing the corresponding codes are generously cited. In any case I don’t think it’s appropriate to withhold such programs from the wider community, which prevents them being either scrutinized or extended as well as being used to further scientific research. In other words excessively proprietorial attitudes to data analysis software are detrimental to the spirit of open science.

Anyway, my views aren’t guaranteed to be representative of the community, so I’d like to ask for a quick show of hands via a poll…

…and you are of course welcome to comment via the usual box.

Advertisements

19 Responses to “Should Open Access Include Open Software?”

  1. Publishing code is a good idea. But don’t underestimate the work required to publish code… A bet a number of researchers would be unhappy to publish their code due to various crimes against Good Practise, such as
    a) lack of comments/documentation
    b) profantities in variable names/comments
    c) ugly, ugly hacks
    d) special cases
    e) source code no longer available, “but the executable is still fine”
    f) no error checking
    g) comments like “FIX ME” or “DON’T REMOVE” or “shouldn’t work, but does”

    • telescoper Says:

      I suspect this is very true.

    • Adrian Burd Says:

      These have all been raised when I have tried to push for code publication in my field of research. Many large scale codes in oceanography and climate are freely available and have been for many years. However, many “in-house” smaller but still substantial codes are not.

      Another frequently heard objection is that by making code freely available, some users expect support, and scientists are not in the business nor are they necessarily equipped to provide such support.

      My own feeling is that, at least in earth sciences, there needs to be a shift in administration towards giving credit for this type of activity. As Will says, it takes considerable time and effort to get a code (and it’s associated documentation) into reasonable shape to be released. That time and effort is not always recognized (by either administrators or ones colleagues).

  2. Several issues here.

    First, I think that data used to make plots should be made available, to aid in comparison of results. See, for example, http://www.astro.multivax.de:8000/ceres/data_from_papers/papers.html . It goes without saying that the input data should be publicly available.

    Second, it is good to make reasonably clean codes (in style and documentation; of course in execution all should be clean) available. Perlmutter et al. used (and cited) some of my Fortran code in their Nobel-Prize-winning work (but nevertheless didn’t invite me to the festivities, even though I once showed Saul where the loo was in Manchester). ASCL is a good place to start: http://asterisk.apod.com/viewforum.php?f=35 . Of course some people who spend years developing code want to use it themselves for a few papers before making it available; I think this is understandable and perhaps even good (see below). Yes, getting citations for the paper describing the code is good but actual papers are better. (Of course, these pressures could change with a different employment system in astrophysics.)

    I think that any paper describing results generated by code should describe the algorithm in enough detail so that a reader with programming knowledge could reproduce it not just in theory but also in practice. However, it might be a bad thing to make all codes publicly available. Once a code is out there, there is less motivation to write a new code for the same job. This means that it is more difficult to find bugs in the original code. Writing another code from scratch and trying to reproduce someone’s results usually finds some bugs—maybe in your code, maybe in the other guy’s. Please read this very short paper and especially the acknowledgements: http://adsabs.harvard.edu/full/1996A%26A…313.1028K . (This raises the question of how to count errata on a publication list; should each erratum decrease the number of papers?) Another reason is that codes are often rewritten to add new features, but at some point a fresh start is a good idea (like CAMB in Fortran 90 as opposed to CMBFAST in Fortran 77, although not just new programming languages but also new algorithms, programming style etc can be used to advantage in new code). Leichter’s First Law of Computing states: If you don’t know how to do it, you don’t know how to do it on a computer. Helbig’s corollary states: If you know how to do it on a computer, you know how to do it. Thus, writing a code which produces correct results is a good tool to understand something, and if all code is quickly available, there will be fewer people who learn the ropes this way. There is no point in re-inventing the wheel, but for new stuff it is good to have agreement between codes. At some point, of course, it is a good idea to make them available, but requiring this at publication time might do more harm than good. (Also, of course, many codes are written quickly and would be of little use to other people in this form, though the author would have no problem with them).

    • Due to the strange ADS format, the link isn’t set properly. Here it is.

      • Adrian Burd Says:

        There is an interesting case in the biosciences a few years ago where several high level papers (5 or 6 if memory serves) in Science and PNAS had to be retracted after a bug was found in data-analysis code that had been developed in-house and was never questioned. The code sorted out protein structures and the bug flipped a sign resulting in the wrong structures being given to some pretty important molecules.

  3. In general I’m in favour of free code, but it’s often not free to make it in a releasable state, as Will suggests. What works well enough for you and your special case may not be general enough for anyone else.

    You then open yourself up to people grabbing the code and then wanting to be hand held when it doesn’t work.

    It can also get out of control. Gadget in particular is hard to pin down a consistent version – its usually Gadget-2 + X’s routines to do Y, and Z’s changes to do A and my own changes to …

    Long term storage is also an issue, a URL published in a paper may quickly go out of date.

  4. From the perspective of someone now on the outside of academia (but still doing science) I’m against this being a requirement.

    While I’d love all software to be as open as possible, a lot of companies enjoy the ability to publish papers that involve data pipelines, without having to publish the code for the pipeline itself.

    Such code can often be deeply integrated into the internally developed software, rely on external licences, include valuable analysis techniques, and so on. The net result would simply companies would stop publishing anything – and that might restrict academics who work with industry. I’d hate to see those sorts of ties severed, even if you’d not mourn the loss of papers coming out of industry (and I would).

    • telescoper Says:

      I was really talking about academic research – I can understand companies wanting to keep commercial software suites private.

  5. ‘Always’, no. If I’ve put a lot of effort into writing a code that lets me interpret data that couldn’t previously be interpreted properly (a situation I’ve been in several times), that is effort I’ve spent getting a tool that nobody else has. I want a reward for that effort other than the warm fuzzies of knowing that other people have used the code; i.e. I want to either write papers with it or collaborate with others who will do so, while I still have that edge. That should be my decision to make.

    Of course some people write code as a public service, like Phillip’s; other people want their code to become the gold standard for analysis in a particular area, which means it’s really got to be public; there are plenty of situations where publishing source code makes sense. But it should be up to the author to decide.

    • telescoper Says:

      Fair enough, I suppose, but I’d say that this exposes a flaw in the much-vaunted system of peer review. How can a referee actually decide whether your analysis is correct if you keep your code secret?

      • They can reasonably ask for a full description of what the code does, such that they could in principle replicate it if they want to; they can compare the results to what’s already out there to see if they make sense; and they can ask me to run more tests and put the results in the paper. In other words, they can do all the same things that you would do if you were a referee in an experimental discipline given a paper about a complicated experimental setup that you can’t easily replicate.

        (Of course, it’s not the referee’s job to decide whether the analysis is correct, anyway, but I’ll take that as shorthand for what the referee’s job actually is…)

      • I wouldn’t expect a referee to check the code. If something is obviously wrong, he can point that out without checking the code. If not, and the code is trivial, he can check it by writing his own code from the description of the algorithm (which, of course, should be in the paper). If the code is non-trivial and/or the CPU time is huge, then I don’t think that anyone expects a referee to check the calculations, neither through examining the code nor checking results against his own code.

        I worked for a while in climate research, back in the early 1990s. Disks were expensive then and I was told that one is moved into a higher category if one has more than 100 MB. I thought that was an unreasonably small amount, even 20 years ago, to qualify for needing more space. Until I realized that they were talking about 100 MB of source code. (Similarly, I remember thinking that 6 hours on a Cray wasn’t that much CPU time until I realized that was just for the compile.)

        I do remember a referee pointing out a typo (sign error) in Eq. 33 or whatever (not present in the corresponding code). This is probably more than what most referees check. (I remember thinking that it might be a good idea to deliberately introduce a typo in one of the higher-numbered equations just to see if the referee bothers to check all the equations.)

  6. I would say this is a no. People who make their code public should be prepared to update and support it -a huge effort (e.g. Cloudy). The argument that it allows the code to be checked does not fly. A public code with faults can be hugely more damaging, than a private code with faults. A much better way to deal with code errors is to have a set of template problems for which code outputs can be compared.

    I am much more concerned about the move to commercial software. Starlink, Midas, etc are bing replaced with IDL for which I need to pay. Is this money well spend?

    • I basically agree. However, I think there is a case for making codes public and at the same time explicitly saying that there is no support. This might make sense for people who have invested considerable work in codes then leave the field. This was one of the motivations for ASCL. A code their can and should have a link to the maintenance web page, if there is one, but even non-maintained codes should be deposited here in case someone wants them for something. Obviously, if one is still in the field, then making code public but not supporting it is probably not a good idea, but that’s not necessarily the case for someone who has left the field.

    • telescoper Says:

      That seems to me to be a non sequitur.

      I don’t think any code would have to be updated (although it would be helpful if it were). The point is to make the code used for a particular analysis available to that others can see it and/or use it.

      • Right, this is a different point, and as I mentioned is interesting (at least) for code written by people no longer active, since if it isn’t deposited somewhere one might never be able to get it. If someone leaves the field, there is also no issue with competition etc.

        Many codes, however, are maintained, CAMB, Gadget etc for example. I think Albert is referring to a code used by someone still active in the field, but not in a position to properly document it, fix bugs in public versions etc.

  7. […] Should Open Access Include Open Software? (telescoper.wordpress.com) […]

  8. […] The software can be downloaded here. It looks a very useful package that includes code to calculate many of the bits and pieces used by cosmologists working on the theory of large-scale structure and galaxy evolution. It is also, I hope, an example of a trend towards greater use of open-source software, for which I congratulate the author! I think this is an important part of the campaign to create truly open science, as I blogged about here. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: