A Python Toolkit for Cosmology

The programming language Python has established itself as the industry standard for researchers in physics and astronomy (as well as the many other fields, including most of those covered by the Data Innovation Research Institute which employs me part-time). It has also become the standard vehicle for teaching coding skills to undergraduates in many disciplines. In fact it looks like the first module I will be teaching in Maynooth next term is in Computational Physics, and that will be delivered using Python too. It’s been a while since I last did any significant hands-on programming, so this will provide me with a good refresher. The best way to learn something well is to have to teach it to others!

But I digress. This morning I noticed a paper by Benedikt Diemer on the arXiv with the title COLOSSUS: A python toolkit for cosmology, large-scale structure, and dark matter halos. Here is the abstract:

This paper introduces Colossus, a public, open-source python package for calculations related to cosmology, the large-scale structure of matter in the universe, and the properties of dark matter halos. The code is designed to be fast and easy to use, with a coherent, well-documented user interface. The cosmology module implements FLRW cosmologies including curvature, relativistic species, and different dark energy equations of state, and provides fast computations of the linear matter power spectrum, variance, and correlation function. The large-scale structure module is concerned with the properties of peaks in Gaussian random fields and halos in a statistical sense, including their peak height, peak curvature, halo bias, and mass function. The halo module deals with spherical overdensity radii and masses, density profiles, concentration, and the splashback radius. To facilitate the rapid exploration of these quantities, Colossus implements about 40 different fitting functions from the literature. I discuss the core routines in detail, with a particular emphasis on their accuracy. Colossus is available at bitbucket.org/bdiemer/colossus.

The software can be downloaded here. It looks a very useful package that includes code to calculate many of the bits and pieces used by cosmologists working on the theory of large-scale structure and galaxy evolution. It is also, I hope, an example of a trend towards greater use of open-source software, for which I congratulate the author! I think this is an important part of the campaign to create truly open science, as I blogged about here.

An important aspect of the way science works is that when a given individual or group publishes a result, it should be possible for others to reproduce it (or not, as the case may be). At present, this can’t always be done. In my own field of astrophysics/cosmology, for example, results in traditional scientific papers are often based on very complicated analyses of large data sets. This is increasingly the case in other fields too. A basic problem obviously arises when data are not made public. Fortunately in astrophysics these days researchers are pretty good at sharing their data, although this hasn’t always been the case.

However, even allowing open access to data doesn’t always solve the reproducibility problem. Often extensive numerical codes are needed to process the measurements and extract meaningful output. Without access to these pipeline codes it is impossible for a third party to check the path from input to output without writing their own version assuming that there is sufficient information to do that in the first place. That researchers should publish their software as well as their results is quite a controversial suggestion, but I think it’s the best practice for science. There isn’t a uniform policy in astrophysics and cosmology, but I sense that quite a few people out there agree with me. Cosmological numerical simulations, for example, can be performed by anyone with a sufficiently big computer using GADGET the source codes of which are freely available. Likewise, for CMB analysis, there is the excellent CAMB code, which can be downloaded at will; this is in a long tradition of openly available numerical codes, including CMBFAST and HealPix.

I suspect some researchers might be reluctant to share the codes they have written because they feel they won’t get sufficient credit for work done using them. I don’t think this is true, as researchers are generally very appreciative of such openness and publications describing the corresponding codes are generously cited. In any case I don’t think it’s appropriate to withhold such programs from the wider community, which prevents them being either scrutinized or extended as well as being used to further scientific research. In other words excessively proprietorial attitudes to data analysis software are detrimental to the spirit of open science.

Anyway, my views aren’t guaranteed to be representative of the community, so I’d like to ask for a quick show of hands via a poll…

…and you are of course welcome to comment via the usual box.

Advertisements

16 Responses to “A Python Toolkit for Cosmology”

  1. Healpix is now available also as healpy. Much easier to work.
    In addition to CMBFAST, CAMB there is also CLASS.
    Not all python codes are written in python but actually uses
    C/C++ modules and a wrapper. Python can be slow but
    Cython is faster. There are many freely available cosmology
    related software in github. They are all free but not bugfree !
    Sometime it is easier to develop your own code.

  2. I think the other big reason people don’t share their code is that they don’t want to suddenly become *responsible* for maintaining and updating a codebase.

    • telescoper Says:

      Yes, that’s a valid point. It’s questionable whether it is useful to share raw code if you’re not going to maintain it or support it. But if someone is doing that job within a collaboration, they might also do it for open source software.

  3. Anton Garrett Says:

    How about if a researcher simply states *exactly* what algorithm the code is designed to implement, rather than give the code itself?

    • Edd Edmondson Says:

      Compiler bugs, subtleties of implementations, outright bugs… better not to where possible I think.

      I have had a bug in an old version of a language interpreter completely screw target selection when run on one particular system, for example. These things happen.

      Although perhaps having more than one implementation of an algorithm is more likely to show up these issues…

    • There are cases where one can literally write the relevant equation down in a few seconds and spend months developing the corresponding code. 😐

      • Anton Garrett Says:

        I don’t see what that has to do with it.

        As for replication, I’m just finishing throwing together my own version of the LHC in my garden shed from a few spare bits lying around, and I look forward to correcting their results imminently.

      • telescoper Says:

        `Little Horticultural Contraption’?

  4. It’s complicated… It may be the case that the “pipeline” codes have been commercially procured, so the organisations (industrial or academic) involved will view them as having a commercial value – especially if there are elements with multiple applications. If future competitions in a similar area are anticipated, then said organisations will (very reasonably) want to keep the codes private to maximise the chances of winning those future competitions. Even if they belong to a completely noncommercial blue-skies academic group, there will be pressures for privacy – how often have wee seen “Group X must be involved in this project because they are the only ones withj Capability Y – so they need funding”. Cunning software has as much of a value as a new hardware technology – where exploitation is not questioned.

    The only way round this is to eliminate any commercial pressures from everyone involved – and without infinite budgets that won’t happen.

    It’s not a new issue – I remember being unable to afford the NAG library and the later graphical libraries from various academic spinout companies – these codes were not published – and it’s always necessary to consider how deep to dig in the “full visibility” quest; do you need to know how the processor implements a multiplication? I recall a standard operating system-provided random number generator turning out to have structure in some analysis planes.

    That means, I think, that it’s not as simple as the question posed. Some level of commercial value will exist somewhere in any computational analysis and it’s a question of balancing the commercial interests with those of openness.

  5. John Peacock Says:

    This looks a good thing on the whole, although I still feel torn about the impact of such extensive black boxes, especially regarding the education of students. When getting new PhDs up to speed, I often give them some programming exercise, and I’m forever wondering if this is old-fashioned given that they can almost certainly just grab an off-the-shelf public solution for the problem. I justify this by saying that one needs the experience of painful debugging in order (a) to appreciate possible limitations of public black boxes; (b) to be capable of contributing black boxes of your own in due course. But maybe one should say on day 1 “here’s all these tools: slap them together without wondering how they work and make some results”. It’s a way of being ‘professional’ more quickly. To some extent, it’s what we have always done: which of us has ever gone through the code of the FFT they use? But I do see a growing dilemma in PhDs between educating the students and getting stuff done.

    • telescoper Says:

      I know what you mean. When I was doing my PhD research I relied too much on the old NAG libraries which worked OK until I found something that crashed one of the subroutines with a meaningless error message. In the end I had to write my own code, which also crashed (at first) but at least I could figure out why.

    • I think it depends on scale. No-one could or should expect someone doing radio astronomy to write his own stuff rather than using AIPS. Also, I don’t think it’s necessary to write one’s own compiler, device driver, operating system, etc. However, for stuff like numerical hydrodynamic simulations, where there are even conferences dedicated to comparing code, one can’t really judge the results without seeing code, and understanding what is seen is best learned by writing one’s own code. CMB power-spectrum calculations can probably be done with a black box, especially since more than one is available and they agree except for details. Foreground removal is a different matter; one really needs to understand what is going on.

  6. The situation in astrophysics is much better than in particle physics where people refuse to make any data/codes/intermediate data products public. I wish particle physicists learn from astrophysicists
    on this.

  7. Another benefit, beyond reproducible analysis, is the potential for peer-review Quality Assurance of the code itself. I know from experience that scientific code, while checked and tested with the best intentions, is often only seen by the coder. It’s easy to not see the wood for the trees and miss an error.

    Plus, the sharing of code could encourage better coding practice, and the sharing of knowledge, useful short-cuts, etc.

    Having more people look at your code can only be a good thing. Even if you’re a terrible coder, like me!

  8. I said no because if my job is to write code I want to be paid for it. Now if my company/funding agency were to stipulate that I will pay you to write this code and pay you to make it Open Source and pay you to fix bugs and maintain it, that’s a different story.

    As I tell my kids, “I was a socialist until I got my first good job.”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: