Archive for NASA/ADS

Lognormality Revisited

Posted in Biographical, Science Politics, The Universe and Stuff with tags , , , , , on January 14, 2015 by telescoper

I was looking up the reference for an old paper of mine on ADS yesterday and was surprised to find that it is continuing to attract citations. Thinking about the paper reminds me off the fun time I had in Copenhagen while it was written.   I was invited there in 1990 by Bernard Jones, who used to work at the Niels Bohr Institute.  I stayed there several weeks over the May/June period which is the best time of year  for Denmark; it’s sufficiently far North (about the same latitude as Aberdeen) that the summer days are very long, and when it’s light until almost midnight it’s very tempting to spend a lot of time out late at night..

As well as being great fun, that little visit also produced what has turned out to be  my most-cited paper. In fact the whole project was conceived, work done, written up and submitted in the space of a couple of months. I’ve never been very good at grabbing citations – I’m more likely to fall off bandwagons rather than jump onto them – but this little paper seems to keep getting citations. It hasn’t got that many by the standards of some papers, but it’s carried on being referred to for almost twenty years, which I’m quite proud of; you can see the citations-per-year statistics even seen to be have increased recently. The model we proposed turned out to be extremely useful in a range of situations, which I suppose accounts for the citation longevity:


I don’t think this is my best paper, but it’s definitely the one I had most fun working on. I remember we had the idea of doing something with lognormal distributions over coffee one day,  and just a few weeks later the paper was  finished. In some ways it’s the most simple-minded paper I’ve ever written – and that’s up against some pretty stiff competition – but there you go.


The lognormal seemed an interesting idea to explore because it applies to non-linear processes in much the same way as the normal distribution does to linear ones. What I mean is that if you have a quantity Y which is the sum of n independent effects, Y=X1+X2+…+Xn, then the distribution of Y tends to be normal by virtue of the Central Limit Theorem regardless of what the distribution of the Xi is  If, however, the process is multiplicative so  Y=X1×X2×…×Xn then since log Y = log X1 + log X2 + …+log Xn then the Central Limit Theorem tends to make log Y normal, which is what the lognormal distribution means.

The lognormal is a good distribution for things produced by multiplicative processes, such as hierarchical fragmentation or coagulation processes: the distribution of sizes of the pebbles on Brighton beach  is quite a good example. It also crops up quite often in the theory of turbulence.

I’ll mention one other thing  about this distribution, just because it’s fun. The lognormal distribution is an example of a distribution that’s not completely determined by knowledge of its moments. Most people assume that if you know all the moments of a distribution then that has to specify the distribution uniquely, but it ain’t necessarily so.

If you’re wondering why I mentioned citations, it’s because it looks like they’re going to play a big part in the Research Excellence Framework, yet another new bureaucratical exercise to attempt to measure the quality of research done in UK universities. Unfortunately, using citations isn’t straightforward. Different disciplines have hugely different citation rates, for one thing. Should one count self-citations?. Also how do you aportion citations to multi-author papers? Suppose a paper with a thousand citations has 25 authors. Does each of them get the thousand citations, or should each get 1000/25? Or, put it another way, how does a single-author paper with 100 citations compare to a 50 author paper with 101?

Or perhaps the REF panels should use the logarithm of the number of citations instead?

The H-index is Redundant…

Posted in Bad Statistics, Science Politics with tags , , , , , on January 28, 2012 by telescoper

An interesting paper appeared on the arXiv last week by astrophysicist Henk Spruit on the subject of bibliometric indicators, and specifically the Hirsch index (or H-index) which has been the subject of a number of previous blog posts on here. The author’s surname is pronounced “sprout”, by the way.

The H-index is defined to be the largest number H such that the author has written at least H papers having H citations. It can easily be calculated by looking up all papers by a given author on a database such as NASA/ADS, sorting them by (decreasing) number of citations, and working down the list to the point where the number of citations of a paper falls below the number representing position in the list. Normalized quantities – obtained by dividing the number of citations a paper receives by the number of authors of that paper for each paper – can be used to form an alternative measure.

Here is the abstract of the paper:

Here are a couple of graphs which back up the claim of a near-perfect correlation between H-index and total citations:

The figure shows both total citations (right) and normalized citations (left); the latter, in my view, a much more sensible measure of individual contributions. The basic problem of course is that people don’t get citations, papers do. Apportioning appropriate credit for a multi-author paper is therefore extremely difficult. Does each author of a 100-author paper that gets 100 citations really deserve the same credit as a single author of a paper that also gets 100 citations? Clearly not, yet that’s what happens if you count total citations.

The correlation between H index and the square root of total citation numbers has been remarked upon before, but it is good to see it confirmed for the particular field of astrophysics.

Although I’m a bit unclear as to how the “sample” was selected I think this paper is a valuable contribution to the discussion, and I hope it helps counter the growing, and in my opinion already excessive, reliance on the H-index by grants panels and the like. Trying to condense all the available information about an applicant into a single number is clearly a futile task, and this paper shows that using H-index and total numbers doesn’t add anything as they are both measuring exactly the same thing.

A very interesting question emerges from this, however, which is why the relationship between total citation numbers and h-index has the form it does: the latter is always roughly half of the square-root of the former. This suggests to me that there might be some sort of scaling law describing onto which the distribution of cites-per-paper can be mapped for any individual. It would be interesting to construct a mathematical model of citation behaviour that could reproduce this apparently universal property….

If it ain’t open, it ain’t science

Posted in Open Access, Science Politics, The Universe and Stuff with tags , , , , , , on May 16, 2011 by telescoper

Last Friday (13th March) the Royal Society launched a study into “openness in science”, as part of which they are inviting submisions from individuals and organizations. According to the Royal Society website

Science has always been about open debate. But incidents such as the UEA email leaks have prompted the Royal Society to look at how open science really is.  With the advent of the Internet, the public now expect a greater degree of transparency. The impact of science on people’s lives, and the implications of scientific assessments for society and the economy are now so great that  people won’t just believe scientists when they say “trust me, I’m an expert.” It is not just scientists who want to be able to see inside scientific datasets, to see how robust they are and ask difficult questions about their implications. Science has to adapt.”

I think this is a timely and important study which at the very least will reveal how different the attitude to this issue is between different science disciplines. On one extreme we have fields like astronomy, where the practice of making all data publically available is increasingly common and where most scientific publications are available free of charge through the arXiv. On the other there are fields where experimental data are generally regarded as the private property of the scientists involved in collecting the measurements or doing the experiments.

I have quite a simple view on this, which is that the default should be that  data resulting from publically funded research should be in the public domain. I accept that this will not always be possible owing to  ethical issues, such as when human subjects are involved, but that should be the default position.I have two reasons for thinking this way. One is that it’s public money that funds us, so we have a moral responsibility to be as open as possible with the public. The other is that the scientific method only works when analyses can be fully scrutinized and, if necessary, replicated by other researchers. In other words, to seek to prevent one’s data becoming freely available is profoundly unscientific.

I’m actually both surprised and depressed at the reluctance of some scientists to make their data available for scrutiny by other scientists, let alone members of the general public. I can give an example of  my own experience of an encounter with a brick wall when trying to find out more about the statistics behind a study in the field of neuroscience. Other branches of physics are also way behind astronomy and cosmology in opening up their research.

If scientists are reluctant to share their data with other scientists it’s very difficult to believe they will be happy to put it all in the public domain. But I think they should. And I don’t mean just chucking terabytes of complicated unsorted data onto a website in such a way that it’s impossible in practice to make use of. I mean fully documented, carefully maintained databases with both raw data, analysis tools and data products. An exemplar is the excellent LAMBDA site which is a repository for data arising for research into the Cosmic Microwave Background.

I’ve ranted before (and will no doubt do so again) about the extremely negative effect the academic publishing industry has on the dissemination of results. At out latest Board of Studies meeting, the prospect of further cuts to our library budget was raised and the suggestion made that we might have to cancel some of our journal subscriptions. I, and most of my astronomy colleagues, frankly don’t really care if we cancel astronomy journals. All our relevant papers can be found on the arXiv and/or via the NASA/ADS system. My physics colleagues, on the other hand, are still in hock to the old-fashioned and ruinously expensive academic journal racket.

One of the questions the Royal Society study will ask is:

How do we make information more accessible and who will pay to do it?

I’m willing to hazard a guess that if we worked out how much universities and research laboratories are spending on pointless journal subscriptions, then we’d find that it’s more than enough to pay for the construction and maintenance of  sufficient  open access repositories.  The current system of publishing could easily be scrapped, and replaced by something radically different, but it won’t be easy to change to a new approach more suited to the era of the internet.  For example, at present  we are forced to  publish in “proper journals” for the purposes of research assessments, so that academic publishers wield immense power over university researchers. These vested interests will be difficult to overthrow, but I think there’s a growing realization that they are actively preventing science adjusting properly to the digital age.

Anyway, whether or not you agree with me, I hope you’ll agree that the Royal Society study is an important one so please take a look and contribute if you can.



Get every new post delivered to your Inbox.

Join 4,999 other followers