Archive for citations

Lognormality Revisited (Again)

Posted in Biographical, Science Politics, The Universe and Stuff with tags , , , , , , , on May 10, 2016 by telescoper

Today provided me with a (sadly rare) opportunity to join in our weekly Cosmology Journal Club at the University of Sussex. I don’t often get to go because of meetings and other commitments. Anyway, one of the papers we looked at (by Clerkin et al.) was entitled Testing the Lognormality of the Galaxy Distribution and weak lensing convergence distributions from Dark Energy Survey maps. This provides yet more examples of the unreasonable effectiveness of the lognormal distribution in cosmology. Here’s one of the diagrams, just to illustrate the point:

Log_galaxy_countsThe points here are from MICE simulations. Not simulations of mice, of course, but simulations of MICE (Marenostrum Institut de Ciencies de l’Espai). Note how well the curves from a simple lognormal model fit the calculations that need a supercomputer to perform them!

The lognormal model used in the paper is basically the same as the one I developed in 1990 with  Bernard Jones in what has turned out to be  my most-cited paper. In fact the whole project was conceived, work done, written up and submitted in the space of a couple of months during a lovely visit to the fine city of Copenhagen. I’ve never been very good at grabbing citations – I’m more likely to fall off bandwagons rather than jump onto them – but this little paper seems to keep getting citations. It hasn’t got that many by the standards of some papers, but it’s carried on being referred to for almost twenty years, which I’m quite proud of; you can see the citations-per-year statistics even seen to be have increased recently. The model we proposed turned out to be extremely useful in a range of situations, which I suppose accounts for the citation longevity:

nph-ref_historyCitations die away for most papers, but this one is actually attracting more interest as time goes on! I don’t think this is my best paper, but it’s definitely the one I had most fun working on. I remember we had the idea of doing something with lognormal distributions over coffee one day,  and just a few weeks later the paper was finished. In some ways it’s the most simple-minded paper I’ve ever written – and that’s up against some pretty stiff competition – but there you go.

Lognormal_abstract

The lognormal seemed an interesting idea to explore because it applies to non-linear processes in much the same way as the normal distribution does to linear ones. What I mean is that if you have a quantity Y which is the sum of n independent effects, Y=X1+X2+…+Xn, then the distribution of Y tends to be normal by virtue of the Central Limit Theorem regardless of what the distribution of the Xi is  If, however, the process is multiplicative so  Y=X1×X2×…×Xn then since log Y = log X1 + log X2 + …+log Xn then the Central Limit Theorem tends to make log Y normal, which is what the lognormal distribution means.

The lognormal is a good distribution for things produced by multiplicative processes, such as hierarchical fragmentation or coagulation processes: the distribution of sizes of the pebbles on Brighton beach  is quite a good example. It also crops up quite often in the theory of turbulence.

I’ll mention one other thing  about this distribution, just because it’s fun. The lognormal distribution is an example of a distribution that’s not completely determined by knowledge of its moments. Most people assume that if you know all the moments of a distribution then that has to specify the distribution uniquely, but it ain’t necessarily so.

If you’re wondering why I mentioned citations, it’s because they’re playing an increasing role in attempts to measure the quality of research done in UK universities. Citations definitely contain some information, but interpreting them isn’t at all straightforward. Different disciplines have hugely different citation rates, for one thing. Should one count self-citations?. Also how do you apportion citations to multi-author papers? Suppose a paper with a thousand citations has 25 authors. Does each of them get the thousand citations, or should each get 1000/25? Or, put it another way, how does a single-author paper with 100 citations compare to a 50 author paper with 101?

Or perhaps a better metric would be the logarithm of the number of citations?

A scientific paper with 5000 authors is absurd, but does science need “papers” at all?

Posted in History, Open Access, Science Politics, The Universe and Stuff with tags , , , , , , , , , on May 17, 2015 by telescoper

Nature News has reported on what appears to be the paper with the longest author list on record. This article has so many authors – 5,154 altogether – that 24 pages (out of a total of 33 in the paper) are devoted just to listing them, and only 9 to the actual science. Not, surprisingly the field concerned is experimental particle physics and the paper emanates from the Large Hadron Collider; it involves combining data from the CMS and ATLAS detectors to estimate the mass of the Higgs Boson. In my own fields of astronomy and cosmology, large consortia such as the Planck collaboration are becoming the exception rather than the rule for observational work. Large ollaborations  have achieved great things not only in physics and astronomy but also in other fields. A for  paper in genomics with over a thousand authors has recently been published and the trend for ever-increasing size of collaboration seems set to continue.

I’ve got nothing at all against large collaborative projects. Quite the opposite, in fact. They’re enormously valuable not only because frontier research can often only be done that way, but also because of the wider message they send out about the benefits of international cooperation.

Having said that, one thing these large collaborations do is expose the absurdity of the current system of scientific publishing. The existence of a paper with 5000 authors is a reductio ad absurdum proof  that the system is broken. Papers simply do not have 5000  “authors”. In fact, I would bet that no more than a handful of the “authors” listed on the record-breaking paper have even read the article, never mind written any of it. Despite this, scientists continue insisting that constributions to scientific research can only be measured by co-authorship of  a paper. The LHC collaboration that kicked off this piece includes all kinds of scientists: technicians, engineers, physicists, programmers at all kinds of levels, from PhD students to full Professors. Why should we insist that the huge range of contributions can only be recognized by shoe-horning the individuals concerned into the author list? The idea of a 100-author paper is palpably absurd, never mind one with fifty times that number.

So how can we assign credit to individuals who belong to large teams of researchers working in collaboration?

For the time being let us assume that we are stuck with authorship as the means of indicating a contribution to the project. Significant issues then arise about how to apportion credit in bibliometric analyses, e.g. through citations. Here is an example of one of the difficulties: (i) if paper A is cited 100 times and has 100 authors should each author get the same credit? and (ii) if paper B is also cited 100 times but only has one author, should this author get the same credit as each of the authors of paper A?

An interesting suggestion over on the e-astronomer a while ago addressed the first question by suggesting that authors be assigned weights depending on their position in the author list. If there are N authors the lead author gets weight N, the next N-1, and so on to the last author who gets a weight 1. If there are 4 authors, the lead gets 4 times as much weight as the last one.

This proposal has some merit but it does not take account of the possibility that the author list is merely alphabetical which actually was the case in all the Planck publications, for example. Still, it’s less draconian than another suggestion I have heard which is that the first author gets all the credit and the rest get nothing. At the other extreme there’s the suggestion of using normalized citations, i.e. just dividing the citations equally among the authors and giving them a fraction 1/N each. I think I prefer this last one, in fact, as it seems more democratic and also more rational. I don’t have many publications with large numbers of authors so it doesn’t make that much difference to me which you measure happen to pick. I come out as mediocre on all of them.

No suggestion is ever going to be perfect, however, because the attempt to compress all information about the different contributions and roles within a large collaboration into a single number, which clearly can’t be done algorithmically. For example, the way things work in astronomy is that instrument builders – essential to all observational work and all work based on analysing observations – usually get appended onto the author lists even if they play no role in analysing the final data. This is one of the reasons the resulting papers have such long author lists and why the bibliometric issues are so complex in the first place.

Having thousands of authors who didn’t write a single word of the paper seems absurd, but it’s the only way our current system can acknowledge the contributions made by instrumentalists, technical assistants and all the rest. Without doing this, what can such people have on their CV that shows the value of the work they have done?

What is really needed is a system of credits more like that used in the television or film. Writer credits are assigned quite separately from those given to the “director” (of the project, who may or may not have written the final papers), as are those to the people who got the funding together and helped with the logistics (production credits). Sundry smaller but still vital technical roles could also be credited, such as special effects (i.e. simulations) or lighting (photometic calibration). There might even be a best boy. Many theoretical papers would be classified as “shorts” so they would often be written and directed by one person and with no technical credits.

The point I’m trying to make is that we seem to want to use citations to measure everything all at once but often we want different things. If you want to use citations to judge the suitability of an applicant for a position as a research leader you want someone with lots of directorial credits. If you want a good postdoc you want someone with a proven track-record of technical credits. But I don’t think it makes sense to appoint a research leader on the grounds that they reduced the data for umpteen large surveys. Imagine what would happen if you made someone director of a Hollywood blockbuster on the grounds that they had made the crew’s tea for over a hundred other films.

Another question I’d like to raise is one that has been bothering me for some time. When did it happen that everyone participating in an observational programme expected to be an author of a paper? It certainly hasn’t always been like that.

For example, go back about 90 years to one of the most famous astronomical studies of all time, Eddington‘s measurement of the bending of light by the gravitational field of the Sun. The paper that came out from this was this one

A Determination of the Deflection of Light by the Sun’s Gravitational Field, from Observations made at the Total Eclipse of May 29, 1919.

Sir F.W. Dyson, F.R.S, Astronomer Royal, Prof. A.S. Eddington, F.R.S., and Mr C. Davidson.

Philosophical Transactions of the Royal Society of London, Series A., Volume 220, pp. 291-333, 1920.

This particular result didn’t involve a collaboration on the same scale as many of today’s but it did entail two expeditions (one to Sobral, in Brazil, and another to the Island of Principe, off the West African coast). Over a dozen people took part in the planning,  in the preparation of of calibration plates, taking the eclipse measurements themselves, and so on.  And that’s not counting all the people who helped locally in Sobral and Principe.

But notice that the final paper – one of the most important scientific papers of all time – has only 3 authors: Dyson did a great deal of background work getting the funds and organizing the show, but didn’t go on either expedition; Eddington led the Principe expedition and was central to much of the analysis;  Davidson was one of the observers at Sobral. Andrew Crommelin, something of an eclipse expert who played a big part in the Sobral measurements received no credit and neither did Eddington’s main assistant at Principe.

I don’t know if there was a lot of conflict behind the scenes at arriving at this authorship policy but, as far as I know, it was normal policy at the time to do things this way. It’s an interesting socio-historical question why and when it changed.

I’ve rambled off a bit so I’ll return to the point that I was trying to get to, which is that in my view the real problem is not so much the question of authorship but the idea of the paper itself. It seems quite clear to me that the academic journal is an anachronism. Digital technology enables us to communicate ideas far more rapidly than in the past and allows much greater levels of interaction between researchers. I agree with Daniel Shanahan that the future for many fields will be defined not in terms of “papers” which purport to represent “final” research outcomes, but by living documents continuously updated in response to open scrutiny by the community of researchers. I’ve long argued that the modern academic publishing industry is not facilitating but hindering the communication of research. The arXiv has already made academic journals virtually redundant in many of branches of  physics and astronomy; other disciplines will inevitably follow. The age of the academic journal is drawing to a close. Now to rethink the concept of “the paper”…

Lognormality Revisited

Posted in Biographical, Science Politics, The Universe and Stuff with tags , , , , , on January 14, 2015 by telescoper

I was looking up the reference for an old paper of mine on ADS yesterday and was surprised to find that it is continuing to attract citations. Thinking about the paper reminds me off the fun time I had in Copenhagen while it was written.   I was invited there in 1990 by Bernard Jones, who used to work at the Niels Bohr Institute.  I stayed there several weeks over the May/June period which is the best time of year  for Denmark; it’s sufficiently far North (about the same latitude as Aberdeen) that the summer days are very long, and when it’s light until almost midnight it’s very tempting to spend a lot of time out late at night..

As well as being great fun, that little visit also produced what has turned out to be  my most-cited paper. In fact the whole project was conceived, work done, written up and submitted in the space of a couple of months. I’ve never been very good at grabbing citations – I’m more likely to fall off bandwagons rather than jump onto them – but this little paper seems to keep getting citations. It hasn’t got that many by the standards of some papers, but it’s carried on being referred to for almost twenty years, which I’m quite proud of; you can see the citations-per-year statistics even seen to be have increased recently. The model we proposed turned out to be extremely useful in a range of situations, which I suppose accounts for the citation longevity:

lognormal

I don’t think this is my best paper, but it’s definitely the one I had most fun working on. I remember we had the idea of doing something with lognormal distributions over coffee one day,  and just a few weeks later the paper was  finished. In some ways it’s the most simple-minded paper I’ve ever written – and that’s up against some pretty stiff competition – but there you go.

Picture1

The lognormal seemed an interesting idea to explore because it applies to non-linear processes in much the same way as the normal distribution does to linear ones. What I mean is that if you have a quantity Y which is the sum of n independent effects, Y=X1+X2+…+Xn, then the distribution of Y tends to be normal by virtue of the Central Limit Theorem regardless of what the distribution of the Xi is  If, however, the process is multiplicative so  Y=X1×X2×…×Xn then since log Y = log X1 + log X2 + …+log Xn then the Central Limit Theorem tends to make log Y normal, which is what the lognormal distribution means.

The lognormal is a good distribution for things produced by multiplicative processes, such as hierarchical fragmentation or coagulation processes: the distribution of sizes of the pebbles on Brighton beach  is quite a good example. It also crops up quite often in the theory of turbulence.

I’ll mention one other thing  about this distribution, just because it’s fun. The lognormal distribution is an example of a distribution that’s not completely determined by knowledge of its moments. Most people assume that if you know all the moments of a distribution then that has to specify the distribution uniquely, but it ain’t necessarily so.

If you’re wondering why I mentioned citations, it’s because it looks like they’re going to play a big part in the Research Excellence Framework, yet another new bureaucratical exercise to attempt to measure the quality of research done in UK universities. Unfortunately, using citations isn’t straightforward. Different disciplines have hugely different citation rates, for one thing. Should one count self-citations?. Also how do you aportion citations to multi-author papers? Suppose a paper with a thousand citations has 25 authors. Does each of them get the thousand citations, or should each get 1000/25? Or, put it another way, how does a single-author paper with 100 citations compare to a 50 author paper with 101?

Or perhaps the REF panels should use the logarithm of the number of citations instead?

The Impact X-Factor

Posted in Bad Statistics, Open Access with tags , , on August 14, 2012 by telescoper

Just time for a quick (yet still rather tardy) post to direct your attention to an excellent polemical piece by Stephen Curry pointing out the pointlessness of Journal Impact Factors. For those of you in blissful ignorance about the statistical aberration that is the JIF, it’s basically a measure of the average number of citations attracted by a paper published in a given journal. The idea is that if you publish a paper in a journal with a large JIF then it’s in among a number of papers that are highly cited and therefore presumably high quality. Using a form of Proof by Association, your paper must therefore be excellent too, hanging around with tall people being a tried-and-tested way of becoming tall.

I won’t repeat all Stephen Curry’s arguments as to why this is bollocks – read the piece for yourself – but one of the most important is that the distribution of citations per paper is extremely skewed, so the average is dragged upwards by a few papers with huge numbers of citations. As a consequence most papers published in a journal with a large JIF attract many fewer citations than the average. Moreover, modern bibliometric databases make it quite easy to extract citation information for individual papers, which is what is relevant if you’re trying to judge the quality impact of a particular piece of work, so why bother with the JIF at all?

I will however copy the summary, which is to the point:

So consider all that we know of impact factors and think on this: if you use impact factors you are statistically illiterate.

  • If you include journal impact factors in the list of publications in your cv, you are statistically illiterate.
  • If you are judging grant or promotion applications and find yourself scanning the applicant’s publications, checking off the impact factors, you are statistically illiterate.
  • If you publish a journal that trumpets its impact factor in adverts or emails, you are statistically illiterate. (If you trumpet that impact factor to three decimal places, there is little hope for you.)
  • If you see someone else using impact factors and make no attempt at correction, you connive at statistical illiteracy.

Statistical illiteracy is by no means as rare among scientists as we’d like to think, but at least I can say that I pay no attention whatsoever to Journal Impact Factors. In fact I don’t think many people in in astronomy or astrophysics use them at all. I’d be interested to hear from anyone who does.

I’d like to add a little coda to Stephen Curry’s argument. I’d say that if you publish a paper in a journal with a large JIF (e.g. Nature) but the paper turns out to attract very few citations then the paper should be penalised in a bibliometric analysis, rather like the handicap system used in horse racing or golf. If, despite the press hype and other tedious trumpetings associated with the publication of a Nature paper, the work still attracts negligible interest then it must really be a stinker and should be rated as such by grant panels, etc. Likewise if you publish a paper in a less impactful journal which nevertheless becomes a citation hit then it should be given extra kudos because it has gained recognition by quality alone.

Of course citation numbers don’t necessarily mean quality. Many excellent papers are slow burners from a bibliometric point of view. However, if a journal markets itself as being a vehicle for papers that are intended to attract large citation counts and a paper published there flops then I think it should attract a black mark. Hoist it on its own petard, as it were.

So I suggest papers be awarded an Impact X-Factor, based on the difference between its citation count and the JIF for the journal. For most papers this will of course be negative, which would serve their authors right for mentioning the Impact Factor in the first place.

PS. I chose the name “X-factor” as in the TV show precisely for its negative connotations.

The Quality of Physics

Posted in Science Politics with tags , , , on February 21, 2012 by telescoper

Just time for a quick post this lunchtime,  in between meetings and exercise classes. My eye was drawn this morning to an article about a lengthy report from the Institute of Physics that gives an international comparison of citation impact in physics and related fields.

According to the IOP website..

Although the UK is ranked seventh in a list of key competitor countries for the quantity of its physics research output – measured by the number of papers published – the UK is second only to Canada, and now higher than the US, when ranked on the average quality of the UK’s physics research output – measured by the average number of times research papers are cited around world.

The piece also goes on to note that the UK’s share of the total number of research papers written has decreased

For the UK, however, its proportionate decrease in output – from 7.1% of the world’s physics research in 2001 to 6.4% in 2010 – has been accompanied by a celebratory increase in overall, average quality – with the average number of citations of UK research papers rising from 1.24 in 2001 to 1.72 in 2010.

This, of course, assumes that citations measure “quality” but I’ve got no time to argue that point today. What I will do is put up a couple of interesting figures from the report.  This one shows that Space Science in the UK (including Astronomy and Astrophysics) holds a much bigger share of the total world output of papers than other disciplines (by a factor of about three):

While this one shows that the “citation impact” for Physics and Space Science roughly track each other…

..apart from the downturn right at the end of the window for space sciences, which, one imagines, might be a result of decisions taken by the management of the Science and Technology Facilities Council  over that period.

Our political leaders will be tempted to portray the steady increase of citation impact across fields as a sign of improved quality arising from the various research assessment exercises.  But I don’t think it’s as simple as that. It seems that many developing countries – especially China – are producing more and more scientific papers. This inevitably drives the UK’s share of world productivity down, because our capacity is not increasing. If anything it’s going down, in fact, owing to recent funding cuts. However, the more papers there are, the more reference lists there are, and the more citations there will be. The increase in citation rates may therefore just be a form of inflation.

Anyway, you can download the entire report here (PDF). I’m sure there will be other reactions to it so, as usual, please feel free to comment via the box below…

The H-index is Redundant…

Posted in Bad Statistics, Science Politics with tags , , , , , on January 28, 2012 by telescoper

An interesting paper appeared on the arXiv last week by astrophysicist Henk Spruit on the subject of bibliometric indicators, and specifically the Hirsch index (or H-index) which has been the subject of a number of previous blog posts on here. The author’s surname is pronounced “sprout”, by the way.

The H-index is defined to be the largest number H such that the author has written at least H papers having H citations. It can easily be calculated by looking up all papers by a given author on a database such as NASA/ADS, sorting them by (decreasing) number of citations, and working down the list to the point where the number of citations of a paper falls below the number representing position in the list. Normalized quantities – obtained by dividing the number of citations a paper receives by the number of authors of that paper for each paper – can be used to form an alternative measure.

Here is the abstract of the paper:

Here are a couple of graphs which back up the claim of a near-perfect correlation between H-index and total citations:

The figure shows both total citations (right) and normalized citations (left); the latter, in my view, a much more sensible measure of individual contributions. The basic problem of course is that people don’t get citations, papers do. Apportioning appropriate credit for a multi-author paper is therefore extremely difficult. Does each author of a 100-author paper that gets 100 citations really deserve the same credit as a single author of a paper that also gets 100 citations? Clearly not, yet that’s what happens if you count total citations.

The correlation between H index and the square root of total citation numbers has been remarked upon before, but it is good to see it confirmed for the particular field of astrophysics.

Although I’m a bit unclear as to how the “sample” was selected I think this paper is a valuable contribution to the discussion, and I hope it helps counter the growing, and in my opinion already excessive, reliance on the H-index by grants panels and the like. Trying to condense all the available information about an applicant into a single number is clearly a futile task, and this paper shows that using H-index and total numbers doesn’t add anything as they are both measuring exactly the same thing.

A very interesting question emerges from this, however, which is why the relationship between total citation numbers and h-index has the form it does: the latter is always roughly half of the square-root of the former. This suggests to me that there might be some sort of scaling law describing onto which the distribution of cites-per-paper can be mapped for any individual. It would be interesting to construct a mathematical model of citation behaviour that could reproduce this apparently universal property….

Citation-weighted Wordles

Posted in Uncategorized with tags , , , , on December 12, 2011 by telescoper

Someone who clearly has too much time on his hands emailed me this morning with the results of an in-depth investigation into trends in the titles of highly cited astronomy papers from the past 30 years, and how this reflects the changing ‘hot-topics’.

The procedure adopted was to query ADS for the top 100 cited papers in three ten-year intervals: 1980-1990, 1990-2000, and 2000-2010. He then took all the words from the titles of these papers and weighted them according to the sum of the number of citations of all the articles that word appears in… so if the word ‘galaxy’ appears in two papers with citations of 100 and 300, it gets a weighting of 400, and so-on.

After getting these lists, he used the online ‘Wordle‘ tool
to generate word-clouds of these words, using those citation weightings in the word-sizing calculation. Common words, numbers, etc. are excluded. There may be some cases where non-astronomy papers have crept in, but as much as possible is done to keep these to a minimum.

There’s probably some bias, since older papers have longer to accumulate citations, but the changing hot-topics on ~10 year time-scales take care of this I think.

Anyway, here are the rather interesting results. First is 1980-1990

Followed by 1990-2000

and, lastly, we have 2000-2010

It’s especially interesting to see the extent to which cosmology has elbowed all the other less interesting stuff out of the way…and how the word “observations” has come to the fore in the last decade.

ps. Here’s the last one again with the WMAP papers taken out: