Archive for citations

Citation Analysis of Scientific Categories

Posted in Open Access, Science Politics with tags , on May 18, 2018 by telescoper

I stumbled across an interesting paper the other day with the title Citation Analysis of Scientific Categories. The title isn’t really accurate because not all the 231 categories covered by the analysis are `scientific’: they include many topics in the arts and humanities too. Anyway, the abstract is here:

Databases catalogue the corpus of research literature into scientific categories and report classes of bibliometric data such as the number of citations to articles, the number of authors, journals, funding agencies, institutes, references, etc. The number of articles and citations in a category are gauges of productivity and scientific impact but a quantitative basis to compare researchers between categories is limited. Here, we compile a list of bibliometric indicators for 236 science categories and citation rates of the 500 most cited articles of each category. The number of citations per paper vary by several orders of magnitude and are highest in multidisciplinary sciences, general internal medicine, and biochemistry and lowest in literature, poetry, and dance. A regression model demonstrates that citation rates to the top articles in each category increase with the square root of the number of articles in a category and decrease proportionately with the age of the references: articles in categories that cite recent research are also cited more frequently. The citation rate correlates positively with the number of funding agencies that finance the research. The category h-index correlates with the average number of cites to the top 500 ranked articles of each category (R2 = 0.997). Furthermore, only a few journals publish the top 500 cited articles in each category: four journals publish 60% (σ = ±20%) of these and ten publish 81% (σ = ±15%).

The paper is open access (I think) and you can find the whole thing here.

I had a discussion over lunch today with a couple of colleagues here in Maynooth about using citations. I think we agreed that citation analysis does convey some information about the impact of a person’s research but that information is rather limited. One of the difficulties is that publication rates and citation activity are very discipline-dependent so one can’t easily compare individuals in different areas. The paper here is interesting because it presents an interesting table showing how various statistical citation measures vary across fields and sub-fields;  physics is broken down into a number of distinct areas (e.g. Astronomy & Astrophysics, Particle Physics, Condensed Matter and Nuclear Physics) across which there is considerable variation. How to best to use this information is still not clear..

 

 

Advertisements

Metrics for `Academic Reputation’

Posted in Bad Statistics, Science Politics with tags , , , on April 9, 2018 by telescoper

This weekend I came across a provocative paper on the arXiv with the title Measuring the academic reputation through citation records via PageRank. Here is the abstract:

The objective assessment of the prestige of an academic institution is a difficult and hotly debated task. In the last few years, different types of University Rankings have been proposed to quantify the excellence of different research institutions in the world. Albeit met with criticism in some cases, the relevance of university rankings is being increasingly acknowledged: indeed, rankings are having a major impact on the design of research policies, both at the institutional and governmental level. Yet, the debate on what rankings are  exactly measuring is enduring. Here, we address the issue by measuring a quantitative and reliable proxy of the academic reputation of a given institution and by evaluating its correlation with different university rankings. Specifically, we study citation patterns among universities in five different Web of Science Subject Categories and use the PageRank algorithm on the five resulting citation networks. The rationale behind our work is that scientific citations are driven by the reputation of the reference so that the PageRank algorithm is expected to yield a rank which reflects the reputation of an academic institution in a specific field. Our results allow to quantifying the prestige of a set of institutions in a certain research field based only on hard bibliometric data. Given the volume of the data analysed, our findings are statistically robust and less prone to bias, at odds with ad hoc surveys often employed by ranking bodies in order to attain similar results. Because our findings are found to correlate extremely well with the ARWU Subject rankings, the approach we propose in our paper may open the door to new, Academic Ranking methodologies that go beyond current methods by reconciling the qualitative evaluation of Academic Prestige with its quantitative measurements via publication impact.

(The link to the description of the PageRank algorithm was added by me; I also corrected a few spelling mistakes in the abstract). You can find the full paper here (PDF).

For what it’s worth, I think the paper contains some interesting ideas (e.g. treating citations as a `tree’ rather than a simple `list’) but the authors make some assumptions that I find deeply questionable (e.g. that being cited among a short reference listed is somehow of higher value than in a long list). The danger is that using such information in a metric could form an incentive to further bad behaviour (such as citation cartels).

I have blogged quite a few times about the uses and abuses of citations (see tag here) , and I won’t rehearse these arguments here. I will say, however, that I do agree with the idea of sharing citations among the authors of the paper rather than giving each and every author credit for the total. Many astronomers disagree with this point of view, but surely it is perverse to argue that the 100th author of a paper with 51 citations deserves more credit than the sole author of paper with 49?

Above all, though, the problem with constructing a metric for `Academic Reputation’ is that the concept is so difficult to define in the first place…

Lognormality Revisited (Again)

Posted in Biographical, Science Politics, The Universe and Stuff with tags , , , , , , , on May 10, 2016 by telescoper

Today provided me with a (sadly rare) opportunity to join in our weekly Cosmology Journal Club at the University of Sussex. I don’t often get to go because of meetings and other commitments. Anyway, one of the papers we looked at (by Clerkin et al.) was entitled Testing the Lognormality of the Galaxy Distribution and weak lensing convergence distributions from Dark Energy Survey maps. This provides yet more examples of the unreasonable effectiveness of the lognormal distribution in cosmology. Here’s one of the diagrams, just to illustrate the point:

Log_galaxy_countsThe points here are from MICE simulations. Not simulations of mice, of course, but simulations of MICE (Marenostrum Institut de Ciencies de l’Espai). Note how well the curves from a simple lognormal model fit the calculations that need a supercomputer to perform them!

The lognormal model used in the paper is basically the same as the one I developed in 1990 with  Bernard Jones in what has turned out to be  my most-cited paper. In fact the whole project was conceived, work done, written up and submitted in the space of a couple of months during a lovely visit to the fine city of Copenhagen. I’ve never been very good at grabbing citations – I’m more likely to fall off bandwagons rather than jump onto them – but this little paper seems to keep getting citations. It hasn’t got that many by the standards of some papers, but it’s carried on being referred to for almost twenty years, which I’m quite proud of; you can see the citations-per-year statistics even seen to be have increased recently. The model we proposed turned out to be extremely useful in a range of situations, which I suppose accounts for the citation longevity:

nph-ref_historyCitations die away for most papers, but this one is actually attracting more interest as time goes on! I don’t think this is my best paper, but it’s definitely the one I had most fun working on. I remember we had the idea of doing something with lognormal distributions over coffee one day,  and just a few weeks later the paper was finished. In some ways it’s the most simple-minded paper I’ve ever written – and that’s up against some pretty stiff competition – but there you go.

Lognormal_abstract

The lognormal seemed an interesting idea to explore because it applies to non-linear processes in much the same way as the normal distribution does to linear ones. What I mean is that if you have a quantity Y which is the sum of n independent effects, Y=X1+X2+…+Xn, then the distribution of Y tends to be normal by virtue of the Central Limit Theorem regardless of what the distribution of the Xi is  If, however, the process is multiplicative so  Y=X1×X2×…×Xn then since log Y = log X1 + log X2 + …+log Xn then the Central Limit Theorem tends to make log Y normal, which is what the lognormal distribution means.

The lognormal is a good distribution for things produced by multiplicative processes, such as hierarchical fragmentation or coagulation processes: the distribution of sizes of the pebbles on Brighton beach  is quite a good example. It also crops up quite often in the theory of turbulence.

I’ll mention one other thing  about this distribution, just because it’s fun. The lognormal distribution is an example of a distribution that’s not completely determined by knowledge of its moments. Most people assume that if you know all the moments of a distribution then that has to specify the distribution uniquely, but it ain’t necessarily so.

If you’re wondering why I mentioned citations, it’s because they’re playing an increasing role in attempts to measure the quality of research done in UK universities. Citations definitely contain some information, but interpreting them isn’t at all straightforward. Different disciplines have hugely different citation rates, for one thing. Should one count self-citations?. Also how do you apportion citations to multi-author papers? Suppose a paper with a thousand citations has 25 authors. Does each of them get the thousand citations, or should each get 1000/25? Or, put it another way, how does a single-author paper with 100 citations compare to a 50 author paper with 101?

Or perhaps a better metric would be the logarithm of the number of citations?

A scientific paper with 5000 authors is absurd, but does science need “papers” at all?

Posted in History, Open Access, Science Politics, The Universe and Stuff with tags , , , , , , , , , on May 17, 2015 by telescoper

Nature News has reported on what appears to be the paper with the longest author list on record. This article has so many authors – 5,154 altogether – that 24 pages (out of a total of 33 in the paper) are devoted just to listing them, and only 9 to the actual science. Not, surprisingly the field concerned is experimental particle physics and the paper emanates from the Large Hadron Collider; it involves combining data from the CMS and ATLAS detectors to estimate the mass of the Higgs Boson. In my own fields of astronomy and cosmology, large consortia such as the Planck collaboration are becoming the exception rather than the rule for observational work. Large ollaborations  have achieved great things not only in physics and astronomy but also in other fields. A for  paper in genomics with over a thousand authors has recently been published and the trend for ever-increasing size of collaboration seems set to continue.

I’ve got nothing at all against large collaborative projects. Quite the opposite, in fact. They’re enormously valuable not only because frontier research can often only be done that way, but also because of the wider message they send out about the benefits of international cooperation.

Having said that, one thing these large collaborations do is expose the absurdity of the current system of scientific publishing. The existence of a paper with 5000 authors is a reductio ad absurdum proof  that the system is broken. Papers simply do not have 5000  “authors”. In fact, I would bet that no more than a handful of the “authors” listed on the record-breaking paper have even read the article, never mind written any of it. Despite this, scientists continue insisting that constributions to scientific research can only be measured by co-authorship of  a paper. The LHC collaboration that kicked off this piece includes all kinds of scientists: technicians, engineers, physicists, programmers at all kinds of levels, from PhD students to full Professors. Why should we insist that the huge range of contributions can only be recognized by shoe-horning the individuals concerned into the author list? The idea of a 100-author paper is palpably absurd, never mind one with fifty times that number.

So how can we assign credit to individuals who belong to large teams of researchers working in collaboration?

For the time being let us assume that we are stuck with authorship as the means of indicating a contribution to the project. Significant issues then arise about how to apportion credit in bibliometric analyses, e.g. through citations. Here is an example of one of the difficulties: (i) if paper A is cited 100 times and has 100 authors should each author get the same credit? and (ii) if paper B is also cited 100 times but only has one author, should this author get the same credit as each of the authors of paper A?

An interesting suggestion over on the e-astronomer a while ago addressed the first question by suggesting that authors be assigned weights depending on their position in the author list. If there are N authors the lead author gets weight N, the next N-1, and so on to the last author who gets a weight 1. If there are 4 authors, the lead gets 4 times as much weight as the last one.

This proposal has some merit but it does not take account of the possibility that the author list is merely alphabetical which actually was the case in all the Planck publications, for example. Still, it’s less draconian than another suggestion I have heard which is that the first author gets all the credit and the rest get nothing. At the other extreme there’s the suggestion of using normalized citations, i.e. just dividing the citations equally among the authors and giving them a fraction 1/N each. I think I prefer this last one, in fact, as it seems more democratic and also more rational. I don’t have many publications with large numbers of authors so it doesn’t make that much difference to me which you measure happen to pick. I come out as mediocre on all of them.

No suggestion is ever going to be perfect, however, because the attempt to compress all information about the different contributions and roles within a large collaboration into a single number, which clearly can’t be done algorithmically. For example, the way things work in astronomy is that instrument builders – essential to all observational work and all work based on analysing observations – usually get appended onto the author lists even if they play no role in analysing the final data. This is one of the reasons the resulting papers have such long author lists and why the bibliometric issues are so complex in the first place.

Having thousands of authors who didn’t write a single word of the paper seems absurd, but it’s the only way our current system can acknowledge the contributions made by instrumentalists, technical assistants and all the rest. Without doing this, what can such people have on their CV that shows the value of the work they have done?

What is really needed is a system of credits more like that used in the television or film. Writer credits are assigned quite separately from those given to the “director” (of the project, who may or may not have written the final papers), as are those to the people who got the funding together and helped with the logistics (production credits). Sundry smaller but still vital technical roles could also be credited, such as special effects (i.e. simulations) or lighting (photometic calibration). There might even be a best boy. Many theoretical papers would be classified as “shorts” so they would often be written and directed by one person and with no technical credits.

The point I’m trying to make is that we seem to want to use citations to measure everything all at once but often we want different things. If you want to use citations to judge the suitability of an applicant for a position as a research leader you want someone with lots of directorial credits. If you want a good postdoc you want someone with a proven track-record of technical credits. But I don’t think it makes sense to appoint a research leader on the grounds that they reduced the data for umpteen large surveys. Imagine what would happen if you made someone director of a Hollywood blockbuster on the grounds that they had made the crew’s tea for over a hundred other films.

Another question I’d like to raise is one that has been bothering me for some time. When did it happen that everyone participating in an observational programme expected to be an author of a paper? It certainly hasn’t always been like that.

For example, go back about 90 years to one of the most famous astronomical studies of all time, Eddington‘s measurement of the bending of light by the gravitational field of the Sun. The paper that came out from this was this one

A Determination of the Deflection of Light by the Sun’s Gravitational Field, from Observations made at the Total Eclipse of May 29, 1919.

Sir F.W. Dyson, F.R.S, Astronomer Royal, Prof. A.S. Eddington, F.R.S., and Mr C. Davidson.

Philosophical Transactions of the Royal Society of London, Series A., Volume 220, pp. 291-333, 1920.

This particular result didn’t involve a collaboration on the same scale as many of today’s but it did entail two expeditions (one to Sobral, in Brazil, and another to the Island of Principe, off the West African coast). Over a dozen people took part in the planning,  in the preparation of of calibration plates, taking the eclipse measurements themselves, and so on.  And that’s not counting all the people who helped locally in Sobral and Principe.

But notice that the final paper – one of the most important scientific papers of all time – has only 3 authors: Dyson did a great deal of background work getting the funds and organizing the show, but didn’t go on either expedition; Eddington led the Principe expedition and was central to much of the analysis;  Davidson was one of the observers at Sobral. Andrew Crommelin, something of an eclipse expert who played a big part in the Sobral measurements received no credit and neither did Eddington’s main assistant at Principe.

I don’t know if there was a lot of conflict behind the scenes at arriving at this authorship policy but, as far as I know, it was normal policy at the time to do things this way. It’s an interesting socio-historical question why and when it changed.

I’ve rambled off a bit so I’ll return to the point that I was trying to get to, which is that in my view the real problem is not so much the question of authorship but the idea of the paper itself. It seems quite clear to me that the academic journal is an anachronism. Digital technology enables us to communicate ideas far more rapidly than in the past and allows much greater levels of interaction between researchers. I agree with Daniel Shanahan that the future for many fields will be defined not in terms of “papers” which purport to represent “final” research outcomes, but by living documents continuously updated in response to open scrutiny by the community of researchers. I’ve long argued that the modern academic publishing industry is not facilitating but hindering the communication of research. The arXiv has already made academic journals virtually redundant in many of branches of  physics and astronomy; other disciplines will inevitably follow. The age of the academic journal is drawing to a close. Now to rethink the concept of “the paper”…

Lognormality Revisited

Posted in Biographical, Science Politics, The Universe and Stuff with tags , , , , , on January 14, 2015 by telescoper

I was looking up the reference for an old paper of mine on ADS yesterday and was surprised to find that it is continuing to attract citations. Thinking about the paper reminds me off the fun time I had in Copenhagen while it was written.   I was invited there in 1990 by Bernard Jones, who used to work at the Niels Bohr Institute.  I stayed there several weeks over the May/June period which is the best time of year  for Denmark; it’s sufficiently far North (about the same latitude as Aberdeen) that the summer days are very long, and when it’s light until almost midnight it’s very tempting to spend a lot of time out late at night..

As well as being great fun, that little visit also produced what has turned out to be  my most-cited paper. In fact the whole project was conceived, work done, written up and submitted in the space of a couple of months. I’ve never been very good at grabbing citations – I’m more likely to fall off bandwagons rather than jump onto them – but this little paper seems to keep getting citations. It hasn’t got that many by the standards of some papers, but it’s carried on being referred to for almost twenty years, which I’m quite proud of; you can see the citations-per-year statistics even seen to be have increased recently. The model we proposed turned out to be extremely useful in a range of situations, which I suppose accounts for the citation longevity:

lognormal

I don’t think this is my best paper, but it’s definitely the one I had most fun working on. I remember we had the idea of doing something with lognormal distributions over coffee one day,  and just a few weeks later the paper was  finished. In some ways it’s the most simple-minded paper I’ve ever written – and that’s up against some pretty stiff competition – but there you go.

Picture1

The lognormal seemed an interesting idea to explore because it applies to non-linear processes in much the same way as the normal distribution does to linear ones. What I mean is that if you have a quantity Y which is the sum of n independent effects, Y=X1+X2+…+Xn, then the distribution of Y tends to be normal by virtue of the Central Limit Theorem regardless of what the distribution of the Xi is  If, however, the process is multiplicative so  Y=X1×X2×…×Xn then since log Y = log X1 + log X2 + …+log Xn then the Central Limit Theorem tends to make log Y normal, which is what the lognormal distribution means.

The lognormal is a good distribution for things produced by multiplicative processes, such as hierarchical fragmentation or coagulation processes: the distribution of sizes of the pebbles on Brighton beach  is quite a good example. It also crops up quite often in the theory of turbulence.

I’ll mention one other thing  about this distribution, just because it’s fun. The lognormal distribution is an example of a distribution that’s not completely determined by knowledge of its moments. Most people assume that if you know all the moments of a distribution then that has to specify the distribution uniquely, but it ain’t necessarily so.

If you’re wondering why I mentioned citations, it’s because it looks like they’re going to play a big part in the Research Excellence Framework, yet another new bureaucratical exercise to attempt to measure the quality of research done in UK universities. Unfortunately, using citations isn’t straightforward. Different disciplines have hugely different citation rates, for one thing. Should one count self-citations?. Also how do you aportion citations to multi-author papers? Suppose a paper with a thousand citations has 25 authors. Does each of them get the thousand citations, or should each get 1000/25? Or, put it another way, how does a single-author paper with 100 citations compare to a 50 author paper with 101?

Or perhaps the REF panels should use the logarithm of the number of citations instead?

The Impact X-Factor

Posted in Bad Statistics, Open Access with tags , , on August 14, 2012 by telescoper

Just time for a quick (yet still rather tardy) post to direct your attention to an excellent polemical piece by Stephen Curry pointing out the pointlessness of Journal Impact Factors. For those of you in blissful ignorance about the statistical aberration that is the JIF, it’s basically a measure of the average number of citations attracted by a paper published in a given journal. The idea is that if you publish a paper in a journal with a large JIF then it’s in among a number of papers that are highly cited and therefore presumably high quality. Using a form of Proof by Association, your paper must therefore be excellent too, hanging around with tall people being a tried-and-tested way of becoming tall.

I won’t repeat all Stephen Curry’s arguments as to why this is bollocks – read the piece for yourself – but one of the most important is that the distribution of citations per paper is extremely skewed, so the average is dragged upwards by a few papers with huge numbers of citations. As a consequence most papers published in a journal with a large JIF attract many fewer citations than the average. Moreover, modern bibliometric databases make it quite easy to extract citation information for individual papers, which is what is relevant if you’re trying to judge the quality impact of a particular piece of work, so why bother with the JIF at all?

I will however copy the summary, which is to the point:

So consider all that we know of impact factors and think on this: if you use impact factors you are statistically illiterate.

  • If you include journal impact factors in the list of publications in your cv, you are statistically illiterate.
  • If you are judging grant or promotion applications and find yourself scanning the applicant’s publications, checking off the impact factors, you are statistically illiterate.
  • If you publish a journal that trumpets its impact factor in adverts or emails, you are statistically illiterate. (If you trumpet that impact factor to three decimal places, there is little hope for you.)
  • If you see someone else using impact factors and make no attempt at correction, you connive at statistical illiteracy.

Statistical illiteracy is by no means as rare among scientists as we’d like to think, but at least I can say that I pay no attention whatsoever to Journal Impact Factors. In fact I don’t think many people in in astronomy or astrophysics use them at all. I’d be interested to hear from anyone who does.

I’d like to add a little coda to Stephen Curry’s argument. I’d say that if you publish a paper in a journal with a large JIF (e.g. Nature) but the paper turns out to attract very few citations then the paper should be penalised in a bibliometric analysis, rather like the handicap system used in horse racing or golf. If, despite the press hype and other tedious trumpetings associated with the publication of a Nature paper, the work still attracts negligible interest then it must really be a stinker and should be rated as such by grant panels, etc. Likewise if you publish a paper in a less impactful journal which nevertheless becomes a citation hit then it should be given extra kudos because it has gained recognition by quality alone.

Of course citation numbers don’t necessarily mean quality. Many excellent papers are slow burners from a bibliometric point of view. However, if a journal markets itself as being a vehicle for papers that are intended to attract large citation counts and a paper published there flops then I think it should attract a black mark. Hoist it on its own petard, as it were.

So I suggest papers be awarded an Impact X-Factor, based on the difference between its citation count and the JIF for the journal. For most papers this will of course be negative, which would serve their authors right for mentioning the Impact Factor in the first place.

PS. I chose the name “X-factor” as in the TV show precisely for its negative connotations.

The Quality of Physics

Posted in Science Politics with tags , , , on February 21, 2012 by telescoper

Just time for a quick post this lunchtime,  in between meetings and exercise classes. My eye was drawn this morning to an article about a lengthy report from the Institute of Physics that gives an international comparison of citation impact in physics and related fields.

According to the IOP website..

Although the UK is ranked seventh in a list of key competitor countries for the quantity of its physics research output – measured by the number of papers published – the UK is second only to Canada, and now higher than the US, when ranked on the average quality of the UK’s physics research output – measured by the average number of times research papers are cited around world.

The piece also goes on to note that the UK’s share of the total number of research papers written has decreased

For the UK, however, its proportionate decrease in output – from 7.1% of the world’s physics research in 2001 to 6.4% in 2010 – has been accompanied by a celebratory increase in overall, average quality – with the average number of citations of UK research papers rising from 1.24 in 2001 to 1.72 in 2010.

This, of course, assumes that citations measure “quality” but I’ve got no time to argue that point today. What I will do is put up a couple of interesting figures from the report.  This one shows that Space Science in the UK (including Astronomy and Astrophysics) holds a much bigger share of the total world output of papers than other disciplines (by a factor of about three):

While this one shows that the “citation impact” for Physics and Space Science roughly track each other…

..apart from the downturn right at the end of the window for space sciences, which, one imagines, might be a result of decisions taken by the management of the Science and Technology Facilities Council  over that period.

Our political leaders will be tempted to portray the steady increase of citation impact across fields as a sign of improved quality arising from the various research assessment exercises.  But I don’t think it’s as simple as that. It seems that many developing countries – especially China – are producing more and more scientific papers. This inevitably drives the UK’s share of world productivity down, because our capacity is not increasing. If anything it’s going down, in fact, owing to recent funding cuts. However, the more papers there are, the more reference lists there are, and the more citations there will be. The increase in citation rates may therefore just be a form of inflation.

Anyway, you can download the entire report here (PDF). I’m sure there will be other reactions to it so, as usual, please feel free to comment via the box below…