Archive for ADS

Lognormality Revisited (Again)

Posted in Biographical, Science Politics, The Universe and Stuff with tags , , , , , , , on May 10, 2016 by telescoper

Today provided me with a (sadly rare) opportunity to join in our weekly Cosmology Journal Club at the University of Sussex. I don’t often get to go because of meetings and other commitments. Anyway, one of the papers we looked at (by Clerkin et al.) was entitled Testing the Lognormality of the Galaxy Distribution and weak lensing convergence distributions from Dark Energy Survey maps. This provides yet more examples of the unreasonable effectiveness of the lognormal distribution in cosmology. Here’s one of the diagrams, just to illustrate the point:

Log_galaxy_countsThe points here are from MICE simulations. Not simulations of mice, of course, but simulations of MICE (Marenostrum Institut de Ciencies de l’Espai). Note how well the curves from a simple lognormal model fit the calculations that need a supercomputer to perform them!

The lognormal model used in the paper is basically the same as the one I developed in 1990 with  Bernard Jones in what has turned out to be  my most-cited paper. In fact the whole project was conceived, work done, written up and submitted in the space of a couple of months during a lovely visit to the fine city of Copenhagen. I’ve never been very good at grabbing citations – I’m more likely to fall off bandwagons rather than jump onto them – but this little paper seems to keep getting citations. It hasn’t got that many by the standards of some papers, but it’s carried on being referred to for almost twenty years, which I’m quite proud of; you can see the citations-per-year statistics even seen to be have increased recently. The model we proposed turned out to be extremely useful in a range of situations, which I suppose accounts for the citation longevity:

nph-ref_historyCitations die away for most papers, but this one is actually attracting more interest as time goes on! I don’t think this is my best paper, but it’s definitely the one I had most fun working on. I remember we had the idea of doing something with lognormal distributions over coffee one day,  and just a few weeks later the paper was finished. In some ways it’s the most simple-minded paper I’ve ever written – and that’s up against some pretty stiff competition – but there you go.

Lognormal_abstract

The lognormal seemed an interesting idea to explore because it applies to non-linear processes in much the same way as the normal distribution does to linear ones. What I mean is that if you have a quantity Y which is the sum of n independent effects, Y=X1+X2+…+Xn, then the distribution of Y tends to be normal by virtue of the Central Limit Theorem regardless of what the distribution of the Xi is  If, however, the process is multiplicative so  Y=X1×X2×…×Xn then since log Y = log X1 + log X2 + …+log Xn then the Central Limit Theorem tends to make log Y normal, which is what the lognormal distribution means.

The lognormal is a good distribution for things produced by multiplicative processes, such as hierarchical fragmentation or coagulation processes: the distribution of sizes of the pebbles on Brighton beach  is quite a good example. It also crops up quite often in the theory of turbulence.

I’ll mention one other thing  about this distribution, just because it’s fun. The lognormal distribution is an example of a distribution that’s not completely determined by knowledge of its moments. Most people assume that if you know all the moments of a distribution then that has to specify the distribution uniquely, but it ain’t necessarily so.

If you’re wondering why I mentioned citations, it’s because they’re playing an increasing role in attempts to measure the quality of research done in UK universities. Citations definitely contain some information, but interpreting them isn’t at all straightforward. Different disciplines have hugely different citation rates, for one thing. Should one count self-citations?. Also how do you apportion citations to multi-author papers? Suppose a paper with a thousand citations has 25 authors. Does each of them get the thousand citations, or should each get 1000/25? Or, put it another way, how does a single-author paper with 100 citations compare to a 50 author paper with 101?

Or perhaps a better metric would be the logarithm of the number of citations?

Advertisements

Lognormality Revisited

Posted in Biographical, Science Politics, The Universe and Stuff with tags , , , , , on January 14, 2015 by telescoper

I was looking up the reference for an old paper of mine on ADS yesterday and was surprised to find that it is continuing to attract citations. Thinking about the paper reminds me off the fun time I had in Copenhagen while it was written.   I was invited there in 1990 by Bernard Jones, who used to work at the Niels Bohr Institute.  I stayed there several weeks over the May/June period which is the best time of year  for Denmark; it’s sufficiently far North (about the same latitude as Aberdeen) that the summer days are very long, and when it’s light until almost midnight it’s very tempting to spend a lot of time out late at night..

As well as being great fun, that little visit also produced what has turned out to be  my most-cited paper. In fact the whole project was conceived, work done, written up and submitted in the space of a couple of months. I’ve never been very good at grabbing citations – I’m more likely to fall off bandwagons rather than jump onto them – but this little paper seems to keep getting citations. It hasn’t got that many by the standards of some papers, but it’s carried on being referred to for almost twenty years, which I’m quite proud of; you can see the citations-per-year statistics even seen to be have increased recently. The model we proposed turned out to be extremely useful in a range of situations, which I suppose accounts for the citation longevity:

lognormal

I don’t think this is my best paper, but it’s definitely the one I had most fun working on. I remember we had the idea of doing something with lognormal distributions over coffee one day,  and just a few weeks later the paper was  finished. In some ways it’s the most simple-minded paper I’ve ever written – and that’s up against some pretty stiff competition – but there you go.

Picture1

The lognormal seemed an interesting idea to explore because it applies to non-linear processes in much the same way as the normal distribution does to linear ones. What I mean is that if you have a quantity Y which is the sum of n independent effects, Y=X1+X2+…+Xn, then the distribution of Y tends to be normal by virtue of the Central Limit Theorem regardless of what the distribution of the Xi is  If, however, the process is multiplicative so  Y=X1×X2×…×Xn then since log Y = log X1 + log X2 + …+log Xn then the Central Limit Theorem tends to make log Y normal, which is what the lognormal distribution means.

The lognormal is a good distribution for things produced by multiplicative processes, such as hierarchical fragmentation or coagulation processes: the distribution of sizes of the pebbles on Brighton beach  is quite a good example. It also crops up quite often in the theory of turbulence.

I’ll mention one other thing  about this distribution, just because it’s fun. The lognormal distribution is an example of a distribution that’s not completely determined by knowledge of its moments. Most people assume that if you know all the moments of a distribution then that has to specify the distribution uniquely, but it ain’t necessarily so.

If you’re wondering why I mentioned citations, it’s because it looks like they’re going to play a big part in the Research Excellence Framework, yet another new bureaucratical exercise to attempt to measure the quality of research done in UK universities. Unfortunately, using citations isn’t straightforward. Different disciplines have hugely different citation rates, for one thing. Should one count self-citations?. Also how do you aportion citations to multi-author papers? Suppose a paper with a thousand citations has 25 authors. Does each of them get the thousand citations, or should each get 1000/25? Or, put it another way, how does a single-author paper with 100 citations compare to a 50 author paper with 101?

Or perhaps the REF panels should use the logarithm of the number of citations instead?

Article of the Day!

Posted in The Universe and Stuff with tags , , , , , , on July 31, 2013 by telescoper

Back in the office today, the heatwave having given way to grey drizzle and cool breezes (at least for the time being). I’ve got stacks of paperwork to catch up on, but fortunately I’ve got time to post a quick congratulatory message to Ian Harrison, who is author of today’s NASA ADS Article of the Day! Ian is a PhD student in the School of Physics & Astronomy at Cardiff University and was supervised by me until I abandoned ship to come here to Sussex earlier this year; he’s got a postdoctoral research position lined up in the Midlands (Manchester) when he finishes his thesis. The other author, Shaun Hotchkiss, is coming to Sussex as a postdoctoral researcher in October.

Anyway, the paper is a nice one, called A consistent approach to falsifying ΛCDM with rare galaxy clusters. Here’s the abstract:

We consider methods with which to answer the question “is any observed galaxy cluster too unusual for ΛCDM?” After emphasising that many previous attempts to answer this question will overestimate the confidence level at which ΛCDM can be ruled out, we outline a consistent approach to these rare clusters, which allows the question to be answered. We define three statistical measures, each of which are sensitive to changes in cluster populations arising from different modifications to the cosmological model. We also use these properties to define the “equivalent mass at redshift zero” for a cluster — the mass of an equally unusual cluster today. This quantity is independent of the observational survey in which the cluster was found, which makes it an ideal proxy for ranking the relative unusualness of clusters detected by different surveys. These methods are then used on a comprehensive sample of observed galaxy clusters and we confirm that all are less than 2σ deviations from the ΛCDM expectation. Whereas we have only applied our method to galaxy clusters, it is applicable to any isolated, collapsed, halo. As motivation for future surveys, we also calculate where in the mass redshift plane the rarest halo is most likely to be found, giving information as to which objects might be the most fruitful in the search for new physics.

In case you’re wondering, the rather Popperian nature of the title is not the reason why I’m not among the authors. I’m just not the sort of supervisor who feels he should always be an author of papers done by his research students even when they had the idea and did all the work themselves. From what I’ve heard talking to others, we’re a dying breed!

Citation-weighted Wordles

Posted in Uncategorized with tags , , , , on December 12, 2011 by telescoper

Someone who clearly has too much time on his hands emailed me this morning with the results of an in-depth investigation into trends in the titles of highly cited astronomy papers from the past 30 years, and how this reflects the changing ‘hot-topics’.

The procedure adopted was to query ADS for the top 100 cited papers in three ten-year intervals: 1980-1990, 1990-2000, and 2000-2010. He then took all the words from the titles of these papers and weighted them according to the sum of the number of citations of all the articles that word appears in… so if the word ‘galaxy’ appears in two papers with citations of 100 and 300, it gets a weighting of 400, and so-on.

After getting these lists, he used the online ‘Wordle‘ tool
to generate word-clouds of these words, using those citation weightings in the word-sizing calculation. Common words, numbers, etc. are excluded. There may be some cases where non-astronomy papers have crept in, but as much as possible is done to keep these to a minimum.

There’s probably some bias, since older papers have longer to accumulate citations, but the changing hot-topics on ~10 year time-scales take care of this I think.

Anyway, here are the rather interesting results. First is 1980-1990

Followed by 1990-2000

and, lastly, we have 2000-2010

It’s especially interesting to see the extent to which cosmology has elbowed all the other less interesting stuff out of the way…and how the word “observations” has come to the fore in the last decade.

ps. Here’s the last one again with the WMAP papers taken out:

What Counts as Productivity?

Posted in Bad Statistics, Science Politics, The Universe and Stuff with tags , , , , on March 18, 2011 by telescoper

Apparently last year the United Kingdom Infra-Red Telescope (UKIRT) beat its own personal best for scientific productivity. In fact here’s a  graphic showing the number of publications resulting from UKIRT to make the point:

The plot also demonstrates that a large part of recent burst of productivity has been associated with UKIDSS (the UKIRT Infrared Deep Sky Survey) which a number of my colleagues are involved in. Excellent chaps. Great project. Lots of hard work done very well.  Take a bow, the UKIDSS team!

Now I hope I’ve made it clear that  I don’t in any way want to pour cold water on the achievements of UKIRT, and particularly not UKIDSS, but this does provide an example of how difficult it is to use bibliometric information in a meaningful way.

Take the UKIDSS papers used in the plot above. There are 226 of these listed by Steve Warren at Imperial College. But what is a “UKIDSS paper”? Steve states the criteria he adopted:

A paper is listed as a UKIDSS paper if it is already published in a journal (with one exception) and satisfies one of the following criteria:

1. It is one of the core papers describing the survey (e.g. calibration, archive, data releases). The DR2 paper is included, and is the only paper listed not published in a journal.
2. It includes science results that are derived in whole or in part from UKIDSS data directly accessed from the archive (analysis of data published in another paper does not count).
3. It contains science results from primary follow-up observations in a programme that is identifiable as a UKIDSS programme (e.g. The physical properties of four ~600K T dwarfs, presenting Spitzer spectra of cool brown dwarfs discovered with UKIDSS).
4. It includes a feasibility study of science that could be achieved using UKIDSS data (e.g. The possiblity of detection of ultracool dwarfs with the UKIRT Infrared Deep Sky Survey by Deacon and Hambly).

Papers are identified by a full-text search for the string ‘UKIDSS’, and then compared against the above criteria.

That all seems to me to by quite reasonable, and it’s certainly one way of defining what a UKIDSS paper is. According to that measure, UKIDSS scores 226.

The Warren measure does, however, include a number of papers that don’t directly use UKIDSS data, and many written by people who aren’t members of the UKIDSS consortium. Being picky you might say that such papers aren’t really original UKIDSS papers, but are more like second-generation spin-offs. So how could you count UKIDSS papers differently?

I just tried one alternative way, which is to use ADS to identify all refereed papers with “UKIDSS” in the title, assuming – possibly incorrectly – that all papers written by the UKIDSS consortium would have UKIDSS in the title. The number returned by this search was 38.

Now I’m not saying that this is more reasonable than the Warren measure. It’s just different, that’s all.  According to my criterion however UKIDSS measures 38 rather than 226. It sounds less impressive (if only because 38 is a smaller number than 226),  but what does it mean about UKIDSS productivity in absolute terms?

Not very much, I think is the answer.

Yet another way you might try to judge UKIDSS using bibliometric means is to look at its citation impact. After all, any fool can churn out dozens of papers that no-one ever reads. I know that for a fact. I am that fool.

But citation data also provide another way of doing what Steve Warren was trying to measure. Presumably the authors of any paper that uses UKIDSS data in any significant way would cite the main UKIDSS survey paper led by Andy Lawrence (Lawrence et al. 2007). According to ADS, the number of times this has been cited since publication is 359. That’s higher than the Warren measure (226), and much higher than the UKIDSS-in-the-title measure (38).

So there we are, three different measures, all in my opinion perfectly reasonable measures of, er,  something or other, but each giving a very different numerical value. I am not saying any  is misleading or that any is necessarily better than the others. My point is simply that it’s not easy to assign a numerical value to something that’s intrinsically difficult to define.

Unfortunately, it’s a point few people in government seem to be prepared to acknowledge.

Andy Lawrence is 57.


Share/Bookmark