The First Digit Phenomenon
I thought it would be fun to put up this quirky example of how sometimes things that really ought to be random turn out not to be. It’s also an excuse to mention a strange connection between astronomy and statistics.
The astronomer Simon Newcomb (right) was born in 1835 in Nova Scotia (Canada). He had no real formal education at all, but since there wasn’t much else to do in Nova Scotia, he taught himself mathematics and astronomy and became very adept at performing astronomical calculations with great diligence. He began work in a lowly position at the US Nautical Almanac Office in 1857, and by 1877 he was director. He became was professor of Mathematics and Astronomy and Johns Hopkins University from 1884 until 1893 and was made the first ever president of the American Astronomical Society in 1899; he died in 1909.
Newcomb was performing lengthy numerical calculations in an era long before the invention of the pocket calculator or desktop computer. In those days many such calculations, including virtually anything involving multiplication, had to be done using logarithms. The logarithm (to the base ten) of a number x is defined to be the number a such that x=10a. To multiply two numbers whose logarithms are a and b respectively involves simply adding the logarithms: 10a times 10b=10(a+b), which helps a lot because adding is a lot easier than multiplying if you have no calculator. The initial logarithms are simply looked up in a table; to find the answer you use different tables to find the “inverse” logarithm.
Newcomb was a heavy user of his book of mathematical tables for this type of calculation, and it became very grubby and worn. But he also noticed that the first pages of the logarithms seemed to have been used much more than the others. This puzzled him greatly. Logarithm tables are presented in order of the first digit of the number required: the first pages therefore contain logarithms for numbers beginning with the digit 1. Newcomb used the tables for a vast range of different calculations of different things. He expected the first digits of numbers that he had to look up to just be as likely to be anything. Shouldn’t they be randomly distributed? Shouldn’t all the pages be equally used?
Once raised, this puzzle faded away until it was re-discovered in 1938 and acquired the name of Benford’s law, or the first digit phenomenon. In virtually any list you can think of – street addresses, city populations, lengths of rivers, and so on – there are more entries beginning with the digit “1” than any other digit.
To give another example, although I admit this one is much harder to explain, in the American Physical Society’s list of fundamental constants, or at least the last version I happened to look at, no less than 40% begin with the digit 1. If you’ve been writing physics examination papers recently like I have, you will notice a similar behaviour. Out of the 16 physical constants listed in the rubric of a physics examination paper lying on my desk right now, 6 begin with the digit 1.
So what is going on?
There is a (relatively) simple answer, and a more complicated one. I’ll take the simple one first.
Consider street numbers in an address book as an example. Suppose Any street will be numbered from 1 to N. It doesn’t really matter what N is as long as it is finite (and nobody has ever built an infinitely long street). Now think about the first digits of the addresses. There are 9 possibilities, because we never start an address with 0. On the face of it, we might expect a fraction 1/9 (approximately 11%) of the addresses will start with 1. Suppose N is 200. What fraction actually starts with 1? The answer is more than 50%. Everything from 100 upwards, plus 1, and 11 to 19. Very few start with 9: only 9 itself, and 90-99 inclusive. If N is 300 then there are still more beginning with 1 than any other digit, and there are no more that start with 9. One only gets close to an equal fraction of each starting number if the value of N is an exact power of 10, e.g. 1000.
Now you can see why pulling numbers out of an address book leads to a distribution of first digits that is not at all uniform. As long as the numbers are being drawn from a collection of streets each of whom has a finite upper limit, then the result is bound to be biased towards low starting digits. Only if every street contained an exact power of ten addresses would the result be uniform. Every other possibility favours 1 at the start.
The more complicated version involves a scaling argument and is a more suitable explanation for the appearance of this phenomenon in measured physical quantities. Lengths, heights and weights of things are usually measured with respect to some reference quantity. In the absence of any other information, one might imagine that the distribution of whatever is being measured possesses some sort of invariance or symmetry with respect to the scale being chosen. In this case the prior distribution p(x) can be taken to have the so-called Jeffreys form, which is uniform in the logarithm, i.e. p(x) is proportional to 1/x. There obviously must be a cut-off at some point as this can’t be allowed to go on forever as it doesn’t converge for large x, but this doesn’t really matter for the sake of this argument. We can suppose anyway that there are many powers of ten involved before this upper limit is reached.
In this case the probability that the first digit is D is just given by the ratio of two terms: In the numerator we have the integral between D and D+1 of p(x) (that’s a measure of how much of the distribution represents numbers starting with the digit D) and on the denominator we have the integral between 1 and 10 of p(x) (the overall measure). The result, if we take p(x) to be proportional to 1/x, is just log (1+1/D).
The shape of this distribution is shown in the Figure. Note that about 30% of the first digits are expected to be 1. Of course I have made a number of simplifying assumptions that are unlikely to be exactly true, and the case of the physical constants is complicated by the fact that some are measured and some are defined, but I think this captures the essential reason for the curious behaviour of first digits.
If nothing else, it provides a valuable lesson that you should be careful in what variables you assume are uniformly distributed!