The First Digit Phenomenon

I thought it would be fun to put up this quirky example of how sometimes things that really ought to be random turn out not to be. It’s also an excuse to mention a strange connection between astronomy and statistics.

The astronomer Simon Newcomb (right) was born in 1835 in Nova Scotia picture2(Canada). He had no real formal education at all, but since there wasn’t much else to do in Nova Scotia, he taught himself mathematics and astronomy and became very adept at performing astronomical calculations with great diligence. He began work in a lowly position at the US Nautical Almanac Office in 1857, and by 1877 he was director. He became was professor of Mathematics and Astronomy and Johns Hopkins University from 1884 until 1893 and was made the first ever president of the American Astronomical Society in 1899; he died in 1909.

Newcomb was performing lengthy numerical calculations in an era long before the invention of the pocket calculator or desktop computer. In those days many such calculations, including virtually anything involving multiplication, had to be done using logarithms. The logarithm (to the base ten) of a number x is defined to be the number a such that x=10a. To multiply two numbers whose logarithms are a and b respectively involves simply adding the logarithms: 10a times 10b=10(a+b), which helps a lot because adding is a lot easier than multiplying if you have no calculator. The initial logarithms are simply looked up in a table; to find the answer you use different tables to find the “inverse” logarithm.

Newcomb was a heavy user of his book of mathematical tables for this type of calculation, and it became very grubby and worn. But he also noticed that the first pages of the logarithms seemed to have been used much more than the others. This puzzled him greatly. Logarithm tables are presented in order of the first digit of the number required: the first pages therefore contain logarithms for numbers beginning with the digit 1. Newcomb used the tables for a vast range of different calculations of different things. He expected the first digits of numbers that he had to look up to just be as likely to be anything. Shouldn’t they be randomly distributed? Shouldn’t all the pages be equally used?

Once raised, this puzzle faded away until it was re-discovered in 1938 and acquired the name of Benford’s law, or the first digit phenomenon. In virtually any list you can think of – street addresses, city populations, lengths of rivers, and so on – there are more entries beginning with the digit “1” than any other digit.

To give another example, although I admit this one is much harder to explain, in the American Physical Society’s list of fundamental constants, or at least the last version I happened to look at, no less than 40% begin with the digit 1. If you’ve been writing physics examination papers recently like I have, you will notice a similar behaviour. Out of the 16 physical constants listed in the rubric of a physics examination paper lying on my desk right now, 6 begin with the digit 1.

So what is going on?

There is a (relatively) simple answer, and a more complicated one. I’ll take the simple one first.

Consider street numbers in an address book as an example. Suppose Any street will be numbered from 1 to N. It doesn’t really matter what N is as long as it is finite (and nobody has ever built an infinitely long street). Now think about the first digits of the addresses. There are 9 possibilities, because we never start an address with 0. On the face of it, we might expect a fraction 1/9 (approximately 11%) of the addresses will start with 1. Suppose N is 200. What fraction actually starts with 1? The answer is more than 50%. Everything from 100 upwards, plus 1, and 11 to 19. Very few start with 9: only 9 itself, and 90-99 inclusive. If N is 300 then there are still more beginning with 1 than any other digit, and there are no more that start with 9. One only gets close to an equal fraction of each starting number if the value of N is an exact power of 10, e.g. 1000.

Now you can see why pulling numbers out of an address book leads to a distribution of first digits that is not at all uniform. As long as the numbers are being drawn from a collection of streets each of whom has a finite upper limit, then the result is bound to be biased towards low starting digits. Only if every street contained an exact power of ten addresses would the result be uniform. Every other possibility favours 1 at the start.

The more complicated version involves a scaling argument and is a more suitable explanation for the appearance of this phenomenon in measured physical quantities. Lengths, heights and weights of things are usually measured with respect to some reference quantity. In the absence of any other information, one might imagine that the distribution of whatever is being measured possesses some sort of invariance or symmetry with respect to the scale being chosen. In this case the prior distribution p(x) can be taken to have the so-called Jeffreys form, which is uniform in the logarithm, i.e. p(x) is proportional to 1/x. There obviously must be a cut-off at some point as this can’t be allowed to go on forever as it doesn’t converge for large x, but this doesn’t really matter for the sake of this argument. We can suppose anyway that there are many powers of ten involved before this upper limit is reached.

In this case the probability that the first digit is D is just given by the ratio of two terms: In the numerator we have the integral between D and D+1 of p(x) (that’s a measure of how much of the distribution represents numbers starting with the digit D) and on the denominator we have the integral between 1 and 10 of p(x) (the overall measure). The result, if we take p(x) to be proportional to 1/x, is just log (1+1/D).

picture1

The shape of this distribution is shown in the Figure. Note that about 30% of the first digits are expected to be 1. Of course I have made a number of simplifying assumptions that are unlikely to be exactly true, and the case of the physical constants is complicated by the fact that some are measured and some are defined, but I think this captures the essential reason for the curious behaviour of first digits.

If nothing else, it provides a valuable lesson that you should be careful in what variables you assume are uniformly distributed!

About these ads

9 Responses to “The First Digit Phenomenon”

  1. Anton Garrett Says:

    It’s interesting to consider how the effect varies with the base of numbers used. 100% of numbers begin with a “1” in binary…

    BBC has been doing helpful stats again:

    http://news.bbc.co.uk/1/hi/magazine/7937382.stm

    Anton

  2. This phenomenon is useful for catching out fraudulent behaviour. Faked data that should display this aspect rarely does, as it’s not terribly well known.

  3. telescoper Says:

    Yes, I’ve heard that the US tax people check tax returns using this.

  4. [...] Looking around for other entries on Benford’s Law, I found this nice entry that attributes Benford’s Law to the astronomer Simon Newcomb, instead of Benford (who [...]

  5. [...] the Iranian election Following my previous post where I commented on Roukema’s use of Benford’s Law on the first digits of the counts, I saw on Andrew Gelman’s blog a pointer to a paper in the [...]

  6. [...] First Digits and Electoral Fraud in Iran An interesting issue has arisen recently about the possibility that the counting of the recent hotly contested Iranian election results might have been fraudulent. I mention it here because it involves  Benford’s Law – otherwise known as the First Digit Phenomenon – which I’ve blogged about before. [...]

  7. Thanks for the article. It was the 2nd Google hit when I searched for “distribution of starting digits” in an effort to remember the name of Benford’s Law.

    Regarding the house numbers’ example, the phenomenon is demonstrated by the house number section of any hardware store. Often, the 1’s are sold out while there are plenty of 9’s in the closeout section. BTW, I believe you meant “Only if every street contained an exact *power* of ten addresses” (rather than multiple). The preceding paragraph makes that same point.

  8. Thanks Dan. You’re right. I’ve fixed the error now.

  9. If there is a set of non-manipulated, naturally occurring numbers, the occurrence frequency of digits one through nine as the first digit should follow Benford’s Law.

    Hubbles data (as used in the infamous “Hubble Diagram”) strays a long way from this “1st Digit Law”, even though both the Distance (values from 0.12Mpc to 10688Mpc) and the Velocity (values from -106km/sec to 1163202km/sec) cover several orders of magnitude and so might be expected to follow Benford’s Law. Why is this?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,701 other followers

%d bloggers like this: