## The Neyman-Scott ‘Paradox’

I just came across this interesting little problem recently and thought I’d share it here. It’s usually called the ‘Neyman-Scott’ paradox. Before going on it’s worth mentioning that Elizabeth Scott (the second half of Neyman-Scott) was an astronomer by background. Her co-author was Jerzy Neyman. As has been the case for many astronomers, she contributed greatly to the development of the field of statistics. Anyway, I think this example provides another good illustration of the superiority of Bayesian methods for estimating parameters, but I’ll let you make your own mind up about what’s going on.

The problem is fairly technical so I’ve done done a quick version in latex that you can download

here, but I’ve also copied into this post so you can read it below:

I look forward to receiving Frequentist Flak or Bayesian Benevolence through the comments box below!

Follow @telescoper
November 26, 2016 at 1:51 am

That’s an interesting example of how maximum likelihood can give an incorrect answer in the large sample limit if the number of nuisance parameters increases with the number of observations. See

https://arxiv.org/abs/1301.6278

for a way of recasting the Neyman-Scott model so that maximum likelihood can be used.

November 26, 2016 at 8:13 am

In fairness to frequentists (and once the analysis has been debugged of typos) this isn’t quite comparing like with like. It would be illuminating to compare maximum-likelihood with maximum-posterior in which the posterior is maximised simultaneously wrt the mu’s and sigma, using flat priors for these, in the large-N limit.

The Bayesian secret weapon in this problem is marginalisation, which is a rigorous consequence of the (always correct, always applicable) sum and product rules of probability.

November 26, 2016 at 10:41 am

I put a couple of deliberate mistakes in to check who was paying attention.

November 26, 2016 at 2:28 pm

I don’t think that the comparison you propose would be very revealing. Maximum-posterior with uniform prior will always be the same as maximum-likelihood (since, by Bayes’s Theorem, the quantities being maximized are proportional).

In any case, from a Bayesian point of view, marginalizing, not maximizing, is the uniquely correct way to answer the question “what do I know about the variance?” The way to make the comparison fair is to try to figure out what a frequentist would actually use as an estimator in this situation. I don’t know the answer to that, but it’s certainly not the MLE.

The situation described in this paradox, although somewhat artificial, is essentially the same as the very common situation of estimating the variance from a set of measurements of unknown mean, which is worked out in every statistics textbook. If you have M iid normal random variables, and you find the values of mean and variance that jointly maximize the likelihood, you find that the estimated mean mu is the arithmetic mean of the measurements and the estimated variance is the mean of (x-mu)^2. This estimator is biased: on average it’s too low by a factor of M/(M-1). That’s the reason people always divide by M-1 instead of M when estimating the variance in this situation. In fact, dividing by M-1 instead of M seems to be the only statistical fact known by many physicists!

This paradox is just an example of this, with M = 2 and many repetitions of the experiment to make the paradoxical result appear sharper. Since frequentists are willing to correct for the bias in the standard case, presumably they’re willing to correct for it in this case, so a hypothetical frequentist would not naively use the biased MLE.

As for what a frequentist would do, I don’t know. And that’s part of the problem with the frequentist approach. From a Bayesian point of view, there’s one right answer to the question “what do I know about the variance?”, namely, the posterior probability distribution. A frequentist has to make a choice of estimator, and that choice seems to be as much art as science.

November 26, 2016 at 3:04 pm

Yes, lazy of me not to see that about max-posterior, given flat priors. In my house you’d be preaching to the converted!

November 26, 2016 at 2:01 pm

While I cannot but concur with your conclusion!, I rather dislike this example, which, as most paradoxes, is too artificial for my taste. (I am also wary of MAP and MMAP estimators, if only because of the impact of the dominating measure.) The choice of the prior has an impact on whether or not the difficulty vanishes and in particular, resorting to the Jeffreys family of non-informative priors will produce worse behaviour.

November 26, 2016 at 3:01 pm

We aren’t free to choose priors arbitrarily; it’s just that the principles of systematically assigning them from the prior information aren’t fully discovered yet.

It is a rather artificial example, but you never know when the world will throw an artificial example at you. Only Bayesian techniques are guaranteed to work correctly in all circumstances.

November 26, 2016 at 4:11 pm

I am afraid I disagree with the last sentence: when given the Neyman-Scott problem and only the problem, with no prior information, there is no reason any Bayesian analysis will produce a sensible output. Witness Jeffreys’ prior.

November 26, 2016 at 4:59 pm

Do you mean using a Jeffrey’s prior on the means?

November 26, 2016 at 5:04 pm

I would think the default choice being the Jeffreys prior on everything.

November 26, 2016 at 5:06 pm

Depends on the problem of course, but a uniform prior on location measures is not unusual. Jeffreys prior is more often used for scale parameters.

November 26, 2016 at 5:05 pm

As I said, we should choose priors according to the prior information, not arbitrarily.

November 26, 2016 at 5:25 pm

I don’t think it is all that artificial, actually. I can think of many situations in physics where you needed to estimate a variance in the presence of AB unknown offset. In astronomy the offset could be a sky background…

November 27, 2016 at 2:13 pm

A few comments

1. The MLE is not the “frequentist solution.”

The frequentist solution is whatever estimator

has good frequentist properties. In this case, it is

2 times the mle

2. The idea that Bayesian inference will always return

reasonable answers is incorrect. They are many examples

where Bayesian inference yields poor estimators.

Examples include: the Robins-Ritov paradox, Stone’s

paradox, sampling to a foregone conclusion etc

I review many of these on my (no defunct) blog:

https://normaldeviate.wordpress.com/

November 27, 2016 at 3:48 pm

The frequentist solution is whatever estimator has good frequentist properties.That’s just a disguised way of saying that you choose the solution according to what you think looks good. The trouble is that this is not an objective criterion: your view of what looks good might differ from that of another frequentist, who prefers a different estimator. Frequentism gives no way of deciding who is right.

Genuine Bayesian inference knows nothing of estimators. Even to speak of a “Bayesian estimator” is to have gone wrong. The only decent approach is to consider a quantity p(A|B) (I’ve not yet called it a probability!) which is interpreted as a numerical measure of how strongly proposition B implies proposition A. [More formally: How strongly binary proposition A is implied to be true upon supposing the truth of binary proposition B, according to the relations known between their referents.] The Boolean calculus of propositions, which are the arguments of the p’s, now implies a calculus for the p’s. This calculus turns out to be (upon solving a few functional equations) the sum and product rules. On that basis I am happy to call the p’s probabilities, although if anybody disagrees I say simply that how strongly one proposition implies another is what you actually want in every problem involving uncertainties, and it obeys the laws of probability. Bayes’ theorem is an immediate mathematical consequence of the sum and product rules. If it gives what you think is a wrong answer then either your intuition needs educating or you have failed to incorporate, in the calculation, some relevant information which your intuition is taking into account.

November 27, 2016 at 4:05 pm

Not “looks good”

We choose the optimal estimator.

i.e. either the minimax estimator, or locally minimax etc

Read about LeCam theory.

These are objective and well-defined procedures.

The are not arbitrary fixes.

Before you dismiss the well known criticisms of Bayesian

inference, you should have a look at the literature.

There are plenty of well studied examples

where the posterior concentrates nowhere near the

true value. You might not be disturbed about this

but most statisticians are (for good reason).

I am not arguing against using Bayesian inference.

But there are well known situations

where there are serious problems. Everyone

who uses Bayesian methods should at least

be familiar with these.

Statisticians have studies these problems in great detail

for 100 years. There are reasons why there are concerns

about Bayes in some cases.

November 27, 2016 at 4:36 pm

What makes you think I haven’t familiarised myself with both sides of this argument?

I’m not surprised that you can find things wrong with “Bayesian estimators” because correct Bayesianism, as I said, does not involve the use of estimators. Any question to which you want the answer is phrased as a proposition (often of the form “the parameter has a value between z and z+dz”) and put into the formalism. I have outlined how anything inequivalent to strict Bayesianism is equivalent to a violation of Boolean algebra. Don’t you accept Boolean algebra?

Suppose we are estimating a parameter from observations, but we actually know the answer in advance and are doing the sampling only because our boss has told us to. Our prior would be a (discrete) δ-function at the answer. In Bayes’ theorem a δ-fn prior carries through unchanged to the posterior, because the posterior contains the prior as a factor. The variety of sampling-theoretical frequentist methods designed to ‘let the data speak for themselves’ while ignoring the prior info give impossible answers having nonzero probability elsewhere than at the δ-fn. These methods are therefore WRONG. Don’t trust any method in a general problem that fails in a simple problem!

November 27, 2016 at 5:32 pm

“What makes you think I haven’t familiarised myself with both sides of this argument?”

That was my impression based on the fact that you did not respond

my points about: Stone’s paradox, the Robins-Ritov paradox etc

But perhaps I am wrong. If so, I apologize.

Anyway, I was just trying to say that there are decades of

technical research in statistics about this that hasn’t been

mentioned.

If I, as a statistician, posted an opinion about string theory,

a physicist might comment that there are decades of papers

that I should mention in the discussion (and why I think they

are wrong). And I would say: thanks for pointing that out.

I thought it would be useful for a statistician to point out

that there is a long technical literature that is relevant here.

I was just trying to be helpful.

November 27, 2016 at 11:29 pm

I’m not interested in arguments from authority, but in case you are I have spent 25 years working on Bayesian probability theory and its applications in physical science, including critical reading of all the schools of probability, and I am now writing a book on the subject. The tightest distillation of my viewpoint, and of what is wrong with methods not isomorphic to the sum and product rules, was set out above under my name. I accept in full that methods of assigning a probability distribution from propositions assumed to be true (‘prior’ distributions if these are to be combined with data via Bayes’ theorem) is not a completed science, and warrants further research – but does not warrant not unbacked criticism. I’d add that the term “Bayesian” (like ‘probability’) has unfortunately come to mean different things to different people; the view that I have outlined, I believe is uniquely correct and unflawed; but anybody who advocates “Bayesian estimators” has already deviated far from it.

November 28, 2016 at 5:29 am

The mean parameters are nonidentifiable which violates the conditions for likelihood inference.

If you reparameterise in terms of the differences of means you can obtain an identifiable model in terms of the standard deviation. Now the MLE gives the correct answer. As pointed out by eg Xian, your Bayes method gets lucky but isn’t really trustworthy here.

November 28, 2016 at 10:11 am

Funny how often it’s lucky… a fact that frequentists above all should appreciate.

November 28, 2016 at 12:00 pm

Yes, what are the chances of the laws of probability giving the right answer?

November 28, 2016 at 6:26 pm

As in my comment: it depends on the question.

November 28, 2016 at 7:08 pm

Please outline in your own words a situation in which you believe that Bayesian methods, in the sense I have defined them above, fail.

November 28, 2016 at 9:17 pm

Bayes doesn’t reliably solve the Neyman Scott problem.

Note that no matter how much data you have, the model structure is such that you only ever have two observations per mean parameter. Hence your prior does not ‘wash out’ in the infinite data limit and your answer depends explicitly on which prior you use no matter how much data you collect.

As Xian points out there are ‘reasonable’ prior choices which guarantee the wrong answer no matter how much data you collect.

That is the sense in which I define failure above. There are other ways in which I believe the dogma that ‘probability theory = logic of science’ fails, but that’s beside the point here.

November 28, 2016 at 9:29 pm

If you don’t use a flat prior in the Neyman-Scott problem then that must reflect something you know about the means. It is entirely right that this should be reflected in the answer. How would you encode the same knowledge in a frequentist analysis?

November 28, 2016 at 9:37 pm

I disagree. See again replies by Xian (and to some extent Wasserman)

The problem as formulated is ill-posed for jointly estimating means and the standard deviation reliably. Luckily you can transform the parameters to obtain a well-posed problem where the standard deviation parameter is identifiable and independent of the means. In this case the MLE is correct for the standard deviation and doesn’t rely on unreliable estimates of the means based on two data points and/or additional priors.

November 28, 2016 at 10:08 pm

But what if you have prior information about the means? What would you do with it?

November 28, 2016 at 10:34 pm

My first priority would always to be to set up a well-posed problem so that Bayes or Likelihood or whatever had a chance of being reliable.

I have no problem using Bayes if applicable, but it is not unproblematically applicable in this case. You are expecting magic where there is none.

Why don’t you respecify the model in terms of the sums and differences of the observations and see what happens? If you want to include prior information after this re-specification then have at it. Unfortunately it won’t demonstrate your original point any more.

November 29, 2016 at 6:52 am

I’m afraid I don’t find the argument

November 29, 2016 at 7:32 am

Your loss I guess (I imagine you’d say the same to me). Last question – what is your estimate for one of the means as the sample size goes to infinity? What is the posterior variance for one of the means?

November 29, 2016 at 7:33 am

I’m afraid I don’t find the argument

November 29, 2016 at 7:40 am

My previous answer was accidentally truncated. I meant to day I don’t find the argument – that you can choose a prior that gives a different, possibly nonsensical answer – convincing at all.

I am surprised that you seem to hink the laws of probability are “magic”, but reading your other comments perhaps I shouldn’t be.

I note your refusal to answer my question. My answer to yours is that I don’t care about the distribution of the means, which is why I marginalised over them.

November 29, 2016 at 7:47 am

I did answer your question.

What if I care about the means. What can you tell me about them?

Most statisticians, Bayesian or Frequentist or whatever recognise that the problem is incorrectly formulated.

Another question – how many parameters does the problem have?

My definition of magic is the ability to take garbage in and give a reliable answer out. Probability theory can’t do that so it isn’t magic. It is useful within a well-defined domain.

Your pseudo-rational pose doesn’t fix that.

November 29, 2016 at 7:53 am

Sigh. My point about priors is *precisely* that garbage in is garbage out…

November 29, 2016 at 7:55 am

Sigh too. Some people just don’t want to think.

November 29, 2016 at 7:56 am

That does not merit a response.

November 29, 2016 at 7:58 am

Technically that it is a response 😉