A Virus Testing Probability Puzzle
Here is a topical puzzle for you.
A test is designed to show whether or not a person is carrying a particular virus.
The test has only two possible outcomes, positive or negative.
If the person is carrying the virus the test has a 95% probability of giving a positive result.
If the person is not carrying the virus the test has a 95% probability of giving a negative result.
A given individual, selected at random, is tested and obtains a positive result. What is the probability that they are carrying the virus?
Update 1: the comments so far have correctly established that the answer is not what you might naively think (ie 95%) and that it depends on the fraction of people in the population actually carrying the virus. Suppose this is f. Now what is the answer?
Update 2: OK so we now have the probability for a fixed value of f. Suppose we know nothing about f in advance. Can we still answer the question?
Answers and/or comments through the comments box please.
Follow @telescoper
April 13, 2020 at 11:56 pm
Surely there is insufficient information, since we do not know the prior probability that the patient has the virus (i.e., we don’t know how common the virus is in the general population).
April 14, 2020 at 10:46 am
Yes, it’s a nice illustration. Suppose the true prevalence of the virus is 1% of the population, then 95% of that 1% get a correct positive result (so about 1%), but about 5% of the 99% not infected also get a (false) positive result, so about 5%. The result would be 6% testing positive….
So in this case the probability that any one of the positive-testing subjects is actually infected, is about 1 in 6.
April 14, 2020 at 6:25 pm
We do at least know that f is between 0 and 1. Using this and no other information I think we can give an answer that doesn’t depend on f…
April 14, 2020 at 12:28 am
Depends on fraction with virus? If everyone has virus then your requested probability is 1. If no-one has virus then your probability=0.
Regretting writing this already!
April 14, 2020 at 12:56 am
Well, yes, as Tom says, a Baysian problem depends on your prior.
April 14, 2020 at 2:26 am
Isn’t the answer just f? This is the total of 0.95f (from the 95% of positives that would be detected by the test) and 0.05f (from the 5% of negative tests that are really positive).
April 14, 2020 at 12:26 pm
No.
April 14, 2020 at 2:38 am
(apologies if this is a duplicate)
April 14, 2020 at 12:26 pm
I don’t think it’s a duplicate but it is definitely a graph with no labels on the axes.
April 14, 2020 at 1:54 pm
Argh. Stupid graphic formats. I’ve tried again; this time, axes and background and grid and all that should appear.
April 14, 2020 at 3:37 am
Let + indicate a positive test, and v indicate having the virus (and !v indicates not having the virus). We therefore want
P(v|+) = P(+|v) * f / (P(+|v) * f + P(+|!v) * (1-f))
= 0.95*f / (0.95*f + 0.05*(1-f))
= 0.95*f / (0.05 + 0.9*f)
For f = 0.1%, P(v|+) is only 1.9%
f = 1%, P(v|+) = 16.1%
f = 10%, P(v|+) = 67.9%
(Unless I’ve made an arithmetical error somewhere along the way.)
April 14, 2020 at 3:40 am
…the point being that even though the test looks good at first glance, the probability that a positive test is correct about the patient having the virus is quite small if only a small fraction of the tested population is expected to have the virus.
April 14, 2020 at 3:24 pm
I think you mean 0.05 on the second term in the denominator.
The answer is equal to 0.95 if and only if f=0.5.
April 14, 2020 at 6:28 pm
I still think you can give a sensible, Bayesian, answer to the original question that doesn’t depend on f.
April 14, 2020 at 9:02 pm
“Our weapon is our knowledge. But remember, it may be a knowledge we may not know that we possess.”
Agatha Christie, The A.B.C. Murders
April 14, 2020 at 10:32 pm
I think you need to show your working!
April 15, 2020 at 11:54 am
Well you need p(v|+) so you start with p(v|+, f) and use a uniform prior p(f) to create p(v, f|+) then marginalise over f to give p(v|+).
April 14, 2020 at 10:25 pm
OK I actually got the D Henley solution before I posted last night – honest! But I didn’t get the last Bayesian step.
Am still wondering whether anyone else should take that last step either. At the end any prior is a subjective guess including in this case a flat prior. So how do we judge if 88% is reasonable if prior not based on any repeat trials? So this example is also nice because it brings out the fundamental problem with Bayesian statistics!
There – I’ve walked into trap – now waiting tutorials on Bayesian evidence etc!
April 14, 2020 at 10:30 pm
It’s not a problem with Bayesian reasoning, it’s a strength!
(And it’s reasonable precisely because it’s Bayesian!)
Of course if you have more information about f (say from samples) you can feed that into the prior. That’s actually difficult to do with Covid-19 because the testing is done in a complicated way: it’s certainly not a random subset of the population that is tested.
April 14, 2020 at 10:45 pm
“(And it’s reasonable precisely because it’s Bayesian!)”
Seems to depend on a strong prior that Bayesian approach is reasonable!
I rest my case!
April 14, 2020 at 10:49 pm
What I meant is that Bayesian inference is *proven* to be the unique way of dealing with these problems in a logically consistent manner. If the result depends strongly on the prior then that tells you that you need more data.
April 14, 2020 at 10:45 pm
“(And it’s reasonable precisely because it’s Bayesian!)”
Seems to depend on a strong prior that Bayesian approach is reasonable!
I rest my case!
April 14, 2020 at 10:46 pm
It reminds of a problem of tossing a coin.
Suppose you know nothing about the coin. What is the probability that if toss it you will get Heads?
Surely you would say 0.5.
Now you toss it 10 times and get Heads every time. What is the probability that you would get Heads on the 11th toss?
I think you shouldn’t say 0.5 because the previous throws give you information that suggests it might not be a fair coin…
April 14, 2020 at 10:51 pm
I recall ruining a holiday in Rio arguing remotely with Anton Garrett on your blog about this topic so may I refer you to that!
Meanwhile best wishes to you and Philip!
April 15, 2020 at 3:11 pm
What you want in any problem involving uncertainty is a number representing how strongly the things you think you know (ie, binary propositions which you take to be true) imply the truth of the binary proposition you are interested in. Let’s call this quantity the implicability, and it has two arguments: the proposition you are interested in, and the ‘conditioning’ proposition that you take to be true.
Of course these may be compound propositions, and you can also break down continuous spaces into propositions such as “the physical quantity measured lies in the range (x,x+dx)”.
Since implicability has binary propositions as its arguments, the calculus of binary propositions, ie Boolean logic, implies a calculus for implicabilities.
This calculus turns out to be the sum and product rules. The pioneer derivation of them in this way was by RT Cox in 1946, and it has since been improved by myself and then Kevin Knuth.
As implicability obeys the “laws of probability”, and as it is what you want in any problem involving uncertainty, I reckon it deserves to be called “probability”. But if anybody takes exception to this definition, I am not going to fight. I am just going to say “you define what you want; I’ll calculate my implicability and solve the problem”.
All of this has nothing to do with the question of priors. Priors enter when we get data with which we wish to update our implicability by incorporating that data in the conditioning information as a logical product together with our prior information. This is unavoidable – and a strength of Bayesian analysis, not a weakness. Suppose you are certain before doing a noisy experiment what the value of a parameter is, and you are doing the experiment only because your boss has commanded you to. Then your prior is a delta-function at the value you know and zero elsewhere. Bayes’ theorem, which follows from the sum and product rule, tells you that the posterior is equal to a normalised product of the prior and likelihood. So the posterior is a delta-function at that value: exactly as intuition demands. Use sampling-theoretical methods beloved of frequentists, though, and you get answers strewing probability where you know it can’t be! Never use a method in a complex problem that fails in a simple one.
So the use of prior information is a strength, not a weakness. If you find you need it, that is a flag. As to what prior to assign in any problem, this should be viewed as a matter for research rather than for denigrating Bayesianism (although I will concede that even of the word ‘Bayesianism’ there are too many meanings…)
Don’t wreck any holiday for this stuff, especially in Rio!
April 15, 2020 at 3:20 pm
If the answer depends strongly on the choice of prior that’s not in itself a problem: it just means that you need more data. Sometimes it’s good to be shown how little you know given the information available.
April 15, 2020 at 6:35 pm
Hi Anton, in Durham not Rio this time!
So at risk of devaluing the debate, do you agree with Peter that the correct answer to the above problem is 0.883?
April 15, 2020 at 11:23 pm
I would write the answer in terms of the prior for f, as Peter didn’t specify what “the probability” was to be conditioned upon. Give me some prior info re f and I’ll do my best to translate it into a prior and then marginalise over f to get the posterior.
April 16, 2020 at 9:02 am
Let me add some more about the prior. “Complete ignorance” would mean that we didn’t even understand the phrase “have a virus” and that viruses are contagious – complete ignorance would mean assigning each individual 50:50 of “having the virus” or not. (I don’t go with a log prior on f because the prior for complete ignorance should be invariant under f –> 1-f.) Once you know that viruses are contagious, you have to average over *how* contagious – unless you know how contagious, and then you need stats for the distribution of household sizes. And so it goes…
April 14, 2020 at 10:46 pm
How do you know the false positive and false negative accuracies without knowing something about f? It’s a bootstrapping problem, isn’t it? Agree with Philip’s comment above about flat prior.
April 14, 2020 at 10:51 pm
In this simplified problem you know because I told you! They might be determined in a controlled laboratory setting without reference to an infected population. But I agree that in a real situation, like the one we find ourselves in now, you could not justify using a flat prior.
April 16, 2020 at 10:25 am
Anton, I agree that you have to specify the prior for f. But if we assume no knowledge about f do we gain anything meaningful by then assuming a flat prior as is often done in these “no knowledge” situations?
April 16, 2020 at 12:15 pm
What do you mean by “no knowledge”, please? Even knowing the meaning of a variable is knowledge, as I suggested in one post.
April 16, 2020 at 12:30 pm
Anton, To quote Peter “Update 2: OK so we now have the probability for a fixed value of f. Suppose we know nothing about f in advance. Can we still answer the question?”
Would you answer “yes” or “no”?
April 16, 2020 at 3:10 pm
I’d like to ask Peter what *he* means by “no knowledge”, because even knowing that it is a virus conveyed from person to person is significant relevant prior information; but if you absolutely demand a Yes or No then I’d assign 50% probability to any individual having it and then treat it as an urn problem, which does mean that you can derive a unique numerical answer.
April 16, 2020 at 3:35 pm
What I meant was just to make it entirely abstract, with + and – being two outcomes of a test and ‘v’ and ‘not v’ just being two mutually exclusive states of the thing being tested.
April 14, 2020 at 11:28 pm
(19/324)(18-ln19) = 0.883, as Philip said. This is for a flat prior on the population infected fraction, f. A more committed Bayesian than me would probably say that a number known to lie between 0 and 1 should take a double Jeffreys prior p(f) propto 1 / [ f(1-f) ], but then all integrals of interest diverge and you have to cut the range to between a and 1-b where a and b are small. Then you can get any answer you like depending on how you take the limits of a and b tending to zero. But I don’t think this prior is reasonable here: it amounts to saying that f is either almost exactly 0 or 1, and it’s clear we wouldn’t be worrying about this problem if that was the case. So this is almost the anthropic principle applied to prior choice….
April 15, 2020 at 2:35 pm
Can’t resist! So who agrees with Peter that the correct answer to the original problem is 0.883?
April 15, 2020 at 8:08 pm
I am a bit late to this discussion, due to to all the on-line meetings. I can see the reason behind a flat prior but it seems a bit of a cop-out. As we know that a disease advances exponentially, wouldn’t a log distribution not make more sense? A flat distribution is ok for large f but here it is unlikely that f exceeds 0.1 (not impossible, but not likely) because that phase doesn’t last very long. Is there a way to assign an uncertainty to Tom’s preferred answer of 0.883?
April 15, 2020 at 11:29 pm
As Philip says, there’s no doubt that the
number 0.883 is correct. More onto Bayesian philosophy now. Peter framed problem that additional benefit accrues from last Bayesian step when a flat prior on the fraction of popn with virus was applied. But does it?
April 16, 2020 at 2:33 pm
To put it bluntly: if you can’t assign an uncertainty, then your number has little validity. In that case the answer is no, you can’t get the answer without knowing something about f.
(And the rate of false positives may also be poorly known if you don’t know f.)
When in doubt, get better data.
April 16, 2020 at 2:34 pm
The rate of false positives is given in the question.
April 16, 2020 at 2:36 pm
OK. I toss a coin that looks fair. What’s the probability of it being Heads?
I think the answer is 1/2.
What’s the uncertainty on that answer?
April 16, 2020 at 3:09 pm
Indeed, the rate of false positives is given. But if it is known, it has been measured and it is hard to realistically measure it if you don’t know f. Although I expect you have a Bayesan answer to that! As for the flat prior in f, if this is an infectious disease, it should progress as a logistic function, and it should have a finite recovery time. You didn’t specify that this was about an infectious disease, but in that case your solution is not well applicable to the current epidemic.
April 16, 2020 at 3:36 pm
As I said in the thread above, the test outcomes could in principle be quantified using controlled experiments in a laboratory rather than by sampling a population.
April 16, 2020 at 3:49 pm
I didn’t say it was!
April 16, 2020 at 12:25 pm
All I wanted from this was to draw attention to the fact that the answer wasn’t simply 0.95!
The arguments about Bayes have been an unexpected bonus!
April 16, 2020 at 12:32 pm
It’s beginning to feel like Rio! Is that a song?
April 16, 2020 at 12:34 pm
I should make people pay to post comments!
April 16, 2020 at 4:35 pm
To conclude(!), one can assume a flat prior but that is no less subjective than any other prior with our level of knowledge in this example. So looking for extra information on the probability of having virus if testing positive from this Bayesian route is fruitless.
But the problem does bring out the main issue with Bayesian inference – the potential subjectivity of even a flat prior.
April 16, 2020 at 4:42 pm
Well done for packing so much non sequitur into a single comment!
April 16, 2020 at 4:49 pm
Hopefully “non sequitur” here means no need for follow- up questions!
April 17, 2020 at 11:27 pm
If I calculated correctly, the Jeffreys prior in this case is constant, where the aim is to choose the least-informative prior, so the inference comes as far as possible from the data rather than the prior. To an Objective Bayesian, this would justify the choice of a uniform prior. But it’s late…
April 17, 2020 at 11:30 pm
It’s a pity Jeffrey Sprior hasn’t himself commented yet.