A Probabilistic View on COVID Testing Using Bayes Theorem

COVID testing is a critical tool as we work to reduce the spread of SARS-CoV-2. However, testing is also mis-understood. An example that Professor Guikema used in class recently illustrates this. Does receiving a positive COVID test mean you certainly have COVID? No. Does receiving a negative COVID test mean you certainly do not have COVID? No. Test results provide information that can change our assessed probabilities of different outcomes such as having COVID or not. Consider the following example from Dr. Guikema’s class.

Assume there are two tests for SARS-CoV-2. The more accurate and more expensive test has a false negative rate of 10% and a false positive rate of 1%. The less accurate and less expensive test has a false negative rate of 14% and a false positive rate of 5%. There are not meant to represent any particular test but instead to illustrate the points with reasonable approximations of some of the existing tests. Assume that the true prevalence in a community is that 3% of the members of that community are positive for SARS-Cov-2.

Question 1: If a person selected at random from the population tests positive with the more accurate test what is their probability of being positive for SARS-Cov-2? What is their probability of being negative?

Let’s set some notation up. Let p(+) be the probability that an individual is positive for SARS-CoV-2 (meaning they are truly infected). Let p(“+”) be the probability the test says they are positive. Let p(“+”|+) be the conditional probability that the test says they are positive if they are positive and p(“+”|-) be the probability that the test says they are positive given that the actually are negative (the false positive rate of the test). 1-p(“+”|+) = p(“-“|+) is then false negative rate of the test.

We want to calculate p(+|”+”). This is the probability that the individual actually is positive given that they received a positive test result. For this we use Bayes’ Theorem. As an interesting aside, Thomas Bayes was both a Presbyterian minister and a mathematician. Ok, so what is Bayes’ Theorem and how do we use it? Bayes says that p(+|”+”) = p(“+”|+)p(+)/[p(“+”|+)p(+)+p(“+”|-)p(-)]. If we plug in the numbers from our example we have p(+|”+”) = 0.9*0.03/(0.9*0.03+0.01*0.97) = 0.736. That is, there is a probability of approximately 0.26 of this individual not being infected with SARS-CoV-2 if they test positive. Or to put it differently, if you tested 1,000 randomly selected individuals from a population with a true prevalence of 3%, you would expect to see approximately 260 false positives (tests that come back positive for an individual who is not infected).

But what is the base rate was 30%, not 3%? For example, maybe you’re only testing symptomatic people? You can redo the math above, and you should see that p(+|”+”) = 0.975. We’re much more sure this person is infected because the base rate is considerably higher.

Question 2: Same question as for (a) but for the less accurate test?

We can redo the calculation above with the less accurate test, and we find that p(+|”+”) = 0.347. That is, with the less accurate test, only about 35% of those testing positive for COVID would actually have COVID if they were drawn at random from a population in which only 3% of people actual are infected with SARS-CoV-2. Base rate matters!

Question 3: If a person selected at random from the population receives a positive test on the less accurate test followed by a positive test on the more accurate test what is their probability of being positive for SARS-CoV-2?

If the person receives a positive result on the less accurate test we know from question 2 that their probability of being infected is 0.347. We then use this as the prior probability for the second test. That is, we use Bayes’ Theorem again, but with p(+) = 0.347 rather than 0.03. This test has given us information. When we calculate Bayes’ Theorem for this, the probability of being infected given a positive result on both tests is 0.98. We get this from p(+|”+”) = 0.9*0.347/(0.9*0.473+0.01*(1-0.347)). Running the sequence of tests has increased the probability some from running just the more accurate test, but the gain is not that large.

Question 4: Now assume that you test 10,000 people (roughly the student population of the University of Michigan College of Engineering) with only the most accurate test. If they test positive they go into quarantine. If they test negative they do not go into quarantine. If you use this test, what is the expected number of people that are negative that end up in quarantine? What is the expected number of people that are positive that are not in quarantine? What is the expected number of people that are positive that are in quarantine?

You can solve this analytically easily, but given that Professor Guikema was using this to teach his students how to structure a simulation problem, he wrote the simulation code linked here. If you ran this testing protocol once (meaning you tested each of the 10,000 people once) you would expect to have 97 student who are not infected in quarantine, and you would expect to have 270 students who are infected in quarantine. That is, you would have found 270 infected students and removed them from the circulating population at the cost of quarantining 97 non-infected individuals. These are expectations (means), and there is uncertainty around these numbers. If you run the simulation you can get a sense of this uncertainty. Note that the simulation is coded in R, so you will need R installed. Also note that the simulation was developed as a tool to teach basic simulation. It is not designed to be fast code. Rather, it emphasizes understandability with loops rather than speed with vectorization.