Created 19/07/2020 at 07:39PM

Probability 101

The following short series is inspired by two papers by Kruschke and Liddell:

  1. Bayesian data analysis for newcomers
  2. The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Bayesian inference in a nutshell can be boiled down to 3 things:

  1. Rules of probability (this post)
  2. Priors matter
  3. Think in distribution not point estimates (coming soon)

The Data

My wife has taken an COVID19 antibody test. The test measures two antibodies, but for simplicity we will only look at the most predictive antibody IgG. The Sensitivity of IgM is only 69% and I suspect that the IgM and IgG test results are correlated, breaking the IID (Independent and Identically Distributed) assumption necessary for performing simple Bayesian updating :).

According to Hoffman et al. the performance of the test is reported as follows:

Cases Healthy Total
IgG positive 27 1 28
IgG negative 2 123 125
Total 29 124 153

Rules of Inference

If we divide all the cells of the above table by the total (153), then we can turn the table of counts into a table of probabilities.

Cases ($Pr(\theta)$) Healthy ($Pr(\neg\theta)$) Total
IgG positive ($Pr(y)$) $27/153\approx 0.176$ $1/153\approx 0.007$ $28/153\approx 0.183$
IgG negative ($Pr(\neg y)$) $2/153\approx 0.013$ $123/153\approx 0.804$ $125/153\approx 0.817$
Total $29/153\approx 0.190$ $124/153\approx 0.810$ $153/153=1$

Joint Probability

In the above table we have 4 joined probabilities, which are the probabilities of two joined (simultaneous) events, such as IgG positive ($y$) and COVID19 positive ($\theta$). This will be written $Pr(\theta,y)$ and should be read as "the (joined) probability of $\theta$ and $y$"

Marginal Probability

In the margins of the table (the Total row/colums) are the marginal probabilities, where one of the variables ($\theta$ or $y$) has been marginalized out. For dichotomous variables (variables of two outcomes) this is easily done by adding them together. For continues variable, we need to resort to integration. For example

$$ Pr(y) = \sum_{\theta}(Pr(\theta, y)) = Pr(y,\theta) + Pr(y,\neg\theta) = \frac{27 + 1}{153} = \frac{28}{153} $$

Conditional Probability

A conditional probability is a probability where one thing is "given" (taken as true). For example $Pr(y\mid{}\theta)$ Is the probability of a positive test ($y$) if you actually have COVID19 ($\theta$) They can be derived from a joined and marginal probability as follows:

$$ \begin{aligned} Pr(y\mid{}\theta) &= \frac{Pr(\theta,y)}{Pr(\theta)} &= \frac{\frac{27}{153}}{\frac{29}{153}} &=\frac{27}{29} &\approx{}0.931 \newline Pr(\neg{}y\mid{}\neg\theta) &= \frac{Pr(\neg\theta,\neg{}y)}{Pr(\neg\theta)} &= \frac{\frac{123}{153}}{\frac{124}{153}} &=\frac{123}{124} &\approx{}0.992 \end{aligned} $$

Evaluate the Data

The two conditional probabilities above have special names:

There are also two other conditional probabilities which can be derived from the table:

While at first glance these may seem more useful, because it is tempting to interpret it as the probability of COVID19 if I have a positive test. The correct interpretation is "The probability of COVID19 if my prior probability of COVID19 is the one who is congruent with $Pr(y)=\frac{28}{153}\approx{}0.183$

To make this more concrete. Imagine I go out and redo the experiment, but instead of recruiting 124 healthy and 29 survivors, I recruit 124 healthy and 58 survivors. Then the table I would expect to receive (ignoring random variation) would be one with $\times{}2$ in the "Cases" column as follows:

Cases Healthy Total
IgG positive 54 1 55
IgG negative 4 123 127
Total 58 124 182

Now all the joined probabilities have changed, but how does the 4 conditional probabilities pan out? (stop and think)

$$ Pr(y\mid{}\theta) = \frac{Pr(\theta,y)}{Pr(\theta)} = \frac{\frac{54}{182}}{\frac{58}{182}} = \frac{54}{58} = \underline{\underline{\frac{27}{29}}} $$

$$ Pr(\neg{}y\mid{}\neg\theta) = \frac{Pr(\neg\theta,\neg{}y)}{Pr(\neg\theta)} = \frac{\frac{123}{182}}{\frac{124}{182}} = \underline{\underline{\frac{123}{124}}} $$

Sensitive and Specificity are unchanged, which is why we use these to evaluate tests.

$$ Pr(\theta\mid{}y) = \frac{Pr(\theta,y)}{Pr(y)} = \frac{\frac{54}{182}}{\frac{55}{182}} = \frac{54}{55}\ne\frac{27}{28} $$

$$ Pr(\neg{}\theta\mid{}\neg{}y) = \frac{Pr(\neg\theta,\neg{}y)}{Pr(\neg\theta)} = \frac{\frac{123}{182}}{\frac{127}{182}} = \frac{123}{127}\ne\frac{123}{125} $$

Positive and Negative predictive value have changed, which is why we do not use these to evaluate tests!

Hey... Wait a minute! While I now understand why $Pr(y\mid{}\theta)$ and $Pr(\neg{}y\mid{}\neg\theta)$ are what I should use for evaluating tests, my primary goal was not to figure out how good the tests were, but what I should believe after taking a test, "given a test results, what is the probability of COVID19", or formally $Pr(\theta\mid{}y)$ or $Pr(\theta\mid{}\neg{}y)$... Well, that you can read in part 2

This is part 1 of 2 in the "Good Bayesian" Series.
Click here for the next post

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .

Edit on GitHub! (if you find errors in this post)