Simple Assumptions
- • There are two populations (red and blue) that differ, on average, in some characteristic (CRS). The CRS is any 'trait' that is normally distributed and that differs between two populations. For example, the red population might be individuals who will get a heart attack within the next ten years, the blue population might be individuals who will not, and the characteristic might be their current LDL cholesterol minus their HDL cholesterol. Side note: that's why a statin will save your life!
- • The Red Population Size (\(\pi\)), defaults to 15% of the total population, meaning that 15% of the whole population is red and 85% is blue.
- • The Population Gap (\(\mu_1 - \mu_0\)) defaults to 0.5 standard deviations (SDs), meaning that the characteristic level of the average red individual is 0.5 SDs higher than that of the average blue individual.
Implications
- • The probability an individual is red (black dashed line) based on the characteristic is simply the percentage of people at that level of the characteristic who are red: \(P(\text{CRS}) = \frac{1}{1 + \frac{1-\pi}{\pi}\exp(\frac{1}{2\sigma^2}[(\text{CRS}-\mu_1)^2 - (\text{CRS}-\mu_0)^2])}\)
- • The probability distribution of the whole population (grey line) equals the weighted sum of the two normal distributions: \(\phi(\text{CRS}) = (1-\pi)N(\mu_0, \sigma) + \pi N(\mu_1, \sigma)\)
- • The AUC is a measure of predictive performance in predicting whether someone is red or blue based on the characteristic. Specifically, it is the probability that a randomly selected red individual has a higher CRS than a randomly selected blue individual. The AUC has a one-to-one relationship with the population gap: \(\text{AUC} = \Phi(\frac{\mu_1-\mu_0}{\sqrt{2}})\)
Clinical Use
- When applied to polygenic disease prediction, the red population represents individuals with the disease (cases) and the blue population represents healthy individuals (controls). The continuous characteristic is a polygenic predictor - a weighted sum of genetic variants associated with disease risk. In fact, the default parameters: 15% lifetime prevalence and 0.64 AUC correspond to the case-control distributions of males for a polygenic predictor of prostate cancer. That means, the black dashed line shows your probability of having prostate cancer given your polygenic risk score. (Refresh the page to see the default parameters.)
Technical Addenda
- • Both populations have the same standard deviation: \(\sigma = \sigma_0 = \sigma_1 = 1\). If one population had a larger standard deviation, its distribution would be wider and have heavier tails on both sides. This leads to unrealistic behavior in the tails for disease prediction, and in practice, the standard deviations of any two populations tend to be close in value.
- • The mean/average of the blue population is \(\mu_0\) and the mean of the red population is \(\mu_1\).
- • The mean of the whole population (red + blue) is arbitrarily defined to be zero: \(\mu = \mu_0(1-\pi) + \mu_1\pi = 0\).
- • The probability distribution of the whole population approximates a normally distributed for small \(\pi\) and small \(\mu_1 - \mu_0\).