Consider Bayes Theorem in the context of medical testing. Let there be a medical test that is either positive (+) or negative (-).
\[Pr(infected|+)=\frac{Pr(+|infected)Pr(infected)}{Pr(+)}\]
Given that we are dealing with a binary outcome, we can rewrite the denominator as the weighted average of positive and negative tests. We’ll write this in terms of probabilities.
\[Pr(infected|+)=\frac{Pr(+|infected)Pr(infected)}{Pr(+|infected)Pr(infected) + Pr(+|uninfected)Pr(uninfected)}\]
Let’s take the following:
If you test positive, you’re infected 95% of the time. If you test positive, you’re not infected 1% of the time. 0.1% of the population is infected.
Pr_Positive_Infected <- 0.95
Pr_Positive_NotInfected <- 0.01
Pr_Infected <- 0.001
Pr_Positive <- Pr_Positive_Infected * Pr_Infected +
Pr_Positive_NotInfected * ( 1 - Pr_Infected )
( Pr_Infected_Positive <- Pr_Positive_Infected*Pr_Infected / Pr_Positive )
## [1] 0.0868
That is all.
p_grid <- seq( from=0 , to=1 , length.out=1000 )
prob_p <- rep( 1 , 1000 )
prob_data <- dbinom( 6 , size=9 , prob=p_grid )
posterior <- prob_data * prob_p
posterior <- posterior / sum(posterior)
ggplot(data.frame(p_grid, posterior), aes(p_grid, posterior)) +
geom_line()
Let’s do the same thing with samples from the posterior.
## R code 3.3
samples <- data.frame(x=sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )) %>%
rownames_to_column("id")
## R code 3.4
ggplot(samples, aes(id, x)) +
geom_hline(yintercept=.67, color="darkblue", size=2, alpha=.7) +
geom_point(size=.6, color="dodgerblue", alpha=.6) +
theme(axis.ticks.x=element_blank())
ggplot(samples, aes(x)) +
geom_density(color="dodgerblue", size=1.2, fill="lightblue", alpha=.3)
Why would we want to draw samples from the posterior distribution? We do this to conduct statistical inference. These could follow three main themes: (1) intervals of defined boundaries, and (2) questions about intervals of defined probability mass, and (3) questions about point estimates.
What’s the probability that the proportion of water on Earth is less than 50%?
Recall that the posterior is a probability distribution that sums to 1, so we can simply add up the posterior values along the grid that are less than 50%.
# add up posterior probability where p < 0.5
sum( posterior[ p_grid < 0.5 ] )
## [1] 0.172
More intuitive is to work with a sample of parameter estimates, which we interpret as representing the true probability distribution. So we simply look at what proportion of estimates are less than 50%, from the entire distribution of parameter estimates.
## R code 3.7
sum( samples < 0.5 ) / 1e4
## [1] 0.173
What proportion of estimates are between 50 and 75 percent?
## R code 3.8
sum( samples > 0.5 & samples < 0.75 ) / 1e4
## [1] 0.604
Let’s look at the range and endpoints defined by the probability mass, rather than a pre-defined interval of unknown mass.
What are the boundaries of the lower 80% of the posterior distribution?
quantile( samples$x , 0.8 )
## 80%
## 0.76
What interval defines the middle 80% of the posterior probability mass?
## R code 3.10
quantile( samples[,2] , c( 0.1 , 0.9 ) )
## 10% 90%
## 0.446 0.813
These intervals assign equal mass to each side, which doesn’t work as well for assymetric distributions.
set.seed(43)
p_grid <- seq( from=0 , to=1 , length.out=1000 )
prior <- rep(1,1000)
likelihood <- dbinom( 3 , size=3 , prob=p_grid )
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
samples <- data.frame(x = sample( p_grid , size=1e4 , replace=TRUE , prob=posterior ))
PI( samples , prob=0.5 )
## 25% 75%
## 0.0661 0.0661
## R code 3.13
HPDI( samples , prob=0.5 )
## |0.5 0.5|
## 0.843 1.000
## R code 3.14
p_grid[ which.max(posterior) ]
## [1] 1
## R code 3.15
chainmode( samples , adj=0.01 )
## Error in density.default(chain, ...): argument 'x' must be numeric
## R code 3.16
mean( samples )
## [1] NA
median( samples )
## Error in median.default(samples): need numeric data
## R code 3.17
sum( posterior*abs( 0.5 - p_grid ) )
## [1] 0.313
## R code 3.18
loss <- sapply( p_grid , function(d) sum( posterior*abs( d - p_grid ) ) )
## R code 3.19
p_grid[ which.min(loss) ]
## [1] 0.841
## R code 3.20
dbinom( 0:2 , size=2 , prob=0.7 )
## [1] 0.09 0.42 0.49
## R code 3.21
rbinom( 1 , size=2 , prob=0.7 )
## [1] 1
## R code 3.22
rbinom( 10 , size=2 , prob=0.7 )
## [1] 2 1 2 2 0 2 1 1 1 1
## R code 3.23
dummy_w <- rbinom( 1e5 , size=2 , prob=0.7 )
table(dummy_w)/1e5
## dummy_w
## 0 1 2
## 0.0885 0.4185 0.4930
## R code 3.24
dummy_w <- rbinom( 1e5 , size=9 , prob=0.7 )
simplehist( dummy_w , xlab="dummy water count" )
ggplot(data.frame(x=dummy_w), aes(x)) +
geom_histogram()
## R code 3.25
w <- rbinom( 1e4 , size=9 , prob=0.6 )
## R code 3.26
w <- rbinom( 1e4 , size=9 , prob=samples )
## Error in rbinom(10000, size = 9, prob = samples): invalid arguments
## R code 3.27
p_grid <- seq( from=0 , to=1 , length.out=1000 )
prior <- rep( 1 , 1000 )
likelihood <- dbinom( 6 , size=9 , prob=p_grid )
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
set.seed(100)
samples <- sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )
birth1 <- c(1,0,0,0,1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,1,0, 0,0,0,1,1,1,0,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0, 1,1,0,1,0,0,1,0,0,0,1,0,0,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,1,1,0, 1,0,1,1,1,0,1,1,1,1) birth2 <- c(0,1,0,1,0,1,1,1,0,0,1,1,1,1,1,0,0,1,1,1,0,0,1,1,1,0, 1,1,1,0,1,1,1,0,1,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,0,1,1,0,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,1,1, 0,0,0,1,1,1,0,0,0,0)
library(rethinking) data(homeworkch3)
sum(birth1) + sum(birth2)
```