Let’s start off with an intro section that provides some background to McElreath.

Two-way distributions (Kruschke Chapter 4-5)

Let’s say we’re interested in joint and marginal probabilities of two variables, eye and hair color. Here’s a frequency table.

freqs <- data.frame(eye_color=c("brown","blue","hazel","green","h"),
                    black = c(.11,.03,.03,.01,.18),
                    brunette = c(.2,.14,.09,.05,.48),
                    red = c(.04,.03,.02,.02,.12),
                    blond = c(.01,.16,.02,.03,.21),
                    e = c(.37,.36,.16,.11,1))

freqs

eye_color	black	brunette	red	blond	e
brown	0.11	0.20	0.04	0.01	0.37
blue	0.03	0.14	0.03	0.16	0.36
hazel	0.03	0.09	0.02	0.02	0.16
green	0.01	0.05	0.02	0.03	0.11
h	0.18	0.48	0.12	0.21	1.00

We denote the joint probability of eye color e and hair color h as $p(e,h)$, and note that $p(e, h) = p(h,e)$.

We obtain the marginal probabilities of hair or eye color by summing the joint probabilities across rows (eye color) or columns (hair color). We denote the marginal probability of eye color as $p(e)$ and hair color as $p(h)$, such that $p(e)=\sum_h{p(e,h)}$ and $p(h)=\sum_e{p(e,h)}$.

Generalizing this, consider a row variable r and column variable c. When the variables are continuous rather than discrete, then we take an integral rather than a sum. So, $p(r)=\int{dc\hspace{1mm} p(r,c)}$ and $p(c)=\int{dr\hspace{1mm}p(r,c)}$.

Suppose we know an individual has blue eyes. Conditional on that information, what are the probabilities for hair color? Blue-eyed individuals make up 36 percent of this sample. We divide the set of joint probabilities of blue eyes and each hair color by the marginal probability of blue eyes, we rescale the probabilities to sum to one, where one represents the sub-sample of people with blue eyes. These probabilities of hair color, conditional on blue eye color, can be denoted as $p(h|e=blue)=\frac{p(e=blue,h)}{p(e=blue)}$

h_blue <- freqs[2,2:5] / freqs[2,6]

h_blue

	black	brunette	red	blond
2	0.083	0.389	0.083	0.444

Blue-eyed people are most likely to be blond or brunette.

And these probabilities should sum to one.

sum(h_blue)

## [1] 1

In general form, we say that $p(h|e)=\frac{p(e,h)}{p(e)}$ which can also be expressed as $p(h|e)=\frac{p(e,h)}{\sum_hp(e,h*)}$, where h* denotes the full set of hair color values to sum over and e denotes a specific value of eye color. More generally, we can go back to the row and column variables: \[p(c|r=x)=\frac{p(r=x,c)}{\sum_{c*}{p(r=x,c*)}}\]

Or:

\[p(c|r=x)=\frac{p(r=x,c)}{p(r=x)}\] Or if you can maintain in your head that the conditioned variable takes on a definite value, we have:

\[p(c|r)=\frac{p(r,c)}{p(r)}\] Again, just bearing in mind that the variable r takes on a specific value.

To get to Bayes Rule, multiply both sides of above by $p(r)$. \[p(c|r)p(r)=p(r,c)\]

We can do the same thing with $p(r|c)$:

$p(r|c)=\frac{p(c,r)}{p(c)}$ and multiply both sides by $p(c)$

\[ p(r|c)p(c)=p(c,r) \]

Recall the symmetry with joint probabilities $p(r,c)=p(c,r)$. We can then set the conditional expressions equal to each other.

\[ p(c|r)p(r)=p(r|c)p(c) \] Now divide by $p(r)$

\[ p(c|r)=\frac{p(r|c)p(c)}{p(r)} \] And this is Bayes’ Theorem. Recall that we can express this as

\[p(r|c) = \frac{p(r|c)p(c)}{\sum_{c*}(r|c*)p(c*)}\]

2.1 Garden of forking data

The small world is the self-contained logical world of the model. All possibilities are enumerated and there are no surprises. Bayesian models have reasonable claims to optimality: no alternative model could make better use of the information in the data and support better decisions, assuming the small world is an accurate representation of the large world.

The large world is the reality in which we deploy a generative model. In the large world, there may be events that were not imagind in the small world. The logical consistency of a model in the small world is no guarantee that it is optimal for the large world. Even correctly specified models will generate data that mimics the reality of the large world, but with error.

Bayesian models help us understand what happened, only by correctly specifying everything else that could have happened. As we learn about what did happen, we are left with only those possibilities of what could happen that are consistent with what did happen in the data.

Consider a bag containing four marbles that are either blue or white. We don’t know the color of each marble, but we know there could be five possibilities: (1) zero blue, (2) one blue, (3) two blue, (4) three blue, or (5) four blue. Consider each possibility as a conjecture about the actual state of the world. Without evidence, we can only assign equal probabilities to each conjecture, so 20 percent.

We take a marble from the bag, record its color, return it to the bag, and shake it up. We repeat this until we get three draws. The resulting sequence is blue, white, blue (b,w,b). This is sampling with replacement to produce data that can inform our conjectures.

Figure 2.2 in the text shows the garden of forking data - the enumeration of all possible paths the marble draws may take to produce the draw that was observed, assuming that the bag contains one blue and three white marbles. Figure 2.4 shows the same garden of forking data, for each of the possible conjectures.

By visually inspecting Figure 2.4, we can count up how many ways the observed draw may arise for each conjecture. For the conjecture of zero blue (1), we can know automatically that the probability is zero. Same for conjecture (5) of four blue.

For conjecture (2) of one blue, there are three paths through the garden of forking data that produce our observed draw.

For conjecture (3) of two blue, there are eight paths through the garden of forking data.

For conjecture (5) of three blue, there are nine paths.

2.1.3 From counts to probabilities

We multiply the number of ways the observed data could arise against the probability of each conjecture. We’ll call the probability of each conjecture the prior probability. With equal probability, the prior has no effect and we call the resulting operation the plausibility. The plausibilities for the set of conjectures are therefore 0, 3, 8, 9, 0. So three blue becomes our most likely conjecture, but only slightly more likely than the conjecture of two blue. We can call these plausibilities ‘likelihoods’.

More formally, let’s define the following:

plausibility $\propto$ ways a conjecture can produce observed data $\times$ prior plausibility of the conjecture

Or more succintly:

plausibility #$ likelihood $\times$ prior

Once we get the plausibilities of each conjecture, we normalize them in order to generate a set of probabilities.

ways <- c(0,3,8,9,0)

ways / sum(ways) %>%
  round(3)

## [1] 0.00 0.15 0.40 0.45 0.00

Because we assigned equal prior probabilities to each conjecture, these probabilities are just a rescaling of what we saw from the plausibility counts: conjecture (3) is most favored, but only by a tiny amount over conjecture (2). One might consider how different two conjectures should be, in order to say one conjecture is more likely than another. With this set of probabilities, we’d probably want to say that conjectures (2) and (3) are about equally likely.

These plausibilities are also probabilities. In model terms, we have

the parameter p, which is the proportion of blue marbles. It’s what we want to know from the model.
the likelihood, or relative number of ways that a candidate parameter could produce the observed data. Another way to think of it is to say, “what is the probability of observing the data that we did, for a given candidate parameter value?”
the prior plausibility of any candidate parameter value. This comes from previously observed data, theory, a principles guess to start from, etc
the posterior plausability of any candidate parameter value, produced by inputting the prior probability and likelihoods into Bayes’ Theorem or the Bayesian engine more generally

Randomization What do we mean when we say a set of data are randomized? It means we have no specific information about the ordering of the data. An ordering about which we know nothing has high information entropy.

2.2 Building a model

We made it through our first example! Which was designed to tortuously show you that Bayesian analysis is fundamentally just a counting procedure that arise from our assumptions. Given a set of assumptions, we simply compare what is possible.

We start with a set of observed data.

d <- tibble(toss=c("w","l","w","w","w","l","w","l","w"))
frq(d)

## toss <character> 
## # total N=9 valid N=9 mean=1.67 sd=0.50
## 
## Value | N | Raw % | Valid % | Cum. %
## ------------------------------------
## l     | 3 | 33.33 |   33.33 |  33.33
## w     | 6 | 66.67 |   66.67 | 100.00
## <NA>  | 0 |  0.00 |    <NA> |   <NA>

Six out of nine tosses (67 percent) were water. Under a frequentist paradigm, this is our estimate of the proportion of water on Earth. Under a Bayesian paradigm (and in some respects a generative model more generally) involves the following steps:

formulate a narrative about how the data might arise (data story)
educate your model by feeding it the observed data, and seeing what new data it generates (model updating)
evaluate your model to see how well it is able to generate the data you observed

The data story behind your model may be descriptive (specifying associations that can be used to predicte outcomes) or causal (a theory of how some events produce other events). Any causal story should also be descriptive, but it’s usually unwarranted to interpret any descriptive story causally.

In the overly simplistic model we currently have, the model goes like this:

the true proportion of water covering the globe is p, the parameter we wish to estimate
a toss of the globe has a probability p of producing water, and 1-p for producing land
each toss is independent of the other tosses

This story is sufficient to power a probability model that generates data that should mimic the data you’ve already observed. Ideally, the model of the small world generates data that is similar to the data generating process in the large world. You then use the data generated by the model to make inferences about the data generating process in the real world.

The value of storytelling Many different stories are consistent with your model. Storytelling remains valuable because in trying to develop your story of the data, the researcher may develop additional questions to help revise the model and distinguish it from other models.

2.2.2 Bayesian updating

Let’s explore a specific part of Bayesian analysis, namely the process of updating our posterior beliefs that arise out of the interaction of our prior beliefs and observed data. For the sequence of tosses described above, we’ll look at the Bayesian estimate of the proportion of water on Earth, after each individual observation.

n_1 <- data.frame(toss=rep("w", 50),
                  trials=rep(1,50),
                  success=rep(1,50),
                  p_water = seq(0,1,length.out=50),
                  prior=rep(.02,50)) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_1g <- ggplot(n_1, aes(p_water)) + 
  geom_vline(xintercept=1, size=1.2, color="darkgoldenrod2", alpha=.3) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W",
       caption="n=1\nW\nobserved water 100%") + 
  theme(plot.caption = element_text(vjust=10))

n_2 <- data.frame(toss=rep("l", 50),
                  trials=rep(2,50),
                  success=rep(1,50),
                  p_water = seq(0,1,length.out=50),
                  prior=n_1$lik) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_2g <- ggplot(n_2, aes(p_water)) + 
  geom_vline(xintercept=.5, size=1.2, color="darkgoldenrod2", alpha=.4) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W L",
       caption="n=2\nW L\nobserved water 50%") + 
  theme(plot.caption = element_text(vjust=10))

n_3 <- data.frame(toss=rep("w", 50),
                  trials=rep(3,50),
                  success=rep(2,50),
                  p_water = seq(0,1,length.out=50),
                  prior=n_2$lik) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_3g <- ggplot(n_3, aes(p_water)) + 
  geom_vline(xintercept=.67, size=1.2, color="darkgoldenrod2", alpha=.4) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W L W",
       caption="n=3\nW L W\nobserved water 67%") + 
  theme(plot.caption = element_text(vjust=10))

n_4 <- data.frame(toss=rep("w", 50),
                  trials=rep(4,50),
                  success=rep(3,50),
                  p_water = seq(0,1,length.out=50),
                  prior=n_3$lik) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_4g <- ggplot(n_4, aes(p_water)) + 
  geom_vline(xintercept=.75, size=1.2, color="darkgoldenrod2", alpha=.4) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W L W W",
       caption="n=4\nW L W W\nobserved water 75%") + 
  theme(plot.caption = element_text(vjust=10))

n_5 <- data.frame(toss=rep("w", 50),
                  trials=rep(5,50),
                  success=rep(4,50),
                  p_water = seq(0,1,length.out=50),
                  prior=n_4$lik) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_5g <- ggplot(n_5, aes(p_water)) + 
  geom_vline(xintercept=.8, size=1.2, color="darkgoldenrod2", alpha=.4) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W L W W W",
       caption="n=5\nW L W W W\nobserved water 80%") + 
  theme(plot.caption = element_text(vjust=10))

n_6 <- data.frame(toss=rep("l", 50),
                  trials=rep(6,50),
                  success=rep(4,50),
                  p_water = seq(0,1,length.out=50),
                  prior=n_5$lik) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_6g <- ggplot(n_6, aes(p_water)) + 
  geom_vline(xintercept=.67, size=1.2, color="darkgoldenrod2", alpha=.4) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W L W W W L",
       caption="n=6\nW L W W W L\nobserved water 67%") + 
  theme(plot.caption = element_text(vjust=10))

n_7 <- data.frame(toss=rep("w", 50),
                  trials=rep(7,50),
                  success=rep(5,50),
                  p_water = seq(0,1,length.out=50),
                  prior=n_6$lik) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_7g <- ggplot(n_7, aes(p_water)) + 
  geom_vline(xintercept=.71, size=1.2, color="darkgoldenrod2", alpha=.4) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W L W W W L W",
       caption="n=7\nW L W W W L W\nobserved water 71%") + 
  theme(plot.caption = element_text(vjust=10))

n_8 <- data.frame(toss=rep("l", 50),
                  trials=rep(8,50),
                  success=rep(5,50),
                  p_water = seq(0,1,length.out=50),
                  prior=n_7$lik) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_8g <- ggplot(n_8, aes(p_water)) + 
  geom_vline(xintercept=.625, size=1.2, color="darkgoldenrod2", alpha=.4) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W L W W W L W L",
       caption="n=8\nW L W W W L W L\nobserved water 63%") + 
  theme(plot.caption = element_text(vjust=10))

n_9 <- data.frame(toss=rep("w", 50),
                  trials=rep(9,50),
                  success=rep(6,50),
                  p_water = seq(0,1,length.out=50),
                  prior=n_8$lik) %>%
  mutate(lik_raw = dbinom(success,trials,p_water),
         lik=lik_raw / sum(lik_raw))

n_9g <- ggplot(n_9, aes(p_water)) + 
  geom_vline(xintercept=.67, size=1.2, color="darkgoldenrod2", alpha=.4) + 
  geom_line(aes(y=prior), size=1.2, color="grey60", linetype="dotdash") +
  geom_line(aes(y=lik), size=1, color="dodgerblue2") + 
  scale_x_continuous(breaks=c(.25,.75),
                     labels=percent) + 
  theme(axis.text.y=element_blank()) +
  labs(x="",
       y="",
       #title="W L W W W L W L W",
       caption="n=9
       W L W W W L W L W
       observed water 67%") + 
  theme(plot.caption = element_text(vjust=10))

out <- n_1g + n_2g + n_3g + n_4g + n_5g + n_6g + n_7g + n_8g + n_9g

# ggsave(out,
#        file="viz/Bayesian updating.png",
#        device="png",
#        type="cairo",
#        height=6,
#        width=8)

out

In graphs 3-5, each new trial hits water so the prior and posterior get closer to each other. For graphs six and eight, the trials do not land on water, so the prior and posterior get a bit further apart relative to their preceding graphs.

2.2.3 Evaluate the model

Your Bayesian model is the best machinery for learning from the interaction between information and assumptions. And it will give more confident assertions as it is fed more data. Just remember that the model can be both confident and wrong.. the ‘truth’ of the assertions is conditional on the ‘truth’ of the model itself. Model performance can be investigated, and different models can be compared against each other.

2.3 Components of the model

Consider what we have counted so far:

The number of ways each conjecture could produce an observation
The accumulated number of ways each conjecture could produce the entire data
The initial plausibility of each conjectured cause of the data

To proceed with the Bayesian model, we need a way to specify how likely each observation is, for a candidate parameter value. For example, if we consider that the earth consists of equal parts land and water (p = .5), how likely was it that we observed six observations of water out of nine tosses? In conventional statistics, a distribution function assigned to an observed variable is called the likelihood.

In R, we assign distributions (densities) through the prefix ‘d’, as in dnorm, dbinom, etc.

## R code 2.2
dbinom(6, size=9, prob=0.5 )

## [1] 0.164

If the true proportion of water was 50 percent, there was a 16.4 percent probability of observing six observations of water out of nine tosses. In McElreath’s wording, taking the likelihood of the data (assigning a distribution to the data according to a candidate parameter value) performed the service of counting the relative number of ways to observe the data that we did, at a given value of a candidate parameter value.

Here is the actual likelihood function for a binomial distribution:

\[Pr(W,L|p)=\frac{(W+L)!}{(W!L!)}p^W(1-p)^L\] Let’s replicate the calculation.

w <- 6
l <- 3
n <- w + l
p <- .5
num <- factorial(n)
den <- factorial(w)*factorial(l)
prob <- p^w*(1-p)^l

( num/den ) * prob

## [1] 0.164

Exactly!

We construct a posterior distribution of plausibilities by mapping the likelihood of the data, conditional on the prior, across a meaningful range of candidate parameter values.

2.3.3 A model is born

So we have observed variables water (W) and land (L), with some probability p of observing water. The distribution function that we assign to these counts (the likelihood) is the binomial distribution.

\[ W \sim Binomial(N, p) \]

Where N is the total number of tosses (L + W).

The unobserved parameter p is assumed to be anywhere from zero to a hundred with equal probability. So the likelihood we assign to the proportion of water is a uniform distribution.

\[ p \sim Uniform(0,1) \]

Consider that we know that there is more water on the surface of the earth than land, so p > .5. This would be incorporating substantive prior information into what is known as an informative prior. But for the moment, we’ll assume we know absolutely nothing about the proportion of water, so we need to make any value between zero and one as likely as any other value between zero and one.

2.4 Making the model go

We now use Bayes Theorem to power the model. Bayes Theorem interacts your prior plausibilities with the assigned likelihoods to produce a distribution of posterior plausibilities - the posterior distribution contains the relative plausibilities of different parameter values, conditional on the data and model. In this example, we’d notate that as Pr(p|W,L).

For any candidate parameter value p, the joint distribution of W, L, and p is Pr(W,L,p) = Pr(W,L|p)Pr(p).

Narratively: the probability of W, L, p can also be expressed as the subset of W, L for some p, scaled by that p’s representation in the data.

To see this explicitly, let’s calculate an example of joint, conditional, and marginal probabilities directly.

library(palmerpenguins)
data(penguins)

penguins <- penguins %>%
  mutate(bill = round(bill_length_mm,-1),
         bill = case_when(bill_length_mm < 40 ~ 40,
                          bill_length_mm > 39.999 & bill_length_mm < 45 ~ 45,
                          bill_length_mm > 44.999 ~ 50))

p <- penguins %>%
  select(species, bill) %>%
  na.omit
head(p)

species	bill
Adelie	40
Adelie	40
Adelie	45
Adelie	40
Adelie	40
Adelie	40

Get the joint probabilities

joint <- p %>%
  group_by(species, bill) %>%
  count() %>%
  ungroup() %>%
  mutate(joint=n/sum(n)) %>%
  select(-3) %>%
  pivot_wider(names_from=bill,
              values_from=joint) %>%
  adorn_totals(where=c("row","col")) %>%
  rename(spec_marg = Total) 

joint$species[4] <- "bill_marg"

joint

species	40	45	50	spec_marg
Adelie	0.292	0.140	0.009	0.442
Chinstrap	NA	0.018	0.181	0.199
Gentoo	NA	0.064	0.295	0.360
bill_marg	0.292	0.222	0.485	1.000

Get conditional frequncies

cond <- p %>%
  group_by(species) %>%
  count(bill) %>%
  mutate(cond_freq = n/sum(n)) %>%
  select(-3) %>%
  pivot_wider(names_from=bill,
              values_from=cond_freq)
cond

species	40	45	50
Adelie	0.662	0.318	0.020
Chinstrap	NA	0.088	0.912
Gentoo	NA	0.179	0.821

Compare to joint

joint

species	40	45	50	spec_marg
Adelie	0.292	0.140	0.009	0.442
Chinstrap	NA	0.018	0.181	0.199
Gentoo	NA	0.064	0.295	0.360
bill_marg	0.292	0.222	0.485	1.000

Pr(bill, species) = Pr(bill|species)p(species)

Pr(bill=45,species=adelie) = p(|bill=45|species=adelie)p(species=adelie) =0.141

.318 * .442

## [1] 0.141

Which matches the frequency in the joint table. So we get joint frequencies by multiplying a conditional with a marginal frequency. And we get a conditional frequency by dividing a joint by a marginal frequency.

Having showed that \[ Pr(W,L,p) = Pr(W,L|p)Pr(p) \] .. note that definitionally we can simply reorder the terms. So:

\[ Pr(W,L,p) = Pr(p|W,L)Pr(W,L) \]

And since these statements are definitionally equivalent, we can set them next to each other.

\[ Pr(p|W,L)(Pr(W,L) = Pr(W,L|p)Pr(p) \]

And then solve for the parameter p, given the data we observe:

\[ Pr(p|W,L) = \frac{Pr(W,L|p)Pr(p)}{Pr(W,L)} \]

And that’s Bayes Theorem. Note that $Pr(W,L|p)$ is the likelihood that we explored above. It does the work of counting the possible ways to get a result in the garden of forking data. $Pr(p)$ is our prior probability of the outcome. $Pr(W,L)$ is the average probability of the data, which is nothing more than a scaling factor to transform the posterior into a probability distribution that sums to 1.

The terminology here is specific to the example of the proportions of land and water. More generally, we might consider the parameter to be some scientific hypothesis, and data to be the evidence relating to that hypothesis. So Bayes Theorem could then be expressed as:

\[ Pr(Hypothesis|Evidence) = \frac{Pr(Evidence|Hypothesis)Pr(Hypothesis)}{Pr(Evidence)} \]

The denominator of Bayes Theorem is an integral. Unfortunately, for most analytical problems, the integral has no solution. Instead, there are three common numerical techniques to approximate the integral:

Grid appoximation
Quadratic approximation
Markov Chain Monte Carolo (MCMC) sampling

Grid approximation

We’ll start of with grid approximation, in which we use a set of finite values to approximate the infinite number of values that are summed in an integral. For a grid of candidate parameter values, and following Bayes Theorem, we multiply the prior probability of a given candidate value with the likelihood of the candidate value given the observed data.

set.seed(54)

bin_mod <- data.frame(prior=1,
                      param = round(seq(0,1,length.out=20),3)) %>%
  mutate(lik = dbinom(6, 9, param),
         unstd.posterior = lik*prior,
         posterior = unstd.posterior / sum(unstd.posterior))

bin_mod

prior	param	lik	unstd.posterior	posterior
1	0.000	0.000	0.000	0.000
1	0.053	0.000	0.000	0.000
1	0.105	0.000	0.000	0.000
1	0.158	0.001	0.001	0.000
1	0.211	0.004	0.004	0.002
1	0.263	0.011	0.011	0.006
1	0.316	0.027	0.027	0.014
1	0.368	0.053	0.053	0.028
1	0.421	0.091	0.091	0.048
1	0.474	0.139	0.139	0.073
1	0.526	0.189	0.189	0.100
1	0.579	0.236	0.236	0.124
1	0.632	0.267	0.267	0.140
1	0.684	0.271	0.271	0.143
1	0.737	0.245	0.245	0.129
1	0.789	0.190	0.190	0.100
1	0.842	0.118	0.118	0.062
1	0.895	0.050	0.050	0.026
1	0.947	0.009	0.009	0.005
1	1.000	0.000	0.000	0.000

ggplot(bin_mod, aes(param, posterior)) + 
  geom_vline(xintercept=.67, color="darkgoldenrod2", size=1.2, alpha=.4) +
  geom_point(color="dodgerblue", size=2) +
  geom_line(color="dodgerblue", size=1)

Now let’s look at how choice of prior affects the posterior.

bin_mod2 <- data.frame(prior=1,
                      param = round(seq(0,1,length.out=20),3)) %>%
  mutate(prior = ifelse(param < 0.5 , 0 , 1),
         lik = dbinom(6, 9, param),
         unstd.posterior = lik*prior,
         posterior = unstd.posterior / sum(unstd.posterior),
         prev_posterior = bin_mod$posterior)

bin_mod2

prior	param	lik	unstd.posterior	posterior	prev_posterior
0	0.000	0.000	0.000	0.000	0.000
0	0.053	0.000	0.000	0.000	0.000
0	0.105	0.000	0.000	0.000	0.000
0	0.158	0.001	0.000	0.000	0.000
0	0.211	0.004	0.000	0.000	0.002
0	0.263	0.011	0.000	0.000	0.006
0	0.316	0.027	0.000	0.000	0.014
0	0.368	0.053	0.000	0.000	0.028
0	0.421	0.091	0.000	0.000	0.048
0	0.474	0.139	0.000	0.000	0.073
1	0.526	0.189	0.189	0.120	0.100
1	0.579	0.236	0.236	0.150	0.124
1	0.632	0.267	0.267	0.169	0.140
1	0.684	0.271	0.271	0.172	0.143
1	0.737	0.245	0.245	0.155	0.129
1	0.789	0.190	0.190	0.121	0.100
1	0.842	0.118	0.118	0.075	0.062
1	0.895	0.050	0.050	0.032	0.026
1	0.947	0.009	0.009	0.006	0.005
1	1.000	0.000	0.000	0.000	0.000

ggplot(bin_mod2, aes(param, posterior)) + 
  geom_vline(xintercept=bin_mod2[bin_mod2$posterior==max(bin_mod2$posterior),2],
             color="darkgoldenrod2", size=1, alpha=.4) +
  geom_point(color="dodgerblue", size=2) +
  geom_line(color="dodgerblue", size=1) +
  geom_vline(xintercept=.67, color="grey60", size=1, alpha=.4, linetype="dotdash") +
  geom_line(aes(y=prev_posterior), color="grey60", size=1, linetype="dotdash")

The prior pushed the posterior max a little higher.

bin_mod3 <- data.frame(prior=1,
                      param = round(seq(0,1,length.out=20),3)) %>%
  mutate(prior = exp( -5*abs(param - 0.5)),
         lik = dbinom(6, 9, param),
         unstd.posterior = lik*prior,
         posterior = unstd.posterior / sum(unstd.posterior),
         prev_posterior = bin_mod2$posterior,
         prev_posterior2 = bin_mod$posterior)

bin_mod3

prior	param	lik	unstd.posterior	posterior	prev_posterior	prev_posterior2
0.082	0.000	0.000	0.000	0.000	0.000	0.000
0.107	0.053	0.000	0.000	0.000	0.000	0.000
0.139	0.105	0.000	0.000	0.000	0.000	0.000
0.181	0.158	0.001	0.000	0.000	0.000	0.000
0.236	0.211	0.004	0.001	0.001	0.000	0.002
0.306	0.263	0.011	0.003	0.004	0.000	0.006
0.399	0.316	0.027	0.011	0.011	0.000	0.014
0.517	0.368	0.053	0.027	0.029	0.000	0.028
0.674	0.421	0.091	0.061	0.065	0.000	0.048
0.878	0.474	0.139	0.122	0.129	0.000	0.073
0.878	0.526	0.189	0.166	0.176	0.120	0.100
0.674	0.579	0.236	0.159	0.168	0.150	0.124
0.517	0.632	0.267	0.138	0.146	0.169	0.140
0.399	0.684	0.271	0.108	0.114	0.172	0.143
0.306	0.737	0.245	0.075	0.079	0.155	0.129
0.236	0.789	0.190	0.045	0.047	0.121	0.100
0.181	0.842	0.118	0.021	0.023	0.075	0.062
0.139	0.895	0.050	0.007	0.007	0.032	0.026
0.107	0.947	0.009	0.001	0.001	0.006	0.005
0.082	1.000	0.000	0.000	0.000	0.000	0.000

What does this prior look like?

ggplot(bin_mod3, aes(param, prior)) + 
  geom_line()

Looks like this prior has more mass at zero than a normal approximation.

And the results

ggplot(bin_mod3, aes(param, posterior)) + 
  geom_point(color="dodgerblue", size=2) +
  geom_line(color="dodgerblue", size=1) +
  geom_line(aes(y=prev_posterior), color="grey60", size=1, linetype="dotdash") +
  geom_line(aes(y=prev_posterior2), color="darkgoldenrod2", size=1, linetype="dotdash")

Now the posterior max has been pushed lower.

Quadratic approximation

This method imposes a normal distribution over the posterior distribution by taking the posterior max and the curvature immediately surrounding the max. Here’s how it works.

library(rethinking)
globe.qa <- quap(
     alist(
          w ~ dbinom(w+l,p) ,  # binomial likelihood
          p ~ dunif(0,1)     # uniform prior
     ) ,
     data=list(w=6, l=3) )

# display summary of quadratic approximation
precis( globe.qa )

x
0.667

x
0.157

x
0.416

x
0.918

The posterior means are the same. What about the standard deviation of the mean? We’ll have to wait until we start extracting samples from the posterior distribution.

globe.samp <- extract.samples(globe.qa)
head(globe.samp)

p
0.735
0.531
0.730
0.743
0.875
0.951

ggplot(globe.samp, aes()) + 
  geom_density(aes(x=p))

How does the quadratic approximation compare to the analytic solution? Let’s look.

set.seed(4332)

n_9 <- data.frame(x = rnorm(9, .67,.16))

ggplot(n_9, aes(x)) + 
  geom_density(color="darkgoldenrod2", size=1, alpha=.4) +
  geom_function(fun=dbeta, args=list(6,3), color="dodgerblue", size=1, alpha=.4) +
  scale_x_continuous(limits=c(0,1))

set.seed(4332)

n_18 <- data.frame(x = rnorm(18, .67,.16))

ggplot(n_18, aes(x)) + 
  geom_density(color="darkgoldenrod2", size=1, alpha=.4) +
  geom_function(fun=dbeta, args=list(6,3), color="dodgerblue", size=1, alpha=.4) +
  scale_x_continuous(limits=c(0,1))

set.seed(4332)

n_36 <- data.frame(x = rnorm(36, .67,.16))

ggplot(n_36, aes(x)) + 
  geom_density(color="darkgoldenrod2", size=1, alpha=.4) +
  geom_function(fun=dbeta, args=list(6,3), color="dodgerblue", size=1, alpha=.4) +
  scale_x_continuous(limits=c(0,1))

2.6 Practice

2M1

Recall the globe tossing model from the chapter. Compute and plot the grid approximate posterior distribution for each of the following sets of observations. In each case, assume a uniform prior for p.

W,W,W

set.seed(54)

mod1_2M1 <- data.frame(prior=1,
                      param = round(seq(0,1,length.out=20),3)) %>%
  mutate(lik = dbinom(3, 3, param),
         unstd.posterior = lik*prior,
         posterior = unstd.posterior / sum(unstd.posterior))

mod1_2M1

prior	param	lik	unstd.posterior	posterior
1	0.000	0.000	0.000	0.000
1	0.053	0.000	0.000	0.000
1	0.105	0.001	0.001	0.000
1	0.158	0.004	0.004	0.001
1	0.211	0.009	0.009	0.002
1	0.263	0.018	0.018	0.003
1	0.316	0.032	0.032	0.006
1	0.368	0.050	0.050	0.009
1	0.421	0.075	0.075	0.014
1	0.474	0.106	0.106	0.020
1	0.526	0.146	0.146	0.028
1	0.579	0.194	0.194	0.037
1	0.632	0.252	0.252	0.048
1	0.684	0.320	0.320	0.061
1	0.737	0.400	0.400	0.076
1	0.789	0.491	0.491	0.093
1	0.842	0.597	0.597	0.113
1	0.895	0.717	0.717	0.136
1	0.947	0.849	0.849	0.161
1	1.000	1.000	1.000	0.190

ggplot(mod1_2M1, aes(param, posterior)) + 
  #geom_vline(xintercept=.67, color="darkgoldenrod2", size=1.2, alpha=.4) +
  geom_point(color="dodgerblue", size=2) +
  geom_line(color="dodgerblue", size=1)

W, W, W, L

set.seed(54)

mod2_2m1 <- data.frame(prior=1,
                      param = round(seq(0,1,length.out=20),3)) %>%
  mutate(lik = dbinom(3, 4, param),
         unstd.posterior = lik*prior,
         posterior = unstd.posterior / sum(unstd.posterior))

mod2_2m1

prior	param	lik	unstd.posterior	posterior
1	0.000	0.000	0.000	0.000
1	0.053	0.001	0.001	0.000
1	0.105	0.004	0.004	0.001
1	0.158	0.013	0.013	0.004
1	0.211	0.030	0.030	0.008
1	0.263	0.054	0.054	0.014
1	0.316	0.086	0.086	0.023
1	0.368	0.126	0.126	0.033
1	0.421	0.173	0.173	0.046
1	0.474	0.224	0.224	0.059
1	0.526	0.276	0.276	0.073
1	0.579	0.327	0.327	0.086
1	0.632	0.372	0.372	0.098
1	0.684	0.404	0.404	0.107
1	0.737	0.421	0.421	0.111
1	0.789	0.415	0.415	0.110
1	0.842	0.377	0.377	0.100
1	0.895	0.301	0.301	0.080
1	0.947	0.180	0.180	0.048
1	1.000	0.000	0.000	0.000

ggplot(mod2_2m1, aes(param, posterior)) + 
  #geom_vline(xintercept=.67, color="darkgoldenrod2", size=1.2, alpha=.4) +
  geom_point(color="dodgerblue", size=2) +
  geom_line(color="dodgerblue", size=1)

L, W, W, L, W, W, W

set.seed(54)

mod3_2m1 <- data.frame(prior=1,
                      param = round(seq(0,1,length.out=20),3)) %>%
  mutate(lik = dbinom(5, 7, param),
         unstd.posterior = lik*prior,
         posterior = unstd.posterior / sum(unstd.posterior))

mod3_2m1

prior	param	lik	unstd.posterior	posterior
1	0.000	0.000	0.000	0.000
1	0.053	0.000	0.000	0.000
1	0.105	0.000	0.000	0.000
1	0.158	0.001	0.001	0.001
1	0.211	0.005	0.005	0.002
1	0.263	0.014	0.014	0.006
1	0.316	0.031	0.031	0.013
1	0.368	0.057	0.057	0.024
1	0.421	0.093	0.093	0.039
1	0.474	0.139	0.139	0.059
1	0.526	0.190	0.190	0.080
1	0.579	0.242	0.242	0.102
1	0.632	0.287	0.287	0.121
1	0.684	0.314	0.314	0.132
1	0.737	0.316	0.316	0.133
1	0.789	0.286	0.286	0.120
1	0.842	0.222	0.222	0.093
1	0.895	0.133	0.133	0.056
1	0.947	0.045	0.045	0.019
1	1.000	0.000	0.000	0.000

ggplot(mod3_2m1, aes(param, posterior)) + 
  #geom_vline(xintercept=.67, color="darkgoldenrod2", size=1.2, alpha=.4) +
  geom_point(color="dodgerblue", size=2) +
  geom_line(color="dodgerblue", size=1)

2M3

Suppose there are two globes, one for Earth and one for Mars. The Earth globe is 70% covered in water. The Mars globe is 100% land. Further suppose that one of these globes– you don’t know which – was tossed in the air and produced a “land” observation. Assume that each globe was equally likely to be tossed. Show that the posterior probability that the globe was the Earth, conditional on seeing “land” (Pr(Earth|land)), is 23%.

\[Pr(Earth|land) = \frac{Pr(land|Earth)Pr(Earth)}{Pr(land)}\]

lnd_erth <- .3
p_erth <- .5
p_lnd <- 1.3/2
.3*.5 + 1*.5

## [1] 0.65

(lnd_erth*p_erth) / p_lnd

## [1] 0.231

Homework 2022 (4)

Suppose there is bias in sampling so that Land is more likely than Water to be recorded. Specifically, assume that 1-in-5 (20%) of Water samples are accidentally recorded instead as “Land”. First, write a generative simulation of this sampling process. Assuming the true proportion of Water is 70%, what proportion does your simulation tend to produce instead? Second, using a simulated sample of 20 tosses, compute the unbiased posterior distribution of the true proportion of water.

The true proportion of Water is 70%

set.seed(43)
N <- 1e6
#trueW <- rbinom(N, 20, .7)
trueW <- rbinom(N, 1, .7)

mean(trueW)

## [1] 0.7

obsW <- rep(NA, N)

for(i in 1:length(obsW)) {
  obsW[i] = ifelse(trueW[i]==1, rbinom(1,1,.8), trueW[i])
}

mean(obsW)

## [1] 0.56

#obsW2 <- replicate(N, ifelse(trueW==1, rbinom(1,1,.8), trueW))

#mean(obsW2)

Here’s how McElreath does it:

\[P(W|W)=.8\] \[P(W|L)=.2\] - I don’t get this one \[P(W)=.7*.8\] - observed, not true \[P(W)=.7\] - true proportion

trueW_MC <- rbinom(N, 20, .7) # probability of water
mean(trueW_MC)/20

## [1] 0.7

obsW_MC <- rbinom(N,trueW_MC, .8) # probability of water, given an initial roll of water
mean(obsW_MC)/20

## [1] 0.56

obsW_MC2 <- rbinom(N, 20, .8*.7) # same as above in one step
mean(obsW_MC2) / 20

## [1] 0.56

# now analyze
# Pr(p|W,N) = Pr(W|p,N)Pr(p) / Z
# Pr(W|N,p) = Pr(W)Pr(W|W)

W <- rbinom(1,size=20,prob=0.7*0.8)
grid_p <- seq(from=0,to=1,len=100)
pr_p <- dbeta(grid_p,1,1)
prW <- dbinom(W,20,grid_p*0.8)
post <- data.frame(x=prW*pr_p)

ggplot(post, aes(x)) + 
  geom_density()

biased <- data.frame(prior=1,
                  param=round(seq(0,1,lenth.out=20),3)) %>%
  mutate(lik=dbinom(W, 20, param),
         unstd.post = lik*prior,
         post = unstd.post/sum(unstd.post))

biased

prior	param	lik	unstd.post	post
1	0	0	0	NaN
1	1	0	0	NaN

bin_mod3 <- data.frame(prior=1,
                      param = round(seq(0,1,length.out=20),3)) %>%
  mutate(prior = exp( -5*abs(param - 0.5)),
         lik = dbinom(6, 9, param),
         unstd.posterior = lik*prior,
         posterior = unstd.posterior / sum(unstd.posterior),
         prev_posterior = bin_mod2$posterior,
         prev_posterior2 = bin_mod$posterior)

bin_mod3

prior	param	lik	unstd.posterior	posterior	prev_posterior	prev_posterior2
0.082	0.000	0.000	0.000	0.000	0.000	0.000
0.107	0.053	0.000	0.000	0.000	0.000	0.000
0.139	0.105	0.000	0.000	0.000	0.000	0.000
0.181	0.158	0.001	0.000	0.000	0.000	0.000
0.236	0.211	0.004	0.001	0.001	0.000	0.002
0.306	0.263	0.011	0.003	0.004	0.000	0.006
0.399	0.316	0.027	0.011	0.011	0.000	0.014
0.517	0.368	0.053	0.027	0.029	0.000	0.028
0.674	0.421	0.091	0.061	0.065	0.000	0.048
0.878	0.474	0.139	0.122	0.129	0.000	0.073
0.878	0.526	0.189	0.166	0.176	0.120	0.100
0.674	0.579	0.236	0.159	0.168	0.150	0.124
0.517	0.632	0.267	0.138	0.146	0.169	0.140
0.399	0.684	0.271	0.108	0.114	0.172	0.143
0.306	0.737	0.245	0.075	0.079	0.155	0.129
0.236	0.789	0.190	0.045	0.047	0.121	0.100
0.181	0.842	0.118	0.021	0.023	0.075	0.062
0.139	0.895	0.050	0.007	0.007	0.032	0.026
0.107	0.947	0.009	0.001	0.001	0.006	0.005
0.082	1.000	0.000	0.000	0.000	0.000	0.000

tes <- ifelse(trueW==1, rbinom(1,1,.8), trueW)
head(obsW)

## [1] 1 0 0 0 1 0

rbinom(1,1,.8)

## [1] 1

a <- data.frame(a=1:100) %>%
  mutate(b=rbinom(1,1,.8))

head(a)

a	b
1	1
2	1
3	1
4	1
5	1
6	1

frq(a$b)

## x <integer> 
## # total N=100 valid N=100 mean=1.00 sd=0.00
## 
## Value |   N | Raw % | Valid % | Cum. %
## --------------------------------------
##     1 | 100 |   100 |     100 |    100
##  <NA> |   0 |     0 |    <NA> |   <NA>

out <- data.frame(trueW, obsW)

head(out)

trueW	obsW
1	1
0	0
1	0
0	0
1	1
1	0

mean(trueW)

## [1] 0.7

mean(obsW)

## [1] 0.56

recW <- rbinom(N, 20, .5)

Statistical Rethinking

Chapter 2 notes

Dan Killian

1/13/2022