# Bayesian Baby Steps: Intro with JuJu

## First steps

I was originally thinking of writing a blog post about multilevel models (aka hierachical, mixed, random effects) because of how useful they are for measuring player performance in sports1 (shameless self promotion for nflWAR here!). But the more I thought about it, the more I realized how ill-minded of an idea that was. Instead, I want to build up the intuition for how and why one would want to use a full Bayesian multilevel model. My brother recommended that I checkout Statistical Rethinking by Richard McElreath. It’s an amazing introduction to Bayesian data analysis that I recommend to anyone interested in learning more. It’s incredibly intuitive and paced very well, for instance MCMC isn’t covered until halfway through the book so you understand why you’re using it. Taking what I have learned from the book, I’m writing a series of posts that will hopefully build the basic understanding of how Bayesian inference works starting with simple updating and gradually working up towards multilevel modeling. This series is called Bayesian Baby Steps for a reason, and there will be What About Bob? GIFs. Because of my work with nflscrapR, all my examples will be using NFL data which is ripe for Bayesian data analysis.

## Catch rate example

You’re trying to evaluate a receiver’s ability to catch a football. Let’s pretend you can take the following (completely unrealistic) strategy: you tell your quarterback to repeatedly throw the ball to your receiver in practice, recording each time whether or not they caught the ball. We’ll let C stand for catch and D stand for drop. We could carry out this procedure any number of times, for instance we could record the following ten pass attempts to our receiver:

sample(c("C", "D"), size = 10, replace = TRUE)
##  [1] "D" "D" "D" "D" "D" "C" "C" "D" "D" "C"

We can describe the story for this data generating process quite easily, informing us how to simulate new data. In this case:

1. Our receiver’s true catch rate is p.
2. Each single pass attempt has a probability p of being caught, thus meaning a pass has a probability of 1 - p of being dropped by our receiver.
3. We assume each pass attempt is independent of one another.

We’re now ready to explore this probability model and how to handle observing data to update our beliefs about our receiver’s catch rate p using Bayesian inference.

In a Bayesian model, there are three things we need to choose to ultimately represent the number of ways our data can be generated:

1. Likelihood - plausibility of our observed data given a receiver’s catch rate p
2. Parameter - our quantity of interest p, which we want to learn about from our data
3. Prior - our initial belief regarding different values for p

Based on the story above, it’s pretty easy to see that we can just use the binomial distribution for our likelihood since we’re assuming each of our n pass attempts is independent and that the receiver’s catch probability p is the same for every attempt. We can then write the probability of observing c receptions in n pass attempts with a catch probability p as:

$\text{Pr}(c | n, p) = \frac{n!}{c!(n - c)!}p^c (1 - p)^{n-c}$

We then use good old Bayes’ theorem to provide us with the probability of a value for the receiver’s catch probability p given our observed data:

$\text{Pr}(p | c) = \frac{\text{Pr}(c|p) \text{Pr}(p)}{\text{Pr}(c)}$

I personally like the way McElreath refers to the denonimator, $\text{Pr}(c)$, as the average likelihood of the observed data over p. All it is is an expectation over all our possible values for p:

$\text{Pr}(c) = \text{E}\big[\text{Pr}(c|p)\big] = \int \text{Pr}(c|p) \text{Pr}(p)dp$

This denonimator is known as the normalizing constant because it ensures that our posterior density integrates to one, i.e. our posterior is actually a probability. What often happens is that this term is extremely difficult to compute. We usually consider the posterior up to the normalizing constant as the product between the likelihood and prior. Or if this form isn’t tractable, we resort to some approximation technique such as grid-based which we cover below, as well as quadratic and Markov Chain Monte Carlo (MCMC) methods. However we’ll cover those in a post later, remember baby steps people!

One notion that I completely glossed over above is how McElreath motivates the mechanism for Bayesian analysis. He provides a walkthrough of the garden of forking data describing how, in formulating the posterior above, we’re really just counting paths. We simply use multiplication as a quick way to count all the ways from our prior number of paths through our new number. McElreath also makes the excellent remark that Bayesian inference is not defined by Bayes’ theorem. Everyone learns about Bayes theorem with some trivial examples like smoking and lung cancer, but that is not the point of Bayesian approaches. The point is to appropriately measure and account for the uncertainty in our models and parameters. We’re not satisfied with stating, “Oh our receiver caught 5 of 5 passes - he never drops a pass!” - we want to capture the uncertainty in his reception probability given what we already knew, such as the belief it’s pretty unlikely his catch rate is a perfect 100%.

## Grid approximation with JuJu

Even though from probability theory we know that every combination of prior, likelihood, and data constitutes a unique posterior - in many cases we cannot derive the resulting posterior. This leads to the use of approximation techniques such as grid approximation, quadratic approximation, and MCMC to name a few. In this post we’ll cover grid approximation because of its simplicity, which makes it a great way to learn the basics behind Bayesian updating.

Let’s expand on our catch rate model a bit by going back in time, and assume we’re working the Pittsburgh Steelers heading into the 2017 NFL season. We drafted JuJu Smith-Schuster to add another option to the offense and want to evaluate his performance as the season goes on based on his catch rate. Using the nflscrapR package we can easily get all plays from the 2017 season. You can use the season_play_by_play() function from the package to scrape the data, or you can access the files I saved here in my nflscrapR-data repository. The following code will access the 2017 play-by-play data, filter down only to the pass attempts to JuJu, then create a dataset that summarizes his performance in each game he played along with cumulative totals after each game:

# Access the tidyverse (install if you don't have it!)
# install.packages("tidyverse")
library(tidyverse)

# Load the 2017 play-by-play data from my repository:
juju_games <-
# Filter down only to the pass attempts to JuJu based on his GSIS ID 00-0033857:
PassAttempt == 1) %>%
# Only select the GameID and PassOutcome columns:
select(GameID, PassOutcome) %>%
# Calculate the number of receptions, targets, and catch rate in each game:
group_by(GameID) %>%
summarise(receptions = length(which(PassOutcome == "Complete")),
targets = n(),
catch_rate = receptions / targets) %>%
# Calculate cumulative stats:
mutate(total_receptions = cumsum(receptions),
total_targets = cumsum(targets),
total_catch_rate = total_receptions / total_targets,
# Columns to be used later:
index = 1:n(),
game_index = paste("game_", index, sep = ""),
game_index = fct_relevel(factor(game_index),
"game_1", "game_2", "game_3",
"game_4", "game_5", "game_6",
"game_7", "game_8", "game_9",
"game_10", "game_11", "game_12",
"game_13"))

Grid approximation provides us with a very simple set of steps to update our evaluation of JuJu’s performance. The first step is to initialize our grid of values for JuJu’s p we’re going to estimate the posterior probability for, which in this case will be increments of 5% from 0 to 1:

p_grid <- seq(from = 0, to = 1, by = .05)

Then we define our prior belief for each of the possible values in the grid. Just to demonstrate, we’ll use a flat prior which means we initially believe each of the grid values for p are equally likely. I expand on the prior below, but for now we use the following:

prior <- rep(1, 21)

Next we’ll calculate the likelihood for each of the grid values for p using the binomial distribution from above. We’ll calculate the likelihood as if only one game was played (just grabbing the receptions and targets values from the first row of the data):

likelihood <- dbinom(x = juju_games$receptions[1], size = juju_games$targets[1],
prob = p_grid)

We’re then able to calculate the numerator of Bayes’ theorem by multiplying the prior by the likelihood, providing the unstandardized posterior for each of the grid values:

bayes_numerator <- likelihood * prior

To arrive at the posterior estimates, we just follow Bayes’ theorem and take these products and divide them by the sum of the numerators for each value on the grid. This is of course easy to do with in R with vectorized operations:

posterior <- bayes_numerator / sum(bayes_numerator)

Our grid approximation for the posterior of JuJu’s catch rate after one game can easily be viewed:

data.frame(p_grid = p_grid, p_posterior = posterior) %>%
ggplot(aes(x = p_grid, y = p_posterior)) +
geom_point(size = 3, color = "darkblue") +
geom_line(color = "darkblue") +
# Add a vertical line for JuJu's observed catch rate:
geom_vline(xintercept = juju_games$catch_rate[1], color = "darkorange", linetype = "dashed", size = 3, alpha = .5) + # Label! labs(x = "Catch rate", y = "Posterior probability", title = "Posterior approximation for\nJuJu's catch rate after one game") + # Clean it up: theme_bw() + theme(axis.text = element_text(size = 10), title = element_text(size = 10))  This gives us a full posterior distribution for JuJu’s catch rate, rather than just his observed value of 0.75 after one game (indicated by the orange dashed line). We can also carry out this grid approximation procedure to generate posterior distributions for his catch rate after every game based on his running totals: # Create a data frame by applying the grid approximation steps to each row # of juju_games: game_posteriors <- map_dfc(c(1:nrow(juju_games)), function(x) { p_grid <- seq(from = 0, to = 1, by = .05) prior <- rep(1, 21) likelihood <- dbinom(x = juju_games$total_receptions[x],
size = juju_games$total_targets[x], prob = p_grid) bayes_numerator <- likelihood * prior posterior <- bayes_numerator / sum(bayes_numerator) # Return this as a data frame: result <- data.frame(posterior) colnames(result) <- paste("game_", x, sep = "") return(result) }) # Join these columns with p_grid and column for the prior probability: data.frame(p_grid = p_grid, prior = rep(1 / 21, 21)) %>% bind_cols(game_posteriors) %>% # Gather the columns so the data is long, one row for each week and grid value gather(key = "game_index", value = "posterior_prob", -p_grid) %>% # Relevel the game_index variable: mutate(game_index = fct_relevel(factor(game_index), "prior", "game_1", "game_2", "game_3", "game_4", "game_5", "game_6", "game_7", "game_8", "game_9", "game_10", "game_11", "game_12", "game_13")) %>% # Visualize the posteriors for each game: ggplot(aes(x = p_grid, y = posterior_prob)) + geom_point(size = 2, color = "darkblue") + geom_line(color = "darkblue") + facet_wrap(~ game_index) + # Add vertical lines for each cumulative observed rate geom_vline(data = juju_games, aes(xintercept = total_catch_rate), color = "darkorange", linetype = "dashed", size = 1, alpha = .5) + geom_text(data = juju_games, size = 3, x = .25, y = .3, aes(label = paste("Caught", receptions, "of", targets, sep = " "))) + # Label! labs(x = "Catch rate", y = "Posterior probability", title = "Posterior approximation for JuJu's catch rate after each game") + # Clean it up: theme_bw() + theme(axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 6), title = element_text(size = 10))  This plot gives us a pretty clear view of Bayesian updating. We start with a flat prior, believing any catch rate between 0 and 1 is equally likely. Then we keep updating our belief or understanding of what JuJu’s catch is likely to be given his in game performances. Each dashed line marks his cumulative catch rate which, should not come as a surprise, is the most likely value after each game. After his second game where he only caught two of six passes we see a shift in the distribution. But as we keep observing games, the distributions move and become more concentrated around his final season catch rate of roughly 0.73. This exact same process holds and yields the same results if we updated after each individual target, after observing one event the resulting posterior becomes the prior for the next target. Which means that, although I generated the plot above using the flat prior andcumulative stats after each game, I could’ve also generated the same figure using the previous game’s posterior as the prior for the following single game performance. It’s just simpler from a coding point of view to present all of the data, but breaking it up individually provides us with the true step-wise view of updating. ## Prior knowledge You’re probably thinking this example is dumb since I started with a flat prior. And you’re right! Why would anyone believe a receiver’s catch is equally likely to be 0 as is 1? You might’ve noticed the fact that the posteriors are even just proportional to the likelihood, so we’re really not taking advantage of the purpose of a prior distribution. We can use priors to incorporate expert knowledge into our model, helping us limit parameter values to a range that makes sense. If you have ever learned about regularized regression, you can think of priors as accomplishing a similar task. In fact Lasso regression can be intepreted this way. People often complain that the choice of prior is subjective, but if you think data analysis is entirely objective then you’re ignorant of the truth. To choose a reasonable prior for JuJu’s catch rate, we’re going to take a data-driven approach or use Empirical Bayes. David Robinson has an excellent series of posts and book on Empirical Bayes using baseball batting averages as an example. We’re going to use the same approach here but in the context of our football problem. We’ll estimate our prior using the distribution of catch rates by receivers in the 2016 season. With my nflWAR package it’s pretty easy to grab statistics by players for certain positions. Here we only grab catch rates by WRs in 2016 with at least 25 targets: # install.packages("devtools") # devtools::install_github("ryurko/nflWAR") library(nflWAR) # Ignore any messages you see wr_catch_rates <- get_pbp_data(2016) %>% add_positions(2016) %>% add_model_variables() %>% prepare_model_data() %>% add_position_tables() %>% join_position_statistics() %>% pluck("WR_table") %>% select(Player_ID_Name, Rec_Perc, Targets) %>% filter(Targets >= 25) Next we display our distribution of catch rates in the 2016 season: wr_catch_rates %>% ggplot(aes(x = Rec_Perc)) + geom_density(color = "darkblue") + geom_rug(color = "darkblue") + theme_bw() + labs(x = "Catch rate", y = "Density", title = "Distribution of WR catch rates in 2016 (minimum 25 targets)") We see concentration of catch rates between 0.5 and 0.6 with no receiver having a catch rate less than 0.4 or above 0.8. Because of the shape of this distribution and from the fact a receiver’s catch rate must be between 0 and 1, it makes sense to use a beta distribution estimated from this data as our prior. This means we’re assuming $p \sim \text{Beta}(\alpha_0, \beta_0)$ where we have to estimate the hyper-parameters (parameters for priors), $\alpha_0$ and $\beta_0$, from our data. A very simple way to do this in R is with the method of moments using the mean and variance following the example in Robinson’s post: # Use the fitdistr function from the MASS library: prior_beta_model <- MASS::fitdistr(wr_catch_rates$Rec_Perc, dbeta,
start = list(shape1 = 10, shape2 = 10))

# Extract the approximated  priors
alpha0 <- prior_beta_model$estimate[1] beta0 <- prior_beta_model$estimate[2]

We’ll now follow our grid approximation steps from before but instead use the approximated beta prior:

# Create a data frame by applying the grid approximation steps to each row
# of juju_games:
new_game_posteriors <- map_dfc(c(1:nrow(juju_games)),
function(x) {
p_grid <- seq(from = 0, to = 1, by = .05)
prior <- dbeta(p_grid, alpha0, beta0)
likelihood <- dbinom(x = juju_games$total_receptions[x], size = juju_games$total_targets[x],
prob = p_grid)
bayes_numerator <- likelihood * prior
posterior <- bayes_numerator / sum(bayes_numerator)
# Return this as a data frame:
result <- data.frame(posterior)
colnames(result) <- paste("game_", x, sep = "")
return(result)
})

# Join these columns with p_grid and column for the prior probability:
data.frame(p_grid = p_grid,
prior = dbeta(p_grid, alpha0, beta0) / sum(dbeta(p_grid, alpha0, beta0))) %>%
bind_cols(new_game_posteriors) %>%
# Gather the columns so the data is long, one row for each week and grid value
gather(key = "game_index", value = "posterior_prob", -p_grid) %>%
# Relevel the game_index variable:
mutate(game_index = fct_relevel(factor(game_index),
"prior", "game_1", "game_2", "game_3",
"game_4", "game_5", "game_6",
"game_7", "game_8", "game_9",
"game_10", "game_11", "game_12",
"game_13")) %>%
# Visualize the posteriors for each game:
ggplot(aes(x = p_grid, y = posterior_prob)) +
geom_point(size = 2, color = "darkblue") +
geom_line(color = "darkblue") +
facet_wrap(~ game_index) +
# Add vertical lines for each cumulative observed rate
geom_vline(data = juju_games,
aes(xintercept = total_catch_rate), color = "darkorange",
linetype = "dashed", size = 1, alpha = .5) +
geom_text(data = juju_games, size = 3,
x = .25, y = .3, aes(label = paste("Caught",
receptions, "of",
targets, sep = " "))) +
# Label!
labs(x = "Catch rate", y = "Posterior probability",
title = "Posterior approximation for JuJu's catch rate after each game") +
# Clean it up:
theme_bw() + theme(axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 6),
title = element_text(size = 10)) 

Now with this informed prior we see a clear difference in the updating procedure compared to before. With the flat prior the catch rate with the peak posterior probability was always JuJu’s cumulative catch rate after the game. Now we see how the prior distribution applies some resistance to the approximated posterior. For instance, after his second game where he only caught two of six targets, JuJu’s catch rate dropped to 0.5 but the posterior doesn’t shift as far to the left because of the prior pulling it back. This process continues the whole way, with the most likely value after the last game slightly less than JuJu’s catch rate after the whole season - hence providing some regularization for our likely value of p.

All I’ve talked about so far is using grid-based approach to reach this posterior approximation. But we actually know the analytical calculation2 for this example’s posterior since the beta distribution is the conjugate prior to the binomial resulting in a posterior that also follows the beta distribution (not the focus of this post but here’s more info). We can calculate this known result quite easily, displaying the density curve using JuJu’s full season statistics in comparison to the final grid approximation:

# Compuate the known posterior for 1000 points from 0 to 1:
known_posterior <- data.frame(p_density = dbeta(seq(0, 1, length.out = 1000),
# Add receptions to prior alpha
juju_games\$total_receptions[nrow(juju_games)] + alpha0,
# Add incompletions to prior beta
with(juju_games,
(total_targets[nrow(juju_games)] -
total_receptions[nrow(juju_games)]) + beta0)),
p_grid = seq(0, 1, length.out = 1000)) %>%
ggplot(aes(x = p_grid, y = p_density)) +
geom_line(color = "darkorange") +
# Label!
labs(x = "Catch rate", y = "Posterior density",
title = "Known posterior distribution using beta prior") +
# Clean it up:
theme_bw() + theme(axis.text = element_text(size = 10),
title = element_text(size = 10))

grid_posterior <- new_game_posteriors %>%
select(game_13) %>%
bind_cols(data.frame(p_grid = p_grid)) %>%
ggplot(aes(x = p_grid, y = game_13)) +
geom_point(size = 2, color = "darkblue") +
geom_line(color = "darkblue") +
# Label!
labs(x = "Catch rate", y = "Posterior probability",
title = "Grid approximation") +
# Clean it up:
theme_bw() + theme(axis.text = element_text(size = 10),
title = element_text(size = 10))

# Install cowplot if you don't have it!
# install.packages("cowplot")
cowplot::plot_grid(known_posterior, grid_posterior)

We see that our grid approximation with twenty points does a pretty good job of capturing the posterior distribution, and if we keep increasing the number of points in the grid it would follow this posterior exactly. While this is simple and incredibly easy for this example, when you are fitting a multilevel model with many parameters this just isn’t practical. Hence the need for more efficient approaches like quadratic approximation or MCMC. But it’s a great way to learn how Bayesian updating works.

## Next steps

The basic mechanics of Bayesian inference are intuitive and I hope this first post keeps you interested in learning more. We’re estimating full posterior distributions for our values of interest, not just point estimates. This means we’re always acknowledging the uncertainty in our values, something that is often ignored in sports statistics in particular. Obviously, this catch rate example is too simplistic - we’re ignoring who is throwing the ball, the opposing defense, game situation, and also are not distinguishing dropped balls from bad throws3. We’ll keep working with this example as we move towards a full Bayesian multilevel model to account for what we can (Roger Goodell doesn’t like to share data). You can take the exact same approach in this example for any binary outcome: catch/drop, win/lose, success/failure. Someone could easily take this code above and update their belief regarding a running back’s Success Rate (% rushes with positive expected points added) as the season progresses using nflscrapR.

Next post in this series will either be about sampling from the posterior or quadratic approximation and Bayesian regression. We’ll get to the fun stuff like MCMC and multilevel models eventually, but it’s better to build our understanding first.

## References

1. My first exposure to multilevel models was reading Jonathan Judge’s excellent work at Baseball Prospectus about their catcher framing model.

2. “Analytical” is just a fancy way to say it follows from real math with proofs.

3. If only there was some technology to track ball location relative to the receiver and account for this distance in a catch probability model… oh wait, there is…