## First steps

I was originally thinking of writing a blog post about multilevel models (aka
hierachical, mixed, random effects) because of how useful they are for measuring
player performance in sports^{1} (shameless self promotion for nflWAR here!).
But the more I thought about it, the more I realized how ill-minded of an idea
that was. Instead, I want to build up the intuition for how and why one would
want to use a full Bayesian multilevel model. My brother recommended that I checkout Statistical Rethinking
by Richard McElreath. It’s an amazing introduction to
Bayesian data analysis that I recommend to anyone interested in learning more.
It’s incredibly intuitive and paced very well, for instance MCMC isn’t covered
until halfway through the book so you understand why you’re using it.
Taking what I have learned from the book, I’m writing a series of posts that will
hopefully build the basic understanding of how Bayesian inference works starting
with simple updating and gradually working up towards multilevel modeling. This
series is called Bayesian Baby Steps for a reason, and there will be *What About Bob?*
GIFs. Because of my work with `nflscrapR`

,
all my examples will be using NFL data which is ripe for Bayesian data analysis.

## Catch rate example

You’re trying to evaluate a receiver’s ability to catch a football. Let’s pretend you can take the following (completely unrealistic) strategy: you tell your quarterback to repeatedly throw the ball to your receiver in practice, recording each time whether or not they caught the ball. We’ll let C stand for catch and D stand for drop. We could carry out this procedure any number of times, for instance we could record the following ten pass attempts to our receiver:

`sample(c("C", "D"), size = 10, replace = TRUE)`

`## [1] "C" "D" "C" "C" "C" "C" "C" "D" "C" "D"`

We can describe the *story* for this *data generating process* quite easily,
informing us how to simulate new data. In this case:

- Our receiver’s true catch rate is
`p`

. - Each single pass attempt has a probability
`p`

of being caught, thus meaning a pass has a probability of`1 - p`

of being dropped by our receiver. - We assume each pass attempt is independent of one another.

We’re now ready to explore this probability model and how to handle observing
data to update our beliefs about our receiver’s catch rate `p`

using Bayesian
inference.

In a Bayesian model, there are three things we need to choose to ultimately represent the number of ways our data can be generated:

- Likelihood - plausibility of our observed data given a receiver’s catch rate
`p`

- Parameter - our quantity of interest
`p`

, which we want to learn about from our data - Prior - our initial belief regarding different values for
`p`

Based on the *story* above, it’s pretty easy to see that we can just use the
binomial distribution
for our likelihood since we’re assuming each of our `n`

pass attempts is
independent and that the receiver’s catch probability `p`

is the same for every
attempt. We can then write the probability of observing `c`

receptions in
`n`

pass attempts with a catch probability `p`

as:

\[ \text{Pr}(c | n, p) = \frac{n!}{c!(n - c)!}p^c (1 - p)^{n-c} \]

We then use good old Bayes’ theorem
to provide us with the probability of a value for the receiver’s catch
probability `p`

given our observed data:

\[ \text{Pr}(p | c) = \frac{\text{Pr}(c|p) \text{Pr}(p)}{\text{Pr}(c)} \]

I personally like the way McElreath refers to the denonimator,
\(\text{Pr}(c)\), as the *average* likelihood of the observed data over `p`

. All
it is is an expectation over all our possible values for `p`

:

\[ \text{Pr}(c) = \text{E}\big[\text{Pr}(c|p)\big] = \int \text{Pr}(c|p) \text{Pr}(p)dp \]

This denonimator is known as the *normalizing constant*
because it ensures that our posterior density integrates to one, i.e. our
posterior is actually a probability. What often happens is that this
term is extremely difficult to compute. We usually consider the
posterior up to the normalizing constant as the product between the likelihood
and prior. Or if this form isn’t tractable, we resort to some approximation
technique such as grid-based which we cover below, as well as quadratic and Markov
Chain Monte Carlo (MCMC) methods. However we’ll cover those in a post later,
remember baby steps people!

One notion that I completely glossed over above is how McElreath motivates the
mechanism for Bayesian analysis. He provides a walkthrough of the *garden of
forking data* describing how, in formulating the posterior above, we’re really
just counting paths. We simply use multiplication as a quick way to count all
the ways from our prior number of paths through our new number. McElreath also
makes the excellent remark that **Bayesian inference is not defined by Bayes’
theorem**. Everyone learns about Bayes theorem with some trivial examples like
smoking and lung cancer,
but that is not the point of Bayesian approaches. The point is to **appropriately
measure and account for the uncertainty in our models and parameters**. We’re not
satisfied with stating, “Oh our receiver caught 5 of 5 passes - he never drops
a pass!” - we want to capture the uncertainty in his reception probability
given what we already knew, such as the belief it’s pretty unlikely his catch
rate is a perfect 100%.

## Grid approximation with JuJu

Even though from probability theory we know that every combination of prior, likelihood, and data constitutes a unique posterior - in many cases we cannot derive the resulting posterior. This leads to the use of approximation techniques such as grid approximation, quadratic approximation, and MCMC to name a few. In this post we’ll cover grid approximation because of its simplicity, which makes it a great way to learn the basics behind Bayesian updating.

Let’s expand on our catch rate model a bit by going back in time, and assume
we’re working the Pittsburgh Steelers heading into the 2017 NFL season. We
drafted JuJu Smith-Schuster to add another option to the offense and want to
evaluate his performance as the season goes on based on his catch rate.
Using the `nflscrapR`

package we
can easily get all plays from the 2017 season. You can use the `season_play_by_play()`

function from the package to scrape the data, or you can access the files I
saved here in my `nflscrapR`

-data repository.
The following code will access the 2017 play-by-play data, filter down only
to the pass attempts to JuJu, then create a dataset that summarizes his
performance in each game he played along with cumulative totals after each game:

```
# Access the tidyverse (install if you don't have it!)
# install.packages("tidyverse")
library(tidyverse)
# Load the 2017 play-by-play data from my repository:
juju_games <-
read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_play_by_play/pbp_2017.csv") %>%
# Filter down only to the pass attempts to JuJu based on his GSIS ID 00-0033857:
filter(Receiver_ID == "00-0033857",
PassAttempt == 1) %>%
# Only select the GameID and PassOutcome columns:
select(GameID, PassOutcome) %>%
# Calculate the number of receptions, targets, and catch rate in each game:
group_by(GameID) %>%
summarise(receptions = length(which(PassOutcome == "Complete")),
targets = n(),
catch_rate = receptions / targets) %>%
# Calculate cumulative stats:
mutate(total_receptions = cumsum(receptions),
total_targets = cumsum(targets),
total_catch_rate = total_receptions / total_targets,
# Columns to be used later:
index = 1:n(),
game_index = paste("game_", index, sep = ""),
game_index = fct_relevel(factor(game_index),
"game_1", "game_2", "game_3",
"game_4", "game_5", "game_6",
"game_7", "game_8", "game_9",
"game_10", "game_11", "game_12",
"game_13"))
```

Grid approximation provides us with a very simple set of steps to update our
evaluation of JuJu’s performance. The first step is to initialize our grid of
values for JuJu’s `p`

we’re going to estimate the posterior probability for,
which in this case will be increments of 5% from 0 to 1:

`p_grid <- seq(from = 0, to = 1, by = .05)`

Then we define our prior belief for each of the possible values in the grid.
Just to demonstrate, we’ll use a flat prior which means we initially believe
each of the grid values for `p`

are equally likely. I expand on the prior below,
but for now we use the following:

`prior <- rep(1, 21)`

Next we’ll calculate the likelihood for each of the grid values for `p`

using
the binomial distribution from above. We’ll calculate the likelihood as if only
one game was played (just grabbing the receptions and targets values from the
first row of the data):

```
likelihood <- dbinom(x = juju_games$receptions[1],
size = juju_games$targets[1],
prob = p_grid)
```

We’re then able to calculate the numerator of Bayes’ theorem by multiplying the prior by the likelihood, providing the unstandardized posterior for each of the grid values:

`bayes_numerator <- likelihood * prior`

To arrive at the posterior estimates, we just follow Bayes’ theorem and take
these products and divide them by the sum of the numerators for each value on
the grid. This is of course easy to do with in `R`

with vectorized operations:

`posterior <- bayes_numerator / sum(bayes_numerator)`

Our grid approximation for the posterior of JuJu’s catch rate after one game can easily be viewed:

```
data.frame(p_grid = p_grid, p_posterior = posterior) %>%
ggplot(aes(x = p_grid, y = p_posterior)) +
geom_point(size = 3, color = "darkblue") +
geom_line(color = "darkblue") +
# Add a vertical line for JuJu's observed catch rate:
geom_vline(xintercept = juju_games$catch_rate[1], color = "darkorange",
linetype = "dashed", size = 3, alpha = .5) +
# Label!
labs(x = "Catch rate", y = "Posterior probability",
title = "Posterior approximation for\nJuJu's catch rate after one game") +
# Clean it up:
theme_bw() + theme(axis.text = element_text(size = 10),
title = element_text(size = 10))
```

This gives us a full posterior distribution for JuJu’s catch rate, rather than just his observed value of 0.75 after one game (indicated by the orange dashed line). We can also carry out this grid approximation procedure to generate posterior distributions for his catch rate after every game based on his running totals:

```
# Create a data frame by applying the grid approximation steps to each row
# of juju_games:
game_posteriors <- map_dfc(c(1:nrow(juju_games)),
function(x) {
p_grid <- seq(from = 0, to = 1, by = .05)
prior <- rep(1, 21)
likelihood <- dbinom(x = juju_games$total_receptions[x],
size = juju_games$total_targets[x],
prob = p_grid)
bayes_numerator <- likelihood * prior
posterior <- bayes_numerator / sum(bayes_numerator)
# Return this as a data frame:
result <- data.frame(posterior)
colnames(result) <- paste("game_", x, sep = "")
return(result)
})
# Join these columns with p_grid and column for the prior probability:
data.frame(p_grid = p_grid, prior = rep(1 / 21, 21)) %>%
bind_cols(game_posteriors) %>%
# Gather the columns so the data is long, one row for each week and grid value
gather(key = "game_index", value = "posterior_prob", -p_grid) %>%
# Relevel the game_index variable:
mutate(game_index = fct_relevel(factor(game_index),
"prior", "game_1", "game_2", "game_3",
"game_4", "game_5", "game_6",
"game_7", "game_8", "game_9",
"game_10", "game_11", "game_12",
"game_13")) %>%
# Visualize the posteriors for each game:
ggplot(aes(x = p_grid, y = posterior_prob)) +
geom_point(size = 2, color = "darkblue") +
geom_line(color = "darkblue") +
facet_wrap(~ game_index) +
# Add vertical lines for each cumulative observed rate
geom_vline(data = juju_games,
aes(xintercept = total_catch_rate), color = "darkorange",
linetype = "dashed", size = 1, alpha = .5) +
geom_text(data = juju_games, size = 3,
x = .25, y = .3, aes(label = paste("Caught",
receptions, "of",
targets, sep = " "))) +
# Label!
labs(x = "Catch rate", y = "Posterior probability",
title = "Posterior approximation for JuJu's catch rate after each game") +
# Clean it up:
theme_bw() + theme(axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 6),
title = element_text(size = 10))
```

This plot gives us a pretty clear view of Bayesian updating. We start with a flat prior, believing any catch rate between 0 and 1 is equally likely. Then we keep updating our belief or understanding of what JuJu’s catch is likely to be given his in game performances. Each dashed line marks his cumulative catch rate which, should not come as a surprise, is the most likely value after each game. After his second game where he only caught two of six passes we see a shift in the distribution. But as we keep observing games, the distributions move and become more concentrated around his final season catch rate of roughly 0.73. We could do this same process at the individual catch level, and although I generated the plot above using the flat prior and cumulative stats after each game I could’ve also generated the same figure using the previous game’s posterior as the prior for the following single game performance. It’s just simpler to present all of the data however, but breaking it up this way provides us with the step-wise view of updating.

## Prior knowledge

You’re probably thinking this example is dumb since I started with a flat prior. And you’re right! Why would anyone believe a receiver’s catch is equally likely to be 0 as is 1? You might’ve noticed the fact that the posteriors are even just proportional to the likelihood, so we’re really not taking advantage of the purpose of a prior distribution. We can use priors to incorporate expert knowledge into our model, helping us limit parameter values to a range that makes sense. If you have ever learned about regularized regression, you can think of priors as accomplishing a similar task. In fact Lasso regression can be intepreted this way. People often complain that the choice of prior is subjective, but if you think data analysis is entirely objective then you’re ignorant of the truth.

To choose a reasonable prior for JuJu’s catch rate, we’re going to take a data-driven
approach or use Empirical Bayes.
David Robinson has an excellent series of posts
and book
on Empirical Bayes using baseball batting averages as an example. We’re going to
use the same approach here but in the context of our football problem. We’ll estimate our
prior using the distribution of catch rates by receivers in the 2016 season.
With my `nflWAR`

package it’s pretty
easy to grab statistics by players for certain positions. Here
we only grab catch rates by WRs in 2016 with at least 25 targets:

```
# install.packages("devtools")
# devtools::install_github("ryurko/nflWAR")
library(nflWAR)
# Ignore any messages you see
wr_catch_rates <- get_pbp_data(2016) %>%
add_positions(2016) %>%
add_model_variables() %>%
prepare_model_data() %>%
add_position_tables() %>%
join_position_statistics() %>%
pluck("WR_table") %>%
select(Player_ID_Name, Rec_Perc, Targets) %>%
filter(Targets >= 25)
```

Next we display our distribution of catch rates in the 2016 season:

```
wr_catch_rates %>%
ggplot(aes(x = Rec_Perc)) +
geom_density(color = "darkblue") +
geom_rug(color = "darkblue") +
theme_bw() +
labs(x = "Catch rate", y = "Density",
title = "Distribution of WR catch rates in 2016 (minimum 25 targets)")
```

We see concentration of catch rates between 0.5 and 0.6 with no receiver having a catch rate less than 0.4 or above 0.8. Because of the shape of this distribution and from the fact a receiver’s catch rate must be between 0 and 1, it makes sense to use a beta distribution estimated from this data as our prior. This means we’re assuming

\[ p \sim \text{Beta}(\alpha_0, \beta_0) \]

where we have to estimate the hyper-parameters (parameters for priors), \(\alpha_0\)
and \(\beta_0\), from our data. A very simple way to do this in `R`

is with the
method of moments using the mean and variance following the example in Robinson’s post:

```
# Use the fitdistr function from the MASS library:
prior_beta_model <- MASS::fitdistr(wr_catch_rates$Rec_Perc, dbeta,
start = list(shape1 = 10, shape2 = 10))
# Extract the approximated priors
alpha0 <- prior_beta_model$estimate[1]
beta0 <- prior_beta_model$estimate[2]
```

We’ll now follow our grid approximation steps from before but instead use the approximated beta prior:

```
# Create a data frame by applying the grid approximation steps to each row
# of juju_games:
new_game_posteriors <- map_dfc(c(1:nrow(juju_games)),
function(x) {
p_grid <- seq(from = 0, to = 1, by = .05)
prior <- dbeta(p_grid, alpha0, beta0)
likelihood <- dbinom(x = juju_games$total_receptions[x],
size = juju_games$total_targets[x],
prob = p_grid)
bayes_numerator <- likelihood * prior
posterior <- bayes_numerator / sum(bayes_numerator)
# Return this as a data frame:
result <- data.frame(posterior)
colnames(result) <- paste("game_", x, sep = "")
return(result)
})
# Join these columns with p_grid and column for the prior probability:
data.frame(p_grid = p_grid,
prior = dbeta(p_grid, alpha0, beta0) / sum(dbeta(p_grid, alpha0, beta0))) %>%
bind_cols(new_game_posteriors) %>%
# Gather the columns so the data is long, one row for each week and grid value
gather(key = "game_index", value = "posterior_prob", -p_grid) %>%
# Relevel the game_index variable:
mutate(game_index = fct_relevel(factor(game_index),
"prior", "game_1", "game_2", "game_3",
"game_4", "game_5", "game_6",
"game_7", "game_8", "game_9",
"game_10", "game_11", "game_12",
"game_13")) %>%
# Visualize the posteriors for each game:
ggplot(aes(x = p_grid, y = posterior_prob)) +
geom_point(size = 2, color = "darkblue") +
geom_line(color = "darkblue") +
facet_wrap(~ game_index) +
# Add vertical lines for each cumulative observed rate
geom_vline(data = juju_games,
aes(xintercept = total_catch_rate), color = "darkorange",
linetype = "dashed", size = 1, alpha = .5) +
geom_text(data = juju_games, size = 3,
x = .25, y = .3, aes(label = paste("Caught",
receptions, "of",
targets, sep = " "))) +
# Label!
labs(x = "Catch rate", y = "Posterior probability",
title = "Posterior approximation for JuJu's catch rate after each game") +
# Clean it up:
theme_bw() + theme(axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 6),
title = element_text(size = 10))
```

Now with this informed prior we see a clear difference in the updating procedure
compared to before. With the flat prior the catch rate with the peak posterior
probability was always JuJu’s cumulative catch rate after the game. Now we see
how the prior distribution applies some resistance to the approximated posterior.
For instance, after his second game where he only caught two of six targets,
JuJu’s catch rate dropped to 0.5 but the posterior doesn’t shift as far to the
left because of the prior pulling it back. This process continues the whole way,
with the most likely value after the last game slightly less than JuJu’s catch
rate after the whole season - hence providing some regularization for our
likely value of `p`

.

All I’ve talked about so far is using grid-based approach to reach this posterior
approximation. But we actually know the analytical calculation^{2} for this example’s posterior
since the beta distribution is the conjugate prior to the binomial resulting
in a posterior that also follows the beta distribution (not the
focus of this post but here’s more info).
We can calculate this known result quite easily, displaying the density curve
using JuJu’s full season statistics in comparison to the final grid approximation:

```
# Compuate the known posterior for 1000 points from 0 to 1:
known_posterior <- data.frame(p_density = dbeta(seq(0, 1, length.out = 1000),
# Add receptions to prior alpha
juju_games$total_receptions[nrow(juju_games)] + alpha0,
# Add incompletions to prior beta
with(juju_games,
(total_targets[nrow(juju_games)] -
total_receptions[nrow(juju_games)]) + beta0)),
p_grid = seq(0, 1, length.out = 1000)) %>%
ggplot(aes(x = p_grid, y = p_density)) +
geom_line(color = "darkorange") +
# Label!
labs(x = "Catch rate", y = "Posterior density",
title = "Known posterior distribution using beta prior") +
# Clean it up:
theme_bw() + theme(axis.text = element_text(size = 10),
title = element_text(size = 10))
grid_posterior <- new_game_posteriors %>%
select(game_13) %>%
bind_cols(data.frame(p_grid = p_grid)) %>%
ggplot(aes(x = p_grid, y = game_13)) +
geom_point(size = 2, color = "darkblue") +
geom_line(color = "darkblue") +
# Label!
labs(x = "Catch rate", y = "Posterior probability",
title = "Grid approximation") +
# Clean it up:
theme_bw() + theme(axis.text = element_text(size = 10),
title = element_text(size = 10))
# Install cowplot if you don't have it!
# install.packages("cowplot")
cowplot::plot_grid(known_posterior, grid_posterior)
```

We see that our grid approximation with twenty points does a pretty good job of capturing the posterior distribution, and if we keep increasing the number of points in the grid it would follow this posterior exactly. While this is simple and incredibly easy for this example, when you are fitting a multilevel model with many parameters this just isn’t practical. Hence the need for more efficient approaches like quadratic approximation or MCMC. But it’s a great way to learn how Bayesian updating works.

## Next steps

The basic mechanics of Bayesian inference are intuitive and I hope this first
post keeps you interested in learning more. We’re estimating full posterior
distributions for our values of interest, not just point estimates. This means
we’re always acknowledging the uncertainty in our values, something that is
often ignored sports statistics in particular. Obviously, this catch rate
example is too simplistic - we’re ignoring who is throwing the ball, the
opposing defense, game situation, and also are not distinguishing dropped
balls from bad throws^{3}. We’ll keep working with this example as we move towards
a full Bayesian multilevel model to account for what we can (Roger Goodell doesn’t
like to share data). You can take the exact same approach in this example for
any binary outcome: catch/drop, win/lose, success/failure. Someone could easily
take this code above and update their belief regarding a running back’s
Success Rate (% rushes with positive expected points added) as the season
progresses using `nflscrapR`

.

Next post in this series will either be about sampling from the posterior or quadratic approximation and Bayesian regression. We’ll get to the fun stuff like MCMC and multilevel models eventually, but it’s better to build our understanding first.

## References

- Richard McElreath’s Statistical Rethinking
- David Robinson’s blog posts on Empirical Bayes

My first exposure to multilevel models was reading Jonathan Judge’s excellent work at Baseball Prospectus about their catcher framing model.↩

“Analytical” is just a fancy way to say it follows from

*real*math with proofs.↩If only there was some technology to track ball location relative to the receiver and account for this distance in a catch probability model… oh wait, there is…↩