Bayesian Baby Steps: Normal Next Steps

Enter marquis de Laplace

In my first post on Bayesian data analysis, I did a brief overview of how Bayesian updating works using grid approximation to arrive at posterior distributions for our parameters of interest, such as a wide receiver’s catch rate. While the grid-based approach is simple and easy to follow, it’s just not practical. Before we turn to MCMC, in this post we’ll cover the popular approach known as Laplace approximation1, aka quadratic approximation.

As I said before, in many cases we’re not able to derive the posterior distribution so some sort of approximation method is required. The one and only Pierre-Simon Laplace2 had the realization you can use a Gaussian (normal) distribution (good old bell curve) for approximating the posterior if its roughly symmetric and unimodal. One of the great benefits of working with the Gaussian distribution is that you only need two numbers to describe it: the mean \(\mu\) for the center, and the variance \(\sigma^2\) for the spread.

The reason it’s often called quadratic approximation is because we use a quadratic function for approximating the logarithm of the posterior density, which we’ll denote as \(p(\theta |x)\) where \(x\) is our observed data and \(\theta\) represents our parameter(s) of interest. This approximation uses the fact that the log of a Gaussian is a parabola - a quadratic function. You simply take a Taylor series expansion of the log of the posterior density centered at the posterior mode \(\hat{\theta}\),

\[ \text{log}\ p(\theta|x) = \text{log}\ p(\hat{\theta}|x) + \frac{1}{2}(\theta - \hat{\theta})^T \Big[ \frac{d^2}{d\theta^2} \text{log} p(\theta | x) \Big]_{\theta = \hat{\theta}}(\theta - \hat{\theta}) +\ ... \]

Let me break this down. The first thing you might be thinking is, “Wait how do we find the posterior mode???” - well it’s actually pretty simple with a standard optimization procedure (as you’ll see below). This mode represents the center of the Gaussian that we’ll use to approximate the posterior. The posterior mode is also referred to by its Latin name3 maximum a-posteriori, denoted by \(\hat{\theta}_{MAP}\). The first term in the expansion above is merely a constant, but the second term provides us an estimate of the curvature near the posterior’s peak and is proportional to the log of a Gaussian density. This provides us with the variance of the Gaussian to approximate the posterior as:

\[ p(\theta|x) \approx \text{N}\big( \hat{\theta}, [I(\hat{\theta})]^{-1} \big) \]

where \(I(\hat{\theta})\) is an estimate for \(I(\theta)\), the observed information,

\[ I(\theta) = -\frac{d^2}{d\theta^2} \text{log}\ p(\theta|x). \]

That’s all we need for this approximation, the posterior mode and the curvature of the posterior density and BAM we have the entire posterior distribution. Well really an approximation of it - but from asymptotic theory the posterior distribution approaches a Gaussian under some conditions. I won’t get into the details, so definitely check out more of the theory yourself. I also recommend chapter four of the classic Bayesian Data Analysis by Gelman et al.

The excellent blog by Rasmus Bååth also has a post on Laplace approximation with a binomial example, like the catch rate model in the previous post, showing how the approximation using a Gaussian improves with more data despite the fact the posterior is actually a beta! I’ll use a different example for the rest of this post and demonstrate the usage of quadratic approximation for Bayesian linear regression.

Modeling NFL score differential

Keeping with my theme of working with American football data4, I’ll demonstrate the use of quadratic approximation with a model for an individual team’s score differential. The first step is to gather the score differential for each team in each season, which can be done using the nflscrapR package. An easier route however is to access the data from my nflscrapR-data repository. Rather than scavenging through those files, in my nflWAR package there’s a function called get_season_summary() that takes in a vector of years and returns a data frame with rows for each team’s individual season, and has columns for their number of wins and total score differential. The code chunk below demonstrates how to use gather this data for NFL seasons from 2009 to 2017:

# Load the tidyverse
# install.packages("tidyverse")
library(tidyverse)

# Use the nflWAR package to get summaries of team performances for each season,
# first install devtools if you don't have then install nflWAR:
# install.packages("devtools")
# devtools::install_github("ryurko/nflWAR")
library(nflWAR)

# Use the get_season_summary() function to create a dataset that has a row
# for team-season with a column for their score differential from 2009 to 2017:
team_summary_df <- get_season_summary(2009:2017)

Before going into the regression example with a predictor, it’s worthwhile to first demonstrate quadratic approximation by just modeling the score differential with a Gaussian. First step is to visualize the score differential distribution:

# Create a histogram displaying the score differential distribution using
# the the density instead so it's consistent for later: (also why I'm assigning
# it to a variable name, so we can add our approximated posterior layer later)
score_diff_hist <- team_summary_df %>%
  ggplot(aes(x = Total_Score_Diff)) +
  geom_histogram(aes(y = ..density..), bins = 20) + theme_bw() + 
  scale_x_continuous(limits = c(-300, 300)) +
  # Always label:
  labs(x = "Score differential", y = "Density",
       title = "Distribution of individual team score differential in each season from 2009-17 ",
       caption = "Data accessed with nflscrapR") +
  # Just some cosmetic settings: (really should do a custom theme at the 
  # beginning instead of repeating this)
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 14))

# Display the plot
score_diff_hist

From this histogram we have support for using a Gaussian, since it’s roughly symmetric and unimodal. Also, since an observed season score differential is the sum of the score differentials in each game, we can think of these values as following a random walk. Some teams will have large positive differentials, which means some team will have large negative differentials. I won’t go into the details but theoretically this process of summing together random values, assumed to be generated from the same distribution, will follow a Gaussian distribution. In other words - the Gaussian distribution is a reasonable choice here.

We assume then that the general model for an observed score differential \(S_i\), where \(i = 1,...,\) 288 is the index for an observed team-season score differential, follows a Gaussian distribution:

\[ S_i \sim \text{N}(\mu, \sigma^2) \]

Because we’re approaching this problem from a Bayesian perspective, we really want to consider an infinite number of possible Guassians based on all of the possible combinations for our parameters \(\mu\) and \(\sigma\). Of course we’re not going to actually look at every possible value for these parameters, laplace approximation will accomplish what we need. But, as I pointed out in the previous post, we’re interested in the entire posterior distribution - which in this case is a posterior distribution of distributions. So for a full model of score differential, we need priors for both \(\mu\) and \(\sigma\):

\[ S_i \sim \text{N}(\mu, \sigma^2) \\ \mu \sim \text{N}(0, 100^2) \\ \sigma \sim \text{Uniform}(0,200) \]

This is a really naive model, just to demonstrate Laplace approximation. Next we’ll implement the same code in the Bååth blog post which uses the optim function in R to find the mode of the posterior and also returns the Hessian matrix for the curvature (following the observed information definition above. For ease, we’ll roughly follow his example code which also uses the mvtnorm packages for sampling from the approximated posterior (this sampling isn’t really necessary but it’s good practice to start sampling from the posterior). We’re working with two parameters here, so the resulting posterior approximation is a multivariate Gaussian, hence why we use the rmvnorm() function since we’re not only working with each parameter’s mean and variance, but their covariance as well (in this example the covariance is essentially 0 but that won’t typically be the case).

# install.packages("mvtnorm")

#' Function to perform simple Laplace approximation
#' @param log_posterior_n Function that returns the log posterior given a vector
#' of parameter values and observed data.
#' @param init_points Vector of starting values for parameters.
#' @param n_samples Number of samples to draw from resulting approximated 
#' posterior distribution for then visualizing.
#' @param ... Any additional arguments passed to the log_posterior_fun during 
#' the optimization.
#' @return Samples from approximated posterior distribution.
laplace_approx <- function(log_posterior_fn, init_points, n_samples, ...) {
  
  # Use the optim function with the input starting points and function to
  # evaluate the log posterior to find the mode and curvature with hessian = TRUE
  # (note that using fnscale = -1 means to maximize). We use the ... here to 
  # allow for flexible name of the data in log_posterior_fn:
  fit <- optim(init_points, log_posterior_fn, 
               # Use the same quasi-Newton optimization method as McElreath's
               # map function in his rethinking package:
               method = "BFGS",
               # using fnscale = -1 means to maximize
               control = list(fnscale = -1), 
               # need the hessian for the curvature
               hessian = TRUE, 
               # additional arguments for the function we're optimizing
               ...)
  
  # Store the mean values for the parameters and curvature to then generate
  # samples from the approximated normal posterior given the number of samples
  # for both of the parameters, returning as a data frame:
  param_mean <- fit$par
  # inverse of the negative hessian for the covariance - same as the use 
  # of the observed information in the intro
  param_cov_mat <- solve(-fit$hessian)
  # sample from the resulting joint posterior to get posterior distributions
  # for each parameter:
  mvtnorm::rmvnorm(n_samples, param_mean, param_cov_mat) %>%
    data.frame()
}

#' Function to calculate log posterior for simple score differential model
#' @param params Vector of parameter values that are named "mu" and "sigma".
#' @param score_diff_values Vector of observed score differential values
score_diff_model <- function(params, score_diff_values) {
  
  # Log likelihood:
  sum(dnorm(score_diff_values, params["mu"], params["sigma"], log = TRUE)) +
    # plus the log priors results in log posterior:
    dnorm(params["mu"], 0, 100, log = TRUE) + 
    dunif(params["sigma"], 0, 200, log = TRUE)

}

With these functions we’ll now approximate the posterior, considering very naive starting values for our optimization with \(\mu = 200\) and \(\sigma = 10\) (this is just to demonstrate, I have no reason for choosing these values).

init_samples <- laplace_approx(log_posterior_fn = score_diff_model,
                               init_points = c(mu = 200, sigma = 10),
                               n_samples = 10000,
                               # Specify the data for the optimization!
                               score_diff_values = team_summary_df$Total_Score_Diff)

We can easily view the joint distribution of our posterior samples with their respective marginals on the side (code courtesy of Claus Wilke using cowplot):

# install.packages("cowplot")
library(cowplot)

# First the joint distribution plot with points for the values:
joint_plot <- ggplot(init_samples, aes(x = mu, y = sigma)) + 
  geom_point(alpha = .3, color = "darkblue")  +
  labs(title = "Posterior distribution for model parameters with marginals") +
  theme_bw() +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 16))

# Marginal density for mu along x-axis using the cowplot function axis_canvas:
mu_dens <- axis_canvas(joint_plot, axis = "x") +
  geom_density(data = init_samples, aes(x = mu), fill = "darkblue",
               alpha = 0.5, size = .2)

# Same thing for sigma but along y and with coord_flip = TRUE to make it vertical:
sigma_dens <- axis_canvas(joint_plot, axis = "y", coord_flip = TRUE) +
  geom_density(data = init_samples, aes(x = sigma), fill = "darkblue",
               alpha = 0.5, size = .2) +
  coord_flip()

# Now generate by adding these objects to the main plot:
# Need grid:
# install.packages("grid")
joint_plot_1 <- insert_xaxis_grob(joint_plot, mu_dens, 
                                  grid::unit(.2, "null"), position = "top")
joint_plot_2 <- insert_yaxis_grob(joint_plot_1, sigma_dens, 
                                  grid::unit(.2, "null"), position = "right")
ggdraw(joint_plot_2)

From this we can clearly see the Gaussian approximations for both of our parameters with the distribution of \(\mu\) centered around 0 and the distribution for \(\sigma\) centered around 100. Remember this represents the distributions for the parameters of our score differential model. So sampling from this posterior means we are generating different Gaussian distributions for the score differential. My next post will focus on sampling from the posterior, but to give you a taste of what I mean the code below uses these 10000 values from init_samples for each parameter, and then samples 10000 values from distributions using these combinations of values to give us our approximate score differential distribution.

# R is vectorized so can just give it the vector of values for the Gaussian
# distribution parameters to create the simulated score-diff distribution:
sim_score_diff_data <- data.frame(Total_Score_Diff = 
                                    rnorm(10000, mean = init_samples$mu,
                                          sd = init_samples$sigma))

# Create the histogram from before but now add this posterior density on top:
score_diff_hist + geom_density(data = sim_score_diff_data,
                               aes(x = Total_Score_Diff), 
                               fill = NA, color = "darkblue")

Not bad, but then again this was a pretty easy example. You could also use the grid approximation from the last post to accomplish the same task but it starts to be annoying to work with even with just two parameters, quickly becoming unbearable with more and more parameters.

How to win in the NFL

Step 1: avoid this man

Step 1: avoid this man

Ok, so we’ve seen how easy it is to use Laplace approximation for modeling score differential… but that didn’t tell us anything interesting. What we’re really interested in, is the relationship between some predictor variable(s) and a team’s score differential. For instance, how is a team’s passing performance related to their score differential? What about their pass defense? Is this more or less related to score differential than rushing?

In order to address these questions, I’m going to define two statistics to summarise each team’s performance in a season. These statistics will be entirely based on a team’s expected points added (\(EPA\)). If you’re not familiar with \(EPA\), it’s built on the fundamental principle in football that not all yards are created equal. A 3 yard rush on 3rd down with 2 yards to go, is a lot more valuable than a 3 yard rush on 3rd down with 15 yards to go. I’m not going to do a full dive on how to calculate expected points or its history in this post, but it gives us a number describing how many points a team is expected to score given their current situation. \(EPA\) is merely the change in this expected points state between two plays. It was recently popularized by Brian Burke, and in nflscrapR we use a multinomial logistic regression model to provide probabilities for each of the possible scoring events resulting in freely available \(EPA\) values for every play, full model details available here.

For each combination of team, \(t\), and season, \(y\), we’ll calculate both its total offensive \(EPA\) from passing and rushing as \(EPA^{pass}_{off, t, y}\) and \(EPA^{rush}_{off, t, y}\) respectively. As well as it’s defensive statistics representing the expected points allowed, \(EPA^{pass}_{def, t, y}\) and \(EPA^{rush}_{def, t, y}\). We’ll capture the efficiency of their performances by dividing these totals each by their respective number of attempts, which we’ll denote as \(Att^{rush}_{off, t,y}\), \(Att^{rush}_{def, t,y}\), \(Att^{pass}_{off, t,y}\), and \(Att^{pass}_{def, t,y}\). Finally, we’ll summarise their overall performances with differentials between their offensive and defensive efficiency stats for both passing and rushing:

\[ EPA\ diff^{pass}_{t,y} = \frac{EPA^{pass}_{off, t, y}}{Att^{pass}_{off, t,y}} - \frac{EPA^{pass}_{def, t, y}}{Att^{pass}_{def, t,y}}\\ EPA\ diff^{rush}_{t,y} = \frac{EPA^{rush}_{off, t, y}}{Att^{rush}_{off, t,y}} - \frac{EPA^{rush}_{def, t, y}}{Att^{rush}_{def, t,y}}\\ \]

These two stats tell us effectively tell us how much more efficient a team’s passing (rushing) offense was compared to their opponents. We’re then going to look at the relationship between score differential and each of these efficiency differentials separately for simplicity to give us a sense of knowing which is more related to their overall performance as captured by score differential. First let’s get this data and join it to our team_summary_df above by grabbing and selecting the columns from my nflscrapR-data repository:

# Load in the four datasets summarising team performance for passing and rushing
# both from offensive and defensive perspectives, selecting only the necessary
# columns and renaming them:

team_passing_off <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_team_stats/team_season_passing_df.csv") %>%
  dplyr::select(Season, Team, EPA_per_Att) %>%
  rename(pass_epa_att_off = EPA_per_Att) %>%
  filter(!is.na(Team))

team_passing_def <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_team_stats/team_def_season_passing_df.csv") %>%
  dplyr::select(Season, Team, EPA_per_Att) %>%
  rename(pass_epa_att_def = EPA_per_Att) %>%
  filter(!is.na(Team))

team_rushing_off <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_team_stats/team_season_rushing_df.csv") %>%
  dplyr::select(Season, Team, EPA_per_Car) %>%
  rename(rush_epa_att_off = EPA_per_Car) %>%
  filter(!is.na(Team))

team_rushing_def <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_team_stats/team_def_season_rushing_df.csv") %>%
  dplyr::select(Season, Team, EPA_per_Car) %>%
  rename(rush_epa_att_def = EPA_per_Car) %>%
  filter(!is.na(Team))

# Join the data to the team_summary_df and calculate the differential columns:

team_summary_df <- team_summary_df %>%
  inner_join(team_passing_off, by = c("Team", "Season")) %>%
  inner_join(team_passing_def, by = c("Team", "Season")) %>%
  inner_join(team_rushing_off, by = c("Team", "Season")) %>%
  inner_join(team_rushing_def, by = c("Team", "Season")) %>%
  mutate(pass_epa_diff = pass_epa_att_off - pass_epa_att_def,
         rush_epa_diff = rush_epa_att_off - rush_epa_att_def)

Next, let’s look at the relationship between each of these efficiency differentials with the team’s score differential visually along with simple linear regression fits (just for assistance, we’re going into Bayesian linear regression anyway):

epa_pass_plot <- ggplot(team_summary_df,
                        aes(x = pass_epa_diff, y = Total_Score_Diff)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  geom_smooth(method = "lm", se = TRUE) +
  labs(x = "Passing EPA/Attempt differential",
       y = "Regular season score differential",
       title = "Relationship between score differential\nand passing efficiency differential") +
  theme_bw() + 
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 13))

epa_rush_plot <- ggplot(team_summary_df,
                        aes(x = rush_epa_diff, y = Total_Score_Diff)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  geom_smooth(method = "lm", se = TRUE) +
  labs(x = "Rushing EPA/Attempt differential",
       y = "Regular season score differential",
       title = "Relationship between score differential\n and rushing efficiency differential") +
  theme_bw() + 
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 13))

plot_grid(epa_pass_plot, epa_rush_plot, rel_widths = c(1, 1))

Ok, I’d say it’s pretty obvious that the passing efficiency differential has a strong positive relationship with score differential. The rushing differential also displays a positive relationship, but not quite as strong. We can also of course view the relationship between these variables (or lack thereof), and color the points by the team’s regular season score differential to give us an idea of how its related to both of these variables. And just because we can, we’ll include the marginal distributions on the same plot.

# First generate the scattterplot with the points
joint_score_plot <- ggplot(team_summary_df,
       aes(x = pass_epa_diff, y = rush_epa_diff)) +
  geom_point(aes(color = Total_Score_Diff), alpha = 0.5, size = 4) +
  labs(x = "Passing EPA/Attempt differential",
       y = "Rushing EPA/Attempt differential",
       color = "Regular season\nscore differential",
       title = "Joint distribution of passing and rushing efficiency differentials\ncolored by regular season score differential") +
  # Make the color scale go from darkorange (bad) to midpoint gray (usually 
  # associate white with missing) to darkblue (good):
  scale_color_gradient2(low = "darkorange", mid = "gray", high = "darkblue",
                        midpoint = 0) +
  # Add dashed red lines through the origin:
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  # Add some text for the different quadrants for ease of reading:
  annotate("text", label = "Better pass,\nbetter run", x = .30, y = .28,
           size = 5, color = "darkred") +
  annotate("text", label = "Better pass,\nworse run", x = .30, y = -.15,
           size = 5, color = "darkred") +
  annotate("text", label = "Worse pass,\nbetter run", x = -.30, y = .28,
           size = 5, color = "darkred") +
  annotate("text", label = "Worse pass,\nworse run", x = -.30, y = -.15,
           size = 5, color = "darkred") +
  theme_bw() + 
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 16))

# Make passing marginal along x axis:
pass_dens <- axis_canvas(joint_score_plot, axis = "x") +
  geom_density(data = team_summary_df, aes(x = pass_epa_diff), 
               fill = "darkblue",
               alpha = 0.5, size = .2)

# Same thing for rushing but along y and with coord_flip = TRUE to make it vertical:
rush_dens <- axis_canvas(joint_score_plot, axis = "y", coord_flip = TRUE) +
  geom_density(data = team_summary_df, aes(x = rush_epa_diff), 
               fill = "darkblue",
               alpha = 0.5, size = .2) +
  coord_flip()

# Now generate by adding these objects to the main plot:
joint_score_plot_1 <- insert_xaxis_grob(joint_score_plot, pass_dens, 
                                  grid::unit(.2, "null"), position = "top")
joint_score_plot_2 <- insert_yaxis_grob(joint_score_plot_1, rush_dens, 
                                  grid::unit(.2, "null"), position = "right")
ggdraw(joint_score_plot_2)

Now it looks pretty clear that you’re probably going to need a better passing offense than your opponent to have a positive score differential. There doesn’t appear to be many points on this plot with positive score differential in the top-left quadrant, which corresponds to rushing better than your opponents but having an inferior passing game. After accounting for the passing efficiency differential, maybe a team’s rushing efficiency differential doesn’t matter? Let’s tackle this with a Bayesian model.

Returning to the score differential model, we’re now going use both efficiency differentials as predictors. Remember, all we need to fully describe a Gaussian distribution is the mean and variance. So in order to incorporate these into the our model we’ll define the mean, \(\mu\), as a function of both predictors explicitly. For simplicity we’ll return to using \(i\) to denote a single combination of team \(t\) and season \(y\) for the 286 team-season combinations that we have, and we’ll also let \(P_i = EPA\ diff^{pass}_{t,y}\) and \(R_i = EPA\ diff^{rush}_{t,y}\) be simpler representations for our differentials:

\[ S_i \sim \text{N}(\mu_i, \sigma^2) \\ \mu_i = \alpha + \beta_{P}P_i + \beta_R R_i \\ \alpha \sim \text{N}(0, 100^2) \\ \beta_P \sim \text{N}(0, 100^2) \\ \beta_R \sim \text{N}(0, 100^2) \\ \sigma \sim \text{Uniform}(0,200) \]

Now the mean of the score differential distribution depends on the predictors as denoted with the subscript \(i\). By denoting the relationship between \(\mu_i\) and each of the predictors with an \(=\) sign, we’re saying that \(\mu_i\) is no longer a parameter, its just a function of our actual parameters of interest. The parameters we really care about here are \(\alpha\), \(\beta_P\), and \(\beta_R\). Of course all of this should look familiar if you’ve seen linear regression before. The \(\alpha\) is our model’s intercept, representing the expected score differential for when both the passing and rushing efficiency differentials are 0. And each of the \(\beta\)s represent the change in expected score differential when the pass/rush efficiency differential increases by 1 unit, after accounting for the other type of efficiency differential (rush/pass). So each of these are given priors (along with \(\sigma\) again), and we don’t need to specify a prior for \(\mu_i\) since it is explained completely by these parameters. These priors for now are pretty weak, but both \(\beta\) priors centered at 0 are just saying, “I don’t know if there’s a relationship between these effiency differentials and score differential, good or bad are equally likely”.

To actually approximate this model, we’ll use that same laplace_approx() function from before - but this time define a new function for calculating the log posterior for this regression model:

#' Function to calculate log posterior for score differential model regression
#' model that also takes in the prior inputs for the regression parameters.
#' @param reg_params Vector of parameter values that are named "alpha", 
#' "beta_p", "beta_r", and "sigma".
#' @param hyperparams List of hyperparameter values that are named "alpha_mu",
#' "alpha_sd", "beta_p_mu", "beta_p_sd", "beta_r_mu", "beta_p_sd", "sigma_a",
#' and "sigma_b".
#' @param score_diff_values Vector of observed score differential values.
#' @param pass_eff_values Vector of pass efficiency differential values.
#' @param rush_eff_values Vector of rush efficiency differential values.

score_diff_reg_eff_model <- function(reg_params, hyperparams, 
                                        score_diff_values, pass_eff_values,
                                        rush_eff_values) {
  # Log likelihood that now uses the function form for mu using the two
  # predictors and their respective values:
  sum(dnorm(score_diff_values, 
            reg_params["alpha"] + 
              reg_params["beta_p"]*pass_eff_values +
              reg_params["beta_r"]*rush_eff_values, 
            reg_params["sigma"], log = TRUE)) +
    # plus the log priors for each parameter results in log posterior:
    dnorm(reg_params["alpha"], 
          hyperparams$alpha_mu, hyperparams$alpha_sd, log = TRUE) + 
    dnorm(reg_params["beta_p"], 
          hyperparams$beta_p_mu, hyperparams$beta_p_sd, log = TRUE) + 
    dnorm(reg_params["beta_r"], 
          hyperparams$beta_r_mu, hyperparams$beta_r_sd, log = TRUE) + 
    dunif(reg_params["sigma"], hyperparams$sigma_a, 
          hyperparams$sigma_b, log = TRUE)

}

Now we’ll just optimize this function in the same way as before, using the hyperparameters in the model outlined above, and generate 10000 samples from the resulting multivariate Gaussian posterior distribution:

init_reg_samples <- laplace_approx(log_posterior_fn = score_diff_reg_eff_model,
                               init_points = c(alpha = 0, beta_p = 0, 
                                               beta_r = 0, sigma = 10),
                               n_samples = 10000,
                               # Vector of hyperparameters:
                               hyperparams = list(alpha_mu = 0, alpha_sd = 100,
                                                beta_p_mu = 0, beta_p_sd = 100,
                                                beta_r_mu = 0, beta_r_sd = 100,
                                                sigma_a = 0, sigma_b = 200),
                               # Specify the data for the optimization!
                               score_diff_values = team_summary_df$Total_Score_Diff,
                               pass_eff_values = team_summary_df$pass_epa_diff,
                               rush_eff_values = team_summary_df$rush_epa_diff)

Now let’s view the distributions for each \(\beta\) using another one of Claus Wilke’s packages ggridges:

# We'll use the latex2exp package here:
# install.packages("latex2exp")
library(latex2exp)
# and ggridges:
# install.packages("ggridges") 
library(ggridges)

# First use the gather function to make this simpler to work with:
init_reg_samples %>%
  gather(param, value) %>%
  # only use the beta parameters:
  filter(param %in% c("beta_p", "beta_r")) %>%
  # visualize the distributions for each with density curves:
  ggplot(aes(x = value, y = param)) +
  geom_density_ridges(alpha = 0.7, fill = "darkblue",
                      # add the rugs underneath as well:
                      jittered_points = TRUE,
                      position = position_points_jitter(width = 0.05, height = 0),
                      point_shape = '|', point_size = 3, point_color = "darkblue",
                      point_alpha = 0.7) +
  # properly label and also change the y axis spacing so the Passing density
  # is along the bottom of the axis
  scale_y_discrete(labels = c("Passing", "Rushing"), expand = c(0.01, 0.01)) +
  # label using the latex symbols:
  labs(x = TeX("$\\beta$ value"),
       y = TeX("$\\beta$ parameter"),
       title = TeX("Sampled posterior distributions for $\\beta_P$ and $\\beta_R$")) +
  # theme settings:
  theme_bw() + 
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 14))

From this we can clearly see that both passing and rushing efficiency differentials have positive relationships with score differential, even while accounting for each other. Unlike my initial guess from the previous plot, improving your rushing efficiency differential is associated with an increase in your team’s score differential even after adjusting for passing - so it’s not entirely useless. But we do see that passing has a greater effect, in fact there’s no overlap between the passing and rushing distributions. Now these \(\beta\) values are quite large due to the actual values for the \(EPA/Att\) differentials, with the medians for \(\beta_P\) and \(\beta_R\) at 490.4779 and 314.4028 respectively. To make this simpler and more realistic from point of view of gaining an advantage: a 0.05 increase in a team’s passing \(EPA/Att\) differential is associated with a 24.5239 increase in their expected score differential (using the posterior distribution median). Meanwhile a 0.05 increase in a team’s rushing \(EPA/Att\) differential is associated with a 15.7201 increase in their expected score differential. So if you’re in a front office and wondering how to allocate your resources, improving your pass offense/defense is more worthwhile than improving in your run offense/defense.

But my prior says run the ball!

I completely glossed over the prior distributions in setting up this model. The ones I used above are really weak priors, that give you results pretty close to just using good old lm() for your standard linear regression. Also, the prior I’m using for \(\sigma\) is extremely naive, a more appropriate choice is to actually model the log(\(\sigma\)) instead with a Gaussian prior then exponeniate to get strictly positive values - but I’ll let the motivated reader adjust this. Instead I’m going to conclude this post with an example of the power of prior distributions.

Let’s pretend you’re an analyst for the Seattle Seahawks. Your team has just hired Brian Schottenheimer to be the offensive coordinator. Brian loves to run the football5, in fact he tells you that from his 20 years of professional coaching experience, he knows that the run game is much more important than the passing game. So how do we account for his prior knowledge (or to be blunt, his bias)?

A pretty cool feature from using the Gaussian distribution as a prior is how the variance represents the amount of previous data from using \(\sigma = \frac{1}{\sqrt{n}}\). Given Brian’s 20 years of professional experience, this equates to using in our priors $= $ 0.2236, which is obviously much smaller than the priors from above. We’ll also specify the \(\mu\) for each of the \(\beta\)s to say that rushing has a stronger relationship than passing, by just flipping our posterior median values in the example above:

\[ S_i \sim \text{N}(\mu_i, \sigma^2) \\ \mu_i = \alpha + \beta_{P}P_i + \beta_R R_i \\ \alpha \sim \text{N}(0, 100^2) \\ \beta_P \sim \text{N}(336, 0.2236^2) \\ \beta_R \sim \text{N}(492, 0.2236^2) \\ \sigma \sim \text{Uniform}(0,200) \]

Let’s use our functions from before and visualize the new posteriors given Brian’s prior beliefs:

brian_reg_samples <- laplace_approx(log_posterior_fn = score_diff_reg_eff_model,
                               init_points = c(alpha = 0, beta_p = 0, 
                                               beta_r = 0, sigma = 10),
                               n_samples = 10000,
                               # Vector of hyperparameters:
                               hyperparams = list(alpha_mu = 0, alpha_sd = 100,
                                                beta_p_mu = 336, beta_p_sd = 0.2236,
                                                beta_r_mu = 492, beta_r_sd = 0.2236,
                                                sigma_a = 0, sigma_b = 200),
                               # Specify the data for the optimization!
                               score_diff_values = team_summary_df$Total_Score_Diff,
                               pass_eff_values = team_summary_df$pass_epa_diff,
                               rush_eff_values = team_summary_df$rush_epa_diff)

# First use the gather function to make this simpler to work with:
brian_reg_samples %>%
  gather(param, value) %>%
  # only use the beta parameters:
  filter(param %in% c("beta_p", "beta_r")) %>%
  # visualize the distributions for each with density curves:
  ggplot(aes(x = value, y = param)) +
  geom_density_ridges(alpha = 0.7, fill = "darkblue",
                      # add the rugs underneath as well:
                      jittered_points = TRUE,
                      position = position_points_jitter(width = 0.05, height = 0),
                      point_shape = '|', point_size = 3, point_color = "darkblue",
                      point_alpha = 0.7) +
  # properly label and also change the y axis spacing so the Passing density
  # is along the bottom of the axis
  scale_y_discrete(labels = c("Passing", "Rushing"), expand = c(0.01, 0.01)) +
  # label using the latex symbols:
  labs(x = TeX("$\\beta$ value"),
       y = TeX("$\\beta$ parameter"),
       title = TeX("Sampled posterior distributions for $\\beta_P$ and $\\beta_R$ accounting for Brian's prior")) +
  # theme settings:
  theme_bw() + 
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 14))

This shouldn’t be a suprise, as those were some very strong priors about the relationships for each of these efficiency differentials. Priors play an important role in Bayesian data analysis precisely for this reason, they can have a major impact on your results (hence why many people prefer to avoid working with them). So if you’re that lucky Seahawks analyst that gets to show Brian these results, you’ve just reaffirmed his beliefs - you’ll need a lot more data to move these posterior distributions away from such strong priors.

Discussion and next steps

Laplace approximation is a useful tool that is pretty easy to implement, assuming you’ve met the necessary assumptions. You’re not just using the mode found from optimization techniques as single estimates, but rather approximating the full posterior distribution with a Gaussian. And from this score differential example, it’s really easy to implement Bayesian linear regression. From the football perspective we’ve seen that the ability to pass the football and prevent others from passing is more valuable in terms of score differential than running the ball. But strong priors can lead us to basically ignore the data, just confirming to someone like Brian Schottenheimer that his personal bias is “correct”. Obviously this analysis was pretty simple, and much more work would need to be done for really assessing the importance of passing and rushing in the NFL. But even with this simple approach, we can quantify in terms of points how much more valuable passing is relative to rushing.

The next post in this series is going to focus on sampling from the posterior distribution - something I’ve done multiple times but deserves more of deep dive on its own before moving into MCMC.

Let me know if there are any errors in this post/code, all feedback is welcome!

You actually read this whole thing?

You actually read this whole thing?

References


  1. My rocket scientist - turned nuclear engineer - turned self taught Bayesian data scientist for a brother loves Laplace approximation.

  2. Ridiculously bad but awesome idea - Lin-Manuel Miranda reads Stephen Stigler’s History of Statistics and creates a Broadway musical with Daveed Diggs as Laplace…

  3. I don’t see the point of ever referring to something by its Latin name, so I’m going to call it what it is: the posterior mode.

  4. Shameless self promotion for my football data package fcscrapR!

  5. I’m sorry Ben and Sean.