## Enter marquis de Laplace

In my first post on Bayesian data analysis,
I did a brief overview of how Bayesian updating works using grid approximation
to arrive at posterior distributions for our parameters of interest, such as a
wide receiver’s catch rate. While the grid-based approach is simple and easy to
follow, it’s just not practical. Before we turn to MCMC, in this post we’ll
cover the popular approach known as Laplace approximation^{1}, aka quadratic approximation.

As I said before, in many cases we’re not able to derive the posterior
distribution so some sort of approximation method is required. The one and only
Pierre-Simon Laplace^{2}
had the realization you can use a Gaussian (normal) distribution (good old bell curve) for approximating the
posterior if its roughly symmetric and unimodal. One of the great benefits of
working with the Gaussian distribution is that you only need two numbers to
describe it: the mean \(\mu\) for the center, and the variance \(\sigma^2\) for the
spread.

The reason it’s often called quadratic approximation is because we use a
*quadratic* function for approximating the logarithm of the posterior density,
which we’ll denote as \(p(\theta |x)\) where \(x\) is our observed data and \(\theta\)
represents our parameter(s) of interest. This approximation uses the fact that
the log of a Gaussian is a parabola - a quadratic function. You simply take a
Taylor series expansion
of the log of the posterior density centered at the posterior mode \(\hat{\theta}\),

\[ \text{log}\ p(\theta|x) = \text{log}\ p(\hat{\theta}|x) + \frac{1}{2}(\theta - \hat{\theta})^T \Big[ \frac{d^2}{d\theta^2} \text{log} p(\theta | x) \Big]_{\theta = \hat{\theta}}(\theta - \hat{\theta}) +\ ... \]

Let me break this down. The first thing you might be thinking is, “Wait how do we
find the posterior mode???” - well it’s actually pretty simple with a standard
optimization procedure (as you’ll see below). This mode represents the center of
the Gaussian that we’ll use to approximate the posterior. The posterior mode is also
referred to by its Latin name^{3} *maximum a-posteriori*, denoted by
\(\hat{\theta}_{MAP}\). The first term in the expansion above is merely a constant,
but the second term provides us an estimate of the curvature near the posterior’s
peak and is proportional to the log of a Gaussian density. This provides us with
the variance of the Gaussian to approximate the posterior as:

\[ p(\theta|x) \approx \text{N}\big( \hat{\theta}, [I(\hat{\theta})]^{-1} \big) \]

where \(I(\hat{\theta})\) is an estimate for \(I(\theta)\), the *observed information*,

\[ I(\theta) = -\frac{d^2}{d\theta^2} \text{log}\ p(\theta|x). \]

That’s all we need for this approximation, the posterior mode and the curvature of the posterior density and BAM we have the entire posterior distribution. Well really an approximation of it - but from asymptotic theory the posterior distribution approaches a Gaussian under some conditions. I won’t get into the details, so definitely check out more of the theory yourself. I also recommend chapter four of the classic Bayesian Data Analysis by Gelman et al.

The excellent blog by Rasmus Bååth also has a post on Laplace approximation with a binomial example, like the catch rate model in the previous post, showing how the approximation using a Gaussian improves with more data despite the fact the posterior is actually a beta! I’ll use a different example for the rest of this post and demonstrate the usage of quadratic approximation for Bayesian linear regression.

## Modeling NFL score differential

Keeping with my theme of working with American football data^{4}, I’ll demonstrate
the use of quadratic approximation with a model for an individual team’s
score differential. The first step is to gather the score differential for each
team in each season, which can be done using the `nflscrapR`

package. An easier route however is to access the data from my `nflscrapR`

-data repository. Rather than scavenging
through those files, in my `nflWAR`

package
there’s a function called `get_season_summary()`

that takes in a vector of years
and returns a data frame with rows for each team’s individual season, and has columns
for their number of wins and total score differential. The code chunk below
demonstrates how to use gather this data for NFL seasons from 2009 to 2017:

```
# Load the tidyverse
# install.packages("tidyverse")
library(tidyverse)
# Use the nflWAR package to get summaries of team performances for each season,
# first install devtools if you don't have then install nflWAR:
# install.packages("devtools")
# devtools::install_github("ryurko/nflWAR")
library(nflWAR)
# Use the get_season_summary() function to create a dataset that has a row
# for team-season with a column for their score differential from 2009 to 2017:
team_summary_df <- get_season_summary(2009:2017)
```

Before going into the regression example with a predictor, it’s worthwhile to first demonstrate quadratic approximation by just modeling the score differential with a Gaussian. First step is to visualize the score differential distribution:

```
# Create a histogram displaying the score differential distribution using
# the the density instead so it's consistent for later: (also why I'm assigning
# it to a variable name, so we can add our approximated posterior layer later)
score_diff_hist <- team_summary_df %>%
ggplot(aes(x = Total_Score_Diff)) +
geom_histogram(aes(y = ..density..), bins = 20) + theme_bw() +
scale_x_continuous(limits = c(-300, 300)) +
# Always label:
labs(x = "Score differential", y = "Density",
title = "Distribution of individual team score differential in each season from 2009-17 ",
caption = "Data accessed with nflscrapR") +
# Just some cosmetic settings: (really should do a custom theme at the
# beginning instead of repeating this)
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14))
# Display the plot
score_diff_hist
```

From this histogram we have support for using a Gaussian, since it’s roughly
symmetric and unimodal. Also, since an observed *season* score differential is the
sum of the score differentials in *each game*, we can think of these values as
following a random walk. Some teams will have large positive differentials, which
means some team will have large negative differentials. I won’t go into the details
but theoretically this process of summing together random values, assumed to be
generated from the same distribution, will follow a Gaussian distribution.
In other words - the Gaussian distribution is a reasonable choice here.

We assume then that the general model for an observed score differential \(S_i\), where \(i = 1,...,\) 288 is the index for an observed team-season score differential, follows a Gaussian distribution:

\[ S_i \sim \text{N}(\mu, \sigma^2) \]

Because we’re approaching this problem from a Bayesian perspective, we really
want to consider an infinite number of possible Guassians based on all of the
possible combinations for our parameters \(\mu\) and \(\sigma\). Of course we’re
not going to actually look at every possible value for these parameters,
laplace approximation will accomplish what we need. But, as I pointed out in the
previous post, we’re interested in the entire posterior distribution - which in this case
is a posterior distribution *of distributions*. So for a full model of score
differential, we need priors for both \(\mu\) and \(\sigma\):

\[ S_i \sim \text{N}(\mu, \sigma^2) \\ \mu \sim \text{N}(0, 100^2) \\ \sigma \sim \text{Uniform}(0,200) \]

This is a really naive model, just to demonstrate Laplace approximation. Next
we’ll implement the same code in the Bååth blog post
which uses the `optim`

function in `R`

to find the mode of the posterior and
also returns the Hessian matrix
for the curvature (following the observed information definition above.
For ease, we’ll roughly follow his example code which also uses
the `mvtnorm`

packages for sampling from the approximated posterior (this sampling isn’t
really necessary but it’s good practice to start sampling from the posterior).
We’re working with two parameters here, so the resulting posterior approximation
is a multivariate Gaussian, hence why we use the `rmvnorm()`

function since we’re
not only working with each parameter’s mean and variance, but their covariance
as well (in this example the covariance is essentially 0 but that
won’t typically be the case).

```
# install.packages("mvtnorm")
#' Function to perform simple Laplace approximation
#' @param log_posterior_n Function that returns the log posterior given a vector
#' of parameter values and observed data.
#' @param init_points Vector of starting values for parameters.
#' @param n_samples Number of samples to draw from resulting approximated
#' posterior distribution for then visualizing.
#' @param ... Any additional arguments passed to the log_posterior_fun during
#' the optimization.
#' @return Samples from approximated posterior distribution.
laplace_approx <- function(log_posterior_fn, init_points, n_samples, ...) {
# Use the optim function with the input starting points and function to
# evaluate the log posterior to find the mode and curvature with hessian = TRUE
# (note that using fnscale = -1 means to maximize). We use the ... here to
# allow for flexible name of the data in log_posterior_fn:
fit <- optim(init_points, log_posterior_fn,
# Use the same quasi-Newton optimization method as McElreath's
# map function in his rethinking package:
method = "BFGS",
# using fnscale = -1 means to maximize
control = list(fnscale = -1),
# need the hessian for the curvature
hessian = TRUE,
# additional arguments for the function we're optimizing
...)
# Store the mean values for the parameters and curvature to then generate
# samples from the approximated normal posterior given the number of samples
# for both of the parameters, returning as a data frame:
param_mean <- fit$par
# inverse of the negative hessian for the covariance - same as the use
# of the observed information in the intro
param_cov_mat <- solve(-fit$hessian)
# sample from the resulting joint posterior to get posterior distributions
# for each parameter:
mvtnorm::rmvnorm(n_samples, param_mean, param_cov_mat) %>%
data.frame()
}
#' Function to calculate log posterior for simple score differential model
#' @param params Vector of parameter values that are named "mu" and "sigma".
#' @param score_diff_values Vector of observed score differential values
score_diff_model <- function(params, score_diff_values) {
# Log likelihood:
sum(dnorm(score_diff_values, params["mu"], params["sigma"], log = TRUE)) +
# plus the log priors results in log posterior:
dnorm(params["mu"], 0, 100, log = TRUE) +
dunif(params["sigma"], 0, 200, log = TRUE)
}
```

With these functions we’ll now approximate the posterior, considering very naive starting values for our optimization with \(\mu = 200\) and \(\sigma = 10\) (this is just to demonstrate, I have no reason for choosing these values).

```
init_samples <- laplace_approx(log_posterior_fn = score_diff_model,
init_points = c(mu = 200, sigma = 10),
n_samples = 10000,
# Specify the data for the optimization!
score_diff_values = team_summary_df$Total_Score_Diff)
```

We can easily view the joint distribution of our posterior samples with their
respective marginals on the side (code courtesy of Claus Wilke using `cowplot`

):

```
# install.packages("cowplot")
library(cowplot)
# First the joint distribution plot with points for the values:
joint_plot <- ggplot(init_samples, aes(x = mu, y = sigma)) +
geom_point(alpha = .3, color = "darkblue") +
labs(title = "Posterior distribution for model parameters with marginals") +
theme_bw() +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 16))
# Marginal density for mu along x-axis using the cowplot function axis_canvas:
mu_dens <- axis_canvas(joint_plot, axis = "x") +
geom_density(data = init_samples, aes(x = mu), fill = "darkblue",
alpha = 0.5, size = .2)
# Same thing for sigma but along y and with coord_flip = TRUE to make it vertical:
sigma_dens <- axis_canvas(joint_plot, axis = "y", coord_flip = TRUE) +
geom_density(data = init_samples, aes(x = sigma), fill = "darkblue",
alpha = 0.5, size = .2) +
coord_flip()
# Now generate by adding these objects to the main plot:
# Need grid:
# install.packages("grid")
joint_plot_1 <- insert_xaxis_grob(joint_plot, mu_dens,
grid::unit(.2, "null"), position = "top")
joint_plot_2 <- insert_yaxis_grob(joint_plot_1, sigma_dens,
grid::unit(.2, "null"), position = "right")
ggdraw(joint_plot_2)
```

From this we can clearly see the Gaussian approximations for both of our
parameters with the distribution of \(\mu\) centered around 0 and the distribution
for \(\sigma\) centered around 100. Remember this represents the distributions
for the parameters of our score differential model. So sampling from this posterior
means we are generating different Gaussian distributions for the score differential.
My next post will focus on sampling from the posterior, but to give you a taste
of what I mean the code below uses these 10000 values from `init_samples`

for
each parameter, and then samples 10000 values from distributions using these
combinations of values to give us our approximate score differential distribution.

```
# R is vectorized so can just give it the vector of values for the Gaussian
# distribution parameters to create the simulated score-diff distribution:
sim_score_diff_data <- data.frame(Total_Score_Diff =
rnorm(10000, mean = init_samples$mu,
sd = init_samples$sigma))
# Create the histogram from before but now add this posterior density on top:
score_diff_hist + geom_density(data = sim_score_diff_data,
aes(x = Total_Score_Diff),
fill = NA, color = "darkblue")
```

Not bad, but then again this was a pretty easy example. You could also use the grid approximation from the last post to accomplish the same task but it starts to be annoying to work with even with just two parameters, quickly becoming unbearable with more and more parameters.

## How to win in the NFL

Ok, so we’ve seen how easy it is to use Laplace approximation for modeling score differential… but that didn’t tell us anything interesting. What we’re really interested in, is the relationship between some predictor variable(s) and a team’s score differential. For instance, how is a team’s passing performance related to their score differential? What about their pass defense? Is this more or less related to score differential than rushing?

In order to address these questions, I’m going to define two statistics to
summarise each team’s performance in a season. These statistics will be entirely
based on a team’s expected points added (\(EPA\)). If you’re not familiar with \(EPA\),
it’s built on the fundamental principle in football that not all yards are created
equal. A 3 yard rush on 3rd down with 2 yards to go, is a lot more valuable than
a 3 yard rush on 3rd down with 15 yards to go. I’m not going to do a full dive
on how to calculate expected points or its history in this post,
but it gives us a number describing how many points a team is expected to score
given their current situation. \(EPA\) is merely the change in this expected points
state between two plays. It was recently popularized by Brian Burke,
and in `nflscrapR`

we use a
multinomial logistic regression model to provide probabilities for each of the
possible scoring events resulting in freely available \(EPA\) values for every
play, full model details available here.

For each combination of team, \(t\), and season, \(y\), we’ll calculate both its
total offensive \(EPA\) from passing and rushing as \(EPA^{pass}_{off, t, y}\) and
\(EPA^{rush}_{off, t, y}\) respectively. As well as it’s defensive statistics
representing the expected points **allowed**, \(EPA^{pass}_{def, t, y}\) and
\(EPA^{rush}_{def, t, y}\). We’ll capture the efficiency of their performances
by dividing these totals each by their respective number of attempts, which we’ll
denote as \(Att^{rush}_{off, t,y}\), \(Att^{rush}_{def, t,y}\), \(Att^{pass}_{off, t,y}\),
and \(Att^{pass}_{def, t,y}\). Finally, we’ll summarise their overall performances
with differentials between their offensive and defensive efficiency stats for
both passing and rushing:

\[ EPA\ diff^{pass}_{t,y} = \frac{EPA^{pass}_{off, t, y}}{Att^{pass}_{off, t,y}} - \frac{EPA^{pass}_{def, t, y}}{Att^{pass}_{def, t,y}}\\ EPA\ diff^{rush}_{t,y} = \frac{EPA^{rush}_{off, t, y}}{Att^{rush}_{off, t,y}} - \frac{EPA^{rush}_{def, t, y}}{Att^{rush}_{def, t,y}}\\ \]

These two stats tell us effectively tell us how much more efficient a team’s
passing (rushing) offense was compared to their opponents. We’re then going to
look at the relationship between score differential and each of these efficiency
differentials separately for simplicity to give us a sense of knowing which
is more related to their overall performance as captured by score differential.
First let’s get this data and join it to our `team_summary_df`

above by grabbing
and selecting the columns from my `nflscrapR`

-data repository:

```
# Load in the four datasets summarising team performance for passing and rushing
# both from offensive and defensive perspectives, selecting only the necessary
# columns and renaming them:
team_passing_off <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_team_stats/team_season_passing_df.csv") %>%
dplyr::select(Season, Team, EPA_per_Att) %>%
rename(pass_epa_att_off = EPA_per_Att) %>%
filter(!is.na(Team))
team_passing_def <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_team_stats/team_def_season_passing_df.csv") %>%
dplyr::select(Season, Team, EPA_per_Att) %>%
rename(pass_epa_att_def = EPA_per_Att) %>%
filter(!is.na(Team))
team_rushing_off <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_team_stats/team_season_rushing_df.csv") %>%
dplyr::select(Season, Team, EPA_per_Car) %>%
rename(rush_epa_att_off = EPA_per_Car) %>%
filter(!is.na(Team))
team_rushing_def <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/data/season_team_stats/team_def_season_rushing_df.csv") %>%
dplyr::select(Season, Team, EPA_per_Car) %>%
rename(rush_epa_att_def = EPA_per_Car) %>%
filter(!is.na(Team))
# Join the data to the team_summary_df and calculate the differential columns:
team_summary_df <- team_summary_df %>%
inner_join(team_passing_off, by = c("Team", "Season")) %>%
inner_join(team_passing_def, by = c("Team", "Season")) %>%
inner_join(team_rushing_off, by = c("Team", "Season")) %>%
inner_join(team_rushing_def, by = c("Team", "Season")) %>%
mutate(pass_epa_diff = pass_epa_att_off - pass_epa_att_def,
rush_epa_diff = rush_epa_att_off - rush_epa_att_def)
```

Next, let’s look at the relationship between each of these efficiency differentials with the team’s score differential visually along with simple linear regression fits (just for assistance, we’re going into Bayesian linear regression anyway):

```
epa_pass_plot <- ggplot(team_summary_df,
aes(x = pass_epa_diff, y = Total_Score_Diff)) +
geom_point(alpha = 0.5, color = "darkblue") +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Passing EPA/Attempt differential",
y = "Regular season score differential",
title = "Relationship between score differential\nand passing efficiency differential") +
theme_bw() +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 13))
epa_rush_plot <- ggplot(team_summary_df,
aes(x = rush_epa_diff, y = Total_Score_Diff)) +
geom_point(alpha = 0.5, color = "darkblue") +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Rushing EPA/Attempt differential",
y = "Regular season score differential",
title = "Relationship between score differential\n and rushing efficiency differential") +
theme_bw() +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 13))
plot_grid(epa_pass_plot, epa_rush_plot, rel_widths = c(1, 1))
```

Ok, I’d say it’s pretty obvious that the passing efficiency differential has a strong positive relationship with score differential. The rushing differential also displays a positive relationship, but not quite as strong. We can also of course view the relationship between these variables (or lack thereof), and color the points by the team’s regular season score differential to give us an idea of how its related to both of these variables. And just because we can, we’ll include the marginal distributions on the same plot.

```
# First generate the scattterplot with the points
joint_score_plot <- ggplot(team_summary_df,
aes(x = pass_epa_diff, y = rush_epa_diff)) +
geom_point(aes(color = Total_Score_Diff), alpha = 0.5, size = 4) +
labs(x = "Passing EPA/Attempt differential",
y = "Rushing EPA/Attempt differential",
color = "Regular season\nscore differential",
title = "Joint distribution of passing and rushing efficiency differentials\ncolored by regular season score differential") +
# Make the color scale go from darkorange (bad) to midpoint gray (usually
# associate white with missing) to darkblue (good):
scale_color_gradient2(low = "darkorange", mid = "gray", high = "darkblue",
midpoint = 0) +
# Add dashed red lines through the origin:
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
# Add some text for the different quadrants for ease of reading:
annotate("text", label = "Better pass,\nbetter run", x = .30, y = .28,
size = 5, color = "darkred") +
annotate("text", label = "Better pass,\nworse run", x = .30, y = -.15,
size = 5, color = "darkred") +
annotate("text", label = "Worse pass,\nbetter run", x = -.30, y = .28,
size = 5, color = "darkred") +
annotate("text", label = "Worse pass,\nworse run", x = -.30, y = -.15,
size = 5, color = "darkred") +
theme_bw() +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 16))
# Make passing marginal along x axis:
pass_dens <- axis_canvas(joint_score_plot, axis = "x") +
geom_density(data = team_summary_df, aes(x = pass_epa_diff),
fill = "darkblue",
alpha = 0.5, size = .2)
# Same thing for rushing but along y and with coord_flip = TRUE to make it vertical:
rush_dens <- axis_canvas(joint_score_plot, axis = "y", coord_flip = TRUE) +
geom_density(data = team_summary_df, aes(x = rush_epa_diff),
fill = "darkblue",
alpha = 0.5, size = .2) +
coord_flip()
# Now generate by adding these objects to the main plot:
joint_score_plot_1 <- insert_xaxis_grob(joint_score_plot, pass_dens,
grid::unit(.2, "null"), position = "top")
joint_score_plot_2 <- insert_yaxis_grob(joint_score_plot_1, rush_dens,
grid::unit(.2, "null"), position = "right")
ggdraw(joint_score_plot_2)
```

Now it looks pretty clear that you’re probably going to need a better passing offense than your opponent to have a positive score differential. There doesn’t appear to be many points on this plot with positive score differential in the top-left quadrant, which corresponds to rushing better than your opponents but having an inferior passing game. After accounting for the passing efficiency differential, maybe a team’s rushing efficiency differential doesn’t matter? Let’s tackle this with a Bayesian model.

Returning to the score differential model, we’re now going use both efficiency differentials as predictors. Remember, all we need to fully describe a Gaussian distribution is the mean and variance. So in order to incorporate these into the our model we’ll define the mean, \(\mu\), as a function of both predictors explicitly. For simplicity we’ll return to using \(i\) to denote a single combination of team \(t\) and season \(y\) for the 286 team-season combinations that we have, and we’ll also let \(P_i = EPA\ diff^{pass}_{t,y}\) and \(R_i = EPA\ diff^{rush}_{t,y}\) be simpler representations for our differentials:

\[ S_i \sim \text{N}(\mu_i, \sigma^2) \\ \mu_i = \alpha + \beta_{P}P_i + \beta_R R_i \\ \alpha \sim \text{N}(0, 100^2) \\ \beta_P \sim \text{N}(0, 100^2) \\ \beta_R \sim \text{N}(0, 100^2) \\ \sigma \sim \text{Uniform}(0,200) \]

Now the mean of the score differential distribution *depends on the predictors*
as denoted with the subscript \(i\). By denoting the relationship between \(\mu_i\)
and each of the predictors with an \(=\) sign, we’re saying that \(\mu_i\) is no
longer a parameter, its just a function of our actual parameters of interest.
The parameters we really care about here are \(\alpha\), \(\beta_P\), and \(\beta_R\).
Of course all of this should look familiar if you’ve seen linear regression
before. The \(\alpha\) is our model’s intercept, representing the *expected* score
differential for when both the passing and rushing efficiency differentials
are 0. And each of the \(\beta\)s represent the change in *expected* score
differential when the pass/rush efficiency differential increases by 1 unit, after
accounting for the other type of efficiency differential (rush/pass). So each of
these are given priors (along with \(\sigma\) again), and we don’t need to specify
a prior for \(\mu_i\) since it is explained completely by these parameters. These
priors for now are pretty weak, but both \(\beta\) priors centered at 0 are just
saying, “I don’t know if there’s a relationship between these effiency
differentials and score differential, good or bad are equally likely”.

To actually approximate this model, we’ll use that same `laplace_approx()`

function
from before - but this time define a new function for calculating the log posterior
for this regression model:

```
#' Function to calculate log posterior for score differential model regression
#' model that also takes in the prior inputs for the regression parameters.
#' @param reg_params Vector of parameter values that are named "alpha",
#' "beta_p", "beta_r", and "sigma".
#' @param hyperparams List of hyperparameter values that are named "alpha_mu",
#' "alpha_sd", "beta_p_mu", "beta_p_sd", "beta_r_mu", "beta_p_sd", "sigma_a",
#' and "sigma_b".
#' @param score_diff_values Vector of observed score differential values.
#' @param pass_eff_values Vector of pass efficiency differential values.
#' @param rush_eff_values Vector of rush efficiency differential values.
score_diff_reg_eff_model <- function(reg_params, hyperparams,
score_diff_values, pass_eff_values,
rush_eff_values) {
# Log likelihood that now uses the function form for mu using the two
# predictors and their respective values:
sum(dnorm(score_diff_values,
reg_params["alpha"] +
reg_params["beta_p"]*pass_eff_values +
reg_params["beta_r"]*rush_eff_values,
reg_params["sigma"], log = TRUE)) +
# plus the log priors for each parameter results in log posterior:
dnorm(reg_params["alpha"],
hyperparams$alpha_mu, hyperparams$alpha_sd, log = TRUE) +
dnorm(reg_params["beta_p"],
hyperparams$beta_p_mu, hyperparams$beta_p_sd, log = TRUE) +
dnorm(reg_params["beta_r"],
hyperparams$beta_r_mu, hyperparams$beta_r_sd, log = TRUE) +
dunif(reg_params["sigma"], hyperparams$sigma_a,
hyperparams$sigma_b, log = TRUE)
}
```

Now we’ll just optimize this function in the same way as before, using the hyperparameters in the model outlined above, and generate 10000 samples from the resulting multivariate Gaussian posterior distribution:

```
init_reg_samples <- laplace_approx(log_posterior_fn = score_diff_reg_eff_model,
init_points = c(alpha = 0, beta_p = 0,
beta_r = 0, sigma = 10),
n_samples = 10000,
# Vector of hyperparameters:
hyperparams = list(alpha_mu = 0, alpha_sd = 100,
beta_p_mu = 0, beta_p_sd = 100,
beta_r_mu = 0, beta_r_sd = 100,
sigma_a = 0, sigma_b = 200),
# Specify the data for the optimization!
score_diff_values = team_summary_df$Total_Score_Diff,
pass_eff_values = team_summary_df$pass_epa_diff,
rush_eff_values = team_summary_df$rush_epa_diff)
```

Now let’s view the distributions for each \(\beta\) using another
one of Claus Wilke’s packages `ggridges`

:

```
# We'll use the latex2exp package here:
# install.packages("latex2exp")
library(latex2exp)
# and ggridges:
# install.packages("ggridges")
library(ggridges)
# First use the gather function to make this simpler to work with:
init_reg_samples %>%
gather(param, value) %>%
# only use the beta parameters:
filter(param %in% c("beta_p", "beta_r")) %>%
# visualize the distributions for each with density curves:
ggplot(aes(x = value, y = param)) +
geom_density_ridges(alpha = 0.7, fill = "darkblue",
# add the rugs underneath as well:
jittered_points = TRUE,
position = position_points_jitter(width = 0.05, height = 0),
point_shape = '|', point_size = 3, point_color = "darkblue",
point_alpha = 0.7) +
# properly label and also change the y axis spacing so the Passing density
# is along the bottom of the axis
scale_y_discrete(labels = c("Passing", "Rushing"), expand = c(0.01, 0.01)) +
# label using the latex symbols:
labs(x = TeX("$\\beta$ value"),
y = TeX("$\\beta$ parameter"),
title = TeX("Sampled posterior distributions for $\\beta_P$ and $\\beta_R$")) +
# theme settings:
theme_bw() +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14))
```

From this we can clearly see that both passing and rushing efficiency differentials
have positive relationships with score differential, even while accounting for
each other. Unlike my initial guess from the previous plot, improving your rushing
efficiency differential is associated with an increase in your team’s score
differential even after adjusting for passing - so it’s not entirely useless.
But we do see that passing has a greater effect, in fact **there’s no overlap
between the passing and rushing distributions**.
Now these \(\beta\) values are quite large due to the actual values
for the \(EPA/Att\) differentials, with the medians for \(\beta_P\) and \(\beta_R\) at
490.4779 and
314.4028 respectively. To make this
simpler and more realistic from point of view of gaining an advantage: a 0.05
increase in a team’s passing \(EPA/Att\) differential is associated with a
24.5239 increase in their expected
score differential (using the posterior distribution median). Meanwhile a 0.05
increase in a team’s rushing \(EPA/Att\) differential is associated with a
15.7201 increase in their expected
score differential. So if you’re in a front office and wondering how to allocate
your resources, improving your pass offense/defense is more worthwhile
than improving in your run offense/defense.

## But my prior says run the ball!

I completely glossed over the prior distributions in setting up this model.
The ones I used above are really weak priors, that give you results
pretty close to just using good old `lm()`

for your standard linear regression.
Also, the prior I’m using for \(\sigma\) is extremely naive, a more appropriate
choice is to actually model the log(\(\sigma\)) instead with a Gaussian prior then
exponeniate to get strictly positive values - but I’ll let the motivated reader
adjust this. Instead I’m going to conclude this post with an example of the power
of prior distributions.

Let’s pretend you’re an analyst for the Seattle Seahawks. Your team has just
hired Brian Schottenheimer
to be the offensive coordinator. Brian loves to run the football^{5}, in fact he tells
you that from his 20 years of professional coaching experience, he knows that
the run game is much more important than the passing game. So how do we account
for his prior knowledge (or to be blunt, his bias)?

A pretty cool feature from using the Gaussian distribution as a prior is how the variance represents the amount of previous data from using \(\sigma = \frac{1}{\sqrt{n}}\). Given Brian’s 20 years of professional experience, this equates to using in our priors $= $ 0.2236, which is obviously much smaller than the priors from above. We’ll also specify the \(\mu\) for each of the \(\beta\)s to say that rushing has a stronger relationship than passing, by just flipping our posterior median values in the example above:

\[ S_i \sim \text{N}(\mu_i, \sigma^2) \\ \mu_i = \alpha + \beta_{P}P_i + \beta_R R_i \\ \alpha \sim \text{N}(0, 100^2) \\ \beta_P \sim \text{N}(336, 0.2236^2) \\ \beta_R \sim \text{N}(492, 0.2236^2) \\ \sigma \sim \text{Uniform}(0,200) \]

Let’s use our functions from before and visualize the new posteriors given Brian’s prior beliefs:

```
brian_reg_samples <- laplace_approx(log_posterior_fn = score_diff_reg_eff_model,
init_points = c(alpha = 0, beta_p = 0,
beta_r = 0, sigma = 10),
n_samples = 10000,
# Vector of hyperparameters:
hyperparams = list(alpha_mu = 0, alpha_sd = 100,
beta_p_mu = 336, beta_p_sd = 0.2236,
beta_r_mu = 492, beta_r_sd = 0.2236,
sigma_a = 0, sigma_b = 200),
# Specify the data for the optimization!
score_diff_values = team_summary_df$Total_Score_Diff,
pass_eff_values = team_summary_df$pass_epa_diff,
rush_eff_values = team_summary_df$rush_epa_diff)
# First use the gather function to make this simpler to work with:
brian_reg_samples %>%
gather(param, value) %>%
# only use the beta parameters:
filter(param %in% c("beta_p", "beta_r")) %>%
# visualize the distributions for each with density curves:
ggplot(aes(x = value, y = param)) +
geom_density_ridges(alpha = 0.7, fill = "darkblue",
# add the rugs underneath as well:
jittered_points = TRUE,
position = position_points_jitter(width = 0.05, height = 0),
point_shape = '|', point_size = 3, point_color = "darkblue",
point_alpha = 0.7) +
# properly label and also change the y axis spacing so the Passing density
# is along the bottom of the axis
scale_y_discrete(labels = c("Passing", "Rushing"), expand = c(0.01, 0.01)) +
# label using the latex symbols:
labs(x = TeX("$\\beta$ value"),
y = TeX("$\\beta$ parameter"),
title = TeX("Sampled posterior distributions for $\\beta_P$ and $\\beta_R$ accounting for Brian's prior")) +
# theme settings:
theme_bw() +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14))
```

This shouldn’t be a suprise, as those were some very strong priors about the relationships for each of these efficiency differentials. Priors play an important role in Bayesian data analysis precisely for this reason, they can have a major impact on your results (hence why many people prefer to avoid working with them). So if you’re that lucky Seahawks analyst that gets to show Brian these results, you’ve just reaffirmed his beliefs - you’ll need a lot more data to move these posterior distributions away from such strong priors.

## Discussion and next steps

Laplace approximation is a useful tool that is pretty easy to implement, assuming
you’ve met the necessary assumptions. You’re not just using the mode found from
optimization techniques as single estimates, but rather approximating the full
posterior distribution with a Gaussian. And from this score differential example,
it’s really easy to implement Bayesian linear regression. From the football
perspective we’ve seen that the ability to pass the football and prevent others
from passing is more valuable in terms of score differential than running the ball.
But strong priors can lead us to basically ignore the data, just confirming to
someone like Brian Schottenheimer that his personal bias is “correct”. Obviously
this analysis was pretty simple, and much more work would need to be done for
really assessing the importance of passing and rushing in the NFL. But even
with this simple approach, **we can quantify in terms of points how much more
valuable passing is relative to rushing**.

The next post in this series is going to focus on sampling from the posterior distribution - something I’ve done multiple times but deserves more of deep dive on its own before moving into MCMC.

Let me know if there are any errors in this post/code, all feedback is welcome!

## References

- Richard McElreath’s Statistical Rethinking,
his
`rethinking`

package uses a function`map()`

, which is basically a wrapper function for the`laplace_approx()`

function we defined above, to carry out Laplace approximation. I’m just personally annoyed by the fact it’s`map()`

since that’s the same name as one of the most useful functions in the`purrr`

package. - Bayesian Data Analysis by Gelman et al.
- Rasmus Bååth’s blog post on Laplace approximation
- Jim Albert’s
`LearnBayes`

package also has similar functions for carrying out this approximation.

My rocket scientist - turned nuclear engineer - turned self taught Bayesian data scientist for a brother loves Laplace approximation.↩

Ridiculously bad but awesome idea - Lin-Manuel Miranda reads Stephen Stigler’s History of Statistics and creates a Broadway musical with Daveed Diggs as Laplace…↩

I don’t see the point of ever referring to something by its Latin name, so I’m going to call it what it is: the posterior mode.↩

Shameless self promotion for my football data package

`fcscrapR`

!↩