Predicting Plays, Revaluing Rushers
Introduction
In the NFL, the ability to anticipate whether an offense will run or pass is critical to defensive success. Games are often decided not just by on-field execution, but by preparation and strategic play-calling from the sidelines. This project aims to quantify this strategic element by building a model to predict offensive plays based on pre-snap context.
This project has two primary goals:
- Build a model that predicts the likelihood of a run or pass using pre-snap context.
- Use this model to evaluate pass rushers, with an emphasis on their ability to generate pressure in unexpected passing situations, where disruption is harder to achieve.
Data
This project combines two key NFL datasets:
NFL Play-by-Play Data (2016–2023): This dataset forms the foundation of our predictive model. It includes core situational features for each play, such as down, distance, quarter, yardline, score differential, and time remaining. It also provides information on offensive formation and team-specific lagged run rates.
NFL Player Tracking Data (2022): This dataset provides rich spatial information. It contains the (x, y) coordinates for all 22 players and the ball on every frame of a play, along with player kinematics (speed, acceleration, direction) and key event timestamps like the ball snap, pass release, and tackle. This data is crucial for analyzing player movement and evaluating pass-rusher disruption.
Exploratory Analysis: The Innate Patterns of Play-Calling
Before building any models, an initial exploration of the play-by-play data reveals strong, intuitive patterns in offensive strategy. These predictable tendencies, driven by game situation and personnel, form the basis of our ability to predict play calls.
League-Wide Trends & Game Script
At a macro level, the total number of offensive plays per season has remained remarkably stable, and the league-wide pass rate hovers consistently around 58-60%. This provides a stable, pass-leaning environment for our model. Clock management and score also heavily influence play selection, with pass rates spiking dramatically in the final two minutes of each half as teams are forced to throw.
Down, Distance, and Field Position
Arguably the most powerful pre-snap signals are the down and distance. As expected, the likelihood of a pass increases significantly on later downs and in long-yardage situations. Field position and the current score differential also play a major role, with teams passing more when trailing and in the open field.
Personnel as a Predictor
Finally, the offensive personnel grouping provides a clear signal of intent. There is a direct and powerful relationship between the number of wide receivers on the field and the likelihood of a pass. Conversely, “heavy” personnel packages (with 3+ running backs and tight ends) are a strong indicator of a run play.
Methods
Modeling the Pre-Snap Pass Probability
The primary modeling goal is to predict the outcome of an offensive play before the ball is snapped. We frame this as a binary classification problem. For each play i, let Y_i be a binary random variable where:
Y_i = \begin{cases} 1 & \text{if the play is a pass} \\ 0 & \text{if the play is a run} \end{cases}
Let \mathbf{X}_i be a vector of k pre-snap features available for play i, such as down, distance, time remaining, and player locations. Our objective is to build a model that accurately estimates the conditional probability of a pass, p_i = P(Y_i = 1 | \mathbf{X}_i = \mathbf{x}_i).
We developed our model iteratively:
A Note on the Target Variable: For the purpose of this model, we define a “pass” (Y_i=1) as any play categorized as a quarterback dropback. This is a crucial distinction: our goal is to predict the designed play call from a pre-snap perspective. Therefore, plays where the quarterback drops back to pass but ends up scrambling are labeled as ‘pass plays’ for our model, even though they are officially recorded as runs. This approach aligns with predicting the offense’s intent at the snap, not the final play outcome.
Baseline and Formation-Aware Models (XGBoost): Our initial two models used XGBoost, a powerful gradient boosting framework. This method builds an ensemble of decision trees sequentially, where each new tree is trained to correct the errors of the previous ones. The model estimates the log-odds of a pass, and the final probability is given by the sigmoid function: \hat{p}_i = \sigma(f(\mathbf{x}_i)), where f(\mathbf{x}_i) is the final ensemble score. The first model only used basic situational data, while the second used situational, peronnel and basic formation data.
Tracking Model (Generalized Additive Models): To incorporate player tracking data, which contains complex, non-linear spatial patterns, we transitioned to Generalized Additive Models (GAMs). A GAM models the log-odds of a pass as a sum of smooth functions of the predictors, allowing for more flexible relationships. The model has the form:
\text{logit}(\hat{p}_i) = \log\left(\frac{\hat{p}_i}{1-\hat{p}_i}\right) = \beta_0 + \sum_{j=1}^{k} s_j(x_{ij})
Here, each s_j is a non-parametric smooth function (e.g., a spline) that is learned from the data. This framework allowed us to capture the nuanced effects of player spacing and motion on play-calling without assuming a linear relationship, boosting accuracy by 5-10%.
Feature Engineering: Quantifying the Pre-Snap Environment
Our model’s strength comes from a comprehensive feature set designed to capture every aspect of the pre-snap environment. We group our 34 final features into four main categories.
1. Situational and Personnel Features
These foundational features describe the game’s context. They are the most direct and powerful predictors of play-calling.
- Game State: We capture the score with
score_differentialandwp(win probability), and the time with flags liketwo_minute_warning. - Down and Distance: The most critical context is captured by
yardline_100, flags forthird_downandfourth_down, and interaction terms likedown_x_distanceandred_zone_x_down. - Personnel and Formation: We use simple counts of players (
n_rb,n_te,n_wr) and a flag fortrips_formationto understand offensive intent. We also include flags forno_huddleand the interactionshotgun_x_down.
2. Player Kinematics (Basic Tracking)
Using the player tracking data, we calculate basic metrics of player movement at the snap.
- Speed and Direction: We capture the
max_speedof any player on the field and thedirection_variance, which measures how much players’ headings differ. High variance can indicate a complex post-snap plan.
3. Formation Geometry (Advanced Spatial)
These features translate the (x, y) coordinates of all 22 players into a holistic description of the offensive and defensive shapes.
- Offensive Shape: We measure the
wr_spread(how far apart the receivers are),backfield_depth(distance from QB to RB), andte_alignment. - Defensive Shape: We calculate the
db_spread(how spread out the defensive backs are), the overalldef_depthof the defense, and thedef_coverage_depth(average depth of the secondary). - Formation-level Concepts: We compute
formation_compactness(how tightly clustered players are),formation_density(players per square yard), and metrics forformation_symmetryandformation_balanceto see if a formation is lopsided.
4. Advanced Spatial Concepts: Control and Structure
This is the most innovative part of our feature set, where we use advanced concepts to quantify spatial dominance and defensive cohesion.
- Convex Hulls: By stretching a “digital rubber band” around all offensive players, we calculate the
off_hull_area. A larger area suggests a more spread-out, pass-oriented formation. - Defensive Graph Theory: We model the defense as a network to measure its structure. Features like
def_graph_clustering_coeftell us if the defense is arranged in tight pods, whiledef_graph_avg_betweennessidentifies how layered or connected the structure is. - Voronoi Diagrams & Pitch Control: To measure which team controls more space, we use Voronoi diagrams. Imagine drawing a cell around each player that includes all points on the field closer to them than to anyone else. We can then calculate the
off_voronoi_area_meananddef_voronoi_area_meanto see who is responsible for more space. A key summary metric is the Pitch Control Ratio, which quantifies the spatial dominance between the two units: \text{Pitch Control} = \frac{\text{Total Area Controlled by Offense}}{\text{Total Area Controlled by Offense + Defense}} A value greater than 0.5 suggests the offense is more spread out and controls more of the field, which is often a strong indicator of a pass. This single feature elegantly summarizes the spatial battle at the line of scrimmage.
Model Analysis
This section provides a detailed look at the diagnostic plots for our predictive models, organized by modeling stage.
This tab contains diagnostics for the initial XGBoost models, which were built using situational and personnel data before the inclusion of advanced tracking features.
These plots summarize the overall performance of the non-tracking model. The ROC curves show the progression from a basic model to one aware of personnel, and the performance across years demonstrates the model’s consistency.
These plots explore the features driving the model. The first chart shows the most important individual predictors, while the second shows how different categories of features contribute to the model’s overall predictive power.
This feature, representing a team’s historical tendency in similar situations, was a powerful predictor. These plots show its distribution and its strong linear relationship with the actual pass rate on a given play.
This tab contains a comprehensive suite of diagnostics for the final Generalized Additive Model (GAM), which incorporates the full set of advanced spatial features derived from player tracking data.
This section summarizes the model’s top-level performance. The summary shows key classification metrics, the confusion matrix breaks down the specific prediction accuracy for runs and dropbacks, and the accuracy heatmap shows how performance varies by game situation.
These plots visualize the model’s discriminative ability across different decision thresholds. The ROC and Precision-Recall curves are standard measures of a classifier’s power, especially for imbalanced datasets.
These plots dive into what drives the model’s predictions. We can see the statistical importance of our top features, how the model’s output changes with those features (Partial Dependence), and how the features work together to separate the two classes (PCA).
This final set of plots validates the model’s health. The calibration plot confirms that its predicted probabilities are reliable. The probability distribution plot shows a strong separation between predicted runs and passes. Finally, the residual plots confirm the absence of systematic bias in the model’s errors.
The Surprisal Pass-Rusher Metric
With a model that produces a pass probability \hat{p}_i for each play, we can now evaluate pass-rusher performance. We use surprisal, a concept from information theory that quantifies the “unexpectedness” of an event.
For any given play i, the surprisal of observing a pass, measured in bits, is the negative logarithm of its estimated probability:
S_i(\text{Pass}) = -\log(\hat{p}_i)
A high surprisal score for a pass play indicates that it was difficult to anticipate based on pre-snap context (e.g., a pass on 3rd-and-1). Conversely, the surprisal of observing a run is:
S_i(\text{Run}) = -\log(1 - \hat{p}_i) Importantly, the average level of surprisal across the league is remarkably stable year-over-year, suggesting that it captures a fundamental and consistent property of NFL game strategy. This stability validates its use as a robust metric for situational difficulty.
Aggregating Performance: The Surprisal-Weighted Disruption Rate
We use play-specific surprisal to create a context-aware evaluation metric for pass rushers. This metric, the Surprisal-Weighted Disruption Rate (SWDR), rewards players for generating pressure (sacks or QB hits) on plays that were harder to predict.
First, we define a player’s total weighted production. Let \mathcal{S} be the set of all pass plays where a player recorded a sack, and \mathcal{H} be the set of plays where they recorded a QB hit.
Weighted Sacks (WS): The sum of surprisal scores for every play where the player recorded a sack. \text{WS} = \sum_{i \in \mathcal{S}} S_i(\text{Pass})
Weighted Hits (WH): The sum of surprisal scores for every play where the player recorded a QB hit. \text{WH} = \sum_{i \in \mathcal{H}} S_i(\text{Pass})
The sum of these two values, \text{WS} + \text{WH}, gives a player’s total disruption, weighted by situational difficulty.
Next, to create a rate, we must define the player’s total opportunity, also weighted by difficulty. Let \mathcal{D} be the set of all pass snaps a player participated in.
- Total Pass-Play Surprisal (TotalS): This term normalizes a player’s production by their total pass-rushing opportunities, with each opportunity weighted by its surprisal. \text{TotalS} = \sum_{j \in \mathcal{D}} S_j(\text{Pass})
Finally, the Surprisal-Weighted Disruption Rate (SWDR) is the ratio of a player’s total weighted disruption to their total weighted opportunity:
\text{SWDR} = \frac{\text{WS} + \text{WH}}{\text{TotalS}} = \frac{\sum_{i \in \mathcal{S} \cup \mathcal{H}} S_i(\text{Pass})}{\sum_{j \in \mathcal{D}} S_j(\text{Pass})}
This rate provides a more nuanced measure of a pass rusher’s impact than raw sack counts or pressure rates by accounting for the context of every single pass rush snap.
Results: Surprisal Reveals a New View of Performance
Our Surprisal-Weighted Disruption Rate (SWDR) provides a powerful new lens through which to view defensive performance. By prioritizing context and situational difficulty over raw volume, our rankings surface a different class of disruptor and challenge conventional wisdom about who the most effective pass rushers are.
The New Top Tier: Elite and Unexpected
Our overall rankings reveal a fascinating mix of established superstars and surprising overachievers. While elite edge rushers like Nick Bosa, Maxx Crosby, and Myles Garrett remain in the top tier, they are joined by players not typically found in top-10 discussions.
Key Takeaways:
- James Smith-Williams and Montez Sweat of Washington, along with veteran Jerry Hughes, rank in the top three. Their high placement suggests they are exceptionally skilled at generating pressure on early downs or in situations where the offense successfully disguises its intent. They don’t just win their matchups—they win them when it’s hardest to do so.
- Nick Bosa, while ranking 4th in our weighted metric, has a raw disruption rate nearly double that of anyone else in the top 8. This indicates that while Bosa is an undeniably dominant force, a significant portion of his production comes in obvious passing situations where his job is easier. Our metric still values his performance highly, but it adjusts for that favorable context.
The Risers: Rewarding the Unconventional Disruptors
The most powerful illustration of our metric’s value comes from identifying the “risers”—players whose rank improves most dramatically when switching from a raw disruption rate to our SWDR. These are the players who specialize in creating chaos on plays that look like runs.
Key Takeaways:
- Interior Linemen Shine: The list is populated by interior defenders like Logan Hall, D.J. Jones, Aaron Donald, and Matt Ioannidis. This is intuitive: interior rushers are often tasked with run-stopping duties on early downs. When they generate pressure on a play-action pass, they are succeeding in a truly unexpected and disruptive way.
- Aaron Donald’s Value: While Donald’s raw sack numbers in 2022 might have seemed low by his historic standards, his +32 rank change confirms his unique value. He consistently beats offensive linemen on plays where they have the schematic advantage, a testament to his elite skill that raw numbers can obscure.
The Fallers: When Context Matters
Conversely, the “fallers” are often highly productive players whose raw statistics are inflated by the situations they play in. These players are still effective, but our metric suggests their impact is more dependent on favorable, predictable passing downs.
Key Takeaways:
- Situational Specialists? Players like Josh Sweat (dropping from 4th to 50th) and Gregory Rousseau (13th to 47th) see the most significant drops. This suggests they are elite performers on 3rd-and-long but may be less disruptive on 1st and 2nd down play-action passes. They feast when the defense has the advantage, but our metric penalizes this lack of “surprising” disruption.
- A Tale of Two Linemen: Washington’s defensive line tells a compelling story. While James Smith-Williams and Montez Sweat are near the top of our overall rankings, their teammates Daron Payne and Jonathan Allen are among the biggest fallers. This highlights the different roles within a single unit: some players excel in the trenches on any down, while others are deployed as pass-rush specialists in obvious passing situations.
Zooming out from individual players, we can use these same concepts to analyze team-level offensive strategy. By plotting play-calling unpredictability (Average Surprisal) against offensive efficiency (EPA per Play), we can map the “Offensive DNA” of each team.
Discussion
The central finding of this project is that context-aware metrics like the Surprisal-Weighted Disruption Rate (SWDR) can fundamentally reshape our understanding of defensive performance. Traditional statistics, such as raw sack counts, often fail to distinguish between a pressure generated on an obvious passing down (e.g., 3rd and long) and one created on a well-disguised play-action pass. Our SWDR metric directly addresses this gap by rewarding players who generate disruption when it is least expected, providing a truer measure of their diagnostic skill and athletic ability.
Our results empirically validate this approach. The emergence of players like Dorance Armstrong and Rasheem Green in our rankings demonstrates that valuable contributions are being overlooked by conventional analysis. These “quiet disruptors” specialize in winning their matchups on early downs or in balanced formations where offensive linemen cannot simply anticipate a pass rush. Conversely, some “headline names” see their rankings fall, suggesting their production may be partially inflated by playing in favorable, high-probability passing situations. This methodology provides teams with a tool to identify potentially undervalued assets in free agency or the draft.
Limitations
Despite its promising results, our approach has several key limitations that must be acknowledged:
Identity-Blind Modeling: The model is “identity-blind,” meaning it treats all teams, coaches, and players as equal. It does not know that a 3rd and 5 for the Kansas City Chiefs with Patrick Mahomes at quarterback is fundamentally different from the same situation for a team with a rookie QB. In reality, offensive tendencies and defensive expectations are heavily shaped by personnel and coaching philosophy. A truly robust model would incorporate these priors.
Limited Time Scope: Our tracking data analysis is confined to the first nine weeks of the 2022 NFL season. This is a relatively small sample size that may not reflect a player’s full-season performance. Schemes evolve, players return from injury, and tendencies shift over a 17-game season. Therefore, these results should be interpreted as an insightful snapshot rather than a definitive season-long evaluation.
Single Pre-Snap Snapshot: The model’s predictions are based exclusively on the pre-snap alignment. It is blind to crucial post-snap information, such as run-pass options (RPOs), where the final play type is determined by the quarterback’s read after the snap. It also fails to account for quarterback scrambles, broken plays, or complex line protections that are not apparent before the play begins. This means our estimate of a play’s “difficulty” is based on incomplete information.
Future Directions
Addressing these limitations points toward several exciting avenues for future research:
Developing Talent-Aware Models: The next logical step is to move from an identity-blind to a talent-aware model. This could be achieved by incorporating player-specific ratings (e.g., PFF grades) for quarterbacks and offensive linemen as features. A more advanced approach would involve building a hierarchical model with team- and coach-level parameters that adjust the baseline pass probabilities, allowing the model to learn specific opponent tendencies.
Incorporating Richer Pass-Rush Context: A more complete evaluation of pass rushers requires analyzing not just if they generated pressure, but how. By leveraging player tracking data, we can identify one-on-one matchups, double teams, slide protections, and chips from running backs or tight ends. The SWDR could then be calculated for specific situations (e.g., “true pass sets” or one-on-one rushes), providing an even clearer signal of a player’s individual dominance.
Expanding the Scope to Offensive Evaluation: This methodology can easily be flipped to evaluate offensive play-callers. A coach who consistently calls plays with high surprisal values and maintains high offensive efficiency could be considered a master of deception and strategy. This would provide a quantitative measure of a coach’s ability to keep defenses off-balance.
Ultimately, this work serves as a proof-of-concept for a more data-driven and nuanced approach to football analytics, moving beyond the box score to understand the strategic heart of the game.