NBA_315_final

Author

Cedric Tuttle, Ainika Hou, Sophia Urabe

Building a Champion Level NBA Team!

Building an NBA team capable of going all the way is the ultimate goal. However there are varying ways to accomplish this. The history of the NBA will have you believe that the only way to construct a championship level roster is with a one of a kind, freak of nature, MVP level superstar like Nikola Jokic, Luka Doncic, Gianis Antetokoumpo, SGA or even Wembanyama. This can be disheartening for teams with no generational player capable of leading a team to a championship. For those teams there is good news! While some would argue that Jayson Tatum is an MVP level player earning his fourth straight All NBA First Team and fifth All NBA Team selection, it has to be acknowledged that he did not play up to his standards in the 2024 NBA finals. During the NBA finals the Celtic’s best player was playing some of his worst basketball yet they dominated a team constructed around Luka Doncic, an international superstar and MVP candidate. The Celtics showed a level of parity among their players as they dominated the NBA post season through a team effort and contributions from players all the way down the depth chart. A lot of this was thanks to their post season experience, so we will attempt to answer some questions regarding the optimal way for General Managers to build their contender. As we look to gain insight on what we should prioritize when building a NBA contender we will look at defense with our first research question, what are indicators of star defense and how they vary by player position? Next we will move on to the offensive side of the ball and look at how shot selection and player type affects point scored. Finally we combine our analysis of defense and offense with a look at how age and fouls affects net rating.

Data:

library(readr)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggridges)
library(corrplot)
library(GGally)
library(gridExtra)
nba.ds <- read_csv("https://data.scorenetwork.org/data/nba-player-stats-2021.csv")

For the data we did create new variables and will address those as we create them. It is worth noting that the data set was filtered in two different ways to only include players having played more than 30 games, or more than 5 minutes per game, as we identified quite a few outliers under that threshold.

Defense Top Performers

Variables Indicating Defense

To begin, as our dataset involves a plethora of statistics for each player, we honed in on those that could best explain defensive ability. The dataset includes the variable drtg, defensive rating, as a measure of an estimate of points allowed per 100 possessions, so the lower drtg is, the better defensive ability the player has.

To identify which variables were most associated with defensive ability, we looked at the correlation between a subset of the quantitative variables with drtg.

[We note here that because of some redundant variables in the dataset (for example, fg, fga, and fgpercent are quite similar in what they measure), we did not include all of them: for the “fg”, “x2”, “x3”, and “ft” variables, we only included the “percent” measurement; of “orb”, “drb”, and “trb”, we only included “drb” (since looking at defense, it made sense to consider defensive rebounds more than the other options).]

# Select numeric columns and remove rows with NAs
basketball_quant <- nba.ds |> 
  select(fgpercent, x2ppercent, x3ppercent, ftpercent, drb, ast, stl, blk, tov, pf, ortg, drtg) |> 
  na.omit()

# Get correlation matrix for all numeric variables
cor_matrix <- cor(basketball_quant, use = "complete.obs")
library(ggcorrplot)

# Get correlation vector for drtg and drop drtg itself
drtg_corr <- cor_matrix["drtg", ]
drtg_corr <- drtg_corr[names(drtg_corr) != "drtg"]

# Convert to data frame and order low to high
drtg_df <- data.frame(var = names(drtg_corr), corr = as.numeric(drtg_corr))
drtg_df <- drtg_df[order(drtg_df$corr), ]

# Reorder factor levels to reflect ordering
drtg_df$var <- factor(drtg_df$var, levels = drtg_df$var)

# Turn into a matrix with one row (drtg)
drtg_mat <- matrix(drtg_df$corr, nrow = 1)
colnames(drtg_mat) <- as.character(drtg_df$var)
rownames(drtg_mat) <- "drtg"

# Create correlation matrix 
ggcorrplot(
  t(drtg_mat),
  method = "square",
  type = "full",
  lab = TRUE,
  lab_size = 4,
  show.legend = TRUE,
  legend.title = "",
  colors = c("blue", "white", "red")
) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 30, hjust = 0.7),
    legend.position = "top",
    plot.title = element_text(hjust = 0.5)
  ) +
  labs(title = "Correlations with Defensive Rating (drtg)", 
       x = "", 
       y = "")

From the correlation matrix, we can observe that the variables most associated with defensive rating (drtg) are defensive rebounds (drb, r = -0.55), steals (stl, r = -0.5), and blocks (blk, r = -0.49). Moreover, from the sign of the correlation values, we know that higher values of these player statistics associate with a lower defensive rating, which is desirable. We can further see this behavior through scatterplots of each of these variables with drtg; additionally, the scatterplots can also give us insight into how these variables differ across the various player positions. [We note here that we first filtered the data to exclude rows where tm was “TOT” since these rows were redundant with the separate teams the player played for. Moreover, we only included players who played more than 30 games during the season, eliminating players who may have had a very good defensive rating, but only in a small handful of games.]

# Filter out "TOT" team rows; filter out players playing less than 30 games
nba.ds_filtered <-  nba.ds |> 
  filter(tm != "TOT") |> 
  filter(g > 30)

# Create dataframe with just defense variables and position
basketball_defense <- nba.ds_filtered |> 
  select(drtg, drb, stl, blk, pos) |> 
  na.omit() 

# scatterplot of drb with drtg
scatter_drb <- basketball_defense |> 
  ggplot(aes(x = drb, y = drtg)) +
  geom_point(alpha = 0.5, aes(color = pos)) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  scale_color_discrete(labels = c("Center", "Power Forward", "Point Guard", "Small Forward", "Shooting Guard")) +
  labs(x = "Number of Defensive Rebounds", 
       y = "Defensive Rating", 
       color = "Position")

# scatterplot of stl with drtg
scatter_stl <- basketball_defense |> 
  ggplot(aes(x = stl, y = drtg)) +
  geom_point(alpha = 0.5, aes(color = pos)) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  scale_color_discrete(labels = c("Center", "Power Forward", "Point Guard", "Small Forward", "Shooting Guard")) +
  labs(x = "Number of Steals", 
       y = "Defensive Rating", 
       color = "Position")

# scatterplot of blk with drtg
scatter_blk <- basketball_defense |> 
  ggplot(aes(x = blk, y = drtg)) +
  geom_point(alpha = 0.5, aes(color = pos)) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  scale_color_discrete(labels = c("Center", "Power Forward", "Point Guard", "Small Forward", "Shooting Guard")) +
  labs(x = "Number of Blocks", 
       y = "Defensive Rating",
       color = "Position")

library(patchwork)
# combine scatterplots into one plot
scatter_drb + scatter_blk + scatter_stl + 
  plot_layout(guides = 'collect', 
              ncol = 2) +
# add shared title, shared subtitle
  plot_annotation(title = "Scatterplots of Top Variables with Defensive Rating (drtg)",
                  subtitle = "By player position")

Defensive Performance by Player Position

Upon observing the general clustering of the points by player position for each of these variables, we became curious about whether certain player positions are indeed more defensively strong than others for each of these measures. To explore this observation, we first explored the distributions of drtg for each of the positions to get an initial sense of the position with the lowest defensive rating; we followed this up by utilizing boxplots and ANOVA tests to investigate for each of drb, blk, and stl whether the means of these variables actually differed across the player positions.

library(ggplot2)
library(ggridges)
library(dplyr)

# Calculate mean drtg per position
mean_drtg <- nba.ds_filtered |>
  group_by(pos) |>
  summarize(mean_drtg = mean(drtg)) |>
  mutate(y_pos = as.numeric(factor(pos, levels = rev(sort(unique(pos))))))

# ggridges plot to illustrate distribution of drtg for each position with means 
nba.ds_filtered |> 
  ggplot(aes(x = drtg, y = pos, fill = pos)) +
  geom_density_ridges(alpha = 0.7, scale = 1.1) +
  # Dashed segments across each ridge at mean
  geom_segment(data = mean_drtg,
               aes(x = mean_drtg, xend = mean_drtg, 
                   y = y_pos, yend = y_pos + 0.7),
               linetype = "dashed", color = "black", size = 0.7,
               inherit.aes = FALSE) +
  # Add text labels near the lines
  geom_text(data = mean_drtg,
            aes(x = mean_drtg, y = y_pos + 0.35, 
                label = paste0("mean = ", round(mean_drtg, 1))),
            color = "black", size = 3.5, hjust = -0.1,
            inherit.aes = FALSE) +
  scale_y_discrete(
    limits = rev(sort(unique(nba.ds_filtered$pos))),
    labels = c(
      "PG" = "Point Guard",
      "SG" = "Shooting Guard",
      "SF" = "Small Forward",
      "PF" = "Power Forward",
      "C" = "Center"
    )
  ) +
  theme_minimal() +
  labs(
    title = "Distribution of Defensive Rating by Position",
    x = "Defensive Rating (drtg)",
    y = "Position"
  ) +
  theme(legend.position = "none")

box_drb <- nba.ds_filtered |> 
  ggplot(aes(x = pos, y = drb)) +
  geom_boxplot(fill = "steelblue", color = "black") +
  theme_minimal() +
  annotate("text", x = 3, y = 15, label = "ANOVA test \n p-value < 2e-16", size = 2.5) + 
  scale_x_discrete(labels = c("Center", "Power\nForward", "Point\nGuard", "Small\nForward", "Shooting\nGuard")) +
  labs(
    title = "Defensive Rebounds by Position",
    x = "Position",
    y = "Defensive Rebounds"
  )

box_stl <- nba.ds_filtered |> 
  ggplot(aes(x = pos, y = stl)) +
  geom_boxplot(fill = "steelblue", color = "black") +
  theme_minimal() +
  annotate("text", x = 4, y = 4, label = "ANOVA test \n p-value = 9.46e-05", size = 2.5) + 
  scale_x_discrete(labels = c("Center", "Power\nForward", "Point\nGuard", "Small\nForward", "Shooting\nGuard")) +
  labs(
    title = "Total Steals by Position",
    x = "Position",
    y = "Total Steals"
  ) 

box_blk <- nba.ds_filtered |> 
  ggplot(aes(x = pos, y = blk)) +
  geom_boxplot(fill = "steelblue", color = "black") +
  theme_minimal() +
  annotate("text", x = 3, y = 3.5, label = "ANOVA test \n p-value < 2e-16", size = 2.5) + 
  scale_x_discrete(labels = c("Center", "Power\nForward", "Point\nGuard", "Small\nForward", "Shooting\nGuard")) +
  labs(
    title = "Total Blocks by Position",
    x = "Position",
    y = "Total Blocks"
  ) 

box_drb + box_blk + box_stl + 
  plot_layout(ncol = 2) +
# add shared title, shared subtitle, shared caption
  plot_annotation(title = "Boxplots of Top Variables",
                  subtitle = "Across player position")

The boxplots of these variables offered some preliminary observations: it seemed that for all three variables, there was a significant difference in means across player positions; visually, this difference seemed more evident for Defensive Rebounds and Blocks than it did for Total Steals. Moreover, the fact that Centers had the lowest mean defensive rating, and highest means for Defensive Rebounds and Blocks, we continued our analysis by investigating the relationship between these three variables.

Top Defense Players

Our observation that high numbers in defensive rebounds and blocks is associated with low defensive ratings led us to wonder which players in the 2021-2022 NBA cohort may have had the best of these stats. Thus, we looked to identify the players that were in the 95th percentiles for both defensive rebounds and number of blocks: visually, the dashed gray lines represent the 95th percentile cutoffs for each of these variables.

library(ggrepel)
# Calculate the 75th percentiles for drb and blk
drb_95th <- quantile(nba.ds_filtered$drb, 0.95, na.rm = TRUE)
blk_95th <- quantile(nba.ds_filtered$blk, 0.95, na.rm = TRUE)


# Filter to only keep players playing > 30 games 
# who are above the 95th percentile in both drb and blk
top_players <- nba.ds_filtered |>
  filter(drb > drb_95th, blk > blk_95th)


# Plot with highlighted players BY POSITION
nba.ds_filtered |> 
  filter(g > 30) |> # only looking at players who played more than 30 games
  ggplot(aes(x = drb, y = blk)) +
  geom_point(alpha = 0.3, aes(color = pos)) +
  geom_point(data = top_players, 
             aes(x = drb, y = blk, color = pos),
             size = 2.25, 
             alpha = 0.7,
             show.legend = FALSE) +
  geom_text_repel(data = top_players, 
                  aes(label = player), 
                  size = 4.4, 
                  color = "firebrick") +
  scale_color_discrete(labels = c("Center", "Power Forward", "Point Guard", "Small Forward", "Shooting Guard")) +
  geom_vline(xintercept = drb_95th, linetype = "dashed", color = "gray") +
  geom_hline(yintercept = blk_95th, linetype = "dashed", color = "gray") +
  labs(
    title = "Top Rebounders and Shot Blockers",
    subtitle = "Players above the 95th percentile in both DRB and BLK highlighted",
    x = "Defensive Rebounds (drb)",
    y = "Blocks (blk)", 
    color = "Position"
  ) +
  theme_minimal()

Since these players have a high number of defensive rebounds and number of blocks, we would expect that their defensive ratings would be on the lower end. Indeed, among players that competed in more than 30 games, three out of the four of these players appear in the lowest 12 defensive ratings as illustrated by the figure below. This analysis suggests that these players may be valuable additions to our “dream” team if we are looking to maximize defensive power.

library(gghighlight)
# Step 1: Count how many times each player appears
player_counts <- nba.ds |>
  filter(g > 30) |>
  count(player)

# Step 2: Join that count back to the original data
basketball_with_counts <- nba.ds |>
  left_join(player_counts, by = "player")

# Step 3: Keep only:
# - the "TOT" row if player played for multiple teams (n > 1)
# - OR the only row if player played for one team (n == 1)
basketball_cleaned <- basketball_with_counts |>
  filter((n > 1 & tm == "TOT") | n == 1)

basketball_cleaned |> 
  arrange((drtg)) |>
  head(12)  |> # Select top 10 players
  ggplot(aes(x = fct_reorder(player, drtg, .desc = TRUE), y = drtg)) +
  geom_bar(stat = "identity", fill = "salmon", width = 0.8) +
  geom_text(aes(label = round(drtg, 1)), 
            hjust = -0.1, vjust = 0.5, color = "black", size = 3) +  # Adjust `vjust` and `size` to position and size text 
  gghighlight(player %in% top_players$player, 
              label_key = player, 
              use_direct_label = TRUE) +
  coord_flip() +  
  labs(x = "Player", 
       y = "Defensive Rating", 
       title = "Top 12 Player Defensive Ratings") + 
  theme_classic() + 
  theme(legend.position = "none")

Creating Dynamic Offense

Player Efficiency

All NBA have identities varying in uniqueness but remaining a part of how well teams perform. Part of this team identity is coaching and play style. This is extremely interesting for us as we look to build a team that will have the highest success. The interesting part of an identity and play style is shot selection. What shots does a team like and what position is doing the majority of the scoring for winning teams. It has long been the case that high percentage two point shots were how teams looked to win. Legends like Shaq dominated the pain and saw all their points come from inside. Other legends like Jordan perfected the mid range shot and drove constantly. Only in more recent years have we seen the exponential growth of the three point. Steph Curry is credited with revolutionizing the game by drastically increasing his 3 point shot attempts and making them consistently.

Initially we look at true shooting percentage depending on position to see if one position shoots scores a greater percent of their shot attempts :

Let’s build our true shooting variable using the following formula:

nba.ds <- nba.ds %>%
  mutate(true_shooting = pts / (2 * (fga + 0.44 * fta)))

Now let’s look at how position affects true shooting:

position_counts <- nba.ds %>%
  group_by(pos) %>%
  summarise(n = n()) %>%
  filter(n >= 5)  # Keep positions with at least 5 players

# Step 2: Filter the main dataset
filtered_nba <- nba.ds %>%
  filter(g >= 15) %>%
  filter(pos %in% position_counts$pos)

ggplot(filtered_nba, aes(x = pos, y = true_shooting, fill = pos)) +
  geom_violin(trim = FALSE, alpha = 0.7) +
  geom_boxplot(width = 0.1, fill = "white", outlier.size = 0.5) +
  labs(title = "Distribution of True Shooting % by Position",
       x = "Position",
       y = "True Shooting %") +
  theme_minimal() +
  theme(legend.position = "none")

With the help of this violin plot we can correctly identify that the center position makes a higher percentage of their shots than the other positions suggesting that if we want a high scoring offense using a center for scoring will yield the highest points per attempted shot. However the wrinkle introduced to the NBA when Curry started heaving double digit threes is that now players can make a smaller percentage of their shots and make up for it with the extra point granted by the three point line. So in essence we now have to take into consideration that high percentage shot makers might not actually be scoring more despite making more of their shot attempts. This is the infamous trade off of shooting more threes, missing a higher percentage of the total shots yet scoring more. So to address this complexity we created yet another new variable; “scoring_type” to essentially take into consideration the shot selection for players. So now instead of looking at points per game by position we can look at points per game by scoring type.

Players Scoring Type Affects Output

To create our variable we used k-mean clustering:

# Step 1: Create three_point_ratio
filtered_nba <- filtered_nba %>%
  mutate(three_point_ratio = `x3pa` / `fga`)

# Step 2: Standardize features
nba_scaled <- filtered_nba %>%
  select(three_point_ratio, `x3pa`, `fga`) %>%
  scale()

# Step 3: K-means clustering
set.seed(123)
kmeans_result <- kmeans(nba_scaled, centers = 3)

# Step 4: Add cluster assignments back to filtered_nba
filtered_nba <- filtered_nba %>%
  mutate(cluster = kmeans_result$cluster)

# Step 5: Label the clusters
filtered_nba <- filtered_nba %>%
  mutate(scoring_type = case_when(
    cluster == 1 ~ "3-Point Sniper",
    cluster == 2 ~ "Inside Scorer",
    cluster == 3 ~ "Balanced Scorer",
    TRUE ~ "Unknown"
  ))

# You can also convert this into a data frame for easier readability
cluster_centers <- as.data.frame(kmeans_result$centers)
print(cluster_centers)
  three_point_ratio       x3pa        fga
1         0.8488366  0.4226362 -0.5103212
2        -1.0829398 -1.1329857 -0.4978258
3         0.1339371  0.7050967  1.1546997

Thanks to the k-mean clustering we created three categories for players, “3pt sniper”, “inside scorer” and “balanced scorer”. We did this using a 3pt ratio which calculates the ratio between 3 point attempts and field goal attempts to identify how many of a players shot attempts were 3-pointers. Quick clarification; there is no such thing as a negative field goal attempt but since we scaled the data to standardize the variables for distance calculation for the k-mean grouping.

Now that we have our scoring type groups we can look at how they affect the points scored per game. For this lets take analyze the following ridge-line plot comparing the distribution of points per game depending on scoring type.

# Calculate means per group
mean_labels <- filtered_nba %>%
  group_by(scoring_type) %>%
  summarise(mean_pts = mean(pts, na.rm = TRUE))

# Create plot with "mean = " in labels
ggplot(filtered_nba, aes(x = pts, y = scoring_type, fill = scoring_type)) +
  geom_density_ridges(scale = 1.2, alpha = 0.7) +
  geom_text(data = mean_labels, 
            aes(x = mean_pts, y = scoring_type, label = paste0("mean = ", round(mean_pts, 1))), 
            inherit.aes = FALSE,
            vjust = -0.5, 
            color = "black", 
            size = 3.5) +
  labs(title = "Distribution of Points Per Game by Scoring Type",
       x = "Points Per Game",
       y = "Scoring Type") +
  theme_minimal() +
  theme(legend.position = "none")

Looking at the ridge-line plot it is clear that the balanced scorers tend to score more points per game with a average of 26.9 points per game higher than the average of the other two scoring types, inside scorers averaging 19 points and three point snipers averaging 16.7.

To ensure statistical significance for these findings we wanted to run an ANOVA test to make sure the difference in mean points per game were statistically significant:

anova_result <- aov(pts ~ scoring_type, data = filtered_nba)
summary(anova_result)
              Df Sum Sq Mean Sq F value Pr(>F)    
scoring_type   2  10400    5200   231.9 <2e-16 ***
Residuals    566  12690      22                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our p-value is extremely small and much smaller than any alpha level clearly proving statistical significance (alpha = 0.5 or 0.1)

There are a couple explanations for this. The most obvious one is that the NBA stars who score the most tend to have the skill set to be scorers at different levels (3pt line, mid-range and at the rim). As stars, they are better than the average players and will therefore score more of their shots. For example Steph Curry is a star and despite being known for his three point shot he is a very capable inside scorer with a good mid-range shot, drive and floater. He is also an exceptional shooter and therefore his fantastic point average is reflected in the balanced scorer type and not 3-point sniper. The other reason is that stars who balance their shot selection are given more shot attempts. The bottom line is, the more shots a player takes the more points they will score even if they are less efficient and stars a allowed to shoot slightly lower percentages while attempting plenty of shots.

Other Confounding Explanations Exploration

To tackle this last concern we can look at field goal attempts depending on scoring type to see if the Balanced scorers have a higher amount of attempted field goals suggesting their superior points per game would be coming from simply more shot attempts than the players in the other two scoring types.

# Calculate means per group
mean_labels <- filtered_nba %>%
  group_by(scoring_type) %>%
  summarise(mean_pts = mean(fga, na.rm = TRUE))

# Create a boxplot
ggplot(filtered_nba, aes(x = scoring_type, y = fga, fill = scoring_type)) +
  geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.shape = 1) +  # boxplot with outliers
  geom_point(data = mean_labels,
             aes(x = scoring_type, y = mean_pts),
             color = "black",
             size = 3,
             shape = 18) +  # plot mean points
  geom_text(data = mean_labels,
            aes(x = scoring_type, y = mean_pts, label = paste0("mean = ", round(mean_pts, 1))),
            vjust = -1.5,
            size = 3.5) +
  labs(title = "Field Goals Attempted by Scoring Type",
       x = "Scoring Type",
       y = "Field Goals Attempted") +
  theme_minimal() +
  theme(legend.position = "none")

As we suspected part of the reason for the higher points per game scored is due to the higher shots attempted from the balanced scorers. What is very interesting is the fact that the inside scorers and 3-point snipers have essentially the same amount of attempted field goals, 14.4 and 14.3 respectively. So if they attempted the same amount of shots how do inside scorers score more points per game on average? Well as we hinted at the beginning when we outlined how centers have the highest true shooting percentage, we might expect inside scorers to make more of their shots than 3-point snipers. However I was surprised as I expected the less efficient 3-pt snipers would make up for their lower efficiency by shooting more shots worth 3 points and end up scoring the same or more than the inside scorers. This is not what we found and inside scorers score more points than 3-pt snipers shooting the same amount of shots.

We can visualize this using a third violin plot to analyze true shooting by position and hopefully show that inside scorers have a higher true shooting percentage:

# Calculate means per group
mean_labels <- filtered_nba %>%
  group_by(scoring_type) %>%
  summarise(mean_pts = mean(true_shooting, na.rm = TRUE))

# Create plot with "mean = " in labels
ggplot(filtered_nba, aes(x = true_shooting, y = scoring_type, fill = scoring_type)) +
  geom_density_ridges(scale = 1.2, alpha = 0.7) +
  geom_text(data = mean_labels, 
            aes(x = mean_pts, y = scoring_type, label = paste0("mean = ", round(mean_pts, 1))), 
            inherit.aes = FALSE,
            vjust = -0.5, 
            color = "black", 
            size = 3.5) +
  labs(title = "Distribution of true shooting by Scoring Type",
       x = "true shooting",
       y = "Scoring Type") +
  theme_minimal() +
  theme(legend.position = "none")

And as we suspected, inside scorers have the highest true shooting of the three groups confirming their increased in points per game is due to, in-part, better efficiency. To make sure this difference is statistically significant we can do an other ANOVA test:

anova_result <- aov(true_shooting ~ scoring_type, data = filtered_nba)
summary(anova_result)
              Df Sum Sq Mean Sq F value   Pr(>F)    
scoring_type   2  0.161 0.08052   20.38 2.84e-09 ***
Residuals    566  2.236 0.00395                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is sufficiently low so the mean true shooting percentage is statistically different for inside scorers compared to the other two scoring types.

I say in part because there is one final confounding variable that we can not measure with our data set but does explain why inside scorers score more points per game than three point snipers that is not explained by efficiency. This confounding variable is shooting fouls. Shooting fouls on 3-pointers are relatively uncommon, estimated to account for roughly 5% of total shooting fouls. This means that the majority of shooting fouls occur on inside shots and therefore affect inside scorers more than three point snipers. It is important to note that true shooting tries to adjust for free throws but points per game scored reflects that inside scorers still get more chances to score off of fouls.

So what happens when a player is fouled in the shooting motion? Well their shot attempts does count as an attempt whether they make it or not but if they miss it, it counts as a missed field goal attempt (fga) but the player gets 2 free throws. These free throws most likely result in points for the player so they did not make a shot but gained 2 points. So inside scorers might have a true shooting of 60% while their misses happen on fouls which contribute to their points per game. We can measure how many free throws are attempted per scoring type and see this discrepancy:

# Calculate mean FTA per scoring type
mean_labels <- filtered_nba %>%
  group_by(scoring_type) %>%
  summarise(mean_pts = mean(fta, na.rm = TRUE))

# Create a bar plot
ggplot(mean_labels, aes(x = scoring_type, y = mean_pts, fill = scoring_type)) +
  geom_col(alpha = 0.7) +  
  geom_text(aes(label = paste0("mean = ", round(mean_pts, 1))),
            vjust = -0.5,
            color = "black",
            size = 3.5) +
  labs(title = "Average Free Throw Attempts by Scoring Type",
       x = "Scoring Type",
       y = "Average Free Throw Attempts") +
  theme_minimal() +
  theme(legend.position = "none")

And this is a big explanation for how inside scorers remain competitive with the boom of the three point line not to mention the other benefits of drawing fouls; putting opposing players in foul trouble, getting your team in the bonus and slowing the game down/stopping the clock.

Blending Defense and Offense

Variables Affecting Net Rating

Connecting the offensive and defensive sides of basketball, we investigated the net rating of players and it changes based on certain factors. Specifically, we focused on the factors of personal fouls and age. Focusing on personal fouls can reflect a player’s discipline on the court, such that players that foul often can be detrimental to team success as they impede player efficiency on both offense and defense. Age is a valuable variable to explore as age may correlate with net ratings and foul rates. Thus, to build a “dream team” that maximizes team performance, teams should trade to acquire players who are of a certain age that is associated with low personal fouls and high net rating. This motivates the next research question, “How does net rating relate to player age and foul rate, and which combinations of these two variables correspond to high net ratings?”

To examine this, we analyzed the variables:

  • age

  • pf

  • net_rt : a new variable calculated by subtracting the defensive rating from the offensive rating.

Given that a low defensive rating suggests better defense, a higher net rating reflects better performance. Additionally, we created a new variable called ‘mpg’ which refers to minutes played per game by a player during the 2021-2022 NBA season (discussed in more detail later in the report). We only included players that we have data for on their age, personal fouls, offensive and defensive rating, as well as players who have played more than 5 minutes per game during the 2021-2022 season. This ensures that the values that are analyzed are meaningful, reliable, and complete.

nba_filtered <- nba.ds |> filter(tm != "TOT")
# The code below filters out any "NA's" in the dataset, creates net_rt and mpg
nba_filterednoNAs <- nba_filtered |>
  filter(!is.na(age), !is.na(pf), !is.na(ortg), !is.na(drtg)) |>
  mutate(net_rt = ortg - drtg) |>
  mutate(mpg = mp / g) |>
  filter(mpg > 5)

Exploring Age, Personal Fouls and Net Rating

Before determining if there are combinations of age and foul rate that correspond with high net ratings, we will explore the distribution of age, foul rate, and net rating separately. This provides initial background knowledge of the variables before conducting more intricate data analysis.

# The code below produces histograms of age, personal fouls, and net rating.
# 'patchwork' was found using the 'help' function of R and 
# used to consolidate the histograms into one grid.
library(patchwork)

p1 <- nba_filterednoNAs |>
  ggplot(aes(x = age)) + 
  geom_density(size = 1) + 
  labs(title = "Histogram of Player Age", 
       x = "Player's Age on 2/1/2022",
       y = "Density")

p2 <- nba_filterednoNAs |>
  ggplot(aes(x = pf)) + 
  geom_density(size = 1) + 
  labs(title = "Histogram of Personal Fouls", 
       x = "Number of Personal Fouls per 100 Team Possesssions",
       y = "Density")

p3 <- nba_filterednoNAs |>
  ggplot(aes(x = net_rt)) + 
  geom_density(size = 1) + 
  labs(title = "Histogram of Net Rating", 
       x = "Player's Net Rating",
       y = "Density")

p1 + p2 + p3

The density curve of the player’s age reveals that the distribution of player’s age is unimodal and right-skewed, meaning that most NBA players are young, peaking at around 24 years old. Although player’s age ranges from 19 to 41, the dataset is dominated by younger players, perhaps because of the relatively short careers in the NBA. Similarly, the density curve of personal fouls reveal that the distribution of personal fouls is generally right-skewed and unimodal, peaking at around 4 personal fouls per 100 team possessions. There is an initial dip, suggesting that it is very rare for players to have less than 1 personal foul, but the majority of players have less than 5 personal fouls. The density curve of player’s net rating is symmetrical, with a peak at 0 points, but the range is quite large, beginning at around -120 points and ending at around 75 points based on the density curve. To summarize, many NBA players in the dataset are in their mid-20s with less than 5 personal fouls per 100 team possessions, and have a net rating close to 0 points.

Relationship between Personal Fouls and Age with Net Rating

Before determining if there are certain combinations of age and foul rate that correspond high net ratings, it is important to establish if age and foul even have a relationship with net rating, and whether there is interaction between age and foul. To determine this, we first created a scatterplot, with each point representing a player colored by their net rating (yellow is low net rating, blue is high net rating), fitted with a linear regression line.

nba_filterednoNAs |> 
  ggplot(aes(x = age, y = pf, color = net_rt)) +
  geom_point(alpha = 0.5, size = 3) +
  scale_color_gradient2(low = "yellow", mid = "green", high = "blue", midpoint = 0, name = "Net Rating") +
  geom_smooth(method = "lm", se = TRUE) + 
  labs(title = "Net Rating by Age and Fouls",
       x = "Player's Age on 02/01/2022",
       y = "Personal Fouls per 100 Possessions")

From the scatterplot, we can see that there is not a clear pattern regarding net rating, suggesting that net rating does not particularly depend on combinations between age and personal fouls. Instead, we can see that many players have a net rating around 0, as seen by the overwhelming number of green points, which corroborates the density curve of net rating above. Additionally, there may be a slightly negative linear relationship between age and personal fouls as seen from the slightly negative linear regression line. However, the variability increases as player’s age increases, so there is more uncertainty about the estimated foul rates perhaps because of fewer data observations.

To compliment this graphic, we computed a linear regression statistical analysis:

lmmodel <- lm(net_rt ~ age * pf, data = nba_filterednoNAs)
summary(lmmodel)

Call:
lm(formula = net_rt ~ age * pf, data = nba_filterednoNAs)

Residuals:
     Min       1Q   Median       3Q      Max 
-118.595   -8.015    0.893    8.702   74.667 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -36.38287   11.39747  -3.192  0.00148 **
age           1.11319    0.42932   2.593  0.00973 **
pf            5.80495    2.29723   2.527  0.01175 * 
age:pf       -0.17652    0.08667  -2.037  0.04209 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.14 on 637 degrees of freedom
Multiple R-squared:  0.0275,    Adjusted R-squared:  0.02292 
F-statistic: 6.004 on 3 and 637 DF,  p-value: 0.0004892

Based on the output of the linear regression, we can see that when age and personal fouls are both 0, the net rating is -36.4, but this is not meaningful in the context of this report as there are no players who are 0 years old. Furthermore, when holding personal fouls constant, when age increases by a year, the predicted net rating increases by about 1.11 points, on average. Similarly, when holding age constant, when personal fouls increases by a year, the predicted net rating increases by about 5.8 points. The p-values associated with the estimates -36.4, 1.11 and 5.8 are all statistically significant with their respective p-values being 0.00148, 0.00973, and 0.1175 (assuming a significance level of 0.05). Importantly, the interaction term between age and personal fouls is also statistically significant with a p-value of 0.042. This means that when for every one-year increase in age, the effect of fouls on net rating becomes 0.17652 more negative. So, as age increases, fouls hurt net rating more. Thus, we have to consider both variables and cannot drop one for the other, regarding how they influence net rating, as age and personal fouls are not independent of each other.

Optimal Combination of Age and Personal Fouls

Now that we know that there is interaction between age and personal fouls, we included both of them when determining the optimal combination of age and personal fouls that corresponds with high net rating. To do so, we created a heat map, where each hexagon represents a collection of players with a certain number of personal fouls and are of a certain age. The color is determined by the average net rating of players within that hexagonal bin. The more blue the hexagon is, the higher average net rating.

nba_filterednoNAs |>
  ggplot(aes(x = age, y = pf, z = net_rt)) + 
  stat_summary_hex(fun = mean, bins = 20, na.rm = TRUE, color = "black") + 
  scale_fill_gradient2(low = "yellow", mid = "green", high = "blue", midpoint = -30, name = "Avg Net Rating") + 
  geom_text(x = 26.5, y = 1, label = "Sweet Spot", color = "black", size = 4) + 
  labs(x = "Player's Age on 2/01/2022",
       y = "Personal fouls per 100 team possessions", 
       title = "Average Net Rating by Age and Personal Fouls") + 
  theme_bw()

From the heat map, we observed that many players have a net rating of around 0, which supports the results from density curve and scatterplot. We also infer that the mean of personal fouls per 100 team possessions around 5 fouls, since many hexagons are concentrated around y-axis level of 5 personal fouls. Importantly, we can conclude that the optimal combination of personal fouls and age is around 1-2 personal fouls and around the age of 26-27. This is because, at this area on the graph, the hexagons are more blue than other hexagons nearby. Thus, at the low level of personal fouls of 1-2 per 100 team possessions, the age that has the bluest hexagons are around the age of 26-27. As such, teams should look for obtaining players around the age of 26-27 who have 1-2 personal fouls per 100 team possessions as they have high average net ratings and low foul rates, suggesting this combination of age and discipline is a performance “sweet spot”, which is annotated on the heat map. There are presumably a few outliers of age, even if the preventative measures described above were taken, which are very very blue hexagons indicating very high average net ratings perhaps because of few data observations for that hexagonal bin.

To conclude, although the peak age of players in the dataset is 24 years old with 4 personal fouls, it is more ideal to acquire players that are slightly younger with less personal fouls. Specifically, to answer the research questions, the optimal combination of age and personal fouls with regard to net rating is 26-27 years old with 1-2 personal fouls. These players are associated with having a high average net rating, meaning that they are disciplined, high-performing athletes. By having players at this age with the selected number of personal fouls, the team’s performance will be improved as their players would have higher net ratings.

Conclusion and Future Analysis

In our first research question, we were interested in exploring indicators of lower, better defensive ratings and how this varied across player positions. Our analysis offered us a few useful insights in crafting our “dream” team: first, high numbers of defensive rebounds, blocks and steals were correlated with low defensive ratings; second, the means in these variables did significantly differ across the player positions with the former two being more evident in that difference; and third, Centers tended to be the players with the highest numbers of defensive rebounds and blocks as well as the lower defensive ratings, suggesting these player stats are good predictors of defensive agility. 

Additionally for the second research question, which analyzes how different positions contribute to efficiency and scoring types, and how those scoring types contributed to points per game. We found that balanced scorers average the most points per game because of more attempts and versatility. Inside scorers outperform three-point snipers even though both have similar shot attempts, but inside scorers have better efficiency and draw more fouls for additional points from free throws.

Finally, the third research question connects the offensive and defensive sides discussed in the previous questions, resulting in the question, “Which combination of age and number of personal fouls corresponds to the high net rating?” We found that the optimal age and personal fouls is 26-27 years old with 1-2 personal fouls rates as these players may have high average net ratings, meaning efficient and disciplined performance on the court. 

Through the process of answering these three questions, some more emerged. Future paths of exploration can explore questions such as, “are players with high net ratings more likely to have better offensive rating or better defensive rating, and does that change based on their positions or age?” This can be answered using multiple linear regression with possible interaction terms, and potentially more data points at each age, given that there were still outlier performers at certain ages as seen in the third question. Another question can be, “is there a relationship between the colleges that drafted athletes attended and the athlete’s offensive and defensive performance?” This can inform teams who are drafting players out of college, which colleges to aim for if they want players that fulfill a certain performance threshold, like having high average defensive rebounds and blocks as suggested by the results from our first research question. This question would require data on the colleges draftees attended, which is not provided in the initial data set. 

Finally, while our analysis went in-depth into the 2021-2022 NBA season, times change and sports can evolve rapidly; thus, we would be curious to see how a similar analysis may pan out using more recent NBA seasons – would the same players land on our “dream” team or would there be new rising stars making themselves known? 

In conclusion, although we explored the three initial research questions, there are still many more questions to be analyzed in the future regarding how to construct a winning “dream” team in the NBA.