The tidyverse

For today’s workshop, we will be working within the tidyverse, which consists of several R packages for data manipulation, exploration, and visualization. They are all based on a common design philosophy, mostly developed by Hadley Wickham (whose name you will encounter a lot as you gain more experience with R). To access all of these packages, you first need to install them (if you have not already) with the following code:

The install.packages() function is one way of installing packages in R. You can also click on the Packages tab in your RStudio view and then click on Install to type in the package you want to install.

Now with the tidyverse suite of packages installed, we can load them with the following code:

When you do that, you’ll see a lot of output to the console, most of which you can safely ignore for now.

Reading in data

Within the tidyverse, the standard way to store and manipulate tabular data like this is to use what is known as a tbl (pronounced tibble), synonymous with a spreadsheet or data.frame. At a high-level, a tbl is a two-dimensional array whose columns can be of different data types. The first column might be characters (e.g. the names of teams) and the second column can be numeric (e.g. the number of yards gained).

All of the datasets we will be using today are saved on the workshop’s website with the necessary links provided in the code chunks below. These were all generated with the nflscrapR package and are saved as comma-separated files, which have extension ‘.csv’.

We can use the function read_csv() to read in a csv file from the workshop website that contains play-by-play data from the recent Browns vs Patriots game (Patriots won 27-13, but more importantly the Browns lost). We will read it into R with the read_csv() function like so:

Chaining commands together with pipes

When dealing with data, we often want to chain multiple commands together. People will often describe the entire process of reading in data, wrangling it with commands like group_by and summarize, then creating visuals and models, as the data analysis pipeline. The pipe operator %>% is a convenient tool in the tidyverse that allows us to create a sequence of code to analyze data, such as the following code to group_by the team and count their number of plays:

Let’s break down what’s happening here. First, R “pipes” the tbl ne_cle_pbp_data into group_by to tell R to perform operations at the posteam and play_type level. Then it pipes the result of this group_by into count to simply count the number of rows corresponding to each combination of posteam and play_type.

The sequence of analysis flow naturally top-to-bottom and puts the emphasis on the actions being carried out by the analyst (like the functions group_by and count) and the final output rather than a bunch of temporary tbl’s that may not be of much interest.

We will be using the pipe %>% operator for the remainder of today, and you will see how convenient it is to use when making visualizations.

Filtering data

We just want to focus on run and pass plays. The filter() function is used to pull out subsets of observations that satisfy some logical condition like posteam == "NE". To make such comparisons in R, we have the following operators available at our disposal:

The code below filters to only look at pass or run plays, then groups by the team, play type, and down:

Introduction to ggplot2

Enough with printing tables! We will use visualizations to answer some questions about the data. Specifically, we will be using the popular ggplot2 package (again created by Hadley Wickham and a member of the tidyverse) for all of our data visualizations. The gg stands for the grammar of graphics, an intuitive framework for data visualization. Given a dataset, such as ne_cle_pbp_data, we want to map the columns to certain aesthetics of a visualization, such as the x-axis, y-axis, size, color, etc. Then a geometric object is used to represent the aesthetics visually such as a barchart or scatterplot. This framework separates the process of visualization into different components: data, aesthetic mappings, and geometric objects. These components are then added together (or layered) to produce the final graph. The ggplot2 package is the most popular implementation of the grammar of graphics and is relatively easy to use.

We’ll start with making a barchart of the types of plays by each team:

What about the performance of these plays?

What about the distribution of these plays instead, just one point doesn’t tell us the full story:

We can improve upon this plot with another layer - a beeswarm plot that displays the individual points in our data:

Expected points and win probability

All yards are not created equal! We should really be looking at the impact in terms of the expected points added (EPA) or win probability added (WPA) instead to get a better understanding of what impacted the game.

Now we see a big difference between the Browns and Patriots, which plays are the most costly for the Browns?

To contrast, what went right for the Patriots?

More resources