I. Introduction

One of the most well-known frameworks for constructing all-in-one player performance metrics is the plus-minus model, which in the most rudimentary form has been applied in hockey since the 1950s. The plus-minus model considers the number of goals scored minus the number of goals conceded when a given player is in the game. A huge problem with this approach is that it does not control for the impact of teammates or opponents. It is important to acknowledge that every player on the pitch, either directly or indirectly, is contributing to the overall team’s performance. Several academic studies have started to utilize linear regression as an adjusted plus-minus (APM) framework to include other players’ influence on that individual’s rating. APM and its variations have most commonly been seen in basketball and hockey, achieving substantial improvements in these fields, an example being ESPN’s widely known real plus-minus (RPM).

As a sport, soccer has numerous inherent disadvantages when it comes to APM, especially compared to basketball or hockey. Soccer is a low-scoring game with few substitutions, which means a traditional APM for the sport will have collinearity issues and an infrequent response variable. The collinearity comes from the low number of substitutions since some players will share the same minutes on the court together in almost every segment, which eventually makes them indistinguishable. Out of these three sports, basketball is the best sport to calculate APM for, and whereas hockey is low-scoring, it has an extremely high number of substitutions every game. Several scholars have tried to handle this challenge, considerably the paper from the Department of Statistics at Carnegie Mellon University, which introduces the use of video game ratings from FIFA as a prior in the APM model.

This paper aims to build up on the foundations of calculating individual player ratings using a plus-minus framework. This procedure ensures that the one-number statistics for soccer players accurately represent the individuals’ skill level as well as their team contribution by adding the traditional box-score rating into the measurement of the APM model. Our approach also uses expected goals instead of the actual goals as we believe this will better measure the team’s performance within a match. The remainder of this paper will be organized as follows. Section 2 describes the dataset that we use. Section 3 goes into detail about the different stages of our method. Sections 4 and 5 summarize the results and discuss the project’s limitations and propose several next steps for further research. The last section will be the acknowledgments for people that have helped us to publish this work.

II. Dataset

1. Prior Stage

In the construction of individuals’ ratings, we use two different datasets. First of all, we collected box-score statistics for each player in the Premier League season 2020-21 and 2021-22 from FBref.com, which is the soccer section of the “Sports Reference Website”. This data set contains all the players from five majors League in Europe, with each observation being a player along with over 180 variables describing their information and box-score statistics. The variables that are considered describe different actions in soccer, such as scoring, creating chances, dribbling, passing and defensive actions. We only include players who had at least 900 minutes in the field last season in order to reduce the bias in which players that usually start from the bench benefits from the team’s results.

The second data set we collected is FIFA ratings from 2021 on the website “SoFIFA.com”. The website contains different characteristics to measure the players, but we stick with the one-number overall statistics for each player. We merge these two datasets by players’ names, resulting in several missing values as the names are recorded slightly differently between the two data sources. This missingness is handled by manual matching. Figure 1 shows the distribution of EPL FIFA 2022 ratings.

The distribution is the frequency of the FIFA Rating among players, where the x-axis represents the ratings, and the y-axis is the frequency. The histogram is close to a normal distribution, in which a great proportion of the observations is in the range of 72 to 83.

2. RAPM Stage

In this stage, with the purpose of getting expected goals for each stint, we also have two different datasets. In order to create the stint dataset, we collected the match summary and line-up information for every match in the 2021-2022 English Premier League season. The dataset created for the purpose of RAPM consists of “stints” as rows and indicator columns for each player, as well as some other columns to adjust for red cards and game state. The response variable is expected goal difference (home - away) per 90 mins. A stint is defined as a period of time within a match in which there are no substitutions and no goals scored. A new stint is generated whenever there is a substitution since the players on the field change, and whenever there is a goal since the game state changes. Using this setup, a single player’s individual impact on their team’s (expected) goal difference per 90 minutes can be estimated with ridge regression, with the length of the stints as weights. This dataset contains almost 4000 stints across 380 matches, which means a match has an average of over 10 stints. The following plots describe the distribution of the number of stints per match in the EPL 2021-22.

As we can see from the distributions, almost 150 matches have 8 or 9 stints, while more than 150 teams have a number of stints ranging from 10 to 14. There are several potential outliers that may be resulted from a no-substitution match or multiple goals are scored from both sides.

The next step in this stage is to have the shooting information match-by-match. Once again, we collect this information from “Sports Reference Website” An observation in this data frame represents a shot made by a player in that game, along with its expected goal value. We also consider several key factors from this dataset such as the minute of the shot, the player’s name, the team’s name, and many more.

After collecting these two data sets, we start the merging process to generate the expected goal difference between the home and away teams during a stint. The joining conditions are based on the Match ID, which is the concatenated string of the home team and away team, and whether the shot is made during the stint or not. The table below is a sample dataset that represents the Manchester City vs Liverpool match last EPL season.