Overview
This project will begin on Friday, June 19 and involve a 15 minute presentation one week later on Friday, June 26. There is also a written report with a maximum of 4 pages due by 5:00pm EST on Wednesday, July 1st. Both the presentation and written submission will follow the same structure: starting with a summary of key takeaways and then walking through EDA, the modeling process, results and conclusions. Teams will work in groups of 2, and data sets will be assigned randomly.
The goal of this project is to practice planning, creating, evaluating and interpreting linear regression models in R. Each team will fit a linear model using one of the continuous variables in their provided data set. Models should be tested with out-of-sample prediction methods, and must be interpretable. Making a model nedlesly complicated can lead to overfitting and make results difficult to interpret.
The first slide of the presentation, and first page of the written report, will be for key takeaways. For the presentation this can be in the form of a single graph, bullet points or a statement. For the written report it should be a combination of the three. This is your work in elevator pitch form, meant to grab the audiences attention and give them a reason to be interested. In the working world you may have only a short time to pitch results to superiors, and will have to start with the conclusions before explaining your process.
The rest of the presentation, and second page and beyond in the written report will follow a more traditional data anaylsis report format. The format will be as follows:
- Introduction of the data set & sport
- Exploratory Data Analysis conducted
- Modeling analysis & model validation
- this step includes variable selection, error rates, and checking diagnostic plots
- Model Results
- this step should include interpretation of the coefficients in the model
- Conclusion
Deliverables & Timeline
There will be 3 different deliverable deadlines:
Tuesday, June 23 @ 4:00pm EST - Each student will push an Rmarkdown file with their analysis so far to thier GitHub accounts for review. We will then provide feedback on the code submitted.
Friday, June 26 @ 12:00pm EST - Slides must be completed and ready for presentation. Send your slides to Nick’s email ncitrone@pittsburghpenguins.com . All code and visualizations must be done in R, but the slides may be created in any program. Presentations will be 15 minutes long with 5 minutes for Q&A.
As a reminder, the presentation should start with a Key Takeaway slide, and then lead into a traditional data analysis report (Introduction, EDA, Modeling, Results, Conclusion)
Wednesday, July 1 @ 5:00pm EST - Each team’s final written report must be emailed to Nick’s email ncitrone@pittsburghpenguins.com .
As a reminder, the written report will have the ‘model pitch’ on the first page, which is a short executive summary and potentially a key graphic. The second page and beyond of the written file will follow the traditional data analysis report format (Introduction, EDA, Modeling, Results, Conclusion). These last pages should set up and support your key takeaways.
Notes
Although this project will use simple linear regression, feel free to contact any of the instructors if you have questions about potential GLM (poisson, logistic) models using any of the data sets provided.
Data Sets
There are eight different datasets for the regression projects (three of which were generated via the init_regression_project_data.R script):
- nba_team_season_summary.csv - summary of regular season performance for each NBA team since 2003, courtesy of NBA stats via the ‘nbastatR package’,
- tennis_2013_2017_GS.csv - tennis grand slam statistics for 3066 ATP and WTA matches between 2013 and 2017. Data from Jeff Sackman’s tennis data repo, retreived by Stephanie Kovalchik’s
R deuce package, and synthesized in Gallagher, Frisoli, and Luby’s R courtsports package.
- baseball_batting.csv - MLB player season level batting statistics for 1429 player-seasons from 2010 to 2019. Data generated by FanGraphs and accessed courtesy of the
baseballr package,
- cfb_2019_games.csv - Results from all NCAA D1 College Football games in 2019. Includes team data, final score and an excitement rating for the game. Data accessed via the
cfbscrapR package,
- overwatch_odds.csv - Overwatch E-Sports League head to head match results with betting odds data. Data from Kaggle’s E-Sports Data Sets,
- womens_ncaa_soccer_20182019.csv - NCAA Women’s D1 soccer offensive and defensive team statistics from the 2018 and 2019 seasons. Data acquired from NCAA.com.
- womens_ncaa_volleyball_20182019.csv - NCAA Women’s D1 volleyball offensive and defensive team statistics from the 2018 and 2019 seasons. Data acquired from NCAA.com.
- womens_ncaa_lacrosse.csv - NCAA Women’s D1 lacrosse offensive and defensive team statistics from the 2018, 2019 and shortened 2020 seasons. Data acquired from NCAA.com.
NBA team season summary data
Each row in the nba_team_season_summary.csv dataset corresponds to a single NBA team in a single regular season dating back to 2003. The column names self-explanatory, but note that the columns ending with *_perc mean the percentage based statistics.
Tennis grand slams data
Each row in the data corresponds to a grand slam match between two players. A variety of summary statistics of the match are reported along with winner and loser information. Variables include:
tournament - one of the four grand slams: Australian Open, French Open, US Open, and Wimbledon
year
winner_name and loser_name
winner_rank and loser_rank according to ATP or WTA, respectively at the time of tournament
Retirement whether the match ended in a retirement (i.e. one person was unable to finish the match). Logical – TRUE means the match ended in retirement
Tour either WTA or ATP
round - R128 Round of 128, R64 - Round of 64, R32 - Round of 32, R16 Round of 16, QF Quarter Final, SF Semi Final, and F Final
w_* and l_* stands for winner and loser, respectively where the suffix is one of many summary statistics including
ave_serve_speed in mph
n_aces number of aces
n_winners number of winners including aces
n_netpt_w number of net points won
n_netpt number of net points played
n_bp_w number of break points won (to break the opponent)
n_bp number of break points (to break the opponent)
n_ue number of unforced errors
n_sv number of serves
n_sv_w number of service points won
MLB Batting Statistics 2010-2019
Each row in the baseball_batting.csv data corresponds to the batting statistics for a single player in a single season between 2010 and 2019. THe first few variables as well as the singles, doubles and triples are self-explanatory, and the other baseball variables mean as follows:
G games played
AB at bats: Plate appearances, not including bases on balls, being hit by pitch, sacrifices, interference, or obstruction.
PA plate appearances
H hits
HR home runs
R runs scored; the number of times a player crosses home plate
RBI runs batted in: the number of runners who score due to a batter’s action
BB walks ‘base on balls’
IBB intentional base on balls, times walked intentionally by pitcher
HBP hit by pitch: walked as a result of being hit by a pitch
SF sacrifice fly: fly balls hit to the outfield which although caught for an out, allow a baserunner to advance
SH sacrifice hit: number of sacrifice bunts which allow runners to advance on the basepaths
GDP ground into double-play: number of ground balls that became double plays
SB stolen bases
CS number of times caught stealing
AVG batting average
Pitches number of pitches faced
Balls number of balls faced
Strikes number of strikes faced
SO strike outs
BB_K walks / strike outs. Walk to strike out ratio
OBP on base percentage
SLG slugging average: total baseas achieved on hits / at bats
OPS on-base plus slugging: on-base percentage plus slugging average
ISO isolated power: a hitter’s ability to hit for extra bases, calculated by subtracting batting average from slugging percentage
wOBA weighted on base average
WAR wins above replacement: a non-standard formula to calculate the number of wins a player contributes to his team over a “replacement-level player”
WPA_plus win probability added, positive total
WPA_minus win probability added, negative total
Overwatch League Results and Odds
Each row in the overwatch_odds.csv data set contains data on a single Overwatch League match played between 2018 and 2020. Data includes the two teams, stage, winner, as well as information on the two teams success in the season thus far and in their history up until that point. Columns include:
id game id
corona_virus_isolation was the game played under corona virus isolation measures?
t1_wins_season t2_wins_season how many games team 1 has won in the season prior to the game (t2 for team 2)
t1_losses_season t2_losses_season how many games team 1 has lost in the season prior to the game (t2 for team 2)
t1_win_percent team 1 (t2 for team 2) win percentage in the season, in the last X games or all-time depending on variable name
t1_odds betting odds for team 1 to win the game.
Positive figures: The odds state the winnings on a 100 dollar bet (e.g. american odds of 110 would win 110 on a 100 dollar bet.)
Negative figures: The odds state how much must be bet to win 100 profit (e.g. american odds of -90 would win 100 on a 90 dollar bet.)
t2_odds betting odds for team 2 to win the game
t1_probability the implied win probability for team 1 given the betting odds
t2_probability the implied win probability for team 2 given the betting odds
Women’s NCAA D1 Soccer 2018 & 2019 Team Statistics
Each row in the womens_ncaa_soccer_20182019.csv data set refers to the statistics for a single school in a particular season. There are 668 team-school combinations spanning 2018 and 2019. Variables include:
assists the total number of assists earned by players on the team
team_games games played
assists_gp total assists earned per game played
corners corner kicks taken. This variable is unavailable for 2018
corners_gp corner kicks taken per game played. This variable is unavailable for 2018
fouls total fouls called on the team.
fouls_gp fouls called on the team per game played
ga goals against
team_min total minutes played by the team, including stoppage time
gaa goals against per game played
ps penalty kicks scored on
psatt penalty kicks attempted
pk_pct percentage of penalty kicks completed
points total points (goals + assists) accumulated for all players on the team
points_gp points accumulated by players on the team per game played
saves saves made by team goalkeepers
save_pct percentage of shots faced that goalkeepers saved
saves_gp number of saves made per game played
goals total goals scored by team
gpg the number of goals scored by the team per game played
sog total shots on goal
shatt total shot attempts
sog_pct percentage of shot attempts that were on goal
won games won
lost games lost
tied games tied
win_pct winning percentage
sog_gp number of shots on goal per game played
season the season the data refers to
Women’s NCAA D1 Volleyball 2018 & 2019 Team Statistics
Each row in the womens_ncaa_volleyball_20182019.csv data set refers to the statistics for a single school in a particular season. There are 666 team-school combinations spanning 2018 and 2019. Variables include:
s number of sets played
aces aces hit. An ace is a serve which lands in the opponent’s court without being touched, or is touched, but unable to be kept in play by one or more receiving team players
aces_per_set aces earned per set played
assists total team assists. Assists are awarded to a player who passes the ball to a teammate who attacks the ball for a kill. Can be awarded off a dig (first contact), provided the attack comes on the second contact
assists_per_set assists earned per set played
block_solos total team solo blocks. Players blocks the ball into the opponent’s court leading to a point or side out
block_assists total team assisted blocks
blocks_per_set total team blocks per set
digs total team digs. A dig occurs when a player passes the ball which has been attacked by the opposition. Digs are only given when players receive an attacked ball and it is kept in play
digs_per_set team digs per set played
kills team kills. An attack by a player that is not returnable by the receiving player on the opposing team and leads directly to a point or loss of rally
errors total team serve errors
total_attackstotal attack attempts. An attack is any overhead contact of the ball designed to score
hit_pct Hitting percentage is calculated by totaling kills, subtracting the hitting errors, then dividing that number by the total number of attack attempts.
kills_per_set kills earned per set played
w team wins
l team losses
win_pct team winning percentage
season season
Women’s NCAA D1 Lacrosse 2018, 2019 & 2020 Team Statistics
Each row in the womens_ncaa_lacrosse.csv data set refers to the D1 Women’s lacrosse statistics for a single school in a particular season. There are 348 team-school combinations spanning 2018, 2019 and 2020. Due to the Corona Virus pandemcic no team has played more than 10 games in 2020. Note that all _gp variables are the per game played versions of the variable they name (e.g. assists_gp is total team assists per game played). Other variable definitions are:
assists total team assists. The player who passes the ball to the player who scores a goal is credited with an assist
caused_tos total turnovers caused by the team. Also referred to as ‘takeaways’
draw_controls total team draw controls. A draw control occurs when a player successfully gains control of the ball after a draw.
fouls total team fouls
clears total team clears. A clear occurs when a team passes the offensive restraining line and is clearly able to get an offensive attempt.
clr_att total team attempted clears
clr_pct percent of team clear attempts that were succesfull
opp_dc opponents draw control total
drawc_control_pct percent of total draws that the team controlled
freepos_goals total free positiong goals. Free-position shot in women’s lacrosse is similar to a foul shot in basketball, awarded to an offensive player when a defender commits a major foul inside the 8-meter arc
freepos_shots total free positioning shots taken
free_position_pct percent of free position shots which resutled in goals
goals total team goals scored
points total points earned by all players on the team (goals + assists)
team_min total team minutes played
goals_allowed total goals allowed
saves total saves from all team goalkeepers
sv_pct team percentage of shots allowed which were saved, and not goals against
ga_gp goals allowed per game played
margin difference in goals scored - goals allowed per game played
gf_gp goals scored per game played
sog total shots on goal
turnovers total team turnovers committed
won games won
lost games lost
win_pct team winning percentage
yellow_cards total yellow cards earned by all players on the team
season season
