36-460: World Cup Fan Base Impacts
Measuring the Impact of Fan Attendance and Travel Distance on Goals Scored
Introduction
Problem
The World Cup is a global soccer (or football) event held every 4 years where the best teams across the world travel to a host country to play in a tournament. Like many other worldwide sporting events, teams will travel far and wide to reach the host city, and we wanted to understand what the relationship was between distance traveled and a team’s performance at the World Cup.
Why is this Interesting?
There were a few reasons we found this research question to be interesting. Firstly, traveling can take a toll on the body and may result in more time needed to acclimate to the host nation. Factors such as jetlag, fatigue, and even climate can affect a team’s performance, and we wanted to know whether that would be visible in the data. Additionally, we know that fans can make a great impact on the environment of a match and the momentum on the field. We wanted to understand if greater distances meant that fewer fans would travel to the host nation, potentially reducing the “home field advantage” that a team may gain.
Summary of Results
Our results showed that adding variability through random effects of distance from the host city has an impact on performance. We found these results by building four models, two for predicting home team goals and two for predicting away team goals, and the impacts found are quite small, but still meaningful.
Data
We used a Kaggle dataset that lists all World Cup matches and their results. The data states where games were played, which teams were playing, and the goals scored by each team. The data labels the two teams as “Home” and “Away,” which is a bit of a misnomer. World Cup matches are held in one nation/geographical area, and these labels are randomly assigned to teams by FIFA. There did not seem to be any pattern to these labels, although we note that teams labelled “Home” overwhelmingly performed better than their “Away” opponents.
We had to pre-process the variable representing the tournament stage. Each group had its own label, and for some of the earliest tournaments, the Group stage was called either the “First round” or “Preliminary stage.” Since these were all equivalent, we compressed all these rows under one “Group stage” label. The same can be said for the Third Place Match, where we grouped multiple names under the same label.
To get distances, we started with the geocode
function from the tidygeocoder
package, which allowed us to encode the latitude and longitude of all host cities/stadiums as well as approximate latitudes and longitudes for competing countries. Countries are large, so their coordinates are vague, but from our research on the geocode function, it returns the centroid of the requested country. Some countries were not properly tracked, because they no longer exist (eg, Yugoslavia) or due to conflicting naming (eg, Korea Republic for South Korea). As a result, we manually entered the coordinates for some cities and countries according to the geographic centroid rule. Using this location data, we used the distm
function from the geosphere
package to compute the Haversine (Great Circle) distance between the host city and both countries playing.
When analyzing the variable representing the attendance at World Cup Matches, we realized that the data was unimodal, but slightly right-skewed. To account for this, we began by doing a log transformation but realized that this resulted in a left-skewed distribution. We instead decided to use a square root transformation, which yielded the more symmetric, unimodal distribution.
The Distribution of distances was interesting as it showed a bimodal distribution. Either countries were especially close or far from the host nation. We assumed that this was a result of oceans. Teams would travel shorter distances on the same continent or farther across the globe to reach the destination.
We can see that distance is still a right-skewed variable, and we do the same transformation as Attendance. Applying a log transformation again resulted in left-skewed distributions, and so we used a square root. The plots show the transformation on home distances, and the same pattern/shape holds for away distances.
Methods
Statistical Models Description
We will use four Bayesian hierarchical Poisson regression models for this project:
home_full_model
: Predicting home goals from the distance from the home team’s city to the host city, distance from the away team’s city to the host city, attendance, an interaction term between attendance and stage, random attendance slope by home team, random home distance slope and intercept by home team, random away distance slope and intercept by away team, and a random intercept of stage.away_full_model
: The same as 1, but for predicting away goals.home_base_model
: Predicting home goals from the distance from the home team’s city to the host city, distance from the away team’s city to the host city, attendance, and an interaction term between attendance and stage. (Removing random slopes and intercepts)away_base_model
: The same as 3, but for predicting away goals
\(\textit{Assumptions}\):
- Each outcome variable (
home_goals
oraway_goals
) follows a Poisson distribution
We used the brm package to initialize the models as shown below:
= brm(home_goals ~ home.dist + away.dist + Attendance +
home_full_model * Attendance +
stage 1 + home.dist | home) +
(1 + away.dist | away) + (1 | stage),
(family = poisson,
data = train_data, init = 0, cores = 4, seed=2025)
= brm(away_goals ~ home.dist + away.dist + Attendance +
away_full_model * Attendance +
stage 1 + home.dist | home) +
(1 + away.dist | away) + (1 | stage),
(family = poisson,
data = train_data, init = 0, cores = 4, seed=2026)
= brm(home_goals ~ home.dist + away.dist + Attendance +
home_base_model * Attendance,
stage family = poisson,
data = train_data, init = 0, cores = 4, seed=2027)
= brm(away_goals ~ home.dist + away.dist + Attendance +
away_base_model * Attendance,
stage family = poisson,
data = train_data, init = 0, cores = 4, seed=2028)
Model Comparison
We will compare our models with root mean-squared error (RMSE) using the posterior distribution for the conditional expectation, which is calculated using the predictive_error
function, and finding the difference between the models with and without random slopes and intercepts. We will create an 80-20 training and testing split on the dataset and compute the RMSE per observation from the testing subset.
Quantifying Uncertainty
We will get the uncertainty about the difference between the performance of the models by computing the mean difference with standard errors between the home models and the away models. This will indicate the uncertainty about the effects of using the random effect terms to predict goals.
Results
Model Inference
We start by examining the coefficients in the base model, which does not have random effects. The first table shows the coefficients for the model estimating home goals scored.
Estimate | Estimated Error | Lower Bound | Upper Bound | |
---|---|---|---|---|
home.dist | -1.3e-04 | 2.8e-05 | -0.000186 | -0.000074 |
away.dist | 7.4e-05 | 2.9e-05 | 0.000018 | 0.000129 |
When modeling goals scored by home teams, greater home distance leads to a lower goal-scoring output, while greater away distance leads to a higher goal-scoring output. This implies that distance is a meaningful contributor to fatigue and performance overall.
The next table shows the coefficients for the model estimating away goals scored.
Estimate | Estimated Error | Lower Bound | Upper Bound | |
---|---|---|---|---|
home.dist | 8.7e-05 | 3.8e-05 | 0.000015 | 0.000162 |
away.dist | -1.2e-04 | 3.9e-05 | -0.000195 | -0.000045 |
The same pattern holds when analyzing the away goal scoring model. Away teams see lower goal scoring when they travel more, and higher goal scoring when their opponent travels more.
All of the coefficients are small but do not have intervals overlapping with zero, implying a small but significant effect.
These plots show the coefficients of the random effect of distance on each team. We notice that the variable generally follows our hypothesis, and greater distances can lead to fewer goals scored. We can see that the top left and bottom right plots show the effect of a team’s distance on its ability to score. With a median that is less than zero, we interpret this to mean that, on average, greater distances will hurt goal scoring. There is a spread that shows how different teams respond, but we note that all coefficients are negative, meaning that distance negatively impacts all teams’ scoring ability. On the contrary, the top right and bottom left plots show the effect of the opponent’s distance travelled on one’s goal-scoring ability. Here we see that both medians are above zero, implying that playing against a team that has traveled a lot can positively impact your goal-scoring ability. Again, while there is some spread, all coefficients are above zero.
It is still worth noting that these coefficients are incredibly close to zero, and almost all overlap with zero when analyzing their credible intervals. Notably, some countries with large numbers of games played have intervals that do not overlap with zero, suggesting that more data could strengthen the conclusions made. When looking at the effect of home distance on home scoring and away distance on away scoring, powerhouses like France, Germany, Italy, and Uruguay have intervals that don’t overlap with zero, implying that there is a significant negative impact that comes with distance traveled. All of the intervals overlapped with zero when looking at opponent distance (ie, away distance in home scoring model, home distance in away scoring model), implying that the effect is much weaker here.
Model Comparison Overview
To compare our models, we calculated the RMSE of each observation in the testing data for each of the four models. We then found the difference in RMSE for each observation between the models with and without random effects for predicting home and away goals separately to evaluate the difference in performance of the models with and without random effects. We plotted these differences with histograms as shown before, and also computed the mean difference with standard errors between the home and away models to find confidence intervals for those.
Lower Bound | Point Estimate | Upper Bound | |
---|---|---|---|
Home Models | 0.0227 | 0.0306 | 0.0384 |
Away Models | 0.0349 | 0.0420 | 0.0491 |
Interpretation of Models
The histograms of the differences in RMSE and the 95% confidence intervals show a statistically significant difference between the models without the random effects terms and the models with these terms.
The confidence interval for the difference in RSME for the home models is [0.023, 0.038]. This shows that adding the random effects terms for distance given home and away teams, as well as stage, decreases the model’s performance in predicting home goals.
The confidence interval for the difference in RMSE for the away models is [0.035, 0.049]. This shows that adding the random effects terms for distance given home and away teams, as well as stage, decreases the model’s performance in predicting away goals.
Using this, we can conclude that for these Bayesian hierarchical Poisson regression models, the effects of distance from the host city for the home and away teams, as well as considering a random intercept for the tournament stage in the World Cup, have negative impacts on performance.
Discussion
Conclusions
Our models show that distance can have a meaningful effect on a team’s goal-scoring performance in the World Cup. In both the base models and the mixed effects models, we can see that distance has a negative effect on a team’s scoring ability. The same pattern holds when looking at their opponent, where goal scoring increases if their opponent travels more.
When comparing our simple Poisson regression to a multilevel model, we see that adding random effects to our model structure decreases its predictive ability. We observe some overfitting in the model that leads to less generalizable results, likely due to the greater number of parameters estimated. The inference gained through the mixed effect models was still meaningful, as we saw how distance affected the performance of every team.
Limitations
There are a number of limitations to these models that arise from the data, as well as some unknown factors.
First, the data only contains 836 observations, which may result in larger standard errors due to the large number of unique values for many variables.
Second, some other variables could have a confounding effect on the results. While we tried to account for this by including an interaction term between stage
and Attendance
, there may be other factors like the longevity of the team and the home country’s sports culture that are not present in the data or nearly impossible to measure that were still at play.
Future Works
The most basic way to improve the models would be through more data. Our data only covers the 800 World Cup games that have ever happened. Including World Cup qualifiers could give us more data to work with.
The most important piece of future work would be implementing some notion of team strength. Our data was fairly thin when it came to team-level details, and only gave a high-level overview of the result of every match. Implementing some idea of team strength could prove to be more informative for inference, to truly understand what portion of the observed result can be attributed to distance, as opposed to just team capabilities. This could be implemented in a few ways. Firstly, we could try to build our own Elo rating system, starting with the first World Cup and moving up. This could give inaccurate results, though, as the 4-year gap between tournaments could lead to massive changes between teams that may not be captured in our potential adjustments. We could also try to gather historical rankings data, as FIFA started ranking the world’s teams in 1992. This obviously does not capture the full range of data, but it could improve assessments in later years.
Next, we could have more nuanced measures of distance other than distances between the country and the host. The main change would be to take distances between matches instead of the distance between the team’s nation to the host. The truth is that a team will make a long trip to the host nation, but once they are there, their travel greatly diminishes. Tracking the distance between games instead of from their point of origin would likely give better indications of the effect of travel. This could also bring its own disadvantages as the distance variable would become incredibly right-skewed. The vast majority of distances would be close to zero, while the initial lengthy leg of travel would have much less coverage.
Finally, another layer of distance traveled that would require more effort could be done at the player level. While it is true that teams will practice and travel together prior to big events like the World Cup, individual players spend the majority of their time away from the national team. Many top players will play their year-long football away from their home country. An added level of analysis could be assessing individual player fatigue and seeing whether that can capture more variance.