# Loading libraries and data
library(tidyverse)
library(ggplot2)
library(dplyr)
library(usmap)
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/raw_polls.csv")
polls
library(ggridges)
library(ggfortify)
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/pollster-ratings-combined.csv")
pollster_ratings
<- polls |>
polls mutate(error = abs(margin_poll-margin_actual))
<- polls %>%
poll_full left_join(pollster_ratings, by = c("pollster" = "pollster"))
Report of Research on Polling Errors in Political Polls
Intro to Dataset
This report analyzes the data from FiveThirtyEight pollster ratings(url:https://github.com/fivethirtyeight/data/tree/master/pollster-ratings). This dataset includes polls from 1998 to 2023 and the information of the pollsters conducting the polls. Each row of the “raw poll” dataset records one specific poll on a certain political race. Variables of each poll include the race type, pollsters’ name, election cycle, polled and actual margin of the race, poll and election date, candidate’s name and party affiliation etc. Each row of the “pollster_ratings” dataset records the information of a specific pollster. Variables of each pollster include number of its polls analyzed by FiveThirtyEight, its pollster rating, bias score, whether it’s inactive or not, transparency score, percent of its work considered to be partisan etc.
Research Questions
- How does partisanship affect pollsters’ behavior and the outcome of the polls, notably the error rate of the poll?
- How do different polling methodologies affect the accuracy of election predictions and what insights can be used to improve our current data collection process?
- Does the type of election and year contribute to polling accuracy?
- Does greater pollster transparency lead to more accurate polling results?
Effect of Partisanship on Pollsters and Polls
We would also like to explore how partisanship affect pollsters’ behavior and the outcome of the polls, notably the error rate of the poll.
First, we would like to run a principle component analysis to explore how each quantitative variables interact with the a pollster’s score and quality.The dataset provides a grade for all active poster, indicating its overall quality.
#Active pollsters
<- filter(pollster_ratings, inactive == FALSE )
pollster_active <- sapply(pollster_active, is.numeric)
quant_cols_active <- pollster_active[, quant_cols_active]
pollster_active_quant <- pollster_active_quant %>% select(-numeric_grade, -rank)
pollster_active_quant <- prcomp(pollster_active_quant,
pollster__active_pca # Center and scale variables:
center = TRUE, scale. = TRUE)
autoplot(pollster__active_pca,
data = pollster_active,
alpha = 0.25,
color = "numeric_grade",
loadings = TRUE, loadings.colour = 'darkblue',
loadings.label.colour = 'darkblue',
loadings.label = TRUE, loadings.label.size = 3,
loadings.label.repel = TRUE)+
scale_color_gradient(low = "navajowhite2", high = "blue", name = "Pollster Grade")+
theme_bw()+
labs(title = "PCA Plot for All Active Pollsters in the Dataset",
subtitle = "Colored by Pollster Grade")
In principle components plot above, the blue dots are pollsters with higher pollster grades given by FiveThirtyEight, indicating that they are better pollsters in general. While the white dots correspond to lower pollster grade and lower quality of the pollsters. There’s a visible positive relations between the pollster grade and first principle component(PC1) where the pollster grade is higher with higher PC1.
This plot shows that average transparency is positively related with PC1, which also indicates that it’s positively correlated with pollster’s quality. In the other direction is bias, error and percentage of partisan polls, which has a negative relations with PC1, which indicates that higher partisanship, higher bias and higher error leads to worse pollsters.
However it’s also worth noting that the first two principle components combined only contributes to around 60% of the variation in the whole dataset of active polls. There could be other variations that are not fully presented in this plot.
<- lm(numeric_grade ~ percent_partisan_work, data = pollster_active)
partisan_lm <- lm(error_ppm ~ percent_partisan_work, data = pollster_active)
partisan_error summary(partisan_lm)
Call:
lm(formula = numeric_grade ~ percent_partisan_work, data = pollster_active)
Residuals:
Min 1Q Median 3Q Max
-1.30296 -0.39363 -0.04258 0.39704 1.30532
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.90296 0.03640 52.278 < 2e-16 ***
percent_partisan_work -0.70828 0.08179 -8.659 3.83e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5278 on 280 degrees of freedom
Multiple R-squared: 0.2112, Adjusted R-squared: 0.2084
F-statistic: 74.98 on 1 and 280 DF, p-value: 3.83e-16
summary(partisan_error)
Call:
lm(formula = error_ppm ~ percent_partisan_work, data = pollster_active)
Residuals:
Min 1Q Median 3Q Max
-1.2499 -0.2499 -0.0331 0.2501 1.8501
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04986 0.03516 1.418 0.157
percent_partisan_work -0.01676 0.07900 -0.212 0.832
Residual standard error: 0.5098 on 280 degrees of freedom
Multiple R-squared: 0.0001608, Adjusted R-squared: -0.00341
F-statistic: 0.04503 on 1 and 280 DF, p-value: 0.8321
After constructing a linear model, we could see that there’s a statistically significant negative relation between an active pollster’s grade and the percentage of partisan polls they do, confirming the previous exploration from PCA plot. Interestingly, there’s no significant relations between the predictive plus minus error and the percentage of partisan polls they do. This could indicate that pollster quality is compromised by partisanship in other ways aside from poll error.
We would also like to explore how partisanship affect individual polls.
<- poll_full |>
poll_full mutate(partisan = if_else(is.na(partisan), "Non-partisan", partisan))
|>
poll_fullgroup_by(partisan)|>
summarize(count = n())
# A tibble: 8 × 2
partisan count
<chr> <int>
1 CRV 1
2 DEM 1489
3 DEM,REP 2
4 IND 10
5 LIB 8
6 Non-partisan 17910
7 REP 1043
8 REP,DEM 3
An overview of partisan demographic of polls show that the majority of the polls are non-partisan. The dataset define all polls affiliated to or internal among a party as a partisan poll. While among the partisan polls, an overwhelming majority of them are either affiliated to Democrats or Republicans. The rest of the polls are either affiliated with multiple parties or affiliated with independent candidates or smaller parties like the Libertarian. Here we will only examine polls that are not affiliated to multiple parties.
<- poll_full|>
poll_partisan filter(partisan == "REP" | partisan == "DEM" | partisan == "IND" | partisan == "LIB"| partisan == "Non-partisan")
|>
poll_partisan ggplot(aes(x = error, y = partisan))+
geom_boxplot(width = .2, alpha = .5)+
scale_y_discrete(labels = c(
"REP" = "Republicans",
"DEM" = "Democrats",
"IND" = "Independents",
"LIB" = "Libertarians",
"Non-partisan"="Non-partisan"
+
)) theme_bw()+
geom_vline(xintercept = median(poll_partisan$error), color = "green4", linewidth = 1)+
labs(title = "Error of Polls by Parties",
subtitle = "Green line = Overall Median",
x = "Absolute error = |Poll Margin - Actual Margin|",
y = "Partisan Polls")
The boxplot above shows that partisan polls has overall higher absolute error compares to the non-partisan plots and overall median, no matter of the party affiliation. Among the partisan polls, Democrats and Republicans, the major parties, seem to have overall better performances than Libertarians and Independents. However, it’s also worth noting that there are many outliers to the right for non-partisan, democrat and republican polls.
<- filter(poll_full, partisan == "Non-partisan")
poll_non_partisan <- filter(poll_full, partisan != "Non-partisan")
poll_partisan2 t.test(poll_non_partisan$error, poll_partisan2$error)
Welch Two Sample t-test
data: poll_non_partisan$error and poll_partisan2$error
t = -7.4912, df = 3149.9, p-value = 8.815e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0804868 -0.6322108
sample estimates:
mean of x mean of y
5.596337 6.452686
A further t-test shows that there’s a statistically significant difference between the mean absolute error of partisan and non partisan polls. On average, partisan polls have higher absolute error than non-partisan polls.
After all, we can conlude that partisanship both negatively affect pollsters and polls.
How do different polling methodologies affect the accuracy of election predictions and what insights can be used to improve our current data collection process?
Before we can explore the relationship between polling methodologies and polling error, we must first standardize the entries in the methodology column of our data. This is so we can avoid overlap due to the many combined survey methods displayed below.
# Displaying Complex Methodologies
<- table(polls$methodology)
method_counts <- as.data.frame(method_counts)
method_counts_df colnames(method_counts_df) <- c("Method", "Count")
<- method_counts_df[order(-method_counts_df$Count), ]
method_counts_df method_counts_df
Method Count
22 Live Phone 7662
4 IVR 3214
40 Online Panel 2745
15 IVR/Online Panel 1107
6 IVR/Live Phone 300
25 Live Phone/Online Panel 201
18 IVR/Text 165
51 Text-to-Web/Online Ad 156
47 Probability Panel 154
44 Online Panel/Text-to-Web 122
13 IVR/Live Phone/Text/Online Panel/Email 118
16 IVR/Online Panel/Text-to-Web 112
33 Live Phone/Text-to-Web 94
39 Online Ad 87
42 Online Panel/Online Ad 80
19 IVR/Text-to-Web 70
49 Text-to-Web 57
48 Text 46
17 IVR/Online Panel/Text-to-Web/Email 44
2 Email 34
14 IVR/Online Ad 34
43 Online Panel/Probability Panel 34
36 Mail 29
37 Mail-to-Web/Mail-to-Phone 27
26 Live Phone/Online Panel/Text 24
5 IVR/Email 22
31 Live Phone/Probability Panel 22
32 Live Phone/Text 17
7 IVR/Live Phone/Online Panel 15
27 Live Phone/Online Panel/Text-to-Web 14
10 IVR/Live Phone/Text 11
21 IVR/Text-to-Web/Online Ad 11
1 App Panel 10
20 IVR/Text-to-Web/Email 10
34 Live Phone/Text-to-Web/Email 10
9 IVR/Live Phone/Online Panel/Text-to-Web/Mail-to-Web 8
23 Live Phone/Email 8
29 Live Phone/Online Panel/Text-to-Web/Mail-to-Web 7
11 IVR/Live Phone/Text-to-Web 6
38 Mixed 6
50 Text-to-Web/Email 6
3 Face-to-Face 3
45 Online Panel/Text-to-Web/Face-to-Face 3
8 IVR/Live Phone/Online Panel/Text-to-Web 2
41 Online Panel/Mail-to-Web 2
12 IVR/Live Phone/Text-to-Web/Email 1
24 Live Phone/Mail-to-Web/Mail-to-Phone 1
28 Live Phone/Online Panel/Text-to-Web/Email 1
30 Live Phone/Online Panel/Text-to-Web/Text 1
35 Live Phone/Text/Online Ad 1
46 Online Panel/Text-to-Web/Mail-to-Web 1
We identified and split the methods into each unique type of method.
# Locating Method Types
# Find all unique methods
<- unique(polls$methodology)
unique_methods <- array(unique_methods)
methods_array
# Split methods by type
<- strsplit(methods_array, split = "/")
split_methods <- unlist(split_methods)
all_method_types
<- table(all_method_types)
method_type_counts <- as.data.frame(method_type_counts)
method_type_counts_df colnames(method_type_counts_df) <- c("Method Type", "Count")
<- method_type_counts_df[order(-method_type_counts_df$Count), ]
method_type_counts_df method_type_counts_df
Method Type Count
5 Live Phone 22
14 Text-to-Web 21
11 Online Panel 20
4 IVR 18
2 Email 10
13 Text 8
8 Mail-to-Web 6
10 Online Ad 6
12 Probability Panel 3
3 Face-to-Face 2
7 Mail-to-Phone 2
1 App Panel 1
6 Mail 1
9 Mixed 1
Then we sorted the types of methods into 5 different categories.
# Separating Method Types into Categories
$method_simplified <- NA
polls
$method_simplified[grepl("Live Phone|IVR", polls$methodology, ignore.case = TRUE)] <- "Phone-Based"
polls$method_simplified[grepl("Text-to-Web|Text", polls$methodology, ignore.case = TRUE)] <- "Text-Based"
polls$method_simplified[grepl("Online Panel|Online Ad|App Panel|Email", polls$methodology, ignore.case = TRUE)] <- "Online-Based"
polls$method_simplified[grepl("Mail-to-Web|Mail-to-Phone|Mail", polls$methodology, ignore.case = TRUE)] <- "Mail-Based"
polls$method_simplified[grepl("Probability Panel|Face-to-Face|Mixed|NA", polls$methodology, ignore.case = TRUE)] <- "Other"
polls$method_simplified[is.na(polls$method_simplified)] <- "Other"
polls
<- table(polls$method_simplified)
method_category_counts <- as.data.frame(method_category_counts)
method_category_counts_df colnames(method_category_counts_df) <- c("Method Category", "Count")
method_category_counts_df
Method Category Count
1 Mail-Based 329
2 Online-Based 4722
3 Other 3773
4 Phone-Based 11176
5 Text-Based 466
Now that we have simplified the methodologies, we can learn how different polling methodologies affect the accuracy of election predictions. We calculated the polling error using margin_poll and margin_actual and used it to examine our five methodology categories.
Error:
$polling_error <- abs(polls$margin_poll - polls$margin_actual) polls
<- c(
custom_colors "Phone-Based" = "#e94d4d",
"Text-Based" = "#cc3c3c",
"Online-Based" = "#87abe7",
"Mail-Based" = "#6288c8",
"Other" = "#aaaaaa"
)
ggplot(polls, aes(x = method_simplified, y = polling_error, fill = method_simplified)) +
geom_boxplot() +
stat_summary(fun = mean, geom = "text", aes(label = round(after_stat(y), 2)), color = "white") +
scale_fill_manual(values = custom_colors) +
labs(title = "Polling Error by Simplified Methodology",
x = "Simplified Methodology",
y = "Polling Error (%)")
The above graph shows boxplots of polling error by methodology, with mean polling errors labeled for each group. The Mail-Based methodology had the lowest error and the Other category had the highest error while Online-Based, Phone-Based, and Text-Based had similar errors. This suggests that there are noticeable differences in polling error between methodologies. Mail-based data collection might produce the most accurate predictions and technology-based data collection is reliable even across different mediums. This highlights an opportunity to prioritize and encourage mail-based data collection when making election predictions.
<- oneway.test(polling_error ~ method_simplified, data = polls)
anova_result anova_result
One-way analysis of means (not assuming equal variances)
data: polling_error and method_simplified
F = 8.3578, num df = 4.0, denom df = 1513.6, p-value = 1.145e-06
We used a one-way (ANOVA) test to further assess whether polling errors differ significantly across different simplified polling methodologies. The p-value is extremely small (1.145e-06) p-value suggests that at least one methodology differs in average polling error. This suggests that polling methodology significantly impacts the accuracy of election predictions and is an important factor to take into consideration.
We also wanted to see if polling errors for each methodology differ by state. This can help us assess whether certain methodologies produce larger errors at the state level, not just overall, so that we can find ways to improve our data collection process as much as possible.
We begin by standardizing the entries in the location column of our data by including states only and stripping district numbers (e.g., “-24”) to retain only the state abbreviation.
$states_simplified <- gsub("-.*", "", polls$location)
polls
<- c("DC", "US", "PR", "M1", "M2", "N2", "N1", "N3", "VI")
exclude<- polls %>%
polls_filtered filter(!states_simplified %in% exclude)
print(sort(unique(polls_filtered$states_simplified)))
[1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WV" "WY"
Next, we create datasets for each state by method.
<- polls %>%
mail_polling_error filter(method_simplified == "Mail-Based") %>%
group_by(states_simplified) %>%
rename(state = states_simplified) %>%
mutate(plot_type = "Mail-Based")
<- polls %>%
online_polling_error filter(method_simplified == "Online-Based") %>%
group_by(states_simplified) %>%
rename(state = states_simplified) %>%
mutate(plot_type = "Online-Based")
<- polls %>%
other_polling_error filter(method_simplified == "Other") %>%
group_by(states_simplified) %>%
rename(state = states_simplified) %>%
mutate(plot_type = "Other")
<- polls %>%
phone_polling_error filter(method_simplified == "Phone-Based") %>%
group_by(states_simplified) %>%
rename(state = states_simplified) %>%
mutate(plot_type = "Phone-Based")
<- polls %>%
text_polling_error filter(method_simplified == "Text-Based") %>%
group_by(states_simplified) %>%
rename(state = states_simplified) %>%
mutate(plot_type = "Text-Based")
Then we take the mean polling errors of each method that were displayed in the boxplots and use them as our midpoint for the heatmaps. States with missing (null) data are displayed in white to focus attention on the meaningful variation among available observations.
library(grid)
<- mean(mail_polling_error$polling_error, na.rm = TRUE)
mail_overall_mean
plot_usmap(data = mail_polling_error, values = "polling_error", regions = "states") +
scale_fill_gradient2(low = "white", high = "red", midpoint = mail_overall_mean, na.value = "white") +
labs(
title = "Polling Error by State and Mail-Based Methodology",
fill = "Polling Error (%)"
+
) theme(legend.position = "left")
For the above US states heatmap, California and Utah show higher polling errors for the mail-based methodology than the other states. This could be due to the fact that they are both a part of the eight states that automatically send all registered voters a mail-in ballot. Interestingly, this finding challenges our earlier assessment that mail-based data collection might be more accurate than other methodologies. This suggests that election predictions should take into account both the data collection methodology and state-specific factors.
For the following three US states heatmaps, no states showed notably high polling errors for the Online-Based, Phone-Based, and Other methodology. This suggests that polling errors were relatively consistent across states for these methodologies.
<- mean(mail_polling_error$polling_error, na.rm = TRUE)
online_overall_mean
plot_usmap(data = online_polling_error, values = "polling_error", regions = "states") +
scale_fill_gradient2(low = "white", high = "red", midpoint = mail_overall_mean, na.value = "white") +
labs(
title = "Polling Error by State and Online-Based Methodology",
fill = "Polling Error (%)"
+
) theme(legend.position = "left")
<- mean(other_polling_error$polling_error, na.rm = TRUE)
other_mean_error
plot_usmap(data = other_polling_error, values = "polling_error", regions = "states") +
scale_fill_gradient2(low = "white", high = "red", midpoint = other_mean_error) +
labs(
title = "Polling Error by State and Other Methodology",
fill = "Polling Error (%)"
+
) theme(legend.position = "left")
<- mean(phone_polling_error$polling_error, na.rm = TRUE)
phone_mean_error
plot_usmap(data = phone_polling_error, values = "polling_error", regions = "states") +
scale_fill_gradient2(low = "white", high = "red", midpoint = phone_mean_error) +
labs(
title = "Polling Error by State and Phone-Based Methodology",
fill = "Polling Error (%)"
+
) theme(legend.position = "left")
<- mean(text_polling_error$polling_error, na.rm = TRUE)
text_mean_error
plot_usmap(data = text_polling_error, values = "polling_error", regions = "states") +
scale_fill_gradient2(low = "white", high = "red", midpoint = text_mean_error, na.value = "white") +
labs(
title = "Polling Error by State and Text-Based Methodology",
fill = "Polling Error (%)"
+
) theme(legend.position = "left")
For the US states heatmap above, South Dakota shows a notably higher polling error for the Text-Based methodology compared to other states. This is interesting because it stands out as the only state with such a discrepancy and would benefit from further analysis. This also contrasts with our earlier assessment that technology-based data collection methods produced similar polling errors. This finding highlights that we cannot make broad assumptions and that state-specific factors are important to consider when analyzing the affect of methodologies on election predictions.
Does the type of election and year contribute to polling accuracy?
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/2023/raw-polls.csv") raw_polls
Rows: 11475 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (19): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- raw_polls %>%
polls_2022 filter(year == 2022) %>%
filter(!is.na(error), !is.na(type_simple), !is.na(location))
ggplot(polls_2022, aes(x = type_simple, y = error, color = location)) +
geom_jitter(width = 0.3, alpha = 0.6, size = 2) +
labs(
title = "Poll Error by Race Type and State (2022)",
x = "Race Type",
y = "Polling Error (|Predicted Margin - Actual Margin|)",
color = "State"
+
) theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14)
)
library(tidyverse)
# Load the raw poll data
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/2023/raw-polls.csv") raw_polls
Rows: 11475 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (19): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean and summarize for a given year
<- function(year_selected) {
plot_state_level_error %>%
raw_polls filter(year == year_selected) %>%
filter(!is.na(error), !is.na(type_simple), !is.na(location)) %>%
# Extract just the state abbreviation (everything before the "-")
mutate(state = str_extract(location, "^[A-Z]{2}")) %>%
# Group by state and race type, then average the error
group_by(state, type_simple) %>%
summarize(mean_error = mean(error), .groups = "drop") %>%
# Plot it
ggplot(aes(x = type_simple, y = mean_error, color = state)) +
geom_jitter(width = 0.3, alpha = 0.6, size = 2) +
labs(
title = paste("Avg Poll Error by Race Type and State (", year_selected, ")", sep = ""),
x = "Race Type",
y = "Mean Polling Error (|Predicted Margin - Actual Margin|)",
color = "State"
+
) theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14)
)
}
# Create plot for 2018 (or change year to 2022, 2014, etc.)
<- plot_state_level_error(2022)
plot_2022_state_level <- plot_state_level_error(2018)
plot_2018_state_level <- plot_state_level_error(2014)
plot_2014_state_level <- plot_state_level_error(2010)
plot_2010_state_level
plot_2022_state_level
plot_2018_state_level
plot_2014_state_level
plot_2010_state_level
[The plots visualize polling error across different race types (Governor, House, Senate) in the 2022 election cycle, with each point colored by the state or district where the poll occurred. We observe that the House races tend to have slightly lower polling errors compared to Governor and Senate races, which may reflect differences in media coverage, sample sizes, or polling methodologies across race types. This plot is informative for our project because it highlights how polling reliability can vary depending on the type of election and geographic region, which are two factors that could be important when evaluating overall polling accuracy. We used the most recent data, as it is most relevant to understanding patterns that still exist.]
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/2023/raw-polls.csv") raw_polls
Rows: 11475 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (19): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- raw_polls %>%
polls_clean filter(!is.na(error), !is.na(year))
<- polls_clean %>%
polls_by_year group_by(year) %>%
summarize(mean_error = mean(error, na.rm = TRUE))
ggplot(polls_by_year, aes(x = year, y = mean_error)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "darkred", size = 2) +
labs(
title = "Mean Polling Error by Year",
x = "Election Year",
y = "Mean Polling Error"
+
) theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14)
)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
<- polls_clean %>%
polls_clean mutate(election_type = case_when(
%in% c(2012, 2016, 2020) ~ "Presidential",
year %in% c(2010, 2014, 2018, 2022) ~ "Midterm",
year TRUE ~ NA_character_
%>%
)) filter(!is.na(election_type))
<- t.test(error ~ election_type, data = polls_clean)
t_test_result
print(t_test_result)
Welch Two Sample t-test
data: error by election_type
t = -7.3318, df = 6019.8, p-value = 2.57e-13
alternative hypothesis: true difference in means between group Midterm and group Presidential is not equal to 0
95 percent confidence interval:
-1.1467662 -0.6629009
sample estimates:
mean in group Midterm mean in group Presidential
5.097626 6.002459
Polling errors have fluctuated over time without clear long-term improvement, and statistical testing shows that Presidential election polls are significantly less accurate than Midterm election polls, with about 0.9% higher mean error.
Does greater pollster transparency lead to more accurate polling results?
library(ggplot2)
= read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/pollster-ratings/pollster-ratings-combined.csv")
polls2
ggplot(polls2, aes(x = wtd_avg_transparency, y = POLLSCORE)) +
geom_point(alpha = 0.7, color = "#e94d4d") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = "Pollster Transparency vs. PollScore",
x = "Weighted Average Transparency",
y = "POLLScore (lower = better accuracy)") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
This scatterplot shows the relationship between pollster transparency and polling accuracy. Each point represents a different pollster, with the x-axis measuring their weighted average transparency score and the y-axis showing their PollScore, where lower values indicate better accuracy. The black trend line slopes slightly downward, suggesting a weak negative correlation: as transparency increases, polling errors tend to decrease slightly. However, the spread of points around the trend line shows a lot of variability, meaning transparency alone does not perfectly predict polling accuracy.
library("hexbin")
ggplot(polls2, aes(x = wtd_avg_transparency, y = POLLSCORE)) +
geom_hex(bins = 30) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = "Pollster Transparency vs. PollScore (Hexbin Plot)",
x = "Weighted Average Transparency",
y = "POLLScore (lower = better accuracy)") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
This hexbin plot shows the relationship between pollster transparency and polling accuracy. Each hexagon represents an area containing multiple pollsters, with darker hexagons indicating a higher number of pollsters in that region. The x-axis shows the weighted average transparency score, and the y-axis shows the PollScore, where lower values represent better accuracy. The black trend line slopes downward, indicating a slight negative correlation: pollsters with higher transparency scores tend to have slightly better accuracy on average. However, the spread and clustering of hexagons suggest that while transparency helps, it is not the only factor affecting pollster performance.
Conclusion
Our findings reveal that multiple factors influence polling accuracy, with some having a greater impact than others. We found that partisanship negatively affects both pollsters and polls, with partisan pollsters and partisan-affiliated polls demonstrating higher error rates and lower overall quality. We discovered that mail-based methodologies generally performed better overall when compared to other methodologies but state-specific factors can complicate this trend. Additionally, we found that transparency can contribute to pollster accuracy, but is not influential enough on it’s own. Lastly, we found that polling accuracy can be dependent on the type of election with Presidential election polls being significantly less accurate than Midterm election polls. Overall, polling error in election predictions can be minimized by placing less weight on partisanship, selecting effective methodologies at the state-level, promoting transparency, and considering the specific context and type of each election. By keeping these factors in mind, we can produce more trustworthy election predictions that better inform the public during elections.
Future Work
This report is not an exhaustive analysis of all variables that could impact polls or pollsters. This research did not explore the relations between different candidates and parties and polls. Also more research and analysis could be done regarding pollster’s information.
The dataset itself is limited. It only contains polls between 1998 and 2023. It would be able to answer more questions if it includes more recent data, and also earlier data before 1998 as they could provide insight on those times where the politics is very different. It could potentially provide more insight if we explore the time series of the data.
There is also plenty of poll information that’s excluded from the current dataset, including wording of questions, response rates, budget or cost. All of those could help us understand what makes a good poll.