Report of Research on Polling Errors in Political Polls

Author

Fletcher Sun, Emma Enkhbold, Alex Palmer, Joseph Scharpf

Published

April 28, 2025

# Loading libraries and data
library(tidyverse)
library(ggplot2)
library(dplyr)
library(usmap)

polls <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/raw_polls.csv")

library(ggridges)
library(ggfortify)
pollster_ratings <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/pollster-ratings-combined.csv")

polls <- polls |>
  mutate(error = abs(margin_poll-margin_actual))

poll_full <- polls %>%
  left_join(pollster_ratings, by = c("pollster" = "pollster"))

Intro to Dataset

This report analyzes the data from FiveThirtyEight pollster ratings(url:https://github.com/fivethirtyeight/data/tree/master/pollster-ratings). This dataset includes polls from 1998 to 2023 and the information of the pollsters conducting the polls. Each row of the “raw poll” dataset records one specific poll on a certain political race. Variables of each poll include the race type, pollsters’ name, election cycle, polled and actual margin of the race, poll and election date, candidate’s name and party affiliation etc. Each row of the “pollster_ratings” dataset records the information of a specific pollster. Variables of each pollster include number of its polls analyzed by FiveThirtyEight, its pollster rating, bias score, whether it’s inactive or not, transparency score, percent of its work considered to be partisan etc.

Research Questions

How does partisanship affect pollsters’ behavior and the outcome of the polls, notably the error rate of the poll?
How do different polling methodologies affect the accuracy of election predictions and what insights can be used to improve our current data collection process?
Does the type of election and year contribute to polling accuracy?
Does greater pollster transparency lead to more accurate polling results?

Effect of Partisanship on Pollsters and Polls

We would also like to explore how partisanship affect pollsters’ behavior and the outcome of the polls, notably the error rate of the poll.

First， we would like to run a principle component analysis to explore how each quantitative variables interact with the a pollster’s score and quality.The dataset provides a grade for all active poster, indicating its overall quality.

#Active pollsters
pollster_active <- filter(pollster_ratings, inactive == FALSE )
quant_cols_active <- sapply(pollster_active, is.numeric)
pollster_active_quant <- pollster_active[, quant_cols_active]
pollster_active_quant <- pollster_active_quant %>% select(-numeric_grade, -rank)
pollster__active_pca <- prcomp(pollster_active_quant, 
                        # Center and scale variables:
                        center = TRUE, scale. = TRUE)

autoplot(pollster__active_pca, 
         data = pollster_active,
         alpha = 0.25,
         color = "numeric_grade",
         loadings = TRUE, loadings.colour = 'darkblue',
         loadings.label.colour = 'darkblue',
         loadings.label = TRUE, loadings.label.size = 3,
         loadings.label.repel = TRUE)+
         scale_color_gradient(low = "navajowhite2", high = "blue", name = "Pollster Grade")+
          theme_bw()+
          labs(title = "PCA Plot for All Active Pollsters in the Dataset",
               subtitle = "Colored by Pollster Grade")

In principle components plot above, the blue dots are pollsters with higher pollster grades given by FiveThirtyEight, indicating that they are better pollsters in general. While the white dots correspond to lower pollster grade and lower quality of the pollsters. There’s a visible positive relations between the pollster grade and first principle component(PC1) where the pollster grade is higher with higher PC1.

This plot shows that average transparency is positively related with PC1, which also indicates that it’s positively correlated with pollster’s quality. In the other direction is bias, error and percentage of partisan polls, which has a negative relations with PC1, which indicates that higher partisanship, higher bias and higher error leads to worse pollsters.

However it’s also worth noting that the first two principle components combined only contributes to around 60% of the variation in the whole dataset of active polls. There could be other variations that are not fully presented in this plot.

partisan_lm <- lm(numeric_grade  ~ percent_partisan_work, data = pollster_active)
partisan_error <- lm(error_ppm  ~ percent_partisan_work, data = pollster_active)
summary(partisan_lm)


Call:
lm(formula = numeric_grade ~ percent_partisan_work, data = pollster_active)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.30296 -0.39363 -0.04258  0.39704  1.30532 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.90296    0.03640  52.278  < 2e-16 ***
percent_partisan_work -0.70828    0.08179  -8.659 3.83e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5278 on 280 degrees of freedom
Multiple R-squared:  0.2112,    Adjusted R-squared:  0.2084 
F-statistic: 74.98 on 1 and 280 DF,  p-value: 3.83e-16

summary(partisan_error)


Call:
lm(formula = error_ppm ~ percent_partisan_work, data = pollster_active)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2499 -0.2499 -0.0331  0.2501  1.8501 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)            0.04986    0.03516   1.418    0.157
percent_partisan_work -0.01676    0.07900  -0.212    0.832

Residual standard error: 0.5098 on 280 degrees of freedom
Multiple R-squared:  0.0001608, Adjusted R-squared:  -0.00341 
F-statistic: 0.04503 on 1 and 280 DF,  p-value: 0.8321

After constructing a linear model, we could see that there’s a statistically significant negative relation between an active pollster’s grade and the percentage of partisan polls they do, confirming the previous exploration from PCA plot. Interestingly, there’s no significant relations between the predictive plus minus error and the percentage of partisan polls they do. This could indicate that pollster quality is compromised by partisanship in other ways aside from poll error.

We would also like to explore how partisanship affect individual polls.

poll_full <- poll_full |>
  mutate(partisan = if_else(is.na(partisan), "Non-partisan", partisan))
poll_full|>
  group_by(partisan)|>
  summarize(count = n())

# A tibble: 8 × 2
  partisan     count
  <chr>        <int>
1 CRV              1
2 DEM           1489
3 DEM,REP          2
4 IND             10
5 LIB              8
6 Non-partisan 17910
7 REP           1043
8 REP,DEM          3

An overview of partisan demographic of polls show that the majority of the polls are non-partisan. The dataset define all polls affiliated to or internal among a party as a partisan poll. While among the partisan polls, an overwhelming majority of them are either affiliated to Democrats or Republicans. The rest of the polls are either affiliated with multiple parties or affiliated with independent candidates or smaller parties like the Libertarian. Here we will only examine polls that are not affiliated to multiple parties.

poll_partisan <- poll_full|>
  filter(partisan == "REP" | partisan == "DEM" | partisan == "IND" | partisan == "LIB"| partisan == "Non-partisan")


poll_partisan |>
  ggplot(aes(x = error, y = partisan))+
  geom_boxplot(width = .2, alpha = .5)+
  scale_y_discrete(labels = c(
    "REP" = "Republicans",
    "DEM" = "Democrats",
    "IND" = "Independents",
    "LIB" = "Libertarians",
    "Non-partisan"="Non-partisan"
  )) +
  theme_bw()+
  geom_vline(xintercept = median(poll_partisan$error), color = "green4", linewidth = 1)+
  labs(title = "Error of Polls by Parties",
       subtitle = "Green line = Overall Median",
       x = "Absolute error = |Poll Margin - Actual Margin|",
       y = "Partisan Polls")

The boxplot above shows that partisan polls has overall higher absolute error compares to the non-partisan plots and overall median, no matter of the party affiliation. Among the partisan polls, Democrats and Republicans, the major parties, seem to have overall better performances than Libertarians and Independents. However, it’s also worth noting that there are many outliers to the right for non-partisan, democrat and republican polls.

poll_non_partisan <- filter(poll_full, partisan == "Non-partisan")
poll_partisan2 <- filter(poll_full, partisan != "Non-partisan")
t.test(poll_non_partisan$error, poll_partisan2$error)


    Welch Two Sample t-test

data:  poll_non_partisan$error and poll_partisan2$error
t = -7.4912, df = 3149.9, p-value = 8.815e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.0804868 -0.6322108
sample estimates:
mean of x mean of y 
 5.596337  6.452686

A further t-test shows that there’s a statistically significant difference between the mean absolute error of partisan and non partisan polls. On average, partisan polls have higher absolute error than non-partisan polls.

After all, we can conlude that partisanship both negatively affect pollsters and polls.

How do different polling methodologies affect the accuracy of election predictions and what insights can be used to improve our current data collection process?

Before we can explore the relationship between polling methodologies and polling error, we must first standardize the entries in the methodology column of our data. This is so we can avoid overlap due to the many combined survey methods displayed below.

# Displaying Complex Methodologies

method_counts <- table(polls$methodology)
method_counts_df <- as.data.frame(method_counts)
colnames(method_counts_df) <- c("Method", "Count")
method_counts_df <- method_counts_df[order(-method_counts_df$Count), ]
method_counts_df

                                                Method Count
22                                          Live Phone  7662
4                                                  IVR  3214
40                                        Online Panel  2745
15                                    IVR/Online Panel  1107
6                                       IVR/Live Phone   300
25                             Live Phone/Online Panel   201
18                                            IVR/Text   165
51                               Text-to-Web/Online Ad   156
47                                   Probability Panel   154
44                            Online Panel/Text-to-Web   122
13              IVR/Live Phone/Text/Online Panel/Email   118
16                        IVR/Online Panel/Text-to-Web   112
33                              Live Phone/Text-to-Web    94
39                                           Online Ad    87
42                              Online Panel/Online Ad    80
19                                     IVR/Text-to-Web    70
49                                         Text-to-Web    57
48                                                Text    46
17                  IVR/Online Panel/Text-to-Web/Email    44
2                                                Email    34
14                                       IVR/Online Ad    34
43                      Online Panel/Probability Panel    34
36                                                Mail    29
37                           Mail-to-Web/Mail-to-Phone    27
26                        Live Phone/Online Panel/Text    24
5                                            IVR/Email    22
31                        Live Phone/Probability Panel    22
32                                     Live Phone/Text    17
7                          IVR/Live Phone/Online Panel    15
27                 Live Phone/Online Panel/Text-to-Web    14
10                                 IVR/Live Phone/Text    11
21                           IVR/Text-to-Web/Online Ad    11
1                                            App Panel    10
20                               IVR/Text-to-Web/Email    10
34                        Live Phone/Text-to-Web/Email    10
9  IVR/Live Phone/Online Panel/Text-to-Web/Mail-to-Web     8
23                                    Live Phone/Email     8
29     Live Phone/Online Panel/Text-to-Web/Mail-to-Web     7
11                          IVR/Live Phone/Text-to-Web     6
38                                               Mixed     6
50                                   Text-to-Web/Email     6
3                                         Face-to-Face     3
45               Online Panel/Text-to-Web/Face-to-Face     3
8              IVR/Live Phone/Online Panel/Text-to-Web     2
41                            Online Panel/Mail-to-Web     2
12                    IVR/Live Phone/Text-to-Web/Email     1
24                Live Phone/Mail-to-Web/Mail-to-Phone     1
28           Live Phone/Online Panel/Text-to-Web/Email     1
30            Live Phone/Online Panel/Text-to-Web/Text     1
35                           Live Phone/Text/Online Ad     1
46                Online Panel/Text-to-Web/Mail-to-Web     1

We identified and split the methods into each unique type of method.

# Locating Method Types 

# Find all unique methods
unique_methods <- unique(polls$methodology)
methods_array <- array(unique_methods)
 
# Split methods by type
split_methods <- strsplit(methods_array, split = "/")
all_method_types <- unlist(split_methods)

method_type_counts <- table(all_method_types)
method_type_counts_df <- as.data.frame(method_type_counts)
colnames(method_type_counts_df) <- c("Method Type", "Count")
method_type_counts_df <- method_type_counts_df[order(-method_type_counts_df$Count), ]
method_type_counts_df

         Method Type Count
5         Live Phone    22
14       Text-to-Web    21
11      Online Panel    20
4                IVR    18
2              Email    10
13              Text     8
8        Mail-to-Web     6
10         Online Ad     6
12 Probability Panel     3
3       Face-to-Face     2
7      Mail-to-Phone     2
1          App Panel     1
6               Mail     1
9              Mixed     1

Then we sorted the types of methods into 5 different categories.

# Separating Method Types into Categories

polls$method_simplified <- NA

polls$method_simplified[grepl("Live Phone|IVR", polls$methodology, ignore.case = TRUE)] <- "Phone-Based"
polls$method_simplified[grepl("Text-to-Web|Text", polls$methodology, ignore.case = TRUE)] <- "Text-Based"
polls$method_simplified[grepl("Online Panel|Online Ad|App Panel|Email", polls$methodology, ignore.case = TRUE)] <- "Online-Based"
polls$method_simplified[grepl("Mail-to-Web|Mail-to-Phone|Mail", polls$methodology, ignore.case = TRUE)] <- "Mail-Based"
polls$method_simplified[grepl("Probability Panel|Face-to-Face|Mixed|NA", polls$methodology, ignore.case = TRUE)] <- "Other"
polls$method_simplified[is.na(polls$method_simplified)] <- "Other"

method_category_counts <- table(polls$method_simplified)
method_category_counts_df <- as.data.frame(method_category_counts)
colnames(method_category_counts_df) <- c("Method Category", "Count")
method_category_counts_df

  Method Category Count
1      Mail-Based   329
2    Online-Based  4722
3           Other  3773
4     Phone-Based 11176
5      Text-Based   466

Now that we have simplified the methodologies, we can learn how different polling methodologies affect the accuracy of election predictions. We calculated the polling error using margin_poll and margin_actual and used it to examine our five methodology categories.

Error:

polls$polling_error <- abs(polls$margin_poll - polls$margin_actual)

custom_colors <- c(
  "Phone-Based" = "#e94d4d",   
  "Text-Based" = "#cc3c3c",    
  "Online-Based" = "#87abe7",  
  "Mail-Based" = "#6288c8",    
  "Other" = "#aaaaaa"         
)

ggplot(polls, aes(x = method_simplified, y = polling_error, fill = method_simplified)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "text", aes(label = round(after_stat(y), 2)), color = "white") +
  scale_fill_manual(values = custom_colors) +
  labs(title = "Polling Error by Simplified Methodology",
       x = "Simplified Methodology",
       y = "Polling Error (%)")

The above graph shows boxplots of polling error by methodology, with mean polling errors labeled for each group. The Mail-Based methodology had the lowest error and the Other category had the highest error while Online-Based, Phone-Based, and Text-Based had similar errors. This suggests that there are noticeable differences in polling error between methodologies. Mail-based data collection might produce the most accurate predictions and technology-based data collection is reliable even across different mediums. This highlights an opportunity to prioritize and encourage mail-based data collection when making election predictions.

anova_result <- oneway.test(polling_error ~ method_simplified, data = polls)
anova_result


    One-way analysis of means (not assuming equal variances)

data:  polling_error and method_simplified
F = 8.3578, num df = 4.0, denom df = 1513.6, p-value = 1.145e-06

We used a one-way (ANOVA) test to further assess whether polling errors differ significantly across different simplified polling methodologies. The p-value is extremely small (1.145e-06) p-value suggests that at least one methodology differs in average polling error. This suggests that polling methodology significantly impacts the accuracy of election predictions and is an important factor to take into consideration.

We also wanted to see if polling errors for each methodology differ by state. This can help us assess whether certain methodologies produce larger errors at the state level, not just overall, so that we can find ways to improve our data collection process as much as possible.

We begin by standardizing the entries in the location column of our data by including states only and stripping district numbers (e.g., “-24”) to retain only the state abbreviation.

polls$states_simplified <- gsub("-.*", "", polls$location)

exclude<- c("DC", "US", "PR", "M1", "M2", "N2", "N1", "N3", "VI")
polls_filtered <- polls %>%
  filter(!states_simplified %in% exclude)

print(sort(unique(polls_filtered$states_simplified)))

 [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WV" "WY"

Next, we create datasets for each state by method.

mail_polling_error <- polls %>%
  filter(method_simplified == "Mail-Based") %>%
  group_by(states_simplified) %>%
  rename(state = states_simplified) %>%
  mutate(plot_type = "Mail-Based")

online_polling_error <- polls %>%
  filter(method_simplified == "Online-Based") %>%
  group_by(states_simplified) %>%
  rename(state = states_simplified) %>%
  mutate(plot_type = "Online-Based")

other_polling_error <- polls %>%
  filter(method_simplified == "Other") %>%
  group_by(states_simplified) %>%
  rename(state = states_simplified) %>%
  mutate(plot_type = "Other")

phone_polling_error <- polls %>%
  filter(method_simplified == "Phone-Based") %>%
  group_by(states_simplified) %>%
  rename(state = states_simplified) %>%
  mutate(plot_type = "Phone-Based")

text_polling_error <- polls %>%
  filter(method_simplified == "Text-Based") %>%
  group_by(states_simplified) %>%
  rename(state = states_simplified) %>%
  mutate(plot_type = "Text-Based")

Then we take the mean polling errors of each method that were displayed in the boxplots and use them as our midpoint for the heatmaps. States with missing (null) data are displayed in white to focus attention on the meaningful variation among available observations.

library(grid)

mail_overall_mean <- mean(mail_polling_error$polling_error, na.rm = TRUE)

plot_usmap(data = mail_polling_error, values = "polling_error", regions = "states") +
  scale_fill_gradient2(low = "white", high = "red", midpoint = mail_overall_mean, na.value = "white") +
  labs(
    title = "Polling Error by State and Mail-Based Methodology",
    fill = "Polling Error (%)"
  ) +
  theme(legend.position = "left")

For the above US states heatmap, California and Utah show higher polling errors for the mail-based methodology than the other states. This could be due to the fact that they are both a part of the eight states that automatically send all registered voters a mail-in ballot. Interestingly, this finding challenges our earlier assessment that mail-based data collection might be more accurate than other methodologies. This suggests that election predictions should take into account both the data collection methodology and state-specific factors.

For the following three US states heatmaps, no states showed notably high polling errors for the Online-Based, Phone-Based, and Other methodology. This suggests that polling errors were relatively consistent across states for these methodologies.

online_overall_mean <- mean(mail_polling_error$polling_error, na.rm = TRUE)

plot_usmap(data = online_polling_error, values = "polling_error", regions = "states") +
  scale_fill_gradient2(low = "white", high = "red", midpoint = mail_overall_mean, na.value = "white") +
  labs(
    title = "Polling Error by State and Online-Based Methodology",
    fill = "Polling Error (%)"
  ) +
  theme(legend.position = "left")

other_mean_error <- mean(other_polling_error$polling_error, na.rm = TRUE)

plot_usmap(data = other_polling_error, values = "polling_error", regions = "states") +
  scale_fill_gradient2(low = "white", high = "red", midpoint = other_mean_error) +
  labs(
    title = "Polling Error by State and Other Methodology",
    fill = "Polling Error (%)"
  ) +
  theme(legend.position = "left")

phone_mean_error <- mean(phone_polling_error$polling_error, na.rm = TRUE)

plot_usmap(data = phone_polling_error, values = "polling_error", regions = "states") +
  scale_fill_gradient2(low = "white", high = "red", midpoint = phone_mean_error) +
  labs(
    title = "Polling Error by State and Phone-Based Methodology",
    fill = "Polling Error (%)"
  ) +
  theme(legend.position = "left")

text_mean_error <- mean(text_polling_error$polling_error, na.rm = TRUE)

plot_usmap(data = text_polling_error, values = "polling_error", regions = "states") +
  scale_fill_gradient2(low = "white", high = "red", midpoint = text_mean_error, na.value = "white") +
  labs(
    title = "Polling Error by State and Text-Based Methodology",
    fill = "Polling Error (%)"
  ) +
  theme(legend.position = "left")

For the US states heatmap above, South Dakota shows a notably higher polling error for the Text-Based methodology compared to other states. This is interesting because it stands out as the only state with such a discrepancy and would benefit from further analysis. This also contrasts with our earlier assessment that technology-based data collection methods produced similar polling errors. This finding highlights that we cannot make broad assumptions and that state-specific factors are important to consider when analyzing the affect of methodologies on election predictions.

Does the type of election and year contribute to polling accuracy?

raw_polls <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/2023/raw-polls.csv")

Rows: 11475 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (19): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

polls_2022 <- raw_polls %>%
  filter(year == 2022) %>%
  filter(!is.na(error), !is.na(type_simple), !is.na(location))

ggplot(polls_2022, aes(x = type_simple, y = error, color = location)) +
  geom_jitter(width = 0.3, alpha = 0.6, size = 2) +
  labs(
    title = "Poll Error by Race Type and State (2022)",
    x = "Race Type",
    y = "Polling Error (|Predicted Margin - Actual Margin|)",
    color = "State"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14)
  )

library(tidyverse)

# Load the raw poll data
raw_polls <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/2023/raw-polls.csv")

Rows: 11475 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (19): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Clean and summarize for a given year
plot_state_level_error <- function(year_selected) {
  raw_polls %>%
    filter(year == year_selected) %>%
    filter(!is.na(error), !is.na(type_simple), !is.na(location)) %>%
    
    # Extract just the state abbreviation (everything before the "-")
    mutate(state = str_extract(location, "^[A-Z]{2}")) %>%
    
    # Group by state and race type, then average the error
    group_by(state, type_simple) %>%
    summarize(mean_error = mean(error), .groups = "drop") %>%
    
    # Plot it
    ggplot(aes(x = type_simple, y = mean_error, color = state)) +
    geom_jitter(width = 0.3, alpha = 0.6, size = 2) +
    labs(
      title = paste("Avg Poll Error by Race Type and State (", year_selected, ")", sep = ""),
      x = "Race Type",
      y = "Mean Polling Error (|Predicted Margin - Actual Margin|)",
      color = "State"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", size = 14)
    )
}

# Create plot for 2018 (or change year to 2022, 2014, etc.)
plot_2022_state_level <- plot_state_level_error(2022)
plot_2018_state_level <- plot_state_level_error(2018)
plot_2014_state_level <- plot_state_level_error(2014)
plot_2010_state_level <- plot_state_level_error(2010)

plot_2022_state_level

plot_2018_state_level

plot_2014_state_level

plot_2010_state_level

[The plots visualize polling error across different race types (Governor, House, Senate) in the 2022 election cycle, with each point colored by the state or district where the poll occurred. We observe that the House races tend to have slightly lower polling errors compared to Governor and Senate races, which may reflect differences in media coverage, sample sizes, or polling methodologies across race types. This plot is informative for our project because it highlights how polling reliability can vary depending on the type of election and geographic region, which are two factors that could be important when evaluating overall polling accuracy. We used the most recent data, as it is most relevant to understanding patterns that still exist.]

raw_polls <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/pollster-ratings/2023/raw-polls.csv")

Rows: 11475 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (19): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

polls_clean <- raw_polls %>%
  filter(!is.na(error), !is.na(year)) 

polls_by_year <- polls_clean %>%
  group_by(year) %>%
  summarize(mean_error = mean(error, na.rm = TRUE))

ggplot(polls_by_year, aes(x = year, y = mean_error)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "darkred", size = 2) +
  labs(
    title = "Mean Polling Error by Year",
    x = "Election Year",
    y = "Mean Polling Error"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14)
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

polls_clean <- polls_clean %>%
  mutate(election_type = case_when(
    year %in% c(2012, 2016, 2020) ~ "Presidential",
    year %in% c(2010, 2014, 2018, 2022) ~ "Midterm",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(election_type))


t_test_result <- t.test(error ~ election_type, data = polls_clean)


print(t_test_result)


    Welch Two Sample t-test

data:  error by election_type
t = -7.3318, df = 6019.8, p-value = 2.57e-13
alternative hypothesis: true difference in means between group Midterm and group Presidential is not equal to 0
95 percent confidence interval:
 -1.1467662 -0.6629009
sample estimates:
     mean in group Midterm mean in group Presidential 
                  5.097626                   6.002459

Polling errors have fluctuated over time without clear long-term improvement, and statistical testing shows that Presidential election polls are significantly less accurate than Midterm election polls, with about 0.9% higher mean error.

Does greater pollster transparency lead to more accurate polling results?

library(ggplot2)

polls2 = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/pollster-ratings/pollster-ratings-combined.csv")

ggplot(polls2, aes(x = wtd_avg_transparency, y = POLLSCORE)) +
  geom_point(alpha = 0.7, color = "#e94d4d") +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Pollster Transparency vs. PollScore",
       x = "Weighted Average Transparency",
       y = "POLLScore (lower = better accuracy)") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

This scatterplot shows the relationship between pollster transparency and polling accuracy. Each point represents a different pollster, with the x-axis measuring their weighted average transparency score and the y-axis showing their PollScore, where lower values indicate better accuracy. The black trend line slopes slightly downward, suggesting a weak negative correlation: as transparency increases, polling errors tend to decrease slightly. However, the spread of points around the trend line shows a lot of variability, meaning transparency alone does not perfectly predict polling accuracy.

library("hexbin")

ggplot(polls2, aes(x = wtd_avg_transparency, y = POLLSCORE)) +
  geom_hex(bins = 30) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Pollster Transparency vs. PollScore (Hexbin Plot)",
       x = "Weighted Average Transparency",
       y = "POLLScore (lower = better accuracy)") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

This hexbin plot shows the relationship between pollster transparency and polling accuracy. Each hexagon represents an area containing multiple pollsters, with darker hexagons indicating a higher number of pollsters in that region. The x-axis shows the weighted average transparency score, and the y-axis shows the PollScore, where lower values represent better accuracy. The black trend line slopes downward, indicating a slight negative correlation: pollsters with higher transparency scores tend to have slightly better accuracy on average. However, the spread and clustering of hexagons suggest that while transparency helps, it is not the only factor affecting pollster performance.

Conclusion

Our findings reveal that multiple factors influence polling accuracy, with some having a greater impact than others. We found that partisanship negatively affects both pollsters and polls, with partisan pollsters and partisan-affiliated polls demonstrating higher error rates and lower overall quality. We discovered that mail-based methodologies generally performed better overall when compared to other methodologies but state-specific factors can complicate this trend. Additionally, we found that transparency can contribute to pollster accuracy, but is not influential enough on it’s own. Lastly, we found that polling accuracy can be dependent on the type of election with Presidential election polls being significantly less accurate than Midterm election polls. Overall, polling error in election predictions can be minimized by placing less weight on partisanship, selecting effective methodologies at the state-level, promoting transparency, and considering the specific context and type of each election. By keeping these factors in mind, we can produce more trustworthy election predictions that better inform the public during elections.

Future Work

This report is not an exhaustive analysis of all variables that could impact polls or pollsters. This research did not explore the relations between different candidates and parties and polls. Also more research and analysis could be done regarding pollster’s information.

The dataset itself is limited. It only contains polls between 1998 and 2023. It would be able to answer more questions if it includes more recent data, and also earlier data before 1998 as they could provide insight on those times where the politics is very different. It could potentially provide more insight if we explore the time series of the data.

There is also plenty of poll information that’s excluded from the current dataset, including wording of questions, response rates, budget or cost. All of those could help us understand what makes a good poll.