36-315 Final Project Report

Author

Cole MacSwain, Micheal Olorunsola, Justin Sun, Qingqing Zhao

Published

April 28, 2025

Introduction

Movies and films are one of the largest artistic mediums to date in terms of popularity and financial capital. For the past century, they have captivated audiences with their ability to evoke powerful emotions like suspense, joy, and sorrow, and have told long-lasting, impactful stories based on a variety of themes and styles. As current college undergraduate students, we have watched our fair share of movies and were interested in analyzing the data surrounding movies as a whole, including production and reception. We wanted to see how certain factors could affect other factors and what relationships or associations could be drawn between two or more variables of interest. Due to the wide breadth and depth of this area, we settled on specifically studying the horror movie genre, or the genre under which movies attempt to evoke physical or psychological fear towards viewers.

Dataset

Our dataset originally came from this GitHub repository created by Tanya Shapiro in October 2022, who extracted the data from The Movie Database (TMDB) via their public API and the R httr package and then filtered the obtained data to keep only the movies classified as being under the horror genre. TMDB is a collaborative website database that aims to keep track of all sorts of information regarding movies, TV shows, and actors/actresses, describing themselves as “a specialized version of Wikipedia.”

The dataset originally contained $32,540$ rows and $20$ columns, but we further filtered it to better fit our usage and research questions. In particular, we only kept the variables we were interested in analyzing and removed all incomplete rows so as to not obtain findings skewed by missing data. Here, missing data was either marked with a NA or a 0 depending on the type of the variable in question. Afterwards, the dataset contained $1,073$ rows and $11$ columns. While this decrease in size is significant, the remaining data should still be enough to identify general patterns that can be applied to the original.

Code

horror_movies <- read_csv("https://raw.githubusercontent.com/tashapiro/horror-movies/refs/heads/main/data/horror_movies.csv") |>
  # drop uninteresting variables
  dplyr::select(!c(id, original_title, overview, tagline, poster_path, status, adult, backdrop_path, collection, collection_name)) |>
  # clean up incomplete data entries
  dplyr::filter(budget > 0, revenue > 0, runtime > 0, !is.na(release_date))

Each row corresponds to a unique horror movie, while each column corresponds to a specific variable of interest. A table naming and briefly defining all the variables of interest can be found below.

Variable	Definition
`title`	Movie’s release title
`original_language`	Movie’s original language
`release_date`	Movie’s release date
`popularity`	Movie’s popularity score from TMDB
`vote_count`	Movie’s number of votes from IMDb
`vote_average`	Movie’s average rating from IMDb
`budget`	Movie’s budget (in USD)
`revenue`	Movie’s revenue (in USD)
`runtime`	Movie’s runtime (in min)
`genre_names`	Movie’s list of genres

The first row of the dataset is displayed below as an example of what each data entry looks like.

Code

knitr::kable(horror_movies[1:1,])

title	original_language	release_date	popularity	vote_count	vote_average	budget	revenue	runtime	genre_names
Smile	en	2022-09-23	1863.628	114	6.8	1.7e+07	4.5e+07	115	Horror, Mystery, Thriller

Research Questions

After some basic exploratory data analysis, we came up with the following research questions:

What variables affect horror movie popularity?
How does the runtime of horror movies affect their average IMDb rating?
How do horror movie financials like budget and revenue affect their average IMDb rating?
How does the original language of horror movies affect their average IMDb rating?

We focused on popularity and rating because we wanted to see if their were any patterns or useful information that could be used to determine the success of a horror movie, as they are both metrics that all directors and production companies want their movies to succeed in.

Data Analysis

Research Question 1: What Variables Affect Popularity

In order to determine what variables could affect horror movie popularity, we split our analysis separately onto the quantitative and categorical variables. In regards to the quantitative variables, we applied a Principal Component Analysis (PCA) to them and graphed a biplot of the result.

Code

horror_movies |>
  # keep only the quantitative variables
  dplyr::select(c(popularity, vote_count, vote_average, budget, revenue, runtime)) |>
  # perform PCA with centered and standardized variables
  prcomp(center = TRUE, scale. = TRUE) |>
  # generate the resulting biplot
  fviz_pca_biplot(alpha.ind = 0.25, alpha.var = 0.75,
                  col.var = "royalblue3",
                  label = "var", repel = TRUE) +
  labs(title = "PCA Biplot Over Quantitative Variables for Horror Movies",
       subtitle = "All quantitative variables are positively correlated with popularity")

From the biplot, we can see that as popularity increases, $Z_1$ tends to decrease while $Z_2$ tends to increase. Regarding the other quantitative variables, we can see that popularity has some positive correlation between all of them, as their respective arrows form an acute angle with the arrow corresponding to popularity. This means that in general, horror movies with greater popularity have longer runtime, greater budget and revenue, and more vote_count and vote_average. Furthermore, it seems that runtime has the strongest correlation with popularity, with vote_average and vote_count shortly behind, as the angle between the two arrows is the smallest. This suggests that out of all the quantitative variables in the dataset, runtime may have the largest effect on popularity. This is quite interesting, as one would not expect the length of a horror movie to greatly affect its popularity. Further analysis can be done between runtime and popularity to determine if their is a causal relationship between them or if this is a result of a combination of multiple factors, but this would definitely require more data and possibly require more advanced techniques.

In regards to the categorical variables, we specifically focused on genre_names, as the genres of any movie can have a large effect on a movie’s popularity (i.e., if a movie is categorized under unpopular genres, then we would expect its own popularity to also be unpopular). To view if this is the case, we plotted a LOESS regression curve between popularity and release_date across each interesting genre in genre_names, and highlighted the top 3 largest and smallest trend lines.

Code

horror_movies_rq1 <- horror_movies |>
  # keep variables of interest
  dplyr::select(c(release_date, popularity, genre_names)) |>
  # for each row, for each genre in `genre_names`, create a separate row with that genre only
  separate_rows(genre_names, sep = ",\\s*") |>
  # remove all rows with unimportant genres
  filter(!(genre_names %in% c("Horror", "Animation", "TV Show")))
# reorder `genre_names` so that the genres of interest are first in the order we want
highlighted_genres <- c("Mystery", "Thriller", "Science Fiction", "Comedy", "Music", "Romance")
horror_movies_rq1$genre_names <- fct_relevel(horror_movies_rq1$genre_names, highlighted_genres)

Code

horror_movies_rq1 |>
  ggplot(aes(x = release_date, y = popularity, color = genre_names)) +
    geom_smooth(method = "loess", alpha = 0.6, se = FALSE) +
    gghighlight(genre_names %in% highlighted_genres) +
    scale_color_manual(values = c("Mystery" = "royalblue1", "Thriller" = "steelblue2", "Science Fiction" = "skyblue2",
                                  "Comedy" = "salmon", "Music" = "tomato", "Romance" = "orangered3")) +
    labs(title = "Trend of Horror Movie Popularity Over Time By Subgenre",
         subtitle = "The other genres a horror movie is classified as seems to affect its popularity",
         x = "Release Date", y = "Popularity", color = "Genre")

From the plot, we can clearly see that different genres have different levels of popularity and conclude that other genres have an effect on horror movie popularity. In particular, the 3 genres with the greatest popularity are mystery, thriller, and science fiction, and the 3 genres with the lowest popularity are comedy, music, and romance. This makes sense as some genres pair well together while others do not pair as well, which is either intuitive or learned through trial and error. Further analysis can be done on the number of horror films that have a specific combination of genres to determine the most popular combinations, or even on the large rise in popularity over most genres in the past ten years, but again this would require more data to thoroughly analyze.

Research Question 2: Runtime Versus Rating

In order to look at how runtime of horror movies affected the rating of movies, we examined how average runtime, budget, and rating have evolved. To examine this question, the variables we should look at / create are avg_runtime, avg_budget, and avg_rating at an annual level.

Code

horror_movies_rq2 <- horror_movies |>
  # get the year each movie was released
  dplyr::mutate(year = as.numeric(format(as.Date(release_date), "%Y"))) |>
  # filter out years with very sparse movie releases
  dplyr::filter(year >= 1960) |>
  # over each year, summarize the budget, rating, and runtime of released movies
  dplyr::group_by(year) |>
  dplyr::summarize(avg_budget = mean(budget, na.rm = TRUE),
                   avg_rating = mean(vote_average, na.rm = TRUE),
                   avg_runtime = mean(runtime, na.rm = TRUE))

We then produced a scatterplot of avg_runtime by year, with points colored by avg_rating.

Code

horror_movies_rq2 |>
  ggplot(aes(x = year, y = avg_runtime)) +
    geom_point(aes(color = avg_rating), alpha = 0.7) +
    scale_color_gradient(low = "red", high = "green") +
    theme_minimal() +
    labs(title = "Horror Movie Trends Over Time",
         subtitle = "Runtime (Y-Axis), Budget (Size), Rating (Color)",
         x = "Year", y = "Average Runtime (minutes)",
         size = "Avg Budget", color = "Avg Rating")

Looking at this graph we can see that average runtimes as well as average ratings fell substantially from the year 1960 to 2020. This could suggest that either movies in the 1960s were higher rated or that movies with higher runtimes tend to have higher ratings.

To further investigate, we conducted a linear regression model between average runtime and average rating.

Code

summary(lm(avg_rating ~ avg_runtime, data = horror_movies_rq2))


Call:
lm(formula = avg_rating ~ avg_runtime, data = horror_movies_rq2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2017 -0.3899 -0.0665  0.3454  1.5009 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.55931    1.03382   4.410 4.44e-05 ***
avg_runtime  0.01492    0.01080   1.381    0.172    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7393 on 59 degrees of freedom
Multiple R-squared:  0.03132,   Adjusted R-squared:  0.01491 
F-statistic: 1.908 on 1 and 59 DF,  p-value: 0.1724

Looking at the output of the regression it seems that the relationship has is not statistically significant, as the p-value is $0.1724$. However, the output also shows that their is a non-zero correlation between the two variables, so we made a scatterplot showing average runtime against average rating with the regression line and a 95% confidence interval in an attempt to visualize the relationship.

Code

horror_movies_rq2 |>
  ggplot(aes(x = avg_runtime, y = avg_rating)) +
  geom_point(color = "darkblue", size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  theme_minimal() +
  labs(title = "Relationship Between Runtime and Rating in Horror Movies",
       y = "Average Rating", x = "Average Runtime",
       caption = "Red line shows linear regression with 95% confidence interval")

Looking at this graph we can see that on average movies with longer runtimes tend to have higher ratings than lower runtime movies. Had the p-value from the linear regression model been higher, this graph in accompaniment with the first one would indicate that in general longer movies have higher ratings so as average runtimes have gone down overtime, the ratings have as well. However, we could conduct further analysis on a larger sample to see if the p-value changes and to see why this is possibly the case, for example due to movies fitting more enticing content in shorter runtimes, but again this would require more data to work.

Research Question 3: Financials Versus Rating

We were also interested in exploring if movie budget influences IMDB ratings, and whether this potential relationship differs across revenue levels. As such, we examined the variables budget, vote_average (IMDb rating), and revenue_group– a variable in which I created, – categorizing movies that earned less than $\$20$ million as “Low”, movies earning between $\$20$ million and $\$400$ million as “Medium”, and movies earning more than $\$400$ million as “High” in revenue.

Code

horror_movies_rq3 <- horror_movies |>
  # create a new categorical variable based on `revenue`
  dplyr::mutate(revenue_group = case_when(revenue < 2e7 ~ "Low",
                                          revenue < 2e8 ~ "Medium",
                                          TRUE ~ "High"))

Code

horror_movies_rq3 |>
  ggplot(aes(x = budget, y = vote_average, col = revenue_group)) +
  geom_point(alpha = .4) + 
  geom_smooth(se = FALSE) +
  facet_wrap(~ revenue_group) +
  coord_flip() +
  labs(title = "How Horror Movie Budget and Revenue Relate to IMDB Ratings",
       x = "Movie Budget (USD)", y = "IMDB Rating", colour = "Revenue (USD)")

This plot reveals that most horror movies generally have modest budgets and tend to cluster around average IMDb ratings between 5 and 7. There isn’t a clear relationship between budget, revenue, and rating, suggesting that spending more on a horror movie does not guarantee better audience scores or higher earnings. As such, this plot highlights how success in the horror genre is less about financial investment and more about tapping into audience tastes, creativity, or strong storytelling.

Furthermore, to better visualize patterns in the data, we created a heatmap to depict the average revenue of horror movies as a function of both movie budget and IMDb rating.

Code

horror_movies_rq3 |>
  ggplot(aes(x = budget, y = vote_average, z = revenue)) +
    stat_density2d(aes(fill = after_stat(density)), geom = "tile", contour = FALSE) +
    stat_summary_hex(fun = mean, bins = 30) +
    scale_fill_gradient2(low = "white", mid = "darkorange", high = "darkred", midpoint = 3.5e8) +
    theme_bw() +
    labs(title = "Average Revenue of Horror Movies by Budget and IMDB Rating",
         x = "Movie Budget (USD)", y = "IMDB Rating", fill = "Average Revenue (USD)")

This plot highlights that while higher-budget films tend to achieve higher average revenues, IMDb ratings remain fairly consistent, primarily clustering between 5 and 7, regardless of their budget size.

Finally, we followed this up by conducting a linear regression analysis with IMDb rating as the outcome variable and both budget and revenue group as predictors.

Code

summary(lm(vote_average ~ budget + revenue_group, data = horror_movies_rq3))


Call:
lm(formula = vote_average ~ budget + revenue_group, data = horror_movies_rq3)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5004 -0.4618  0.1998  0.8010  4.5161 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          6.847e+00  3.029e-01  22.604  < 2e-16 ***
budget              -1.368e-09  3.028e-09  -0.452   0.6516    
revenue_groupLow    -1.347e+00  3.017e-01  -4.463 8.95e-06 ***
revenue_groupMedium -6.509e-01  2.877e-01  -2.263   0.0239 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.655 on 1069 degrees of freedom
Multiple R-squared:  0.04958,   Adjusted R-squared:  0.04691 
F-statistic: 18.59 on 3 and 1069 DF,  p-value: 9.292e-12

The model was statistically significant with a p-value of $9.292\times 10^{-12}$. However, it is imperative to underscore that it explained only $4.958\%$ of the variance in ratings, indicating weak predictive power. While budget had a statistically significant negative effect, its practical impact was negligible—for example, a $\$100$ million budget increase would reduce the predicted rating by just $0.73$ points. Similarly, low-revenue movies had significantly lower ratings than high-revenue films, as did medium-revenue movies, though these differences were modest. Overall, while budget and revenue group showed statistically significant associations with ratings, their real-world influence is minimal, suggesting that other unmeasured factors play a more substantial role in determining audience ratings for horror movies.

Research Question 4: Language Versus Rating

One other variable we were interested in determing how it affected rating was original_language. Most movies viewed by audiences are produced in English, but foreign films exist and can be highly rated in their respective regions, so we wanted to see what this actually looked like.

Code

horror_movies |>
  count(original_language, sort = TRUE) |>
  slice_max(n, n = 10) |>
  ggplot(aes(x = fct_reorder(original_language, n), y = n)) +
    geom_col(fill = "lightblue", color = "black") +
    coord_flip() +
    labs(title = "Top 10 Languages of Horror Movies",
         x = "Original Language", y = "Number of Movies") +
    theme_minimal()

This bar chart shows the top 10 original languages of horror movies in the dataset. English is by far the most common, followed by Spanish, Japanese, Portuguese, and German. French, Indonesian, Chinese, and Korean also appear, but with fewer movies compared to the top five. Overall, the chart highlights that while horror movies are made in many different languages, English dominates the genre by a wide margin.

Now that we know what the distribution of original_language looks like, we plotted the relationship between vote_average and vote_count colored by original_language to see what relationship exists between the three variables.

Code

horror_movies |>
  ggplot(aes(x = vote_count, y = vote_average, color = original_language)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10() +
  labs(title = "Horror Movies with More Votes Receive Higher Ratings",
       subtitle = "Color represents original language; log-scale used for vote count",
       x = "Vote Count (log10 scale)", y = "Average IMDB Rating", color = "Original Language") +
  theme_minimal()

In the scatterplot, we use a log scale for vote count to better visualize the wide range in popularity. Each colored trend line represents a linear regression for a different language group. Overall, the plot suggests a slight positive trend: horror movies with more votes tend to have slightly higher ratings. However, this relationship varies by language—some languages like English and Spanish show a clearer upward trend, while others appear flatter or more variable, possibly due to smaller sample sizes. In short, movies with more votes tend to have slightly higher ratings, especially in widely spoken languages.

Conclusion

In this project, we gained insight to numerous patterns regarding horror movies in this dataset. First, we uncovered that there are many factors that can affect horror movie popularity, from things like the movie’s runtime to the other genres the movie was classified as. Second, we saw that runtime seems to have a correlation with IMDb rating, but this assertion requires more data. However, we did see that budget and revenue seem to have a significant correlation with IMDb rating, albeit being quite weak indicators. We also saw that English was the most common original language in the dataset by a significant margin and saw how there are trends between the ratings of movies and their original language.

All these insights offer not only valuable information regarding the horror movies in this dataset but also motivate further interesting questions such as do movies in different languages have significantly different budgets or gain significantly different revenue. Additionally, there are many other variables in the original dataset we were not interested and thus did not analyze, so we could conduct further research by analyzing those variables separately and seeing how they might relate to any of the variables we did analyze in this project. Nonetheless, the data and variables we did look into offered interesting observations that could be applied to the entire horror movie industry.