The dataset that we chose to analyze consists of the IMDb scores and related information for four decades of movies. This data was scraped from IMDb and can be found on https://www.kaggle.com/datasets/danielgrijalvas/movies?resource=download. This dataset contains approximately 6,820 different movies from 1986 to 2016 and focuses on 5 quantitative and 10 qualitative variables. These are split up into the budget, revenue, IMDb score and the associated votes, the release date, genre, rating, as well as the movie production information such as company, cast, writers, directors, country of origin, and runtime. Data cleaning and preprocessing helped to remove missing instances, create informative new variables such as profit, decade, and release season. The data and the analysis below provides the possibility to explore and learn more about the underlying trends seen in the movie industry.
Through our analysis of the data, we have identified three overarching questions that help to understand the dataset and the interesting relationships between the movie variables.
This question helps to address the relationships between movie ratings, IMDb scores, and the overall distribution of movies which helps to show how the global audience feels about specific types of movies. Understanding this will help to identify truly good movies and may even help to predict which upcoming movies may receive a high IMDb score.
This question helps to compare all movies in each decade which spans the entire dataset by focusing in on genre and titles. Both of these components to a movie drive it’s success, so understanding how this differs over time will prove helpful to grasp the evolution of movie production.
This question helps to better understand what movie producers and financiers think of when selecting a release date (season). There are many different types of movies released at strategic times, and this question will help understand if some movie release seasons make for a more successful movie. [The season variable was not originally in the data, but was created based on a string decomposition and grouping of the release date variable.]
Learning the question of how movie ratings impact IMDb scores and overall movie composition distributions starts with an EDA to understand preliminary relationships in the data.
Our first chart, below, shows how the distribution of IMDB scores differs across ratings and the decade the movie came out.