Introduction

Movies are an art form that originated in the late 19th century and have remained popular ever since. However, with the advent of the internet in the second half of the 20th century, people began to upload pirated movies to websites. According to research conducted by the U.S. Chamber in 2019, piracy causes at least 29 billion US dollars in lost revenue to the United States each year, and this likely applies to other countries as well.

As investigative data reporters, we aim to determine which types of movies have the highest popularity on pirated websites. By doing so, we can provide insights to the movie industry about which movies should be prioritized for protection from piracy to minimize revenue loss. Specifically, we believe that the audience’s preference can be represented by the number of views and downloads, while IMDb rating and industry are two important predictors of such preference as we hypothesize that people would want to see movies with relatively high ratings, and those produced by certain industries due to their cultural and social background. Thus, in this investigation, we will first explore the preferred industries and IMDb ratings of the pirated websites when choosing movies, then analyze the relationships between the number of views of the movie and industry as well as IMDb ratings, and finally address potential confounding variables in the relationship of views and IMDb ratings.

Data Source Descirption

The dataset we use is created by Arsalan Ur Rehman and is consisting of information on movies from a pirated website with a user base of around 2 million visitors every month. It has 20548 rows where each row represents a movie found on the website, and 14 columns where each column contains some information about the movies in different aspects.

The quantitative variables in the dataset include:

  • views: the number of clicks on the movie
  • downloads: the number of downloads of the movie
  • IMDb-rating: the rating of the movie on the Internet Movie Database
  • runtime: the run time of the movie (in minutes)

The categorical variables in the dataset include:

  • id: the movie’s unique identification number
  • title: the movie’s title
  • storyline: a short description of the plot of the movie
  • appropriate_for: the classification of the movie (R-rated, PG-13, etc)
  • language: the language of the movie (can be multiple languages)
  • industry: the industry the movie belongs to (Hollywood, Bollywood, etc)
  • posted_date: the time when the movie is posted on the pirated platform
  • release_date: the time when the movie is released worldwide
  • director: the name(s) of the director(s) of the movie
  • writer: the name(s) of the writer(s) of the movie.

Research Questions

Question 1: What are the preferred types of movies for the pirated website, with a primary focus on IMDb ratings and industry?

In this question, we want to learn about what types of industry and IMDb ratings the pirated website prefers when it chooses movies. We focus on industry and rating because we hold the hypothesis that people would prefer certain industries and would love to watch movies with relatively high IMDb ratings. The two variables we use are industry and rating.

a. Marginal distribution of the movies’ industries on the pirated website

First, we explore the marginal distribution of the industries of movies on the pirated website. Note that we clean the data and regroup industry into “Bollywood / Indian,” “Hollywood / English,” and “Other Industries” because there are too many industries in the original dataset, and the number of ones falling into “Hollywood / English” is bigger than the sum of numbers of the rest of the industries, and “Bollywood / Indian” is the second largest industry.

From the barplot, we can see that the number of movies from “Hollywood / English” industry has the highest proportion on the pirated website, which makes sense as Hollywood is one of the most famous movie industries in the world and English is the most common language used in the US, where Hollywood is based, and around the world. The industry with the second highest proportion on the website is “Bollywood / Indian,” and we can see its bar is only slightly lower than the bar representing the sum of all the other industries.

## 
##  Chi-squared test for given probabilities
## 
## data:  industry_marginal$count
## X-squared = 13350, df = 2, p-value < 2.2e-16

We perform a Chi-squared test to check if the differences in the proportion of the three groups of industries are statistically significant. The p-value of the test is less than \(2.2 \times 10 ^ {-16}\), which is smaller than 0.05, thus we reject the null hypothesis that the proportions of each group of industries are the same and have sufficient evidence to conclude that the differences in proportions of groups of industries on the website are statistically significant.

b. Marginal distribution of the movies’ IMDb ratings on the pirated website

Second, we observe the marginal distribution of the IMDb ratings of the movies on the pirated website.

From the histogram we can see that the distribution of IMDb Rating is unimodal and close to normal, with a peak at around the rating of 5.2. The smoothed density curve also shows similar conclusions. This indicates that overall, the pirated website’s choice of movies follows the normal distribution, which is interesting as we believed the website would choose more movies with relatively higher ratings.

c. Marginal distributions of movies’ IMDb ratings from different industries on the website

To further interpret the results in the previous section, we decide to investigate the marginal distributions of the IMDb ratings of movies from different industries on the pirated website. By doing so, we can confirm if the trend of choice of movies is consistent in each group of industries, or if the overall trend is due to the trend of the “Hollywood / English” industry as its proportion is dominating. To address the question, we visualize the empirical cumulative distribution functions (eCDFs) of the three groups of industries.

The eCDF plot shows that while the IMDb rating of “Hollywood / English” is close to normal and the overall distribution shown in the previous section, the IMDb ratings of the other two groups are both skewed to the left to some degree. For other industries, we can see that roughly 15% of the movies chosen have an IMDb rating of about 8.1, making it the peak of the distribution. For “Bollywood / Indian”, its distribution has a peak at around the rating of 9. The differences above indicate that there may be differences in distributions of the IMDb ratings for different industries.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  movie_hollywood and movie_bollywood
## D = 0.19089, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  movie_hollywood and movie_others
## D = 0.27456, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  movie_bollywood and movie_others
## D = 0.12682, p-value < 2.2e-16
## alternative hypothesis: two-sided

We ran three two-sample KS tests to assess the above conclusion. Since the p-values are all smaller than \(2.2 \times 10 ^ {-16}\), we reject the null hypothesis that the distributions for the IMDb rating are the same for different groups of industries and have sufficient evidence to conclude that the IMDb rating has a statistically different distribution for each group of industries. Thus, we find that the pirated website tends to choose “Hollywood/English” movies with a distribution close to normal, peaking slightly higher than 5 on the IMDb rating scale, but tends to select higher-rated movies from other industries.

Question 2: How does movie industry influence the number of views of pirated movies?

In the pirated movies dataset, there are two similar variables, downloads and views that interest us particularly as potential response variables. We suspect that there is a positive linear relationship between the two, namely, a movie with high views also have high downloads.

a. Choosing the response of interest.

After plotting the relationship of downloads and views in a scatterplot with a fitted line, we can clearly see that there exists a positive linear relationship between the two. This proves true our assumption that movies with high views also have high downloads, and that studying how different factors affect views or downloads would produce similar results. Thus we decide to focus on views as the main response in the following study.

While conducting EDA, we also realized that the original distribution of views is highly skewed to the right. After a log transformation, however, the distribution appears to be approximately normal. Thus, for the convenience of the following study, we will use the log of views as the main variable of interest.

b. How does different movie industry influence number of views?

The dataset contains pirated movies that mainly fall into the categories of Hollywood / English and Bollywood / Indian, as well as some other less popular categories that we chose to classify as Other Industries.

To explore how different movie industries affect number of views, we drew boxplots as well as violin plots. We can see from the above graph that the average number of views is highest for Bollywood / Indian movies, followed by Other Industries and Hollywood / English. The most viewed movie also falls under the Boollywood / Indian category. The shapes of the distributions indicate that for Indian and English movies, the number of views is highly concentrated around the medians. For Other Industries, however, the number of views is more concentrated around the 25% percentile rather than the median. Our overall takeaway is that Bollywood / Indian movies have the highest popularity on pirated movie sites while Hollywood / English movies are the least popular. One potential explanation is that Bollywood movies are less shown in theatres or on major streaming platforms, so people tend to watch them more on pirate sites.

c. How does rating, together with industry, influence number of views?

Next, we are particularly interested in how IMBd rating, which is a score on the scale of 1-10 voted by registered IMDb users, influence number of views of pirated movies. Now that we know different movie industries corresponds to different popularity, we also want to explore the role that industry plays in the aforementioned relationship.

To understand the relationship between rating and number of views, we drew a contour plot that is colored by industry. We can see from the graph above that there are five major clusters, each with a center that have log of views between 7.5 and 10.5 and IMBd rating between 5 and 10. Out of the five clusters, three contain mainly Other Industries movies, one contains mainly Bollywood movies, and the other one mainly contains Hollywood movies. These clusters have the highest datapoint densities, which in the context of our study means the highest popularity of pirated movies. Our major takeaway is that movies with IMBd rating higher than 5 and are in Other Industries have high popularity on pirated movie sites.

Question 3: What are some possible factors that might influence the relationship between Views and Rating?

Exploring further into the relationship between views and rating, we would like to know if there is any other factors that might be influential.

a. Is the relationship influenced by level of restrictions?

First, we would like to know if the relationships are different for restricted movies and non-restricted ones.

For the convenience of research, we categorize movies into two different restriction groups: restricted group with R-level rating, and non-restricted group with other types of rating. We can see that there is no clear distinction in terms of range between restricted movies and non-restricted ones. By plotting non-linear regression lines of the two groups separately, we observe that even though the two groups share a similar trend of views for movies with IMDb ratings under 7.5, behaviors are significantly different for highly-rated movies. Views of movies increase exponentially with their corresponding IMDb ratings for restricted movies, but decrease for non-restricted movies.

b. Is the relationship influenced by release year?

Second, we want to see if pirated website visitors have varying preferences for movies from different time periods.

We grouped movies into four different time periods: before 2000, 2000s, 2010s and 2020s. We can see that the distribution of views corresponding to IMDb rating are roughly the same for movies before 2010, regardless of their specific time period. But for “new” movies after year 2020, there are significantly more high-rating movies. And overall number of views of movies from the 2010s period are much higher than movies from other three time periods.

c. What are some most frequent words used in movie descriptions?

Apart from all the analyses into ratings and number of views, we are also interested in whether pirated websites have a preference of topics of movies. We try to explore into the question by looking at some most common words used in movie descriptions.

We can see that some of the most frequent words are “life”, “family”, “friend”, and “love” for both groups of movies. Notice that most of the movies contain some proportion of personal relationships of main characters, so it is not surprising that the most common words are all related. We also notice that “father” appears more frequently than “mother” for movies with relatively high views, which might be a interesting topic to explore into that if it is really the case that fathers are playing a more important role than mothers in these movies. Another than that, “kill” and “police” is also frequently included in movies with higher views, indicating that pirated movie websites might have a preference on horror movies and suspense movies.

Conclusion

Based on the analyses above, we can conclude that the pirated website has a preference when choosing movies. The majority of movies on the website are from the “Hollywood / English” industry, and these movies follow a close-to-normal distribution of IMDb ratings with a peak at around 5.2. Meanwhile, the website tends to choose higher-rated movies from other industries. Also, we found that pirated movie website visitors do exhibit certain preferences when choosing what movies to view. Even though we do not see a relationship between rating and number of views directly, there are some other factors influencing this relationship. There are more views from pirated website for high-rating Bollywood movies, restricted movies, and relatively recent movies.

These findings suggest that 1) Bollywood should focus more on protecting their high-rating movies; 2) Hollywood should protect their movies in general, regardless of IMDb ratings and restriction rating level; and 3) other industries should focus on preventing high-rating movies in particular to be uploaded onto pirated movie websites. These findings could help movie companies put more targeted effort into protecting their copyrights and profits from being violated.

Future Insprations

Our analyses shed light on specific movie industries in enhancing their copyright protection to prevent infringement from pirated movie websites. However, in our analyses we mainly focused on characteristics that are unrelated to movie genres, which could be playing a substantial role in determining pirated movie preferences. Do pirated movie website visitors favor suspense and horror movies over romance movies? Do they prefer a story line that involves conflicts between family members? Do they find movies where male characters play major roles better than those starring female heroes? While we were not able to explore into these questions due to limited information offered by the dataset, these questions focusing on movie genres might be interesting to explore into. With more elaborated data collected from pirated movie websites, further research could be done such that industries would be able to protect their movies and fight copyright infringement accordingly.