NOBODY KNOWS ANYTHING. Not one person in the entire motion picture field knows for a certainty what’s going to work. Every time out it’s a guess—and, if you’re lucky, an educated one. — William Goldman, Adventures in the Screen Trade

Your data for the midterm project consists of the 1000 highest rated movies on the Internet Movie Database (IMDB). For your convenience, there is a 61 MB .zip archive of all the information on the class website, [http://www.stat.cmu.edu/~cshalizi/statcomp/14/exams/1/midterm.zip]. This archive contains RData files with the HTML of the webapges already turned into character vectors, one string per line. It also contains HTML pages which list the 1000 movie titles, and an example of one of the movie pages; you may find these convenient to work with.

Each IMDB page records a large amount of information about each movie. We are interested in the following:

Note that the first four variables are numerical, and the last is categorical. In some cases, some of these variables may be missing for a particular movie.

Each movie on IMDB has a separate page of business information; these are also included in the zip archive. In most cases, the gross revenue is on the movie’s business page, rather than on its main page. In some cases, the business information page lists multiple gross revenues, depending on the country, or gross receipts by different dates. In case of ambiguity, we are interested in gross receipts for the US, and want to use the figure for the latest available date. If no gross revenue figure is available for the US, treat the gross revenue as missing.

General advice: Get started on this one early. If you wait to the last minute, it will not go well.

  1. (25 pts) Write code to extract this information from all 1000 movies, and store it in a data frame. For full credit, you should write a function which can extract this information from an arbitrary movie, and then further code which uses that function and applies it to all 1000 movies. For full credit, your code should avoid loops in favor of vectorized operations and apply (and sapply, lapply, etc., as convenient). Your code should handle missing values appropriately, and should not convert categorical variables into numbers, or numbers into strings, etc.
    Victory conditions: You have a data frame with 1000 rows and 5 columns, corresponding to the 5 variables of interest. Columns have short but clear names The overwhelming majority of rows have no missing values; the few rows where there are missing values have NA in the appropriate places.
    Hint: Use regular expressions.

  2. (20 pts) Write code to plot the probability distributions of the five variables. Make sure missing values, if any, are handled gracefully.
    Victory conditions: You have five plots showing five different probability distributions, each with clearly labeled axes and reasonable precision.

  3. (30 pts) Write code to fit distributions to the variables listed below. After fitting, conduct appropriate tests to determine which distributions (if any) are reasonable fits to the data.
    For full credit, include tests that your estimation functions will correctly recover the parameters when fed simulated data from the corresponding theoretical distribution. Also, make sure that the code does reasonable things in the presence of missing values (and explain why your choices are reasonable). Victory conditions: You have six sets of parameter estimates, clearly labeled by the data set they came from and the model they assume, with all parameters reported to reasonable precision. You have appropriate tests of the fit of the distributions, and a well-written explanations of which models (if any) fit their data sets, supported by specific references to the output of your code.
    1. Gross revenue; fit both the exponential distribution and the log-normal distribution.
    2. Average user rating; fit both the Gaussian distribution and the gamma distribution.
    3. Number of raters; fit both the geometric distribution and the Poisson distribution.
  4. (25 pts) Do a linear regression of the log of gross receipts on year of release, average rating, and number of raters. Make a scatter-plot showing, for all movies, actual gross receipts (not log receipts) on the vertical axis and predicted gross receipts on the horizontal axis (again, not log receipts). Make further plots showing actual vs. predicted receipts separately for each genre of movies. Do afurther linear regression of log of gross receipts on year of release, average rating, number of raters, and genre. Make the same series of plots for the expanded model. Comment on which model appears preferable.
    For full credit, the repetitive aspects of producing the plots for the two models should be automated in one or more functions which work equally well with either model, rather than blocks of redundant code.
    Victory conditions: You have the estimated coefficients and related summary statistics for both models, printed to reasonable precision. You have the prescribed series of plots of correctly-calculated predictions, with axes and titles on all plots which are both short and informative. The final comparison of the models argues cogently from specific results of the code.

  5. (20 pts., extra credit) Construct a list of all actors and actresses (for short, actors) who have appeared in at least 5 movies in the data set. For each such actor, construct a vector indicating whether or not they appeared in a given movie. Add all these vectors to the data frame as additional columns. Re-run the regressions for problem 4 with the actors as additional variables. Describe what happens to the coefficients of the four original variables with these additional controls.