---
title: "Lab 10m: Exploratory Data Analysis"
author: "Statistical Computing, 36-350"
date: "Monday October 31, 2016"
---
Name:
Andrew ID:
Collaborated with:
This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit **your own** lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.
There are Homework 9 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 9 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit **your own** homework as a knit HTML file on Blackboard, by 11:59pm on **Tuesday November 8**. This document contains 15 of the 45 total points for Homework 9.
Reading in, exploring exoplanets
===
- There are now over 1,000 confirmed planets outside of our solar system. They have been discovered through a variety of methods, with each method providing access to different information about the planet. Many were discovered by NASA's [Kepler space telescope](https://en.wikipedia.org/wiki/Kepler_(spacecraft)), which observes the "transit" of a planet in front of its host star. In these problems you will use data from the [NASA Exoplanet Archive](http://exoplanetarchive.ipac.caltech.edu) to investigate some of the properties of these exoplanets. (You don't have to do anything yet, this was just by way of background.)
- A data table of dimension 1892 x 10 on exoplanets (planets outside of our solar system, i.e., which orbit anothet star, other than our sun) is up at http://www.stat.cmu.edu/~ryantibs/statcomp/data/exoplanets.csv. Load this data table into your R session with `read.csv()` and save the resulting data frame as `exo.df`. (Hint: the first 13 lines of the linked filed just explain the nature of the data, so you can use an appropriate setting for the `skip` argument in `read.csv()`.) Check that `exo.df` has the right dimensions, and display its first 5 rows.
- The column `pl_discmethod` of `exo.df` documents the method by which the planet was discovered. How many planets were discovered by the "Transit" method? How many were discovered by the "Radial Velocity" method? How method different methods of discovery are there total, and what is the most common? Least common?
- The last 6 columns of the `exo.df` data frame, `pl_pnum`,` pl_orbper`, `pl_orbsmax`, `pl_massj`, `pl_msinij`, and `st_mass`, are all numeric variables. Define a matrix `exo.mat` that contains these variables as columns. Display the first 5 rows of `exo.mat`.
- As we can see, at least one of the columns, `pl_massj`, has many NA values in it. How many missing values are present in each of the 6 columns of `exo.mat`? (Hint: you can do this easily with vectorization, you shouldn't use any loops.)
- Define `ind.clean` to be the vector of row indices corresponding to the "clean" rows in `exo.mat`, i.e., rows for which there are no NAs among the 6 variables. (Hint: again, you can do this easily with vectorization, you shouldn't use any loops.) Use `ind.clean` to define `exo.mat.clean`, a new matrix containing the corresponding clean rows of `exo.mat`. How many rows are left in `exo.mat.clean`?
- Yikes! You should have seen that, because there are so many missing values (NA values) in `exo.mat`, we only have 7 rows with complete observations! This is far too little data. Because of this, we're going to restrict our attention to the variables `pl_orbper`, `st_mass`, and `pl_orbsmax`. Redefine `exo.mat` to contain just these 3 variables as columns. Then repeat the previous part, i.e., define `ind.clean` to be the vector of row indices corresponding to the "clean" rows in `exo.mat`, and define `exo.mat.clean` accordingly. Now, how many rows are left in `exo.mat.clean`?
- Compute histograms of each of the variables in `exo.mat.clean`. Set the titles and label the x-axes appropriately (indicating the variable being considered.) What do you notice about these distributions?
- Apply a log transformation to the variables in `exo.mat.clean`, saving the resulting matrix as `exo.mat.clean.log`. Name the columns of `exo.mat.clean.log` to be "pl\_orbper\_log", "st\_mass\_log", and "pl\_orbsmax\_log", respectively, to remind yourself that these variables are log transformed. Recompute histograms as in the last question. Now what do you notice about these distributions?
- Display all correlations between pairs of variables in `exo.mat.clean.log`. Report, programmatically (not just by eye), the most correlated pair, and the value of this correlation.
- Plot the relationships between pairs of variables in `exo.mat.clean.log` with `pairs()`. What do you notice?
Linear regression modeling
===
- For our exoplanet data set, the orbital period $T$ is found in the variable `pl_orbper`, and the mass of the host star $M$ in the variable `st_mass`, and the semi-major axis $a$ in the variable `pl_orbsmax`. Kepler's third law states that (when the mass $M$ of the host star is much greater than the mass $m$ of the planet), the orbital period $T$ satisfies:
$$T^2 \approx \frac{4\pi^2}{GM}a^3.$$
Above, $G$ is Newton's constant. (You don't have to do anthing yet, this was just by way of background.)
- We are going to consider only the observations in `exo.mat.clean.log` for which the mass of the host star is between 0.9 and 1.1 (on the log scale, between $\log(0.9) \approx -0.105$ and $\log(1.1) \approx 0.095$), inclusive. Define `exo.reg.data` to be the corresponding matrix. Check that it has 439 rows. It will help for what follows to convert `exo.reg.data` to be a data frame, so do that as well, and check that it still has the right number of rows.
- Perform a linear regression of a response variable $\log(T)$ onto predictor variables $\log(M)$ and $\log(a)$, using only planets for which the host star mass is between 0.9 and 1.1, i.e., the data in `exo.reg.data`. (Hint: use `lm()`, and study the last slide of the mini-lecture "Exploratory Data Analysis", or skip ahead to the mini-lecture "Linear Models", if you need to.) What values do you get for the coefficients of the predictors $\log(M)$ and $\log(a)$? (Hint: use `coef()`.) Does this match what you would expect, given Kepler's third law (displayed above)?
**Hw9 Bonus.** What value do you get for the intercept, and does this match what you would expect, given Kepler's third law (displayed above)?
- What are the standard errors associated with the coefficients of $\log(M)$ and $\log(a)$? (Hint: use `summary()`.)
**Hw9 Q1 (5 points).** Write a function called `exo.reg()`, that has just one input `exo.reg.data.cur`, performs a regression of the of the log orbital period, $\log(T)$, onto the log host star mass and log semi-major axis, $\log(M)$ and $\log(a)$, using the data in `exo.reg.data.cur`, and returns a vector of the 3 regression coefficients (intercept, coefficient for $\log(M)$, and coefficient for $\log(a)$). Verify, that when run with `exo.reg.data.cur=exo.reg.data`, which should be the default, it gives the same coefficients that you computed previously.
**Hw9 Q2 (10 points).** Compute jackknife estimates of the standard errors of the estimated regression coefficients of $\log(M)$ and $\log(a)$, from the `exo.reg.data` data set. Your code here should use either a `for()` loop, or one of the `apply()` functions, and should involve repeatedly calling the function `exo.reg()` defined in the last question. (Hint: you may wish to revisit Lab 7w as a refresh on jackknife estimates of standard errors.) How do your estimates here compare to this read off of the summary of the linear model object, reported in the lab above?