Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are no homework questions here. (Lucky you! But don’t worry, you still have something to think about at home: you should be working on your final project…)

Practice with training and test errors

The code below generates and plots training and test data from a simple univariate linear model, as in the “Training and Testing Errors” mini-lecture. (You don’t need to do anything yet.)

set.seed(1)
n = 30
x = sort(runif(n, -3, 3))
y = 2*x + 2*rnorm(n)
x0 = sort(runif(n, -3, 3))
y0 = 2*x0 + 2*rnorm(n)

par(mfrow=c(1,2))
xlim = range(c(x,x0)); ylim = range(c(y,y0))
plot(x, y, xlim=xlim, ylim=ylim, main="Training data")
plot(x0, y0, xlim=xlim, ylim=ylim, main="Test data")

For every \(k\) in between 1 and 15, regress y onto a polynomial in x of degree \(k\). (Hint: look at the “Training and Testing Errors” mini-lecture to see how to use the poly() function.) Then use this fitted model to predict y0 from x0, and record the observed test error. Also record the observed training error. Plot the test error and training errors curves, as functions of \(k\), on the same plot, with properly labeled axes, and an informative legend. What do you notice about the relative magnitudes of the training and test errors? What do you notice about the shapes of the two curves? If you were going to select a regression model based on training error, which would you choose? Based on test error, which would you choose?
Without any programmatic implementation, answer: what would happen to the training error in the current example if we let the polynomial degree be as large as 29?
Modify the above code for the generating current example data so that the underlying trend between y and x, and y0 and x0, is cubic (with a reasonable amount of added noise). Recompute training and test errors from regressions of y onto polynomials in x of degrees 1 up to 15. Answer the same questions as before, and notably: if you were going to select a regression model based on training error, which would you choose? Based on test error, which would you choose?

Sample-splitting with the prostate cancer data

Below, we read in data on 97 men who have prostate cancer (from the book The Elements of Statistical Learning), which we studied in the “Exploratory Data Analysis”, “Linear Models”, and “Beyond Linear Models” mini-lectures. (You don’t need to do anything yet.)

pros.df = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")
dim(pros.df)

## [1] 97 10

head(pros.df)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train
## 1  TRUE
## 2  TRUE
## 3  TRUE
## 4  TRUE
## 5  TRUE
## 6  TRUE

As we can see, the designers of this data set already defined training and test sets for us, in the last column of pros.ds! Split the prostate cancer data frame into two parts according to the last column, and report the number of observations in each part. On the training set, fit a linear model of lpsa on lcavol and lweight. On the test set, predict lpsa from the lcavol and lweight measurements. What is the test error?
Using the same training and test set division as in the previous question, fit a linear model on the training set lpsa on age, gleason, and pgg45. Then on the test set, predict lpsa from the relevant predictor measurements. What is the test error?
How do the test errors compare in the last two questions? Based on this comparison, what regression model would you recommend to your clinician friend? What other considerations might your clinician friend have when deciding between the two models that is not captured by your test error comparison?
Advanced. The difference between the test errors of the two linear models considered above seems significant, but we have no sense of variability of these test error estimates, since it was just performed with one training/testing split. Repeatedly, split the prostate cancer data frame randomly into training and test sets of roughly equal size, fit the two linear models on the training set, and record errors on the test set. As a final estimate of test error, average the observed test errors over this process for each of the two model types. Then, compute the standard deviation of the test errors over this process for each of the two model types. After accounting for the standard errors, do the test errors between the two linear model types still appear significantly different?

Sample-splitting with the wage data

Below, we read in data on 3000 individuals living in the mid-Atlantic regression, measuring various demographic and economic variables (adapted from the book An Introduction to Statistical Learning.)) (You don’t have to do anything yet.)

wage.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/wage.csv", 
                   skip=16)
dim(wage.df)

## [1] 3000   11

head(wage.df, 5)

##        year age     sex           maritl     race       education
## 231655 2006  18 1. Male 1. Never Married 1. White    1. < HS Grad
## 86582  2004  24 1. Male 1. Never Married 1. White 4. College Grad
## 161300 2003  45 1. Male       2. Married 1. White 3. Some College
## 155159 2003  43 1. Male       2. Married 3. Asian 4. College Grad
## 11443  2005  50 1. Male      4. Divorced 1. White      2. HS Grad
##                    region       jobclass         health health_ins
## 231655 2. Middle Atlantic  1. Industrial      1. <=Good      2. No
## 86582  2. Middle Atlantic 2. Information 2. >=Very Good      2. No
## 161300 2. Middle Atlantic  1. Industrial      1. <=Good     1. Yes
## 155159 2. Middle Atlantic 2. Information 2. >=Very Good     1. Yes
## 11443  2. Middle Atlantic 2. Information      1. <=Good     1. Yes
##             wage
## 231655  75.04315
## 86582   70.47602
## 161300 130.98218
## 155159 154.68529
## 11443   75.04315

Randomly split the wage data frame into training and test sets of roughly equal size. Report how many observations ended up in each half.
On the training set, fit the following two models. The first is a linear model of wage on year, age, and education. The second is an additive model of wage on year, s(age), and education, using the gam package. (Hint: recall the “Beyond Linear Models” mini-lecture, and Lab 10f.) For the second model, plot the effects fit to each predictor, with plot(). Then, use each of these two models to make predictions on the test set, and report the associated test errors. Which model predicts better? What does that tell you about the nonlinearity that we allowed in the additive model for the age predictor?
Advanced. Sample-splitting can be done for logistic regression too, but it just requires us to be a bit more careful about how we choose the training and testing sets. In particular, we want to ensure that the ratio of 0s to 1s (assuming without a loss of generality that the response variable takes these two values) ends up being roughly equal in each of the training and testing sets. For the current wage data set, consider as the response variable the indicator that wage is above 250 (recall, this is measured in thousands of dollars!). Discard for the moment all observations that correspond to an education level of less than HS graduate (recall, the reason for this important step was explained in Lab 10f.) Then split the remaining observations into training and testing sets, but do so in a way that maintains equal ratios of 0s to 1s in the two sets, as best as possible. Once you have done this, fit two models on the training set. The first is a logistic model of I(wage>250) on year, age, and education. The second is an additive logistic model of I(wage>250) on year, s(age), and education.

Now, on the test set, use each of the two models to predict the probabilities that wage is above 250 for each of the test points. From these probabilities, produce the predicted outcomes—0s and 1s—according to the following rule: 0 if the probability is below \(\pi\), and 1 if the probability is above \(\pi\), where \(\pi\) is the observed proportion of 1s in the training set. From these predicted outcomes, compute and compare test errors. Note that test error, here, is just an average of the number of times we would have predicted the response (wage above or below 250) incorrectly. What does this tell you about the nonlinearity allowed in the second model for age?

Lab 13m: Training and Testing Errors, Sample-Splitting and Cross-Validation

Statistical Computing, 36-350

Monday November 28, 2016

Practice with training and test errors

Sample-splitting with the prostate cancer data

Sample-splitting with the wage data