Often, we want an accurate estimate of the test error of our method (e.g., linear regression). Why? Two main purposes:
Predictive assessment: get an absolute understanding of the magnitude of errors we should expect in making future predictions
Model/method selection: choose among different models/methods, attempting to minimize test error
As’ve we’ve seen, training error is inappropriate for both purposes: generally too optimistic, and also more optimistic the more complex/adaptive the method
Given a data set, how can we estimate test error? (Can’t simply simulate more data for testing.) We know training error won’t work
A tried-and-true technique with an old history in statistics: sample-splitting
dat = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/xy.dat")
head(dat)##           x         y
## 1 -2.908021 -7.298187
## 2 -2.713143 -3.105055
## 3 -2.439708 -2.855283
## 4 -2.379042 -4.902240
## 5 -2.331305 -6.936175
## 6 -2.252199 -2.703149n = nrow(dat)
# Split data in half, randomly
set.seed(0)
inds = sample(rep(1:2, length=n))
head(inds, 10)##  [1] 1 2 2 1 2 2 2 1 2 2table(inds)## inds
##  1  2 
## 25 25dat.tr = dat[inds==1,] # Training data
dat.te = dat[inds==2,] # Test data
plot(dat$x, dat$y, pch=c(21,19)[inds], main="Sample-splitting")
legend("topleft", legend=c("Training","Test"), pch=c(21,19))# Train on the first half
lm.1 = lm(y ~ x, data=dat.tr)
lm.10 = lm(y ~ poly(x,10), data=dat.tr)
# Predict on the second half, evaluate test error
pred.1 = predict(lm.1, data.frame(x=dat.te$x))
pred.10 = predict(lm.10, data.frame(x=dat.te$x))
test.err.1 = mean((dat.te$y - pred.1)^2)
test.err.10 = mean((dat.te$y - pred.10)^2)
# Plot the results
par(mfrow=c(1,2))
xx = seq(min(dat$x), max(dat$x), length=100)
plot(dat$x, dat$y, pch=c(21,19)[inds], main="Sample-splitting")
lines(xx, predict(lm.1, data.frame(x=xx)), col=2, lwd=2)
legend("topleft", legend=c("Training","Test"), pch=c(21,19))
text(0, -6, label=paste("Test error:", round(test.err.1,3)))
plot(dat$x, dat$y, pch=c(21,19)[inds], main="Sample-splitting")
lines(xx, predict(lm.10, data.frame(x=xx)), col=3, lwd=2)
legend("topleft", legend=c("Training","Test"), pch=c(21,19))
text(0, -6, label=paste("Test error:", round(test.err.10,3)))Sample-splitting is simple, effective. But its it estimates the test error when the model/method is trained on less data (say, roughly half as much)
An improvement over sample splitting: \(k\)-fold cross-validation
A common choice is \(k=5\) or \(k=10\) (sometimes \(k=n\), called leave-one-out!)
For demonstration purposes, suppose \(n=6\) and we choose \(k=3\) parts
| Data point | Part | Trained on | Prediction | 
|---|---|---|---|
| \(Y_1\) | 1 | 2,3 | \(\hat{Y}^{-(1)}_1\) | 
| \(Y_2\) | 1 | 2,3 | \(\hat{Y}^{-(1)}_2\) | 
| \(Y_3\) | 2 | 1,3 | \(\hat{Y}^{-(2)}_3\) | 
| \(Y_4\) | 2 | 1,3 | \(\hat{Y}^{-(2)}_4\) | 
| \(Y_5\) | 3 | 1,2 | \(\hat{Y}^{-(3)}_5\) | 
| \(Y_6\) | 3 | 1,2 | \(\hat{Y}^{-(3)}_6\) | 
Notation: model trained on parts 2 and 3 in order to make predictions for part 1. So prediction \(\hat{Y}^{-(1)}_1\) for \(Y_1\) comes from model trained on all data except that in part 1. And so on
The cross-validation estimate of test error (also called the cross-validation error) is \[ \frac{1}{6}\Big( (Y_1-\hat{Y}^{-(1)}_1)^2 + (Y_1-\hat{Y}^{-(1)}_2)^2 + (Y_1-\hat{Y}^{-(2)}_3)^2 + \\ (Y_1-\hat{Y}^{-(2)}_4)^2 + (Y_1-\hat{Y}^{-(3)}_5)^2 + (Y_1-\hat{Y}^{-(3)}_6)^2 \Big) \]
# Split data in 5 parts, randomly
k = 5
set.seed(0)
inds = sample(rep(1:k, length=n))
head(inds, 10)##  [1] 5 4 3 2 2 5 5 1 3 1table(inds)## inds
##  1  2  3  4  5 
## 10 10 10 10 10# Now run cross-validation: easiest with for loop, running over
# which part to leave out
pred.mat = matrix(0, n, 2) # Empty matrix to store predictions
for (i in 1:k) {
  cat(paste("Fold",i,"... "))
  
  dat.tr = dat[inds!=i,] # Training data
  dat.te = dat[inds==i,] # Test data
  
  # Train our models
  lm.1.minus.i = lm(y ~ x, data=dat.tr)
  lm.10.minus.i = lm(y ~ poly(x,10), data=dat.tr)
  
  # Record predictions
  pred.mat[inds==i,1] = predict(lm.1.minus.i, data.frame(x=dat.te$x))
  pred.mat[inds==i,2] = predict(lm.10.minus.i, data.frame(x=dat.te$x))
}## Fold 1 ... Fold 2 ... Fold 3 ... Fold 4 ... Fold 5 ...# Compute cross-validation error, one for each model
cv.errs = colMeans((pred.mat - dat$y)^2)
# Plot the results
par(mfrow=c(1,2))
xx = seq(min(dat$x), max(dat$x), length=100)
plot(dat$x, dat$y, pch=20, col=inds+1, main="Cross-validation")
lines(xx, predict(lm.1, data.frame(x=xx)), # Note: model trained on FULL data!
      lwd=2, lty=2) 
legend("topleft", legend=paste("Fold",1:k), pch=20, col=2:(k+1))
text(0, -6, label=paste("CV error:", round(cv.errs[1],3)))
plot(dat$x, dat$y, pch=20, col=inds+1, main="Cross-validation")
lines(xx, predict(lm.10, data.frame(x=xx)), # Note: model trained on FULL data!
      lwd=2, lty=2) 
legend("topleft", legend=paste("Fold",1:k), pch=20, col=2:(k+1))
text(0, -6, label=paste("CV error:", round(cv.errs[2],3)))# Now we visualize the different models trained, one for each CV fold
for (i in 1:k) {
  dat.tr = dat[inds!=i,] # Training data
  dat.te = dat[inds==i,] # Test data
  
  # Train our models
  lm.1.minus.i = lm(y ~ x, data=dat.tr)
  lm.10.minus.i = lm(y ~ poly(x,10), data=dat.tr)
  
  # Plot fitted models
  par(mfrow=c(1,2)); cols = c("red","gray")
  plot(dat$x, dat$y, pch=20, col=cols[(inds!=i)+1], main=paste("Fold",i))
  lines(xx, predict(lm.1.minus.i, data.frame(x=xx)), lwd=2, lty=2) 
  legend("topleft", legend=c(paste("Fold",i),"Other folds"), pch=20, col=cols)
  text(0, -6, label=paste("Fold",i,"error:",
       round(mean((dat.te$y - pred.mat[inds==i,1])^2),3)))
                          
  plot(dat$x, dat$y, pch=20, col=cols[(inds!=i)+1], main=paste("Fold",i))
  lines(xx, predict(lm.10.minus.i, data.frame(x=xx)), lwd=2, lty=2) 
  legend("topleft", legend=c(paste("Fold",i),"Other folds"), pch=20, col=cols)
  text(0, -6, label=paste("Fold",i,"error:",
       round(mean((dat.te$y - pred.mat[inds==i,2])^2),3)))
}