--- title: "Training and Testing Errors" author: "Statistical Computing, 36-350" date: "Monday November 28, 2016" --- Reminder: statistical (regression) models === You have some data $X_1,\ldots,X_p,Y$: the variables $X_1,\ldots,X_p$ are called predictors, and $Y$ is called a response. You're interested in the relationship that governs them So you posit that $Y|X_1,\ldots,X_p \sim P_\theta$, where $\theta$ represents some unknown parameters. This is called **regression model** for $Y$ given $X_1,\ldots,X_p$. Goal is to estimate parameters. Why? - To assess model validity, predictor importance (**inference**) - To predict future $Y$'s from future $X_1,\ldots,X_p$'s (**prediction**) Reminder: linear regression models === The linear model is arguably the **most widely used** statistical model, has a place in nearly every application domain of statistics Given response $Y$ and predictors $X_1,\ldots,X_p$, in a **linear regression model**, we posit: $$Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon, \quad \text{where \epsilon \sim N(0,\sigma^2)}$$ Goal is to estimate parameters $\beta_0,\beta_1,\ldots,\beta_p$. Why? - To assess whether the linear model is true, which predictors are important (**inference**) - To simply predict future $Y$'s from future $X_1,\ldots,X_p$'s (**prediction**) Shifting tides: a focus on prediction === Nowadays, we try to fit linear models in such a wide variety of difficult problem settings that, in many cases, we have no reason to believe the true data generating model is linear, the errors are close to Gaussian or homoskedastic, etc. Hence, a modern perspective: > The linear model is only a rough approximation, so **evaluate prediction accuracy**, and let this determine its usefulness This idea, to focus on prediction, is far more general than linear models. We'll return to this later in the week, but for now here is the summary: > Models are only approximations; some methods need not even have underlying models; let's **evaluate prediction accuracy**, and let this determine model/method usefulness In fact, this is (in some sense) one of the basic principles of machine learning Test error === Suppose we have **training data** $X_{i1},\ldots,X_{ip},Y_i$, $i=1,\ldots,n$ used to estimate regression coefficients $\hat\beta_0,\hat\beta_1,\ldots,\hat\beta_p$ Given new $X_1^*,\ldots,X_p^*$ and asked to predict the associated $Y^*$. From the estimated linear model, prediction is: $\hat{Y^*} = \hat\beta_0 + \hat\beta_1 X_1^* + \ldots + \hat\beta_p X_p^*$. We define the **test error**, also called **prediction error**, by $$\mathbb{E}(Y^* - \hat{Y^*})^2$$ where the expectation is over every random: training data, $X_{i1},\ldots,X_{ip},Y_i$, $i=1,\ldots,n$ and test data, $X_1^*,\ldots,X_p^*,Y^*$ This was explained for a linear model, but the same definition of test error **holds in general** Estimating test error === Often, we want an accurate **estimate of the test error** of our method (e.g., linear regression). Why? Two main purposes: - **Predictive assessment:** get an absolute understanding of the magnitude of errors we should expect in making future predictions - **Model/method selection:** choose among different models/methods, attempting to minimize test error Training error === Suppose, as an estimate the test error of our method, we take the observed **training error** $$\frac{1}{n} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 \quad$$ What's wrong with this? Generally **too optimistic** as an estimate of the test error---after all, the parameters $\hat\beta_0,\hat\beta_1,,\ldots,\hat\beta_p$ were estimated to make $\hat{Y}_i$ close to $Y_i$, $i=1,\ldots,n$, in the first place! Also, importantly, the more **complex/adaptive** the method, the more optimistic its training error is as an estimate of test error Examples === {r} set.seed(1) n = 30 x = sort(runif(n, -3, 3)) y = 2*x + 2*rnorm(n) x0 = sort(runif(n, -3, 3)) y0 = 2*x0 + 2*rnorm(n) par(mfrow=c(1,2)) xlim = range(c(x,x0)); ylim = range(c(y,y0)) plot(x, y, xlim=xlim, ylim=ylim, main="Training data") plot(x0, y0, xlim=xlim, ylim=ylim, main="Test data")  (Continued) === {r} # Training and test errors for a simple linear model lm.1 = lm(y ~ x) yhat.1 = predict(lm.1, data.frame(x=x)) train.err.1 = mean((y-yhat.1)^2) y0hat.1 = predict(lm.1, data.frame(x=x0)) test.err.1 = mean((y0-y0hat.1)^2) par(mfrow=c(1,2)) plot(x, y, xlim=xlim, ylim=ylim, main="Training data") lines(x, yhat.1, col=2, lwd=2) text(0, -6, label=paste("Training error:", round(train.err.1,3))) plot(x0, y0, xlim=xlim, ylim=ylim, main="Test data") lines(x0, y0hat.1, col=3, lwd=2) text(0, -6, label=paste("Test error:", round(test.err.1,3)))  (Continued) === {r} # Training and test errors for a 10th order polynomial regression # (The problem is only exacerbated!) lm.10 = lm(y ~ poly(x,10)) yhat.10 = predict(lm.10, data.frame(x=x)) train.err.10 = mean((y-yhat.10)^2) y0hat.10 = predict(lm.10, data.frame(x=x0)) test.err.10 = mean((y0-y0hat.10)^2) par(mfrow=c(1,2)) xx = seq(min(xlim), max(xlim), length=100) plot(x, y, xlim=xlim, ylim=ylim, main="Training data") lines(xx, predict(lm.10, data.frame(x=xx)), col=2, lwd=2) text(0, -6, label=paste("Training error:", round(train.err.10,3))) plot(x0, y0, xlim=xlim, ylim=ylim, main="Test data") lines(xx, predict(lm.10, data.frame(x=xx)), col=3, lwd=2) text(0, -6, label=paste("Test error:", round(test.err.10,3)))