\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\optimand}{\theta} \newcommand{\optimum}{\optimand^*} \newcommand{\OptimalParameter}{\optimand^*} \newcommand{\OptDim}{{p}} \newcommand{\ERM}{\hat{\optimand}} \newcommand{\ObjFunc}{{M}} \newcommand{\Hessian}{\mathbf{k}} \]

Reminders

We want to learn rules (models, strategies) which will act well in the future
$\Rightarrow$ We’d like to minimize risk, $\Risk(\optimand) = \Expect{\Loss(Y, s(X;\optimand))} =$ expected loss on new data
What we can do is minimize empirical risk, $\EmpRisk(\optimand) = n^{-1}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \optimand))}$
$minimizes $\Risk(\optimand)$, $\ERM$ minimizes $\EmpRisk$
Since $\Risk \neq \EmpRisk$, $\optimum \neq \ERM$
Since $\EmpRisk \rightarrow \Risk$, $\ERM \rightarrow \optimum$
Since $\ERM \rightarrow \optimum$, $\Risk(\ERM) \rightarrow \Risk(\optimum)$

Agenda for today: more direct estimates of risk

We got mathematical results last time about the risk of estimated model
We’ll mostly look today at procedures which require less theory
- How do we know those procedures work? More theory!

Asymptotic results on the risk of our trained model

Two asymptotic approximations: \[\begin{eqnarray} \Expect{\Risk(\ERM)} & \approx & \Risk(\optimum) + \frac{1}{2n}\tr{\mathbf{j}\Hessian}\\ \Expect{\Risk(\ERM)} & \approx & \Expect{\EmpRisk(\ERM)} + \frac{1}{n}\tr{\mathbf{j} \Hessian} \end{eqnarray}\]
- See lecture 9 for definitions & estimation methods for the matrices
This lets us estimate risk from the training data: \[ \EmpRisk(\ERM) + \frac{1}{n}\tr{\mathbf{j}\Hessian} \] is an unbiased estimate of $
- And in fact $\Risk(\ERM) \rightarrow \Expect{\Risk(\ERM)}$
The quantity $n^{-1}\tr{\mathbf{j}\Hessian}$ is therefore called the optimism
Notice: When $\mathrm{dim}(\optimand) = \OptDim$, $\Hessian$ is $[\OptDim \times \OptDim]$, $\mathbf{j}$ is also $[\OptDim\times \OptDim]$, so $\tr{\mathbf{j}\Hessian}$ is a sum of $\OptDim$ terms
$\therefore$ the optimism is $O(p/n)$, at least roughly speaking

Why aren’t we happy with using the optimism to estimate true risk?

We relied on a lot of calculus
The formulas only apply when the model has continuous parameters
- Think of adjusting regression slopes (continuous) vs. choosing the shape of a tree or neural network, or the number of nearest neighbors to use
  - Sometimes we try to make dis-continuous problems continuous, like deciding if the slope on some interaction term should be $\neq 0$
The formulas only apply when the loss function is continuous
- log loss is continuous, 0-1 loss is not
The formulas require us to take derivatives
- Some of us hate calculus and wish it would go away
- Others just don’t like the work

“Why think, when you can do the experiment?”

If we want to see how well our model will generalize to new data, have it generalize to new data

Wait for new data to appear, or go out to gather it
- Actual science (not data science); slow, expensive, painful, does not play to our strengths as statisticians
Get someone else’s data, see if the model generalizes
- Who is this person, who magically has exactly the type of data we need?
See how well our model generalizes from one part of our data to another

Out-of-sample prediction evaluation, in general

Divide the data into a training and testing part
- Or a training set and a validation set, etc.
Estimate the model using only the training set
- If we’re comparing multiple models, estimate all of them on the training set
Find the model(s)’s average loss on the training set
Optional: consider multiple training/testing splits, and average loss over testing sets

Make sure the training and testing data are independent

We have a data set

Deaths per day in Chicago, from all causes except accidents, plus measurements of air pollution and temperature.

death	pm10median	pm25median	o3median	so2median	time	tmpd	date
130	-7.4335443	NA	-19.59234	1.9280426	-2556.5	31.5	1986-12-31
150	NA	NA	-19.03861	-0.9855631	-2555.5	33.0	1987-01-01
101	-0.8265306	NA	-20.21734	-1.8914161	-2554.5	33.0	1987-01-02
135	5.5664557	NA	-19.67567	6.1393413	-2553.5	29.0	1987-01-03
126	NA	NA	-19.21734	2.2784649	-2552.5	32.0	1987-01-04
130	6.5664557	NA	-17.63400	9.8585839	-2551.5	40.0	1987-01-05

We fit models and want to know how well they do

chicago.lm <- lm(death ~ pm10median + o3median + so2median + tmpd, death = chicago)

… this could go on through many models, many models
We want the best model
“Best” = predicts best on new data
- We don’t want to presume any of them are true

Sample splitting

To see how well a model predicts new data, use it to predict new data
Divide the data at random into two halves, training and testing
Fit all the models on the training set
Evaluate the predictions of the fitted models on the testing set
Pick the best

Sample splitting

chicago$split <- sample(c("train", "test"), size = nrow(chicago), replace = TRUE)
head(chicago$split)

## [1] "train" "test"  "train" "test"  "train" "test"

death	pm10median	pm25median	o3median	so2median	time	tmpd	date	split
130	-7.4335443	NA	-19.59234	1.9280426	-2556.5	31.5	1986-12-31	train
150	NA	NA	-19.03861	-0.9855631	-2555.5	33.0	1987-01-01	test
101	-0.8265306	NA	-20.21734	-1.8914161	-2554.5	33.0	1987-01-02	train
135	5.5664557	NA	-19.67567	6.1393413	-2553.5	29.0	1987-01-03	test
126	NA	NA	-19.21734	2.2784649	-2552.5	32.0	1987-01-04	train
130	6.5664557	NA	-17.63400	9.8585839	-2551.5	40.0	1987-01-05	test

Sample splitting

kable(head(chicago[chicago$split == "train", ]))

	death	pm10median	pm25median	o3median	so2median	time	tmpd	date	split
1	130	-7.4335443	NA	-19.592338	1.928043	-2556.5	31.5	1986-12-31	train
3	101	-0.8265306	NA	-20.217338	-1.891416	-2554.5	33.0	1987-01-02	train
5	126	NA	NA	-19.217338	2.278465	-2552.5	32.0	1987-01-04	train
8	109	-5.4335443	NA	-12.170527	-5.107941	-2549.5	29.0	1987-01-07	train
10	153	NA	NA	-18.580280	-2.046929	-2547.5	32.5	1987-01-09	train
11	124	-19.4335443	NA	-5.712194	-1.600999	-2546.5	29.5	1987-01-10	train

kable(head(chicago[chicago$split == "test", ]))

	death	pm10median	pm25median	o3median	so2median	time	tmpd	date	split
2	150	NA	NA	-19.03861	-0.9855631	-2555.5	33.0	1987-01-01	test
4	135	5.5664557	NA	-19.67567	6.1393413	-2553.5	29.0	1987-01-03	test
6	130	6.5664557	NA	-17.63400	9.8585839	-2551.5	40.0	1987-01-05	test
7	129	-0.4335443	NA	-15.37440	-5.8189921	-2550.5	34.5	1987-01-06	test
9	125	-0.5714286	NA	-20.09234	0.1822373	-2548.5	26.5	1987-01-08	test
12	111	-15.4335443	NA	-15.62886	2.9379306	-2545.5	34.5	1987-01-11	test

Sample splitting

Re-number the data so points $1, 2, \ldots n/2$ are in the training set and $n/2+1, \ldots n$ are in the testing set
We fit our model models $m_1, m_2, \ldots m_r$ using the training set, to get $\hat{m}_1, \hat{m}_2, \ldots \hat{m}_r$
Our sample-splititng scores are then \[ SSS_j \equiv \frac{1}{n/2}\sum_{i=n/2+1}^{n}{\ell(y_i, \hat{m}_j(x_i))} \]
Because we made the split at random, $\hat{m}_j$ is independent of the testing data, so $SSS_j$ is an unbiased estimate of $\mathbb{E}\left[\ell(Y, \hat{m}_j(X))\right]$.
Pick the model with the best sample-splitting score

Drawbacks of sample splitting

One random split is noisy
- Often not such a big issue
Which model predicts best should change with $n$
- $\Risk(\ERM) = \Risk(\optimum) + \frac{1}{2n}\tr{\mathbf{j}\Hessian^{-1}}$ so $\Risk(\ERM) = \Risk(\optimum) + O(\OptDim/n)$
- Say model 2 is more complicated than model 1 ($\OptDim_2 > \OptDim_1$) but a better approximation to the truth, $\Risk(\optimum_1) > \Risk(\optimum_2)$
- For small enough $n$, $\Risk(\ERM_1) < \Risk(\ERM_2)$, but for for large enough $n$, $\Risk(\ERM_2) < \Risk(\ERM_1)$
The optimal model at sample size $n/2$ isn’t the same as the optimal model with sample size $n$
Obvious solution to the sample-size-bias issue is to make the training set bigger than the testing set
- … but now the score on the testing is noisier

Enter cross-validation

Divide the data into $k$ equally-sized chunks
- Say $k=3$ for demo purposes

head(rep(1:3, length.out = nrow(chicago)))

## [1] 1 2 3 1 2 3

chicago$fold <- sample(rep(1:3, length.out = nrow(chicago)))
head(chicago$fold)

## [1] 1 1 1 3 1 3

kable(head(chicago[chicago$fold == 1, ]))

	death	pm10median	pm25median	o3median	so2median	time	tmpd	date	split	fold
1	130	-7.4335443	NA	-19.59234	1.9280426	-2556.5	31.5	1986-12-31	train	1
2	150	NA	NA	-19.03861	-0.9855631	-2555.5	33.0	1987-01-01	test	1
3	101	-0.8265306	NA	-20.21734	-1.8914161	-2554.5	33.0	1987-01-02	train	1
5	126	NA	NA	-19.21734	2.2784649	-2552.5	32.0	1987-01-04	train	1
9	125	-0.5714286	NA	-20.09234	0.1822373	-2548.5	26.5	1987-01-08	test	1
10	153	NA	NA	-18.58028	-2.0469293	-2547.5	32.5	1987-01-09	train	1

kable(head(chicago[chicago$fold == 2, ]))

	death	pm10median	pm25median	o3median	so2median	time	tmpd	date	split	fold
8	109	-5.433544	NA	-12.170527	-5.1079414	-2549.5	29.0	1987-01-07	train	2
11	124	-19.433544	NA	-5.712194	-1.6009988	-2546.5	29.5	1987-01-10	train	2
15	109	-7.256018	NA	-18.712194	-1.1414161	-2542.5	32.5	1987-01-14	train	2
17	128	NA	NA	-17.378861	1.8072373	-2540.5	27.0	1987-01-16	test	2
19	130	-9.433544	NA	-7.634004	-0.7186133	-2538.5	23.0	1987-01-18	test	2
20	133	-3.433544	NA	-18.413614	0.6816042	-2537.5	20.5	1987-01-19	train	2

kable(head(chicago[chicago$fold == 3, ]))

	death	pm10median	pm25median	o3median	so2median	time	tmpd	date	split	fold
4	135	5.5664557	NA	-19.67567	6.139341	-2553.5	29.0	1987-01-03	test	3
6	130	6.5664557	NA	-17.63400	9.858584	-2551.5	40.0	1987-01-05	test	3
7	129	-0.4335443	NA	-15.37440	-5.818992	-2550.5	34.5	1987-01-06	test	3
12	111	-15.4335443	NA	-15.62886	2.937931	-2545.5	34.5	1987-01-11	test	3
13	104	11.5664557	NA	-17.04553	3.641800	-2544.5	34.0	1987-01-12	train	3
18	141	-2.4335443	NA	-17.00386	0.562692	-2539.5	17.5	1987-01-17	train	3

How to actually do cross-validation

For each fold $f$ in $1:k$,
- Set the data points assigned to fold $f$ aside as the testing set
- Estimate the model(s) on the rest of the data, getting $\hat{m}_j^{(-f)}$
- Evaluate the models on the testing set, \[ CV^{(f)}_j = \frac{1}{n/k}\sum_{i \in \mathrm{fold}(f)}{\Loss(y_i, \hat{m}_j^{(-f)}(x_i))} \]
  - Again, an unbiased estimate of $\Expect{\Loss(Y, \hat{m}_{j}^{(-f)}(X))}$
Average over folds, \[ CV_j = \frac{1}{k}\sum_{f=1}^{k}{CV^{(f)}_j} \]
Take the model with the best CV score

Why should this work?

Sample size isn’t $n/2$ but $\frac{k-1}{k}n$, which is a lot closer to $n$
- The sample-size-bias is attenuated
Every data point gets used to evaluate the models, and every data point gets used exactly the same number of times
We’re averaging over multiple random splits which reduces some of the randomness
- But not all; any two of the training sets have $n\frac{k-2}{k}$ of the data points in common, which creates correlations between $\hat{m}_j^{(-f_1)}$ and $\hat{m}_j^{(-f_2)}$

How do we know this works?

Could do some theory, but it’s surprisingly hard…
Instead, let’s simulate!
“Synthetic” (i.e., fake) data example, so we know what the real answer is, and can tell which methods do and do not work

Visually, it looks like more complicated models fit better
- Here “more complicated” = “more parameters”

In-sample error just goes down with the order of the polynomial

Quantitatively, more complicated models do have lower mean-squared error
- They have to, because degree-$d$ polynomials are special cases of degree-$(d+1)$ polynomials, and we fit by minimizing MSE

The more complicated models don’t generalize well

Add a lot of new data points from the same distribution (blue dots)

More complicated models generalize very badly

Use all the new data points to get the risk (= expected loss) of the polynomials we fit to those first 20 data points

Note: logarithmic scale of vertical axis!
ATTN: Can you add a good legend?

$R^2$ and $R^2_{adjusted}$ are useless for goodness-of-fit

k-fold CV for linear models

cv.lm <- function(data, formulae, nfolds = 5) {
    data <- na.omit(data)
    formulae <- sapply(formulae, as.formula)
    responses <- sapply(formulae, response.name)
    names(responses) <- as.character(formulae)
    n <- nrow(data)
    fold.labels <- sample(rep(1:nfolds, length.out = n))
    mses <- matrix(NA, nrow = nfolds, ncol = length(formulae))
    colnames <- as.character(formulae)
    for (fold in 1:nfolds) {
        test.rows <- which(fold.labels == fold)
        train <- data[-test.rows, ]
        test <- data[test.rows, ]
        for (form in 1:length(formulae)) {
            current.model <- lm(formula = formulae[[form]], data = train)
            predictions <- predict(current.model, newdata = test)
            test.responses <- test[, responses[form]]
            test.errors <- test.responses - predictions
            mses[fold, form] <- mean(test.errors^2)
        }
    }
    return(colMeans(mses))
}
response.name <- function(formula) {
    var.names <- all.vars(formula)
    return(var.names[1])
}

How well does cross-validation work?

Remember that our original data had 20 points, and we’re seeing how we generalize to 20,000 points from the same distribution

Leave-one-out-CV (LOOCV)

Take the limit, $k=n$
Each data point gets “left out” once, to be the testing set
Fit the models with the remaining $n-1$ data points as the training set
See how well all models predict the left-out point
Average over all $n$ points

Short-cut formula for LOOCV

Ordinarily LOOCV is the most time-consuming sort of CV
- Need to re-estimate the model $n$ times
IF the model is a linear smoother, $\hat{m}(x_0) = \sum_{j=1}^{n}{w(x_0, x_j) y_j}$, then there is a short-cut (Wahba 1990): \[ LOOCV = \frac{1}{n}\sum_{i=1}^{n}{\left(\frac{y_i - \hat{m}(x_i)}{1-w_{ii}}\right)^2} \]
We estimate the model once, and get the diagonal of its smoother/hat matrix $w_{ii} = w(x_i, x_i)$
- We often have to work this out anyway

$k$-fold CV vs. LOOCV

LOOCV is unbiased as an estimate of risk at sample size $n-1$
- So a pretty good estimate of the risk at sample size $n$
LOOCV tends to have high variance
If there is a true model, LOOCV tends to pick a model which is bigger than the true model
$k$-fold CV tends to be better at selecting the true model
- It’s estimating risk at the sample size $\frac{k-1}{k}n$, which puts more complex models at more of a disadvantage (compared to LOOCV)
On the other hand, LOOCV is better predictively
- That is, the risk of the model picked by LOOCV tends to be smaller than the risk of the model picked by $k$-fold
- There are theoretical reasons to think we can’t get both predictive optimality and consistent selection of the true model (Yang 2005; Claeskens and Hjort 2008)
Advice:
- In prediction tasks, if time allows, use LOOCV
- Otherwise, 5- or 10- fold CV

Reminder: Best CV score is optimistically biased

The CV score of our best model is going to be optimistic about how well that model will generalize
- For basically the same reason that in-sample error is optimistic
  - “The winner’s curse” (see backup)
The easiest way to get an unbiased estimate of how well the winner will perform is to evaluate it on held-out data, which we never use for anything else
Alternately, Tibshirani and Tibshirani (2009) suggest a bias-correction formula

Summing up

In-sample error is going to be optimistically biased, often badly so
- $R^2$ and $R^2_{adjusted}$ are useless
The “optimism” gives a calculus-based way of estimating true risk (and so of selecting models)
- Asymptotic, needs calculus, so may not apply, may need too much thinking
Data splitting gives a direct estimate of true risk, but at a smaller sample size
- Which we know from the theory leading to the optimism formula can really matter!
Use cross-validation
- Typically 5- or 10- fold
- Leave-one-out if you can use the short-cut formula or $n$ is really small
Remember the winner’s CV score will still be an optimistic estimate of true risk
- Use a hold-out set if you really need to know the risk of the selected model

Backup: Further reading

Data splitting and cross-validation go back in statistical practice (as opposed to theory) for many decades
- Stone (1974) and Geisser (1975) seem to be the first formal publications
- Geisser and Eddy (1979) made the generality of the method clear
- Another notable early contributor to cross-validation as a general method was Grace Wahba; her book Wahba (1990) covers this (with references to her original papers from the 1970s)
Computer scientists adopted cross-validation from statisticians around 1990; this is was one of the steps that led to the modern discipline of “machine learning”
Arlot and Celisse (2010) is a good recent review of cross-validation methods in practice
- Cross-validation for dependent data raises its own issues (Burman, Chow, and Nolan 1994; Racine 2000)
Important references on the theory of cross-validation include:
- Györfi et al. (2002), chs. 7 and 8
- Laan and Dudoit (2003)
- Vaart, Dudoit, and Laan (2006)
- These are all much more mathematically involved than this course…

Backup: The Winner’s Curse

Enpirical risk = true risk plus noise, $\EmpRisk(\optimand) = \Risk(\optimand) + \gamma(\optimand)$
$\Expect{\gamma(\optimand)} = 0$
When we pick $\ERM = \argmin_{\optimand}{\EmpRisk(\optimand)}$, we
- tend to pick a value of $\optimand$ with $\Risk$ close to the minimum
- tend to pick a $\optimand$ where $\gamma(\optimand) < 0$
If data splitting, CV, etc., give us an unbiased estimate of true risk, but we use those estimates to pick one model out of many, the score of the selected model is a biased estimate of its risk
- Same argument as above!
Analogy: the winner’s curse in auctions (Thaler 1994)
- Bids in auctions $=$ true economic value of the object plus noise in estimating its value
- The highest bidder tends to be someone for whom the object is very valuable
  - And has that much money…
- The highest bidder also tends to be someone with a lot of noise in their estimate of the value
- If you win an auction, that is evidence that you over-paid, in and of itself
Estimation and model selection are like the winner’s curse
- Even if we use an unbiased score to select a parameter value or a model, winning the competition is evidence of optimism

References

Arlot, Sylvain, and Alain Celisse. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4:40–79. https://doi.org/10.1214/09-SS054.

Burman, Prabir, Edmond Chow, and Deborah Nolan. 1994. “A Cross-Validatory Method for Dependent Data.” Biometrika 81:351–58. https://doi.org/10.1093/biomet/81.2.351.

Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge, England: Cambridge University Press.

Geisser, Seymour. 1975. “The Predictive Sample Reuse Method with Applications.” Journal of the American Statistical Association 70:320–28. https://doi.org/10.1080/01621459.1975.10479865.

Geisser, Seymour, and William F. Eddy. 1979. “A Predictive Approach to Model Selection.” Journal of the American Statistical Association 74:153–60. https://doi.org/10.1080/01621459.1979.10481632.

Györfi, László, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002. A Distribution-Free Theory of Nonparametric Regression. New York: Springer-Verlag.

Laan, Mark J. van der, and Sandrine Dudoit. 2003. “Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples.” 130. U.C. Berkeley Division of Biostatistics Working Paper Series. http://www.bepress.com/ucbbiostat/paper130/.

Racine, Jeff. 2000. “Consistent Cross-Validatory Model-Selection for Dependent Data: Hv-Block Cross-Validation.” Journal of Econometrics 99:39–61. https://doi.org/10.1016/S0304-4076(00)00030-0.

Stone, M. 1974. “Cross-Validatory Choice and Assessment of Statistical Predictions.” Journal of the Royal Statistical Society B 36:111–47. http://www.jstor.org/stable/2984809.

Thaler, Richard H. 1994. The Winner’s Curse: Paradoxes and Anomalies of Economic Life. Princetonm, New Jersey: Princeton University Press.

Tibshirani, Ryan J., and Robert Tibshirani. 2009. “A Bias Correction for the Minimum Error Rate in Cross-Validation.” Annals of Applied Statistics 3:822–29. http://arxiv.org/abs/0908.2904.

Vaart, Aad W. van der, Sandrine Dudoit, and Mark J. van der Laan. 2006. “Oracle Inequalities for Multi-Fold Cross Validation.” Statistics and Decisions 24:1001–21. https://doi.org/10.1524/stnd.2006/24.3.351.

Wahba, Grace. 1990. Spline Models for Observational Data. Philadelphia: Society for Industrial; Applied Mathematics.

Yang, Yuhong. 2005. “Can the Strengths of AIC and BIC Be Shared? A Conflict Between Model Indentification and Regression Estimation.” Biometrika 92:937–50. https://doi.org/10.1093/biomet/92.4.937.

Cross-Validation (Lecture 10)

Reminders

Agenda for today: more direct estimates of risk

Asymptotic results on the risk of our trained model

Why aren’t we happy with using the optimism to estimate true risk?

“Why think, when you can do the experiment?”

Out-of-sample prediction evaluation, in general

We have a data set

We fit models and want to know how well they do

Sample splitting

Sample splitting

Sample splitting

Sample splitting

Drawbacks of sample splitting

Enter cross-validation

How to actually do cross-validation

Why should this work?

How do we know this works?

In-sample error just goes down with the order of the polynomial

The more complicated models don’t generalize well

More complicated models generalize very badly

\(R^2\) and \(R^2_{adjusted}\) are useless for goodness-of-fit

k-fold CV for linear models

How well does cross-validation work?

Leave-one-out-CV (LOOCV)

Short-cut formula for LOOCV

\(k\)-fold CV vs. LOOCV

Reminder: Best CV score is optimistically biased

Summing up

Backup: Further reading

Backup: The Winner’s Curse

References