\documentclass{article}
\input{../common}
\begin{document}
\title{Lecture 21: Model Selection}
\author{36-401, Fall 2015, Section B}
\date{10 November 2015}
\maketitle
\tableofcontents
<< echo=FALSE >>=
library(knitr)
opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE,
options(show.signif.stars=FALSE))
@
\vspace{10mm}
\section{Generalization and Optimism}
We estimated our model by minimizing the mean squared error on our data:
\[
\widehat{\mathbf{\beta}} = \argmin_{\mathbf{b}}{\frac{1}{n}(\mathbf{y} - \mathbf{x}\mathbf{b})^T(\mathbf{y} - \mathbf{x}\mathbf{b})}
\]
Different linear models will amount to different choices of the design matrix
$\mathbf{x}$ --- we add or drop variables, we add or drop interactions or
polynomial terms, etc., and this adds or removes columns from the design
matrix. We might consider doing selecting among models themselves
by minimizing the MSE. This is a very bad idea, for a fundamental reason:
\begin{quotation}
Every model is too optimistic about how well it will actually predict.
\end{quotation}
Let's be very clear about what it would mean to predict well. The most
challenging case would be that we see a new {\em random} point,
with predictor values $X_1, \ldots X_p$ and response $Y$, and our
{\em old} $\widehat{\mathbf{\beta}}$ has a small expected squared error:
\[
\Expect{\left(Y - \left(\hat{\beta}_0 + \sum_{j=1}^{p}{X_j \hat{\beta}_j}\right)\right)^2}
\]
Here both $Y$ and the $X$'s are random (hence the capital letters), so we might
be asking the model for a prediction at a point it never saw before. (Of
course if we have multiple identically distributed $(X,Y)$ pairs, say $q$ of
them, the expected MSE over those $q$ points is just the same as the expected
squared error at one point.)
An easier task would be to ask the model for predictions at the {\em same}
values of the predictor variables as before, but with different random noises.
That is, we fit the model to
\[
\mathbf{Y} = \mathbf{x}\mathbf{\beta} + \mathbf{\epsilon}
\]
and now Tyche\footnote{Look her up.} reach into her urn and gives us
\[
\mathbf{Y}^{\prime} = \mathbf{x}\mathbf{\beta} + \mathbf{\epsilon}^\prime
\]
where $\mathbf{\epsilon}$ and $\mathbf{\epsilon}^\prime$ are independent but
identically distributed. The design matrix is the same, the true parameters
$\mathbf{\beta}$ are the same, but the noise is different\footnote{If we really
are in an experimental setting, we really could get a realization of
$\mathbf{Y}^{\prime}$ just by running the experiment a second time. With
surveys or with observational data, it would be harder to actually realize
$\mathbf{Y}^\prime$, but mathematically at least it's unproblematic.}. We
now want to see if the coefficients we estimated from $(\mathbf{x},
\mathbf{Y})$ can predict $(\mathbf{x}, \mathbf{Y}^\prime)$. Since the only
thing that's changed is the noise, if the coefficients can't predict well any
more, that means that they were really just memorizing the noise, and not
actually doing anything useful.
Our out-of-sample expected MSE, then, is
\[
\Expect{n^{-1}(\mathbf{Y}^\prime - \mathbf{x}\widehat{\mathbf{\beta}})^T (\mathbf{Y}^\prime - \mathbf{x}\widehat{\mathbf{\beta}})}
\]
It will be convenient to break this down into an average over data points, and
to abbreviate $\mathbf{x}\widehat{\mathbf{\beta}} = \widehat{\mathbf{m}}$, the
vector of fitted values. Notice that since the predictor variables and the
coefficients aren't changing, our predictions are the same both in and out of
sample --- at point $i$, we will predict $\widehat{m}_i$.
In this notation, then, the expected out-of-sample MSE is
\[
\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}}
\]
We'll compare this to the expected in-sample MSE,
\[
\Expect{\frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2}}
\]
Notice that $\widehat{m}_i$ is a function of $Y_i$ (among other things), so
those are dependent random variables, while $\widehat{m}_i$ and $Y_i^\prime$
are completely statistically independent\footnote{That might sound weird, but
remember we're holding $\mathbf{x}$ fixed in this exercise, so what we mean
is that knowing $\widehat{m}_i$ doesn't give us an extra information about
$Y_i^\prime$ beyond what we'd get from knowing the values of the $X$
variables.}.
Break this down term by term. What's the expected value of the
$i^{\mathrm{th}}$ in-sample squared error?
\begin{eqnarray}
\Expect{(Y_i - \hat{m}_i)^2} & = & \Var{Y_i-\hat{m}_i} + {\left(\Expect{Y_i - \hat{m}_i}\right)}^2\\
& = & \Var{Y_i} + \Var{\hat{m}_i} - 2\Cov{Y_i, \hat{m}_i} + {\left(\Expect{Y_i} - \Expect{\hat{m}_i}\right)}^2
\end{eqnarray}
The covariance term is not (usually) zero, because, as I just said, $\hat{m}_i$ is a function of, in part, $Y_i$.
On the other hand, what's the expected value of the $i^{\mathrm{th}}$ squared
error on new data?
\begin{eqnarray}
\Expect{(Y_i^\prime - \hat{m}_i)^2} & = & \Var{Y_i\prime-\hat{m}_i} + {\left(\Expect{Y_i^\prime - \hat{m}_i}\right)}^2\\
& = & \Var{Y_i^\prime} + \Var{\hat{m}_i} - 2\Cov{Y_i^\prime, \hat{m}_i} + {\left(\Expect{Y_i^\prime} - \Expect{\hat{m}_i}\right)}^2
\end{eqnarray}
$Y_i^\prime$ is independent of $Y_i$, but has the same distribution. This tells
us that
$\Expect{Y_i^\prime} = \Expect{Y_i}$, $\Var{Y_i^\prime} = \Var{Y_i}$, but
$\Cov{Y_i^\prime, \hat{m}_i} = 0$. So
\begin{eqnarray}
\Expect{(Y_i^\prime - \hat{m}_i)^2} &= & \Var{Y_i} + \Var{\hat{m}_i} + {\left(\Expect{Y_i} - \Expect{\hat{m}_i}\right)}^2\\
& = & \Expect{(Y_i - \hat{m}_i)^2} + 2\Cov{Y_i, \hat{m}_i}
\end{eqnarray}
Averaging over data points,
\[
\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}} = \Expect{\frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2}} + \frac{2}{n}\sum_{i=1}^{n}{\Cov{Y_i, \hat{m}_i}}
\]
Clearly, we need to get a handle on that sum of covariances.
For a linear model, though, $\Cov{Y_i, \hat{m}_i} = \sigma^2 H_{ii}$ (Exercise \ref{exercise:cov-of-fit-and-y}).
So, for linear models,
\[
\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}} = \Expect{\frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2}} + \frac{2}{n}\sigma^2 \tr{\mathbf{H}}
\]
and we know that with $p$ predictors and one intercept, $\tr{\mathbf{H}} = p+1$
(Homework 5). Thus, for linear models,
\[
\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}} = \Expect{\frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2}} + \frac{2}{n}\sigma^2 (p+1)
\]
Of course, we don't actually know the {\em expectation} on the right-hand
side, but we do have a sample estimate of it, which is the in-sample MSE.
If the law of large numbers is still our friend,
\[
\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}} \approx \frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2} + \frac{2}{n}\sigma^2 (p+1)
\]
The second term on the right, $(2/n)\sigma^2 (p+1)$, is the {\bf optimism} of
the model --- the amount by which its in-sample MSE systematically
under-estimates its true expected squared error. Notice that this:
\begin{itemize}
\item Grows with $\sigma^2$: more noise gives the model more opportunities to
seem to fit well by capitalizing on chance.
\item Shrinks with $n$: at any fixed level of noise, more data makes it harder
to pretend the fit is better than it really is.
\item Grows with $p$: every extra parameter is another control which can be
adjusted to fit to the noise.
\end{itemize}
Minimizing the in-sample MSE completely ignores the bias from optimism, so it
is guaranteed to pick models which are too large and predict poorly out of
sample. If we could calculate the optimism term, we could at least use an
unbiased estimate of the true MSE on new data.
Of course, we do not actually know $\sigma^2$.
\section{Mallow's $C_p$ Statistic}
The Mallows $C_p$ statistic just substitutes in a feasible estimator of
$\sigma^2$, which is $\hat{\sigma}^2$ {\em from the largest model we consider}.
This will be an unbiased estimator of $\sigma^2$ if the real model is smaller
(contains a strict subset of the predictor variables), but not vice
versa\footnote{This assumes the largest model must contain the truth!}.
That is, for a linear model with $p+1$ coefficients fit by OLS,
\begin{equation}
\label{eqn:cp}
C_p \equiv = \frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2} + \frac{2}{\hat{\sigma}^2}{n}(p+1)
\end{equation}
The selection rule is to pick the model which minimizes $C_p$.
We can think of $C_p$ as having two parts,
\[
C_p = MSE + (\mathrm{penalty})
\]
From one point of view, the penalty is just an estimate of the bias.
From another point of view, it's a cost we're imposing on models for having
extra parameters. Every new parameter has got to pay that cost by
reducing the MSE by at least a certain amount; if it doesn't, the extra
parameter isn't worth it.
(Before this, we've only been dealing with {\em one} model, so we've not had
to distinguish carefully between the in-sample MSE and the maximum likelihood
estimate of $\sigma^2$. With multiple models floating around,
though, each can have its own MSE, but there is only one true $\sigma^2$,
and we need {\em an} estimate of it.)
For comparing models, we really care about differences:
\begin{equation}
\label{eqn:delta-Cp}
\Delta C_p = MSE_1 - MSE_2 + \frac{2}{n}\hat{\sigma}^2(p_1 - p_2)
\end{equation}
(The extra term for the intercept, being common to both models, doesn't
contribute.)
\paragraph{Alternate form of $C_p$} You will find many references
which define $C_p$ somewhat differently:
\begin{equation}
\label{eqn:funky-Cp}
\frac{n MSE}{\hat{\sigma}^2} - n + 2p
\end{equation}
and say that the optimal value is close to $p$, not close to 0. To
see that this selects exactly the same models as the rule given above,
take a difference between two models, with MSE's $MSE_1, MSE_2$ and $p_1, p_2$
predictors. We get
\[
\frac{n(MSE_1 - MSE_2)}{\hat{\sigma}^2} + 2(p_1 -p_2)
\]
Dividing by $n$ and multiplying by $\hat{\sigma}^2$ gives us back Eq.\
\ref{eqn:delta-Cp}. There are reasons to assert that Eq.\ \ref{eqn:funky-Cp}
should indeed be close to $p$ for the right model (if the Gaussian noise
assumption holds), but Eq.\ \ref{eqn:cp} is a good estimate of the
out-of-sample error, and a good model selection rule, much more broadly.
\subsection{$R^2$ and Adjusted $R^2$}
Recall that
\[
R^2 = 1 - \frac{MSE}{s^2_Y}
\]
Picking a model by maximizing $R^2$ is thus equivalent to picking a model
by minimizing MSE. It is therefore stupid for exactly the same reasons
that minimizing MSE across models is stupid.
Recall that the adjusted $R^2$ is
\[
R^2_{adj} = 1 - \frac{MSE \frac{n}{n-p-1}}{s^2_Y}
\]
That is, it's $R^2$ with the unbiased estimator of $\sigma^2$. Maximizing adjusted
$R^2$ therefore corresponds to minimizing that unbiased estimator. What does that translate to?
\begin{eqnarray}
MSE\frac{n}{n-p-1} & = & MSE \frac{1}{1-(p+1)/n}\\
& \approx & MSE\left(1 + \frac{p+1}{n}\right)\\
& = & MSE + MSE\frac{p+1}{n}
\end{eqnarray}
where the approximation becomes exact as $n\rightarrow\infty$ with $p$
fixed\footnote{Use the binomial theorem to expand $1/(1-u)$ as $1 + u + u^2 +
\ldots$, and truncate the series at first order. (If $u$ is small, $u^2$ is
tiny, and the higher powers microscopic.)}. Even for the completely right
model, where $MSE$ is a consistent estimator of $\hat{\sigma}^2$, the
correction or penalty is only half as big as we've seen it should be.
Selecting models using adjusted $R^2$ is not completely stupid, as maximizing
$R^2$ is, but it is still not going to work very well.
\section{Akaike Information Criterion (AIC)}
The great Japanese statistician Hirotugu Akaike proposed a famous model
selection rule which also has the form of ``in-sample performance plus
penalty''. What has come to be called the {\bf Akaike information
criterion} (AIC) is
\[
AIC(S) \equiv L_S - \mathrm{dim}(S)
\]
where $L_S$ is the log likelihood of the model $S$, evaluated at the
maximum likelihood estimate, and $\mathrm{dim}(S)$ is the dimension of $S$, the
number of adjustable parameters it has. Akaike's rule is to pick the model
which maximizes AIC\footnote{Actually, in his original paper
\citep{Akaike-AIC}, he proposed using {\em twice} this, to simplify some
calculations involving chi-squared distributions. Many subsequent authors
have since kept the factor of 2, which of course will not change which model
is selected. Also, some authors define AIC as negative of this, and then
minimize it; again, clearly the same thing.}.
The reason for this definition is that Akaike showed $AIC/n$ is an unbiased
estimate of the expected log-probability the estimated parameters will give to
a new data point which it hasn't seen before, if the model is right. This is
the natural counterpart of expected squared error for more general
distributions than the Gaussian. IF we do specialize to linear-Gaussian
models, then we've seen (Lecture 10) that
\[
L = -\frac{n}{2}(1+\log{2\pi}) - \frac{n}{2}\log{MSE}
\]
and the dimension of the model is $p+2$ (because $\sigma^2$ is also an
adjustable parameter). Notice that $-\frac{n}{2}(1+\log{2\pi})$ doesn't
involve the parameters at all. If we compare AICs for two models, with mean
squared errors in-sample of $MSE_1$ and $MSE_2$, and one with $p_1$ predictors
and the other with $p_2$, the difference in AICs will be
\[
\Delta AIC = -\frac{n}{2}\log{MSE_1} + \frac{n}{2}\log{MSE_2} - (p_1 - p_2)
\]
To relate this to $C_p$, let's write $MSE_2 = MSE_1 + \Delta MSE$. Then
\begin{eqnarray}
\Delta AIC & = & -\frac{n}{2}\log{MSE_1} + \frac{n}{2}\log{MSE_1\left(1+\frac{\Delta MSE}{MSE_1}\right)} - (p_1 - p_2)\\
& = & -\frac{n}{2}\log{\left(1+\frac{\Delta MSE}{MSE_1}\right)} - (p_1 - p_2)
\end{eqnarray}
Now let's suppose that model 1 is actually the correct model, so $MSE_1 =
\hat{\sigma}^2$, and that $\Delta MSE$ is small compared to $\hat{\sigma}^2$,
so\footnote{Taylor expand $\log{1+u}$ around 1 to get $\log{1+u} \approx u$,
for $u$ close to 0.}
\begin{eqnarray}
\Delta AIC & \approx & -\frac{n}{2}\frac{\Delta MSE}{\hat{\sigma}^2} - (p_1 - p_2)\\
\frac{-2 \hat{\sigma}^2}{n}\Delta AIC & \approx & \Delta MSE + \frac{2}{n}\hat{\sigma^2}(p_1 - p_2) = \Delta C_p
\end{eqnarray}
So, if one of the models we're looking at is actually the correct model, and
the others aren't too different from it, picking by maximizing AIC will give
the same answer as picking by minimizing $C_p$.
\paragraph{Other Uses of AIC}
AIC can be applied whenever we have a likelihood. It is therefore used for
tasks like comparing models of probability distributions, or predictive models
where the whole distribution is important. $C_p$, by contrast, really only
makes sense if we're trying to do regression and want to use squared error.
\subsection{Why $-\mathrm{dim}(S)$?}
Akaike had a truly brilliant argument for subtracting a penalty equal to the
number of parameters from the log-likelihood, which is too pretty not to at
least sketch here.\footnote{Nonetheless, this subsection is optional.}
Generically, say that the parameter vector is $\theta$, and its true value is
$\theta^*$. (For linear regression with Gaussian noise, $\theta$ consists of
all $p+1$ coefficients plus $\sigma^2$.) The length of this vector, which is
$\mathrm{dim}(S)$, is let's say $d$. (For linear regression with Gaussian
noise, $d=p+2$.) The maximum likelihood estimate is $\hat{\theta}$. We know
that the derivative of the likelihood is zero at the MLE:
\[
\nabla L(\hat{\theta}) = 0
\]
Let's do a Taylor series expansion of $\nabla L(\theta)$ around the true
parameter value $\theta^*$:
\[
\nabla L(\theta) = \nabla L(\theta^*) + (\theta - \theta^*) \nabla\nabla L(\theta^*)
\]
Here $ \nabla\nabla L(\theta^*)$ is the $d\times d$ matrix of second partial
derivatives of $L$, evaluated at $\theta^*$. This is called the {\bf Hessian},
and would traditionally be written $\mathbf{H}$, but that would lead to confusion with the hat matrix, so I'll call it $\mathbf{K}$. Therefore the
Taylor expansion for the gradient of the log-likelihood is
\[
\nabla L(\theta) = \nabla L(\theta^*) + (\theta-\theta^*) \mathbf{K}
\]
Applied to the MLE,
\[
\mathbf{0} = \nabla L(\theta^*) + (\hat{\theta} - \theta^*)\mathbf{K}
\]
or
\[
\hat{\theta} = \theta^* - K^{-1} \nabla L(\theta^*)
\]
What is the {\em expected} log-likelihood, on new data, of $\hat{\theta}$?
Call this expected log-likelihood $\ell$ (using a lower-case letter to
indicate that it is non-random). Doing another Taylor series,
\[
\ell(\theta) \approx \ell(\theta^*) + (\theta-\theta^*)^T \nabla \ell(\theta^*) + \frac{1}{2}(\theta-\theta^*)^T \nabla \nabla \ell(\theta^*) (\theta-\theta^*)
\]
However, it's not hard to show that the expected log-likelihood is
always\footnote{Except for quite weird models.} maximized by the true
parameters, so $\nabla \ell(\theta^*) = 0$. (The same argument also shows
$\Expect{\nabla L(\theta^*)} = 0$.)
Call the Hessian in this Taylor expansion $\mathbf{k}$. (Again, notice the
lower-case letter for a non-random quantity.) We have
\[
\ell(\theta) \approx \ell(\theta^*) + \frac{1}{2}(\theta-\theta^*)^T \mathbf{k} (\theta-\theta^*)
\]
Apply this to the MLE:
\[
\ell(\hat{\theta}) \approx \ell(\theta^*) + \frac{1}{2}\nabla L(\theta^*) \mathbf{K}^{-1} \mathbf{k} \mathbf{K}^{-1} \nabla L(\theta^*)
\]
Taking expectations,
\[
\Expect{\ell(\hat{\theta})} \approx \ell(\theta^*) + \frac{1}{2}\tr{\mathbf{K}^{-1} \mathbf{k} \mathbf{K}^{-1} \mathbf{J}}
\]
where $\Var{\nabla L(\theta^*)} = \mathbf{J}$. For large $n$,
$\mathbf{K}$ converges on $\mathbf{k}$, so this simplifies to
\[
\Expect{\ell(\hat{\theta})} \approx \ell(\theta^*) + \frac{1}{2}\tr{\mathbf{k}^{-1}\mathbf{J}}
\]
This still leaves things in terms of $\ell(\theta^*)$, which of course we don't
know, but now we do another Taylor expansion, this time of $L$ around
$\hat{\theta}$:
\[
L(\theta^*) \approx L(\hat{\theta}) + \frac{1}{2}(\theta^* - \hat{\theta})^T \nabla \nabla L(\hat{\theta}) (\theta^* - \hat{\theta})
\]
so
\[
L(\theta^*) \approx L(\hat{\theta}) + \frac{1}{2}(\mathbf{K}^{-1} \nabla L(\theta^*))^T \nabla \nabla L(\hat{\theta}) (\mathbf{K}^{-1} \nabla L(\theta^*))
\]
For large $n$, $\nabla\nabla L(\hat{\theta}) \rightarrow \nabla\nabla L(\theta^*) \rightarrow \mathbf{k}$. So,
again taking expectations,
\[
\ell(\theta^*) \approx \Expect{L(\hat{\theta}} + \frac{1}{2}\tr{\mathbf{k}^{-1}\mathbf{J}}
\]
Putting these together,
\[
\Expect{\ell(\hat{\theta})} \approx \Expect{L(\hat{\theta}} + \tr{\mathbf{k}^{-1}\mathbf{J}}
\]
An unbiased estimate is therefore
\[
L(\hat{\theta}) + \tr{\mathbf{k}^{-1}\mathbf{J}}
\]
Finally, a fundamental result (the ``Fisher identity'') says that for
well-behaved models, {\em if} the model is correct, then
\[
\Var{\nabla L(\theta^*)} = -\nabla\nabla \ell(\theta^*)
\]
or $\mathbf{J} = -\mathbf{k}$. Hence, if the model is correct, our
unbiased estimate is just
\[
L(\hat{\theta}) - \tr{\mathbf{I}}
\]
and of course $\tr{\mathbf{I}} = d$.
There, as you'll notice, several steps where we're making a bunch of
approximations. Some of these approximations (especially those involving the
Taylor expansions) can be shown to be OK asymptotically (i.e., as
$n\rightarrow\infty$) by more careful math. The last steps, however,
where we invoke the Fisher identity, are rather more dubious. (After all,
all of the models we're working with can hardly contain the true distribution.) A somewhat more robust version of AIC is therefore to use as the criterion
\[
L(\hat{\theta}) + \tr{\mathbf{K}\mathbf{J}}
\]
\section{Leave-one-out Cross-Validation (LOOCV)}
When looking at influential points and outliers, we considered omitting one
point from the data set, estimating the model, and then trying to predict that
one data point. The {\bf leave-one-out} fitted value for data point $i$ is
$\hat{m}^{(-i)}_i$, where the subscript $(-i)$ indicates that point $i$ was
left out in calculating this fit. The {\bf leave-one-out cross-validation score} of the model is
\[
LOOCV = \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \hat{m}^{(-i)}_i)^2}
\]
(Many more old-fashioned regression textbooks look at $n LOOCV$, and call
it PRESS, ``predictive residual sum of squares''.)
The story for cross-validation is pretty compelling: we want to know if our
model can generalize to new data, so {\em see} how well it generalizes to new
data. Leaving out each point in turn ensures that that the set of points on
which we try to make predictions is just as representative of the whole
population as the original sample was. Fortunately, this is one of those cases
where a compelling story is actually true: LOOCV is an unbiased estimate of the
generalization error.
\subsection{Short-cut Based on Leverage}
Re-estimating the model $n$ times would be seriously time-consuming, but there
is fortunately a short-cut:
\[
LOOCV = \frac{1}{n}\sum_{i=1}^{n}{\left( \frac{Y_i - \hat{m}_i}{1-H_{ii}}\right)^2}
\]
The numerator inside the square is just the residual of the model fit to the
full data. This gets divided by $1-H_{ii}$, which is also something we can
calculate with just one fit to the model. (The denominator says that the
residuals for high-leverage points count more, and those for low-leverage
points count less. If the model is going out of its way to match $Y_i$ (high
leverage $H_{ii}$) and it still can't fit it, that's worse than the same sized
residual at a point the model doesn't really care about (low leverage).)
The gap between LOOCV and the MSE can be thought of as a penalty, just like
with $C_p$ or AIC. The penalty doesn't have such a nice mathematical
expression, but it's well-defined and easy for us to calculate.
It also converges to the penalty $C_p$ applies as $n$ grows. To help see this,
first observe that the $H_{ii}$ must be getting small. (We know that
$\sum_{i}{H_{ii}} = p+1$.) Then\footnote{Use the binomial theorem again.}
$(1-H_{ii})^{-2} \approx 1 - 2 H_{ii}$, and
\[
LOOCV \approx \frac{1}{n}\sum_{i=1}^{n}{( Y_i - \hat{m}_i)^2(1-2 H_{ii})} \approx MSE + 2 \sigma^2 \tr{\mathbf{H}}
\]
\paragraph{Cross-validation with log-likelihood} The leave-one-out idea can
also be applied for any model where we make a probabilistic prediction.
Instead of measuring mean squared error, we measure the negative log
probability density the model assigns to the actual left-out point. (Negative,
so that a lower score is still better.) With Gaussian noise, this comes to the
same thing as the MSE, of course.
\subsection{Summing Up $C_p$, AIC, LOOCV}
Under a very broad range of circumstances, there are theorems which say,
roughly, the following:
\begin{quotation}
As $n\rightarrow \infty$, the expected out-of-sample MSE of the model picked
by leave-one-out cross-validation is close to that of the best model
considered.
\end{quotation}
The condition for these results do {\em not} require that any of the models
considered be true, or that the true model have Gaussian noise or even be
linear.
As we've seen, for large $n$ leave-one-out and Mallow's $C_p$ become extremely
similar, and will pick the same model, and so will AIC, if one of the models is
right. So they will also pick models which predict almost as well as the best
of the models we're working with. Since $C_p$ and AIC involve less calculation
than leave-one-out, they have advantages when $n$ is large. Against this,
there don't seem to be any situations where $C_p$ or AIC pick models with good
predictive performance but leave-one-out does not. The best way to think about
$C_p$ and AIC is that they are fast approximations to the more fundamental
quantity, which is leave-one-out.
On the other hand, one can {\em also} prove the following:
\begin{quotation}
As $n\rightarrow\infty$, if the true model is among those being compared,
LOOCV, $C_p$ and AIC will all tend to pick a {\em strictly larger} model than
the truth.
\end{quotation}
That is, all three criteria tend to prefer models which are bigger than
the true model, even when the true model is available to them. They are
``not consistent for model selection''.
The problem is that while these methods give unbiased estimates of the
generalization error, that doesn't say anything about the variance of the
estimates. Models with more parameters have higher variance, and the penalty
applied by these methods isn't strong enough to overcome the chance of
capitalizing on that variance.
\section{Other Model Selection Criteria}
While many, many other model selection criteria have been proposed,
two are particularly important.
\subsection{$k$-Fold Cross-Validation}
In leave-one-out cross-validation, we omitted each data point in turn,
and tried to predict it.
$K$-fold cross-validation is somewhat different, and goes as follows.
\begin{itemize}
\item Randomly divide the data into $k$ equally-sized parts, or ``folds''.
\item For each fold
\begin{itemize}
\item Temporarily hold back that fold, calling it the ``testing set''.
\item Call the other $k-1$ folds, taken together, the ``training set''.
\item Estimate each model on the training set.
\item Calculate the MSE of each model on the testing set.
\end{itemize}
\item Average MSEs over folds.
\end{itemize}
We then pick the model with the lowest MSE, averaged across testing sets.
The point of this is just like the point of leave-one-out: the models are
compared only on data which they didn't get to see during estimation.
Indeed, leave-one-out is the special case of $k$-fold cross-validation
where $k=n$. The disadvantage of doing that is that in leave-one-out,
all of the training sets are very similar (they share $n-2$ data points),
so averaging over folds does very little to reduce variance. For
moderate $k$ --- people typically use 5 or 10 --- $k$-fold CV tends
to produce very good model selection results.
Like leave-one-out CV, $k$-fold cross-validation can be applied to any loss
function, such as the proportion of cases mis-classified, or negative
log-likelihood.
\subsection{BIC}
A more AIC-like criterion is the ``Bayesian\footnote{Bayesianism is the idea
that we ought to have probabilities for parameter values and for models, and
not just for random variables (or, said another way, to treat parameters and
models as also random variables), and update those probabilities as we see
more events using Bayes's rule. It is a controversial position within
statistics and philosophy of science, with many able and learned supporters,
and equally able and learned opponents. (It is also the only position in
statistics and philosophy of science I know of which has an online cult
dedicated to promoting it, alongside reading certain works of Harry Potter
fanfic, and trying not to think about the possibility a future
superintelligent computer will simulate your being tortured.)} information
criterion'' introduced by \citet{Schwarz-BIC}. The name is quite
misleading\footnote{The truly Bayesian position is not to {\em select} a model
at all, but rather to maintain a probability distribution over all models you
think possible.}, but irrelevant; it's got the exact same idea of
penalizing the log-likelihood with the number of parameters, but using a penalty which gets bigger with $n$:
\[
BIC(S) = L_S - \frac{\log{n}}{2}\mathrm{dim}(S)
\]
This is a stronger penalty than AIC applies, and this has consequences:
\begin{quotation}
As $n\rightarrow\infty$, if the true model is among those BIC can select
among, BIC will tend to pick the true model.
\end{quotation}
Of course there are various conditions attached to this, some of them
quite technical, but it's generally true for IID samples, for regression
modeling, for many sorts of time series model, etc. Unfortunately,
the model selected by BIC will tend to predict less well than the one
selected by leave-one-out cross-validation or AIC.
\section{Stepwise Model Selection}
One way to automatically select a model is to begin with the largest
model you can, and then prune it, which can be done in several ways:
\begin{itemize}
\item Eliminate the least-significant coefficient.
\item Pick your favorite model selection criterion, consider deleting each
coefficient in turn, and pick the sub-model with the best value of the
criterion.
\end{itemize}
Having eliminated a variable, one then re-estimates the model, and repeats the
procedure. Stop when either all the remaining coefficients are significant
(under the first option), or nothing can be eliminated without worsening the
criterion.
(What I've described is {\bf backwards} stepwise model selection. {\bf
Forward} stepwise model selection starts with the intercept-only model and
adds variables in the same fashion. There are, naturally, forward-backward
hybrids.)
Stepwise model selection is a {\bf greedy} procedure: it takes the move which
does the most to immediately improve the criterion, without considering the
consequences down the line. There are very, very few situations where it is
consistent for model selection, or (in its significance-testing version) where
it even does a particularly good job of coming up with predictive models, but
it's surprisingly popular.
\section{Inference after Selection}
\label{sec:inference-after-selection}
All of the inferential statistics we have done in earlier lectures presumed
that our choice of model was completely fixed, and not at all dependent on the
data. If different data sets would lead us to use different models, and our
data are (partly) random, then which model we're using is also random. This
leads to some extra uncertainty in, say, our estimate of the slope on $X_1$,
which is {\em not} accounted for by our formulas for the sampling
distributions, hypothesis tests, confidence sets, etc.
A very common response to this problem, among practitioners, is to ignore it,
or at least hope it doesn't matter. This can be OK, if the data-generating
distribution forces us to pick one model with very high probability, or if all
of the models we might pick are very similar to each other. Otherwise,
ignoring it leads to nonsense.
Here, for instance, I simulate 200 data points where the $Y$ variable
is a standard Gaussian, and there are 100 independent predictor
variables, all also standard Gaussians, independent of each other
{\em and of} $Y$:
<<>>=
n <- 200; p <- 100
y <- rnorm(n)
x <- matrix(rnorm(n*p),nrow=n)
df <- data.frame(y=y,x)
mdl <- lm(y~., data=df)
@
Of the \Sexpr{p} predictors, \Sexpr{sum(coefficients(summary(mdl))[-1,4]<0.05)} have
$t$-statistics which are significant at the $0.05$ level or less. (The
expected number would be \Sexpr{p*0.05}.) If I select the model using just
those variables\footnote{Exercise: Explain all the ways in which this is a bad
idea. Now imagine explaining the same thing to your boss, who took
econometrics 20 years ago, and wants to know why he can't just follow the
stars.}, I get the following:
<<>>=
stars <- 1+which(coefficients(summary(mdl))[-1,4]<0.05) # Why 1+?
mdl.2 <- lm(y~., data=df[,c(1,stars)])
summary(mdl.2)
@
Notice that final over-all $F$ statistic: it's testing whether including those
variables fits better than an intercept-only model, and saying it thinks it
does, with a definitely significant $p$-value. This is the case even though,
by construction, the response is {\em completely independent} of {\em all}
predictors. This is not a fluke: if you re-run my simulation many times, your
$p$-values in the full $F$ test will not be uniformly distributed (as they
would be on all 100 predictors), but rather will have a distribution strongly
shifted over to the left. Similarly, if we looked at the confidence intervals,
they would be much too narrow.
These issues do not go away if the true model isn't ``everything is independent
of everything else'', but rather has some structure. Because we picked the
model to predict well on this data, if we then run hypothesis tests on that
same data, they'll be too likely to tell us everything is significant, and our
confidence intervals will be too narrow. Doing statistical inference on the
same data we used to select our model is just broken. It may not always be as
spectacularly broken as in my demo above, but it's still broken.
There are three ways around this. One is to pretend the issue doesn't exist;
as I said, this is popular, but it's got nothing else to recommend it.
Another, which is an area of very active research currently in statistics, is
to try to come up with clever technical adjustments to the inferential
statistics\footnote{If you're curious, ask Profs.\ Tibshirani or G'Sell about
this.}. The third approach, which is in many ways the simplest, is to use
{\em different data sets} to select a model and to do inference within the
selected model\footnote{Technically, there is a fourth possible approach, which
is to select the model completely at random, and then do inference within it.
This may sound like a joke, but there are actually situations, like testing
for a difference in means between high-dimensional vectors, where it's
perfectly reasonable.}.
\subsection{Data Splitting}
Data splitting is (for regression) a very simple procedure:
\begin{itemize}
\item Randomly divide your data set into two parts.
\item Calculate your favorite model selection criterion for all your candidate
models using only the first part of the data. Pick one model as the winner.
\item Re-estimate the winner, and calculate all your inferential statistics,
using only the other half of the data.
\end{itemize}
(Division into two equal halves is optional, but usual.)
Because the winning model is statistically independent of the second half
of the data, the confidence intervals, hypothesis tests, etc., can treat
it as though that model were fixed {\em a priori}. Since we're only
using $n/2$ data points to calculate confidence intervals (or whatever),
they will be somewhat wider than if we really had fixed the model
in advance and used all $n$ data points, but that's the price we pay for
having to select a model based on data.
\section{R Practicalities}
$R^2$ and adjusted $R^2$ are calculated by the \texttt{summary} function for
\texttt{lm} objects, if --- Heaven forbid -- you should ever need them.
So, more practically, is the in-sample root mean squared error, using the
unbiased estimator:
<>=
mdl <- lm(something ~ other_things, data=df)
summary(mdl)$r.squared
summary(mdl)$adj.r.squared
summary(mdl)$sigma
@
The un-adjusted MSE is also easily calculated:
<>=
mean(residuals(mdl)^2)
@
The \texttt{AIC} function knows how to work with models produced by
\texttt{lm}; it uses an alternate definition of AIC which is $-2 \times$ the
one I gave above (so smaller AIC is preferred). Similarly for the \texttt{BIC}
function.
The \texttt{step} function will do stepwise model selection based on AIC,
either forward or backward. Manipulating the arguments also allows for doing
BIC. (See the help file.) {\em Warning:} By default this prints out a lot of
information about every model it looks at; consider setting \texttt{trace=0}.
For leave-one-out cross-validation, the most straightforward approach is to use the following
function:
<<>>=
# Calculate LOOCV score for a linear model
# Input: a model as fit by lm()
# Output: leave-one-out CV score
cv.lm <- function(mdl) {
return(mean((residuals(mdl)/(1-hatvalues(mdl)))^2))
}
@
For $k$-fold cross-validation, the easiest option at this stage is to
use the \texttt{cv.glm} function in the package \texttt{boot}\footnote{Later in this course and in 402, we
will write our own CV code, partly as character building and partly because
there's nothing quite like doing this to actually get how it works.}.
Note that this requires you to fit your model with the \texttt{glm}
function, not with \texttt{lm}, and that you will really only be
interested in the \texttt{delta} component of what \texttt{cv.glm}
returns. (See the help file, especially the examples at the end.)
Nobody seems to have written a function for calculating $C_p$.
Here is one.
<<>>=
# Calculate Mallow's Cp for a list of linear models
# Input: List of models, all fit by lm
# Output: Vector of Cp statistics
# Presumes: All models are nested inside the largest model; all models
# fit on a common data set
Cp.lm <- function(mdl.list) {
# How many samples do we have?
# Presumes all models fit to the same data
n <- nobs(mdl.list[[1]])
# Extract the number of degrees of freedom for each model
DoFs <- sapply(mdl.list, function(mdl) { sum(hatvalues(mdl)) })
# Extract the MSEs of each model
MSEs <- sapply(mdl.list, function(mdl) { mean(residuals(mdl)^2) })
# Which model had the most parameters?
# Presuming that model includes all the others as special cases
biggest <- which.max(DoFs)
# Use the nesting model's MSE to estimate sigma^2
sigma2.hat <- MSEs[[biggest]]*n/(n-DoFs[[biggest]])
Cp <- MSEs + 2*sigma2.hat*DoFs/n
return(Cp)
}
@
<>=
# Example of usage:
Cp.lm(list(mdl1, mdl2, mdl3))
@
\subsection{A Demo}
We'll do polynomial regression with just one $X$ variable; this way we can
keep throwing in as many terms as we need to, in order to make the point.
$X$ will be uniformly distributed on the interval $[-2,2]$, and when
we use a $q^{\mathrm{th}}$ order polynomial, we'll set
\[
Y = \sum_{i=1}^{q}{(-1)^q X^q} + \epsilon
\]
with $\epsilon$ having our usual Gaussian distribution with mean
0 and standard deviation $0.1$.
Here's code to simulate from the model:
<<>>=
# Simulate variable-degree polynomial with fixed X and coefficients
# Inputs: Number of points to simulate; degree of polynomial
# Output: Data from with x and y columns
sim.poly <- function(n, degree) {
x <- runif(n, min=-2, max=2)
poly.x <- poly(x, degree=degree, raw=TRUE)
alternating.signs <- rep(c(-1,1),length.out=degree)
sum.poly <- poly.x %*% alternating.signs
y <- x+rnorm(n,0,0.1)
return(data.frame(x=x,y=y))
}
@
And here is code to fit many polynomials to it:
<<>>=
# Fit multiple univariate polynomials to the same data
# Input: data frame; maximum degree of polynomial
# Output: Liist of estimated models
# Presumes: data frame has columns called x and y; y is response; maximum
# degree is an integer >= 1.
poly.fit <- function(df, max.degree) {
lapply(1:max.degree, function(deg) { lm(y ~ poly(x, degree=deg), data=df) })
}
@
And to apply multiple selection criteria to a list of models:
<<>>=
# Apply multiply model selection criteria to a list of models
# Inputs: list of models
# Outputs: Vector, indicating which model from the list was picked by
# each criterion
# Presumes: all models are set up to work with all criteria functions applied
# True if all models were fit by lm()
# All models fit on same data set (otherwise, weird)
selectors <- function(mdl.list) {
Rsq <- which.max(sapply(mdl.list, function(mdl) { summary(mdl)$r.sq }))
Rsq.adj <- which.max(sapply(mdl.list, function(mdl) { summary(mdl)$adj.r.sq }))
Cp <- which.min(Cp.lm(mdl.list))
LOOCV <- which.min(sapply(mdl.list, cv.lm))
AIC <- which.min(sapply(mdl.list, AIC))
BIC <- which.min(sapply(mdl.list, BIC))
choices <- c(Rsq = Rsq, Rsq.adj = Rsq.adj, Cp=Cp, LOOCV=LOOCV,
AIC = AIC, BIC=BIC)
return(choices)
}
@
To put this all together, let's see what gets picked if we simulate 20
data points from the quadratic, and allow models of up to order 10:
<<>>=
selectors(poly.fit(sim.poly(n=20, degree=2), max.degree=10))
@
Of course, one run doesn't mean much, so let's do this a bunch of times:
<<>>=
summary(t(replicate(1000,
selectors(poly.fit(sim.poly(n=20, degree=2),
max.degree=10)))))
@
This is showing us the summary statistics for the degree of the polynomial
model selected according to each criteria. (Why do I put in the transpose?)
Remember that the right degree here is 2, so $R^2$ is (as usual) useless, and
adjusted $R^2$ little better. The others all at least do something
roughly right, though AIC is worse than the other three.
Of course, $n=20$ isn't very much information\footnote{Though it seems to be enough for leave-one-out or BIC.}, so let's increase that
to $n=1000$.
<<>>=
summary(t(replicate(1000,
selectors(poly.fit(sim.poly(n=1000, degree=2),
max.degree=10)))))
@
Adjusted $R^2$ is hopeless, leave-one-out does essentially the same as AIC or
$C_p$ (and all have a 25\% probability of picking a model which is too big),
and BIC is, as expected, much more conservative.
You can experiment with seeing what happens if you change the true
order of the model, or the range of orders compared by the model
selectors, or make some of the higher-order polynomial terms close to but
not quite zero, etc., etc.
\section{Further Reading}
The best reference on model selection I know of, by far, is
\citet{Claeskens-Hjort-model-selection}; unfortunately, much of the theory is
beyond the level of this course, but some of the earlier chapters should not
be. \citet{Hansen-challenges-for-model-selection} provides interesting
perspectives based on extensive experience in econometrics.
Cross-validation goes back in statistics into the 1950s, if not earlier, but
did not become formalized as a tool until the 1970s, with the work of
\citet{Stone1974}, \citet{Geisser-predictive-sample-reuse} and
\citet{Geisser-Eddy-predictive-approach}. (The last paper, written in 1977,
made it perfectly clear the approach could be used on log-likelihood,
mis-classification rates, etc., as well as squared error.) It was adopted,
along with many other statistical ideas, by computer scientists during the
period in the late 1980s--early 1990s when the modern area of ``machine
learning'' emerged from (parts of) earlier areas called ``artificial
intelligence'', ``pattern recognition'', ``connectionism'', ``neural
networks'', or indeed ``machine learning''. Subsequently, many of the
scientific descendants of the early machine learners forgot where their ideas
came from, to the point where many people now think cross-validation is
something computer science contributed to data analysis. For a recent survey
of cross-validation techniques and their uses, see \citet{Arlot-Celisse-on-cv}.
For theoretical results on model selection by cross-validation, and on data
splitting, see \citet{Gyorfi-Kohler-Krzyzak-Walk-nonparametric-regrssion}.
This lecture has emphasized model selection criteria which could be applied
automatically. Of course, doing anything automatically is usually somewhat
dubious. An alternative, with a lot to recommend it, is to very carefully
construct models, test their implications, and gradually move towards
more complicated models as improvements in data (volume, precision
of measurement, range of variables, etc.) show definite problems
with simpler models \citep{Gelman-Shalizi-philosophy}.
\section{Exercises}
To think through or practice on, not to hand in.
\begin{enumerate}
\item \label{exercise:cov-of-fit-and-y} Show that $\Cov{Y_i, \hat{m}_i} =
\sigma^2 H_{ii}$. ({\em Hint:} Write $\hat{m}_i$ as a weighted sum of
$Y_j$.)
\item Using the \texttt{step} function, repeat the simulation from \S
\ref{sec:inference-after-selection} and report the number of selected
coefficients, their median $p$-value (in Wald tests of the slope being zero),
and the $p$-value of the full $F$-test. Repeat the simulation many times,
and plot a histogram of the $F$-test $p$-values.
\end{enumerate}
\bibliography{locusts}
\bibliographystyle{crs}
\end{document}