\documentclass{article} \input{../common} \begin{document} \title{Lecture 21: Model Selection} \author{36-401, Fall 2015, Section B} \date{10 November 2015} \maketitle \tableofcontents << echo=FALSE >>= library(knitr) opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE, options(show.signif.stars=FALSE)) @ \vspace{10mm} \section{Generalization and Optimism} We estimated our model by minimizing the mean squared error on our data: $\widehat{\mathbf{\beta}} = \argmin_{\mathbf{b}}{\frac{1}{n}(\mathbf{y} - \mathbf{x}\mathbf{b})^T(\mathbf{y} - \mathbf{x}\mathbf{b})}$ Different linear models will amount to different choices of the design matrix$\mathbf{x}$--- we add or drop variables, we add or drop interactions or polynomial terms, etc., and this adds or removes columns from the design matrix. We might consider doing selecting among models themselves by minimizing the MSE. This is a very bad idea, for a fundamental reason: \begin{quotation} Every model is too optimistic about how well it will actually predict. \end{quotation} Let's be very clear about what it would mean to predict well. The most challenging case would be that we see a new {\em random} point, with predictor values$X_1, \ldots X_p$and response$Y$, and our {\em old}$\widehat{\mathbf{\beta}}$has a small expected squared error: $\Expect{\left(Y - \left(\hat{\beta}_0 + \sum_{j=1}^{p}{X_j \hat{\beta}_j}\right)\right)^2}$ Here both$Y$and the$X$'s are random (hence the capital letters), so we might be asking the model for a prediction at a point it never saw before. (Of course if we have multiple identically distributed$(X,Y)$pairs, say$q$of them, the expected MSE over those$q$points is just the same as the expected squared error at one point.) An easier task would be to ask the model for predictions at the {\em same} values of the predictor variables as before, but with different random noises. That is, we fit the model to $\mathbf{Y} = \mathbf{x}\mathbf{\beta} + \mathbf{\epsilon}$ and now Tyche\footnote{Look her up.} reach into her urn and gives us $\mathbf{Y}^{\prime} = \mathbf{x}\mathbf{\beta} + \mathbf{\epsilon}^\prime$ where$\mathbf{\epsilon}$and$\mathbf{\epsilon}^\prime$are independent but identically distributed. The design matrix is the same, the true parameters$\mathbf{\beta}$are the same, but the noise is different\footnote{If we really are in an experimental setting, we really could get a realization of$\mathbf{Y}^{\prime}$just by running the experiment a second time. With surveys or with observational data, it would be harder to actually realize$\mathbf{Y}^\prime$, but mathematically at least it's unproblematic.}. We now want to see if the coefficients we estimated from$(\mathbf{x}, \mathbf{Y})$can predict$(\mathbf{x}, \mathbf{Y}^\prime)$. Since the only thing that's changed is the noise, if the coefficients can't predict well any more, that means that they were really just memorizing the noise, and not actually doing anything useful. Our out-of-sample expected MSE, then, is $\Expect{n^{-1}(\mathbf{Y}^\prime - \mathbf{x}\widehat{\mathbf{\beta}})^T (\mathbf{Y}^\prime - \mathbf{x}\widehat{\mathbf{\beta}})}$ It will be convenient to break this down into an average over data points, and to abbreviate$\mathbf{x}\widehat{\mathbf{\beta}} = \widehat{\mathbf{m}}$, the vector of fitted values. Notice that since the predictor variables and the coefficients aren't changing, our predictions are the same both in and out of sample --- at point$i$, we will predict$\widehat{m}_i$. In this notation, then, the expected out-of-sample MSE is $\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}}$ We'll compare this to the expected in-sample MSE, $\Expect{\frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2}}$ Notice that$\widehat{m}_i$is a function of$Y_i$(among other things), so those are dependent random variables, while$\widehat{m}_i$and$Y_i^\prime$are completely statistically independent\footnote{That might sound weird, but remember we're holding$\mathbf{x}$fixed in this exercise, so what we mean is that knowing$\widehat{m}_i$doesn't give us an extra information about$Y_i^\prime$beyond what we'd get from knowing the values of the$X$variables.}. Break this down term by term. What's the expected value of the$i^{\mathrm{th}}$in-sample squared error? \begin{eqnarray} \Expect{(Y_i - \hat{m}_i)^2} & = & \Var{Y_i-\hat{m}_i} + {\left(\Expect{Y_i - \hat{m}_i}\right)}^2\\ & = & \Var{Y_i} + \Var{\hat{m}_i} - 2\Cov{Y_i, \hat{m}_i} + {\left(\Expect{Y_i} - \Expect{\hat{m}_i}\right)}^2 \end{eqnarray} The covariance term is not (usually) zero, because, as I just said,$\hat{m}_i$is a function of, in part,$Y_i$. On the other hand, what's the expected value of the$i^{\mathrm{th}}$squared error on new data? \begin{eqnarray} \Expect{(Y_i^\prime - \hat{m}_i)^2} & = & \Var{Y_i\prime-\hat{m}_i} + {\left(\Expect{Y_i^\prime - \hat{m}_i}\right)}^2\\ & = & \Var{Y_i^\prime} + \Var{\hat{m}_i} - 2\Cov{Y_i^\prime, \hat{m}_i} + {\left(\Expect{Y_i^\prime} - \Expect{\hat{m}_i}\right)}^2 \end{eqnarray}$Y_i^\prime$is independent of$Y_i$, but has the same distribution. This tells us that$\Expect{Y_i^\prime} = \Expect{Y_i}$,$\Var{Y_i^\prime} = \Var{Y_i}$, but$\Cov{Y_i^\prime, \hat{m}_i} = 0$. So \begin{eqnarray} \Expect{(Y_i^\prime - \hat{m}_i)^2} &= & \Var{Y_i} + \Var{\hat{m}_i} + {\left(\Expect{Y_i} - \Expect{\hat{m}_i}\right)}^2\\ & = & \Expect{(Y_i - \hat{m}_i)^2} + 2\Cov{Y_i, \hat{m}_i} \end{eqnarray} Averaging over data points, $\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}} = \Expect{\frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2}} + \frac{2}{n}\sum_{i=1}^{n}{\Cov{Y_i, \hat{m}_i}}$ Clearly, we need to get a handle on that sum of covariances. For a linear model, though,$\Cov{Y_i, \hat{m}_i} = \sigma^2 H_{ii}$(Exercise \ref{exercise:cov-of-fit-and-y}). So, for linear models, $\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}} = \Expect{\frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2}} + \frac{2}{n}\sigma^2 \tr{\mathbf{H}}$ and we know that with$p$predictors and one intercept,$\tr{\mathbf{H}} = p+1$(Homework 5). Thus, for linear models, $\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}} = \Expect{\frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2}} + \frac{2}{n}\sigma^2 (p+1)$ Of course, we don't actually know the {\em expectation} on the right-hand side, but we do have a sample estimate of it, which is the in-sample MSE. If the law of large numbers is still our friend, $\Expect{\frac{1}{n}\sum_{i=1}^{n}{(Y_i^\prime - \widehat{m}_i)^2}} \approx \frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2} + \frac{2}{n}\sigma^2 (p+1)$ The second term on the right,$(2/n)\sigma^2 (p+1)$, is the {\bf optimism} of the model --- the amount by which its in-sample MSE systematically under-estimates its true expected squared error. Notice that this: \begin{itemize} \item Grows with$\sigma^2$: more noise gives the model more opportunities to seem to fit well by capitalizing on chance. \item Shrinks with$n$: at any fixed level of noise, more data makes it harder to pretend the fit is better than it really is. \item Grows with$p$: every extra parameter is another control which can be adjusted to fit to the noise. \end{itemize} Minimizing the in-sample MSE completely ignores the bias from optimism, so it is guaranteed to pick models which are too large and predict poorly out of sample. If we could calculate the optimism term, we could at least use an unbiased estimate of the true MSE on new data. Of course, we do not actually know$\sigma^2$. \section{Mallow's$C_p$Statistic} The Mallows$C_p$statistic just substitutes in a feasible estimator of$\sigma^2$, which is$\hat{\sigma}^2${\em from the largest model we consider}. This will be an unbiased estimator of$\sigma^2$if the real model is smaller (contains a strict subset of the predictor variables), but not vice versa\footnote{This assumes the largest model must contain the truth!}. That is, for a linear model with$p+1$coefficients fit by OLS, $$\label{eqn:cp} C_p \equiv = \frac{1}{n}\sum_{i=1}^n{(Y_i - \widehat{m}_i)^2} + \frac{2}{\hat{\sigma}^2}{n}(p+1)$$ The selection rule is to pick the model which minimizes$C_p$. We can think of$C_p$as having two parts, $C_p = MSE + (\mathrm{penalty})$ From one point of view, the penalty is just an estimate of the bias. From another point of view, it's a cost we're imposing on models for having extra parameters. Every new parameter has got to pay that cost by reducing the MSE by at least a certain amount; if it doesn't, the extra parameter isn't worth it. (Before this, we've only been dealing with {\em one} model, so we've not had to distinguish carefully between the in-sample MSE and the maximum likelihood estimate of$\sigma^2$. With multiple models floating around, though, each can have its own MSE, but there is only one true$\sigma^2$, and we need {\em an} estimate of it.) For comparing models, we really care about differences: $$\label{eqn:delta-Cp} \Delta C_p = MSE_1 - MSE_2 + \frac{2}{n}\hat{\sigma}^2(p_1 - p_2)$$ (The extra term for the intercept, being common to both models, doesn't contribute.) \paragraph{Alternate form of$C_p$} You will find many references which define$C_p$somewhat differently: $$\label{eqn:funky-Cp} \frac{n MSE}{\hat{\sigma}^2} - n + 2p$$ and say that the optimal value is close to$p$, not close to 0. To see that this selects exactly the same models as the rule given above, take a difference between two models, with MSE's$MSE_1, MSE_2$and$p_1, p_2$predictors. We get $\frac{n(MSE_1 - MSE_2)}{\hat{\sigma}^2} + 2(p_1 -p_2)$ Dividing by$n$and multiplying by$\hat{\sigma}^2$gives us back Eq.\ \ref{eqn:delta-Cp}. There are reasons to assert that Eq.\ \ref{eqn:funky-Cp} should indeed be close to$p$for the right model (if the Gaussian noise assumption holds), but Eq.\ \ref{eqn:cp} is a good estimate of the out-of-sample error, and a good model selection rule, much more broadly. \subsection{$R^2$and Adjusted$R^2$} Recall that $R^2 = 1 - \frac{MSE}{s^2_Y}$ Picking a model by maximizing$R^2$is thus equivalent to picking a model by minimizing MSE. It is therefore stupid for exactly the same reasons that minimizing MSE across models is stupid. Recall that the adjusted$R^2$is $R^2_{adj} = 1 - \frac{MSE \frac{n}{n-p-1}}{s^2_Y}$ That is, it's$R^2$with the unbiased estimator of$\sigma^2$. Maximizing adjusted$R^2$therefore corresponds to minimizing that unbiased estimator. What does that translate to? \begin{eqnarray} MSE\frac{n}{n-p-1} & = & MSE \frac{1}{1-(p+1)/n}\\ & \approx & MSE\left(1 + \frac{p+1}{n}\right)\\ & = & MSE + MSE\frac{p+1}{n} \end{eqnarray} where the approximation becomes exact as$n\rightarrow\infty$with$p$fixed\footnote{Use the binomial theorem to expand$1/(1-u)$as$1 + u + u^2 + \ldots$, and truncate the series at first order. (If$u$is small,$u^2$is tiny, and the higher powers microscopic.)}. Even for the completely right model, where$MSE$is a consistent estimator of$\hat{\sigma}^2$, the correction or penalty is only half as big as we've seen it should be. Selecting models using adjusted$R^2$is not completely stupid, as maximizing$R^2$is, but it is still not going to work very well. \section{Akaike Information Criterion (AIC)} The great Japanese statistician Hirotugu Akaike proposed a famous model selection rule which also has the form of in-sample performance plus penalty''. What has come to be called the {\bf Akaike information criterion} (AIC) is $AIC(S) \equiv L_S - \mathrm{dim}(S)$ where$L_S$is the log likelihood of the model$S$, evaluated at the maximum likelihood estimate, and$\mathrm{dim}(S)$is the dimension of$S$, the number of adjustable parameters it has. Akaike's rule is to pick the model which maximizes AIC\footnote{Actually, in his original paper \citep{Akaike-AIC}, he proposed using {\em twice} this, to simplify some calculations involving chi-squared distributions. Many subsequent authors have since kept the factor of 2, which of course will not change which model is selected. Also, some authors define AIC as negative of this, and then minimize it; again, clearly the same thing.}. The reason for this definition is that Akaike showed$AIC/n$is an unbiased estimate of the expected log-probability the estimated parameters will give to a new data point which it hasn't seen before, if the model is right. This is the natural counterpart of expected squared error for more general distributions than the Gaussian. IF we do specialize to linear-Gaussian models, then we've seen (Lecture 10) that $L = -\frac{n}{2}(1+\log{2\pi}) - \frac{n}{2}\log{MSE}$ and the dimension of the model is$p+2$(because$\sigma^2$is also an adjustable parameter). Notice that$-\frac{n}{2}(1+\log{2\pi})$doesn't involve the parameters at all. If we compare AICs for two models, with mean squared errors in-sample of$MSE_1$and$MSE_2$, and one with$p_1$predictors and the other with$p_2$, the difference in AICs will be $\Delta AIC = -\frac{n}{2}\log{MSE_1} + \frac{n}{2}\log{MSE_2} - (p_1 - p_2)$ To relate this to$C_p$, let's write$MSE_2 = MSE_1 + \Delta MSE$. Then \begin{eqnarray} \Delta AIC & = & -\frac{n}{2}\log{MSE_1} + \frac{n}{2}\log{MSE_1\left(1+\frac{\Delta MSE}{MSE_1}\right)} - (p_1 - p_2)\\ & = & -\frac{n}{2}\log{\left(1+\frac{\Delta MSE}{MSE_1}\right)} - (p_1 - p_2) \end{eqnarray} Now let's suppose that model 1 is actually the correct model, so$MSE_1 = \hat{\sigma}^2$, and that$\Delta MSE$is small compared to$\hat{\sigma}^2$, so\footnote{Taylor expand$\log{1+u}$around 1 to get$\log{1+u} \approx u$, for$u$close to 0.} \begin{eqnarray} \Delta AIC & \approx & -\frac{n}{2}\frac{\Delta MSE}{\hat{\sigma}^2} - (p_1 - p_2)\\ \frac{-2 \hat{\sigma}^2}{n}\Delta AIC & \approx & \Delta MSE + \frac{2}{n}\hat{\sigma^2}(p_1 - p_2) = \Delta C_p \end{eqnarray} So, if one of the models we're looking at is actually the correct model, and the others aren't too different from it, picking by maximizing AIC will give the same answer as picking by minimizing$C_p$. \paragraph{Other Uses of AIC} AIC can be applied whenever we have a likelihood. It is therefore used for tasks like comparing models of probability distributions, or predictive models where the whole distribution is important.$C_p$, by contrast, really only makes sense if we're trying to do regression and want to use squared error. \subsection{Why$-\mathrm{dim}(S)$?} Akaike had a truly brilliant argument for subtracting a penalty equal to the number of parameters from the log-likelihood, which is too pretty not to at least sketch here.\footnote{Nonetheless, this subsection is optional.} Generically, say that the parameter vector is$\theta$, and its true value is$\theta^*$. (For linear regression with Gaussian noise,$\theta$consists of all$p+1$coefficients plus$\sigma^2$.) The length of this vector, which is$\mathrm{dim}(S)$, is let's say$d$. (For linear regression with Gaussian noise,$d=p+2$.) The maximum likelihood estimate is$\hat{\theta}$. We know that the derivative of the likelihood is zero at the MLE: $\nabla L(\hat{\theta}) = 0$ Let's do a Taylor series expansion of$\nabla L(\theta)$around the true parameter value$\theta^*$: $\nabla L(\theta) = \nabla L(\theta^*) + (\theta - \theta^*) \nabla\nabla L(\theta^*)$ Here$ \nabla\nabla L(\theta^*)$is the$d\times d$matrix of second partial derivatives of$L$, evaluated at$\theta^*$. This is called the {\bf Hessian}, and would traditionally be written$\mathbf{H}$, but that would lead to confusion with the hat matrix, so I'll call it$\mathbf{K}$. Therefore the Taylor expansion for the gradient of the log-likelihood is $\nabla L(\theta) = \nabla L(\theta^*) + (\theta-\theta^*) \mathbf{K}$ Applied to the MLE, $\mathbf{0} = \nabla L(\theta^*) + (\hat{\theta} - \theta^*)\mathbf{K}$ or $\hat{\theta} = \theta^* - K^{-1} \nabla L(\theta^*)$ What is the {\em expected} log-likelihood, on new data, of$\hat{\theta}$? Call this expected log-likelihood$\ell$(using a lower-case letter to indicate that it is non-random). Doing another Taylor series, $\ell(\theta) \approx \ell(\theta^*) + (\theta-\theta^*)^T \nabla \ell(\theta^*) + \frac{1}{2}(\theta-\theta^*)^T \nabla \nabla \ell(\theta^*) (\theta-\theta^*)$ However, it's not hard to show that the expected log-likelihood is always\footnote{Except for quite weird models.} maximized by the true parameters, so$\nabla \ell(\theta^*) = 0$. (The same argument also shows$\Expect{\nabla L(\theta^*)} = 0$.) Call the Hessian in this Taylor expansion$\mathbf{k}$. (Again, notice the lower-case letter for a non-random quantity.) We have $\ell(\theta) \approx \ell(\theta^*) + \frac{1}{2}(\theta-\theta^*)^T \mathbf{k} (\theta-\theta^*)$ Apply this to the MLE: $\ell(\hat{\theta}) \approx \ell(\theta^*) + \frac{1}{2}\nabla L(\theta^*) \mathbf{K}^{-1} \mathbf{k} \mathbf{K}^{-1} \nabla L(\theta^*)$ Taking expectations, $\Expect{\ell(\hat{\theta})} \approx \ell(\theta^*) + \frac{1}{2}\tr{\mathbf{K}^{-1} \mathbf{k} \mathbf{K}^{-1} \mathbf{J}}$ where$\Var{\nabla L(\theta^*)} = \mathbf{J}$. For large$n$,$\mathbf{K}$converges on$\mathbf{k}$, so this simplifies to $\Expect{\ell(\hat{\theta})} \approx \ell(\theta^*) + \frac{1}{2}\tr{\mathbf{k}^{-1}\mathbf{J}}$ This still leaves things in terms of$\ell(\theta^*)$, which of course we don't know, but now we do another Taylor expansion, this time of$L$around$\hat{\theta}$: $L(\theta^*) \approx L(\hat{\theta}) + \frac{1}{2}(\theta^* - \hat{\theta})^T \nabla \nabla L(\hat{\theta}) (\theta^* - \hat{\theta})$ so $L(\theta^*) \approx L(\hat{\theta}) + \frac{1}{2}(\mathbf{K}^{-1} \nabla L(\theta^*))^T \nabla \nabla L(\hat{\theta}) (\mathbf{K}^{-1} \nabla L(\theta^*))$ For large$n$,$\nabla\nabla L(\hat{\theta}) \rightarrow \nabla\nabla L(\theta^*) \rightarrow \mathbf{k}$. So, again taking expectations, $\ell(\theta^*) \approx \Expect{L(\hat{\theta}} + \frac{1}{2}\tr{\mathbf{k}^{-1}\mathbf{J}}$ Putting these together, $\Expect{\ell(\hat{\theta})} \approx \Expect{L(\hat{\theta}} + \tr{\mathbf{k}^{-1}\mathbf{J}}$ An unbiased estimate is therefore $L(\hat{\theta}) + \tr{\mathbf{k}^{-1}\mathbf{J}}$ Finally, a fundamental result (the Fisher identity'') says that for well-behaved models, {\em if} the model is correct, then $\Var{\nabla L(\theta^*)} = -\nabla\nabla \ell(\theta^*)$ or$\mathbf{J} = -\mathbf{k}$. Hence, if the model is correct, our unbiased estimate is just $L(\hat{\theta}) - \tr{\mathbf{I}}$ and of course$\tr{\mathbf{I}} = d$. There, as you'll notice, several steps where we're making a bunch of approximations. Some of these approximations (especially those involving the Taylor expansions) can be shown to be OK asymptotically (i.e., as$n\rightarrow\infty$) by more careful math. The last steps, however, where we invoke the Fisher identity, are rather more dubious. (After all, all of the models we're working with can hardly contain the true distribution.) A somewhat more robust version of AIC is therefore to use as the criterion $L(\hat{\theta}) + \tr{\mathbf{K}\mathbf{J}}$ \section{Leave-one-out Cross-Validation (LOOCV)} When looking at influential points and outliers, we considered omitting one point from the data set, estimating the model, and then trying to predict that one data point. The {\bf leave-one-out} fitted value for data point$i$is$\hat{m}^{(-i)}_i$, where the subscript$(-i)$indicates that point$i$was left out in calculating this fit. The {\bf leave-one-out cross-validation score} of the model is $LOOCV = \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \hat{m}^{(-i)}_i)^2}$ (Many more old-fashioned regression textbooks look at$n LOOCV$, and call it PRESS, predictive residual sum of squares''.) The story for cross-validation is pretty compelling: we want to know if our model can generalize to new data, so {\em see} how well it generalizes to new data. Leaving out each point in turn ensures that that the set of points on which we try to make predictions is just as representative of the whole population as the original sample was. Fortunately, this is one of those cases where a compelling story is actually true: LOOCV is an unbiased estimate of the generalization error. \subsection{Short-cut Based on Leverage} Re-estimating the model$n$times would be seriously time-consuming, but there is fortunately a short-cut: $LOOCV = \frac{1}{n}\sum_{i=1}^{n}{\left( \frac{Y_i - \hat{m}_i}{1-H_{ii}}\right)^2}$ The numerator inside the square is just the residual of the model fit to the full data. This gets divided by$1-H_{ii}$, which is also something we can calculate with just one fit to the model. (The denominator says that the residuals for high-leverage points count more, and those for low-leverage points count less. If the model is going out of its way to match$Y_i$(high leverage$H_{ii}$) and it still can't fit it, that's worse than the same sized residual at a point the model doesn't really care about (low leverage).) The gap between LOOCV and the MSE can be thought of as a penalty, just like with$C_p$or AIC. The penalty doesn't have such a nice mathematical expression, but it's well-defined and easy for us to calculate. It also converges to the penalty$C_p$applies as$n$grows. To help see this, first observe that the$H_{ii}$must be getting small. (We know that$\sum_{i}{H_{ii}} = p+1$.) Then\footnote{Use the binomial theorem again.}$(1-H_{ii})^{-2} \approx 1 - 2 H_{ii}$, and $LOOCV \approx \frac{1}{n}\sum_{i=1}^{n}{( Y_i - \hat{m}_i)^2(1-2 H_{ii})} \approx MSE + 2 \sigma^2 \tr{\mathbf{H}}$ \paragraph{Cross-validation with log-likelihood} The leave-one-out idea can also be applied for any model where we make a probabilistic prediction. Instead of measuring mean squared error, we measure the negative log probability density the model assigns to the actual left-out point. (Negative, so that a lower score is still better.) With Gaussian noise, this comes to the same thing as the MSE, of course. \subsection{Summing Up$C_p$, AIC, LOOCV} Under a very broad range of circumstances, there are theorems which say, roughly, the following: \begin{quotation} As$n\rightarrow \infty$, the expected out-of-sample MSE of the model picked by leave-one-out cross-validation is close to that of the best model considered. \end{quotation} The condition for these results do {\em not} require that any of the models considered be true, or that the true model have Gaussian noise or even be linear. As we've seen, for large$n$leave-one-out and Mallow's$C_p$become extremely similar, and will pick the same model, and so will AIC, if one of the models is right. So they will also pick models which predict almost as well as the best of the models we're working with. Since$C_p$and AIC involve less calculation than leave-one-out, they have advantages when$n$is large. Against this, there don't seem to be any situations where$C_p$or AIC pick models with good predictive performance but leave-one-out does not. The best way to think about$C_p$and AIC is that they are fast approximations to the more fundamental quantity, which is leave-one-out. On the other hand, one can {\em also} prove the following: \begin{quotation} As$n\rightarrow\infty$, if the true model is among those being compared, LOOCV,$C_p$and AIC will all tend to pick a {\em strictly larger} model than the truth. \end{quotation} That is, all three criteria tend to prefer models which are bigger than the true model, even when the true model is available to them. They are not consistent for model selection''. The problem is that while these methods give unbiased estimates of the generalization error, that doesn't say anything about the variance of the estimates. Models with more parameters have higher variance, and the penalty applied by these methods isn't strong enough to overcome the chance of capitalizing on that variance. \section{Other Model Selection Criteria} While many, many other model selection criteria have been proposed, two are particularly important. \subsection{$k$-Fold Cross-Validation} In leave-one-out cross-validation, we omitted each data point in turn, and tried to predict it.$K$-fold cross-validation is somewhat different, and goes as follows. \begin{itemize} \item Randomly divide the data into$k$equally-sized parts, or folds''. \item For each fold \begin{itemize} \item Temporarily hold back that fold, calling it the testing set''. \item Call the other$k-1$folds, taken together, the training set''. \item Estimate each model on the training set. \item Calculate the MSE of each model on the testing set. \end{itemize} \item Average MSEs over folds. \end{itemize} We then pick the model with the lowest MSE, averaged across testing sets. The point of this is just like the point of leave-one-out: the models are compared only on data which they didn't get to see during estimation. Indeed, leave-one-out is the special case of$k$-fold cross-validation where$k=n$. The disadvantage of doing that is that in leave-one-out, all of the training sets are very similar (they share$n-2$data points), so averaging over folds does very little to reduce variance. For moderate$k$--- people typically use 5 or 10 ---$k$-fold CV tends to produce very good model selection results. Like leave-one-out CV,$k$-fold cross-validation can be applied to any loss function, such as the proportion of cases mis-classified, or negative log-likelihood. \subsection{BIC} A more AIC-like criterion is the Bayesian\footnote{Bayesianism is the idea that we ought to have probabilities for parameter values and for models, and not just for random variables (or, said another way, to treat parameters and models as also random variables), and update those probabilities as we see more events using Bayes's rule. It is a controversial position within statistics and philosophy of science, with many able and learned supporters, and equally able and learned opponents. (It is also the only position in statistics and philosophy of science I know of which has an online cult dedicated to promoting it, alongside reading certain works of Harry Potter fanfic, and trying not to think about the possibility a future superintelligent computer will simulate your being tortured.)} information criterion'' introduced by \citet{Schwarz-BIC}. The name is quite misleading\footnote{The truly Bayesian position is not to {\em select} a model at all, but rather to maintain a probability distribution over all models you think possible.}, but irrelevant; it's got the exact same idea of penalizing the log-likelihood with the number of parameters, but using a penalty which gets bigger with$n$: $BIC(S) = L_S - \frac{\log{n}}{2}\mathrm{dim}(S)$ This is a stronger penalty than AIC applies, and this has consequences: \begin{quotation} As$n\rightarrow\infty$, if the true model is among those BIC can select among, BIC will tend to pick the true model. \end{quotation} Of course there are various conditions attached to this, some of them quite technical, but it's generally true for IID samples, for regression modeling, for many sorts of time series model, etc. Unfortunately, the model selected by BIC will tend to predict less well than the one selected by leave-one-out cross-validation or AIC. \section{Stepwise Model Selection} One way to automatically select a model is to begin with the largest model you can, and then prune it, which can be done in several ways: \begin{itemize} \item Eliminate the least-significant coefficient. \item Pick your favorite model selection criterion, consider deleting each coefficient in turn, and pick the sub-model with the best value of the criterion. \end{itemize} Having eliminated a variable, one then re-estimates the model, and repeats the procedure. Stop when either all the remaining coefficients are significant (under the first option), or nothing can be eliminated without worsening the criterion. (What I've described is {\bf backwards} stepwise model selection. {\bf Forward} stepwise model selection starts with the intercept-only model and adds variables in the same fashion. There are, naturally, forward-backward hybrids.) Stepwise model selection is a {\bf greedy} procedure: it takes the move which does the most to immediately improve the criterion, without considering the consequences down the line. There are very, very few situations where it is consistent for model selection, or (in its significance-testing version) where it even does a particularly good job of coming up with predictive models, but it's surprisingly popular. \section{Inference after Selection} \label{sec:inference-after-selection} All of the inferential statistics we have done in earlier lectures presumed that our choice of model was completely fixed, and not at all dependent on the data. If different data sets would lead us to use different models, and our data are (partly) random, then which model we're using is also random. This leads to some extra uncertainty in, say, our estimate of the slope on$X_1$, which is {\em not} accounted for by our formulas for the sampling distributions, hypothesis tests, confidence sets, etc. A very common response to this problem, among practitioners, is to ignore it, or at least hope it doesn't matter. This can be OK, if the data-generating distribution forces us to pick one model with very high probability, or if all of the models we might pick are very similar to each other. Otherwise, ignoring it leads to nonsense. Here, for instance, I simulate 200 data points where the$Y$variable is a standard Gaussian, and there are 100 independent predictor variables, all also standard Gaussians, independent of each other {\em and of}$Y$: <<>>= n <- 200; p <- 100 y <- rnorm(n) x <- matrix(rnorm(n*p),nrow=n) df <- data.frame(y=y,x) mdl <- lm(y~., data=df) @ Of the \Sexpr{p} predictors, \Sexpr{sum(coefficients(summary(mdl))[-1,4]<0.05)} have$t$-statistics which are significant at the$0.05$level or less. (The expected number would be \Sexpr{p*0.05}.) If I select the model using just those variables\footnote{Exercise: Explain all the ways in which this is a bad idea. Now imagine explaining the same thing to your boss, who took econometrics 20 years ago, and wants to know why he can't just follow the stars.}, I get the following: <<>>= stars <- 1+which(coefficients(summary(mdl))[-1,4]<0.05) # Why 1+? mdl.2 <- lm(y~., data=df[,c(1,stars)]) summary(mdl.2) @ Notice that final over-all$F$statistic: it's testing whether including those variables fits better than an intercept-only model, and saying it thinks it does, with a definitely significant$p$-value. This is the case even though, by construction, the response is {\em completely independent} of {\em all} predictors. This is not a fluke: if you re-run my simulation many times, your$p$-values in the full$F$test will not be uniformly distributed (as they would be on all 100 predictors), but rather will have a distribution strongly shifted over to the left. Similarly, if we looked at the confidence intervals, they would be much too narrow. These issues do not go away if the true model isn't everything is independent of everything else'', but rather has some structure. Because we picked the model to predict well on this data, if we then run hypothesis tests on that same data, they'll be too likely to tell us everything is significant, and our confidence intervals will be too narrow. Doing statistical inference on the same data we used to select our model is just broken. It may not always be as spectacularly broken as in my demo above, but it's still broken. There are three ways around this. One is to pretend the issue doesn't exist; as I said, this is popular, but it's got nothing else to recommend it. Another, which is an area of very active research currently in statistics, is to try to come up with clever technical adjustments to the inferential statistics\footnote{If you're curious, ask Profs.\ Tibshirani or G'Sell about this.}. The third approach, which is in many ways the simplest, is to use {\em different data sets} to select a model and to do inference within the selected model\footnote{Technically, there is a fourth possible approach, which is to select the model completely at random, and then do inference within it. This may sound like a joke, but there are actually situations, like testing for a difference in means between high-dimensional vectors, where it's perfectly reasonable.}. \subsection{Data Splitting} Data splitting is (for regression) a very simple procedure: \begin{itemize} \item Randomly divide your data set into two parts. \item Calculate your favorite model selection criterion for all your candidate models using only the first part of the data. Pick one model as the winner. \item Re-estimate the winner, and calculate all your inferential statistics, using only the other half of the data. \end{itemize} (Division into two equal halves is optional, but usual.) Because the winning model is statistically independent of the second half of the data, the confidence intervals, hypothesis tests, etc., can treat it as though that model were fixed {\em a priori}. Since we're only using$n/2$data points to calculate confidence intervals (or whatever), they will be somewhat wider than if we really had fixed the model in advance and used all$n$data points, but that's the price we pay for having to select a model based on data. \section{R Practicalities}$R^2$and adjusted$R^2$are calculated by the \texttt{summary} function for \texttt{lm} objects, if --- Heaven forbid -- you should ever need them. So, more practically, is the in-sample root mean squared error, using the unbiased estimator: <>= mdl <- lm(something ~ other_things, data=df) summary(mdl)$r.squared summary(mdl)$adj.r.squared summary(mdl)$sigma @ The un-adjusted MSE is also easily calculated: <>= mean(residuals(mdl)^2) @ The \texttt{AIC} function knows how to work with models produced by \texttt{lm}; it uses an alternate definition of AIC which is $-2 \times$ the one I gave above (so smaller AIC is preferred). Similarly for the \texttt{BIC} function. The \texttt{step} function will do stepwise model selection based on AIC, either forward or backward. Manipulating the arguments also allows for doing BIC. (See the help file.) {\em Warning:} By default this prints out a lot of information about every model it looks at; consider setting \texttt{trace=0}. For leave-one-out cross-validation, the most straightforward approach is to use the following function: <<>>= # Calculate LOOCV score for a linear model # Input: a model as fit by lm() # Output: leave-one-out CV score cv.lm <- function(mdl) { return(mean((residuals(mdl)/(1-hatvalues(mdl)))^2)) } @ For $k$-fold cross-validation, the easiest option at this stage is to use the \texttt{cv.glm} function in the package \texttt{boot}\footnote{Later in this course and in 402, we will write our own CV code, partly as character building and partly because there's nothing quite like doing this to actually get how it works.}. Note that this requires you to fit your model with the \texttt{glm} function, not with \texttt{lm}, and that you will really only be interested in the \texttt{delta} component of what \texttt{cv.glm} returns. (See the help file, especially the examples at the end.) Nobody seems to have written a function for calculating $C_p$. Here is one. <<>>= # Calculate Mallow's Cp for a list of linear models # Input: List of models, all fit by lm # Output: Vector of Cp statistics # Presumes: All models are nested inside the largest model; all models # fit on a common data set Cp.lm <- function(mdl.list) { # How many samples do we have? # Presumes all models fit to the same data n <- nobs(mdl.list[[1]]) # Extract the number of degrees of freedom for each model DoFs <- sapply(mdl.list, function(mdl) { sum(hatvalues(mdl)) }) # Extract the MSEs of each model MSEs <- sapply(mdl.list, function(mdl) { mean(residuals(mdl)^2) }) # Which model had the most parameters? # Presuming that model includes all the others as special cases biggest <- which.max(DoFs) # Use the nesting model's MSE to estimate sigma^2 sigma2.hat <- MSEs[[biggest]]*n/(n-DoFs[[biggest]]) Cp <- MSEs + 2*sigma2.hat*DoFs/n return(Cp) } @ <>= # Example of usage: Cp.lm(list(mdl1, mdl2, mdl3)) @ \subsection{A Demo} We'll do polynomial regression with just one $X$ variable; this way we can keep throwing in as many terms as we need to, in order to make the point. $X$ will be uniformly distributed on the interval $[-2,2]$, and when we use a $q^{\mathrm{th}}$ order polynomial, we'll set $Y = \sum_{i=1}^{q}{(-1)^q X^q} + \epsilon$ with $\epsilon$ having our usual Gaussian distribution with mean 0 and standard deviation $0.1$. Here's code to simulate from the model: <<>>= # Simulate variable-degree polynomial with fixed X and coefficients # Inputs: Number of points to simulate; degree of polynomial # Output: Data from with x and y columns sim.poly <- function(n, degree) { x <- runif(n, min=-2, max=2) poly.x <- poly(x, degree=degree, raw=TRUE) alternating.signs <- rep(c(-1,1),length.out=degree) sum.poly <- poly.x %*% alternating.signs y <- x+rnorm(n,0,0.1) return(data.frame(x=x,y=y)) } @ And here is code to fit many polynomials to it: <<>>= # Fit multiple univariate polynomials to the same data # Input: data frame; maximum degree of polynomial # Output: Liist of estimated models # Presumes: data frame has columns called x and y; y is response; maximum # degree is an integer >= 1. poly.fit <- function(df, max.degree) { lapply(1:max.degree, function(deg) { lm(y ~ poly(x, degree=deg), data=df) }) } @ And to apply multiple selection criteria to a list of models: <<>>= # Apply multiply model selection criteria to a list of models # Inputs: list of models # Outputs: Vector, indicating which model from the list was picked by # each criterion # Presumes: all models are set up to work with all criteria functions applied # True if all models were fit by lm() # All models fit on same data set (otherwise, weird) selectors <- function(mdl.list) { Rsq <- which.max(sapply(mdl.list, function(mdl) { summary(mdl)$r.sq })) Rsq.adj <- which.max(sapply(mdl.list, function(mdl) { summary(mdl)$adj.r.sq })) Cp <- which.min(Cp.lm(mdl.list)) LOOCV <- which.min(sapply(mdl.list, cv.lm)) AIC <- which.min(sapply(mdl.list, AIC)) BIC <- which.min(sapply(mdl.list, BIC)) choices <- c(Rsq = Rsq, Rsq.adj = Rsq.adj, Cp=Cp, LOOCV=LOOCV, AIC = AIC, BIC=BIC) return(choices) } @ To put this all together, let's see what gets picked if we simulate 20 data points from the quadratic, and allow models of up to order 10: <<>>= selectors(poly.fit(sim.poly(n=20, degree=2), max.degree=10)) @ Of course, one run doesn't mean much, so let's do this a bunch of times: <<>>= summary(t(replicate(1000, selectors(poly.fit(sim.poly(n=20, degree=2), max.degree=10))))) @ This is showing us the summary statistics for the degree of the polynomial model selected according to each criteria. (Why do I put in the transpose?) Remember that the right degree here is 2, so $R^2$ is (as usual) useless, and adjusted $R^2$ little better. The others all at least do something roughly right, though AIC is worse than the other three. Of course, $n=20$ isn't very much information\footnote{Though it seems to be enough for leave-one-out or BIC.}, so let's increase that to $n=1000$. <<>>= summary(t(replicate(1000, selectors(poly.fit(sim.poly(n=1000, degree=2), max.degree=10))))) @ Adjusted $R^2$ is hopeless, leave-one-out does essentially the same as AIC or $C_p$ (and all have a 25\% probability of picking a model which is too big), and BIC is, as expected, much more conservative. You can experiment with seeing what happens if you change the true order of the model, or the range of orders compared by the model selectors, or make some of the higher-order polynomial terms close to but not quite zero, etc., etc. \section{Further Reading} The best reference on model selection I know of, by far, is \citet{Claeskens-Hjort-model-selection}; unfortunately, much of the theory is beyond the level of this course, but some of the earlier chapters should not be. \citet{Hansen-challenges-for-model-selection} provides interesting perspectives based on extensive experience in econometrics. Cross-validation goes back in statistics into the 1950s, if not earlier, but did not become formalized as a tool until the 1970s, with the work of \citet{Stone1974}, \citet{Geisser-predictive-sample-reuse} and \citet{Geisser-Eddy-predictive-approach}. (The last paper, written in 1977, made it perfectly clear the approach could be used on log-likelihood, mis-classification rates, etc., as well as squared error.) It was adopted, along with many other statistical ideas, by computer scientists during the period in the late 1980s--early 1990s when the modern area of machine learning'' emerged from (parts of) earlier areas called artificial intelligence'', pattern recognition'', connectionism'', neural networks'', or indeed machine learning''. Subsequently, many of the scientific descendants of the early machine learners forgot where their ideas came from, to the point where many people now think cross-validation is something computer science contributed to data analysis. For a recent survey of cross-validation techniques and their uses, see \citet{Arlot-Celisse-on-cv}. For theoretical results on model selection by cross-validation, and on data splitting, see \citet{Gyorfi-Kohler-Krzyzak-Walk-nonparametric-regrssion}. This lecture has emphasized model selection criteria which could be applied automatically. Of course, doing anything automatically is usually somewhat dubious. An alternative, with a lot to recommend it, is to very carefully construct models, test their implications, and gradually move towards more complicated models as improvements in data (volume, precision of measurement, range of variables, etc.) show definite problems with simpler models \citep{Gelman-Shalizi-philosophy}. \section{Exercises} To think through or practice on, not to hand in. \begin{enumerate} \item \label{exercise:cov-of-fit-and-y} Show that $\Cov{Y_i, \hat{m}_i} = \sigma^2 H_{ii}$. ({\em Hint:} Write $\hat{m}_i$ as a weighted sum of $Y_j$.) \item Using the \texttt{step} function, repeat the simulation from \S \ref{sec:inference-after-selection} and report the number of selected coefficients, their median $p$-value (in Wald tests of the slope being zero), and the $p$-value of the full $F$-test. Repeat the simulation many times, and plot a histogram of the $F$-test $p$-values. \end{enumerate} \bibliography{locusts} \bibliographystyle{crs} \end{document}