\documentclass{article} \input{../common} \begin{document} \title{Lecture 20: Outliers and Influential Points} \author{36-401, Fall 2015, Section B} \date{5 November 2015} \maketitle \tableofcontents << echo=FALSE >>= library(knitr) opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE, options(show.signif.stars=FALSE)) @ \newpage An {\bf outlier} is a data point which is very far, somehow, from the rest of the data. They are often worrisome, but not always a problem. When we are doing regression modeling, in fact, we don't really care about whether some data point is far from the rest of the data, but whether it breaks a pattern the rest of the data seems to follow. Here, we'll first try to build some intuition for when outliers cause trouble in linear regression models. Then we'll look at some ways of quantifying how much influence particular data points have on the model; consider the strategy of pretending that inconvenient data doesn't exist; and take a brief look at the {\bf robust regression} strategy, of replacing least squares estimates with others which are less easily influenced. \section{Outliers Are Data Points Which Break a Pattern} Consider Figure \ref{fig:outliers}. The points marked in red and blue are clearly not like the main cloud of the data points, even though their $x$ and $y$ coordinates are quite typical of the data as a whole: the $x$ coordinates of those points aren't related to the $y$ coordinates in the right way, they break a pattern. On the other hand, the point marked in green, while its coordinates are very weird on both axes, does not break that pattern --- it was positioned to fall right on the regression line. \begin{figure}[h] <>= x <- runif(20) y <- 2*x + rnorm(n=length(x),mean=0,sd=0.2) xy <- cbind(x=x,y=y) plot(xy, xlim=c(0,5), ylim=c(min(0,min(y)),9)) outliers <- rbind(c(0.5,0), c(0,1), c(4,8)) rug(c(x,outliers[,1]), side=1) rug(c(y,outliers[,2]), side=2) points(outliers, pch=c(4,6,15), col=c("red","blue","green")) all.points <- data.frame(rbind(outliers,xy)) @ \caption{Points marked with a red $\times$ and a blue triangle are outliers for the regression line through the main cloud of points, even though their $x$ and $y$ coordinates are quite typical of the marginal distributions (see rug plots along axes). The point marked by the green square, while an outlier along both axes, falls right along the regression line. (See the source file online for the figure-making code.)} \label{fig:outliers} \end{figure} \clearpage What affect do these different outliers have on a simple linear model here? Table \ref{table:some-estimates} shows the estimates we get from using just the black points, from adding only one of the three outlying points to the black points, and from using all the points. As promised, adding the red or blue points shifts the line, while adding the green point changes hardly anything at all. \begin{table} <>= some.coefs <- rbind(coefficients(lm(y~x, data=all.points[-(1:3),])), coefficients(lm(y~x, data=all.points[-(2:3),])), coefficients(lm(y~x, data=all.points[-c(1,3),])), coefficients(lm(y~x, data=all.points[-(1:2),])), coefficients(lm(y~x, data=all.points))) rownames(some.coefs) <- c("black only","black+blue","black+red", "black+green","all points") kable(signif(some.coefs,3)) @ \caption{Estimates of the simple regression line from the black points in Figure \ref{fig:outliers}, plus re-estimates adding in various outliers.} \label{table:some-estimates} \end{table} If we are worried that outliers might be messing up our model, we would like to quantify how much the estimates change if we add or remove individual data points. Fortunately, we can quantify this using only quantities we estimated on the complete data, especially the hat matrix. \clearpage \subsection{Examples with Simple Linear Regression} To further build intuition, let's think about what happens with simple linear regression for a moment; that is, our model is \begin{equation} Y = \beta_0 + \beta_1 X + \epsilon \end{equation} with a single real-valued predictor variable $X$. When we estimate the coefficients by least squares, we know that \begin{equation} \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x} \end{equation} Let us turn this around. The fitted value at $X=\overline{x}$ is \begin{equation} \hat{\beta}_0 + \hat{\beta}_1 \overline{x} = \overline{y} \end{equation} Suppose we had a data point, say the $i^\mathrm{th}$ point, where $X=\overline{x}$. Then the actual value of $y_i$ almost wouldn't matter for the fitted value there --- the regression line {\em has} to go through $\overline{y}$ at $\overline{x}$, never mind whether $y_i$ there is close to $\overline{y}$ or far away. If $x_i = \overline{x}$, we say that $y_i$ has little {\em leverage} over $\hat{m}_i$, or little {\em influence} on $\hat{m}_i$. It has {\em some} influence, because $y_i$ is part of what we average to get $\overline{y}$, but that's not a lot of influence. Again, with simple linear regression, we know that \begin{equation} \hat{\beta}_1 = \frac{c_{XY}}{s^2_X} \end{equation} the ratio between the sample covariance of $X$ and $Y$ and the sample variance of $X$. How does $y_i$ show up in this? It's \begin{equation} \hat{\beta}_1 = \frac{n^{-1}\sum_{i=1}^{n}{(x_i - \overline{x})(y_i - \overline{y})}}{s^2_X} \end{equation} Notice that when $x_i = \overline{x}$, $y_i$ doesn't actually matter at all to the slope. If $x_i$ is far from $\overline{x}$, then $y_i - \overline{y}$ will contribute to the slope, and its contribution will get bigger (whether positive or negative) as $x_i - \overline{x}$ grows. $y_i$ will also make a big contribution to the slope when $y_i -\overline{y}$ is big (unless, again, $x_i =\overline{x}$). Let's write a general formula for the predicted value, at an arbitrary point $X=x$. \begin{eqnarray} \hat{m}(x) & = & \hat{\beta}_0 + \hat{\beta}_1 x\\ & = & \overline{y} - \hat{\beta}_1 \overline{x} + \hat{\beta}_1 x\\ & = & \overline{y} + \hat{\beta}_1 (x - \overline{x})\\ & = & \overline{y} + \frac{1}{n}\frac{\sum_{i=1}^{n}{(x_i - \overline{x})(y_i-\overline{y})}}{s^2_X}(x-\overline{x}) \end{eqnarray} So, in words: \begin{itemize} \item The predicted value is always a weighted average of all the $y_i$. \item As $x_i$ moves away from $\overline{x}$, $y_i$ gets more weight (possibly a large negative weight). When $x_i = \overline{x}$, $y_i$ only matters because it contributes to the global mean $\overline{y}$. \item The weights on all data points increase in magnitude when the point $x$ where we're trying to predict is far from $\overline{x}$. If $x=\overline{x}$, only $\overline{y}$ matters. \end{itemize} All of this is still true of the fitted values at the original data points: \begin{itemize} \item If $x_i$ is at $\overline{x}$, $y_i$ only matters for the fit because it contributes to $\overline{y}$. \item As $x_i$ moves away from $\overline{x}$, in either direction, it makes a bigger contribution to {\em all} the fitted values. \end{itemize} Why is this happening? We get the coefficient estimates by minimizing the mean squared error, and the MSE treats all data points equally: \begin{equation} \frac{1}{n}\sum_{i=1}^{n}{(y_i - \hat{m}(x_i))^2} \end{equation} But we're not just using any old function $\hat{m}(x)$; we're using a linear function. This has only two parameters, so we can't change the predicted value to match each data point --- altering the parameters to bring $\hat{m}(x_i)$ closer to $y_i$ might actually increase the error elsewhere. By minimizing the over-all MSE with a linear function, we get two constraints, \begin{equation} \overline{y} = \hat{\beta}_0 + \hat{\beta}_1 \overline{x} \end{equation} and \begin{equation} \sum_{i}{e_i (x_i - \overline{x})} = 0 \end{equation} The first of these makes the regression line insensitive to $y_i$ values when $x_i$ is close to $\overline{x}$. The second makes the regression line {\em very} sensitive to residuals when $x_i - \overline{x}$ is big --- when $x_i - \overline{x}$ is large, a big residual ($e_i$ far from 0) is harder to balance out than if $x_i -\overline{x}$ were smaller. So, let's sum this up. \begin{itemize} \item Least squares estimation tries to bring all the predicted values closer to $y_i$, but it can't match each data point at once, because the fitted values are all functions of the same coefficients. \item If $x_i$ is close to $\overline{x}$, $y_i$ makes little difference to the coefficients or fitted values --- they're pinned down by needing to go through the mean of the data. \item As $x_i$ moves away from $\overline{x}$, $y_i - \overline{y}$ makes a bigger and bigger impact on both the coefficients and on the fitted values. \end{itemize} If we worry that some point isn't falling on the same regression line as the others, we're really worrying that including it will throw off our estimate of the line. This is going to be a concern when $x_i$ is far from $\overline{x}$, or when the combination of $x_i-\overline{x}$ and $y_i -\overline{y}$ makes that point has a disproportionate impact on the estimates. We should also be worried if the residual values are too big, but when asking what's ``too big'', we need to take into account the fact that the model will try harder to fit some points than others. A big residual at a point of high leverage is more of a red flag than an equal-sized residual at point with little influence. All of this will carry over to multiple regression models, but with more algebra to keep track of the different dimensions. \section{Influence of Individual Data Points on Estimates} Recall that our least-squares coefficient estimator is \begin{equation} \widehat{\mathbf{\beta}} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\mathbf{y} \end{equation} from which we get our fitted values as \begin{equation} \widehat{\mathbf{m}} = \mathbf{x} \widehat{\mathbf{\beta}} = \mathbf{x}(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\mathbf{y} = \mathbf{H}\mathbf{y} \end{equation} with the hat matrix $\mathbf{H} \equiv \mathbf{x}(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T$. This leads to a very natural sense in which one observation might be more or less influential than another: \begin{equation} \frac{\partial \hat{\beta}_k}{\partial y_i} = \left((\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\right)_{ki} \end{equation} and \begin{equation} \frac{\partial \hat{m}_k}{\partial y_i} = H_{ii} \end{equation} If $y_i$ were different, it would change the estimates for all the coefficients and for all the fitted values. The rate at which the $k^{\mathrm{th}}$ coefficient or fitted value changes is given by the $ki^{\mathrm{th}}$ entry in these matrices --- matrices which, notice, are completely defined by the design matrix $\mathbf{x}$. \subsection{Leverage} $H_{ii}$ is the influence of $y_i$ on its own fitted value; it tells us how much of $\hat{m}_i$ is just $y_i$. This turns out to be a key quantity in looking for outliers, so we'll give it a special name, the {\bf leverage}. It is sometimes also written $h_i$. Once again, the leverage of the $i^{\mathrm{th}}$ data point doesn't depend on $y_i$, only on the design matrix. Because the general linear regression model doesn't assume anything about the distribution of the predictors, other than that they're not collinear, we can't say definitely that some values of the leverage break model assumptions, or even are very unlikely under the model assumptions. But we can say some things about the leverage. \paragraph{Average leverages} We showed in the homework that the trace of the hat matrix equals the number of coefficients we estimate: \begin{equation} \tr{\mathbf{H}} = p+1 \end{equation} But the trace of any matrix is the sum of its diagonal entries, \begin{equation} \tr{\mathbf{H}} = \sum_{i=1}^{n}{H_{ii}} \end{equation} so the trace of the hat matrix is the sum of each point's leverage. The average leverage is therefore $\frac{p+1}{n}$. We don't expect every point to have exactly the same leverage, but if some points have much more than others, the regression function is going to be pulled towards fitting the high-leverage points, and the function will tend to ignore the low-leverage points. \paragraph{Leverage vs.\ geometry} Let's center all the predictor variables, i.e., subtract off the mean of each predictor variable. Call this new vector of predictor variables $Z$, with the $n\times p$ matrix $\mathbf{z}$. This will not change any of the slopes, and will fix the intercept to be $\overline{y}$. The fitted values then come from \begin{equation} \hat{m}_i = \overline{y} + \frac{1}{n}(\mathbf{x}_i - \overline{\mathbf{x}})\Var{X}^{-1} \mathbf{z}^T \mathbf{y} \end{equation} This tells us that $y_i$ will have a lot of leverage if $(\mathbf{x}_i -\overline{\mathbf{x}})\Var{X}^{-1}(\mathbf{x}_i - \overline{\mathbf{x}})^T$ is big\footnote{This sort of thing --- take the difference between two vectors, multiply by an inverse variance matrix, and multiply by the difference vector again --- is called a {\bf Mahalanobis distance}. As we will see in a moment, it gives more attention to differences along coordinates where the variance is small, and less attention to differences along coordinates where the variance is high.}. If the data point falls exactly at the mean of the predictors, $y_i$ matters only because it contributes to the over-all mean $\overline{y}$. If the data point moves away from the mean of the predictors, not all directions count equally. Remember the eigen-decomposition of $\Var{X}$: \begin{equation} \Var{X} = \mathbf{V} \mathbf{U} \mathbf{V}^T \end{equation} where $\mathbf{V}$ is the matrix whose columns are the eigenvectors of $\Var{X}$, $\mathbf{V}^T = \mathbf{V}^{-1}$, and $\mathbf{U}$ is the diagonal matrix of the eigenvalues of $\Var{X}$. Each eigenvalue gives the variance of the predictors along the direction of the corresponding eigenvector. It follows that \begin{equation} \Var{X}^{-1} = \mathbf{V} \mathbf{U}^{-1} \mathbf{V} \end{equation} So if the data point is far from the center of the predictors along a high-variance direction, that doesn't count as much as being equally far along a low-variance direction\footnote{I have an unfortunate feeling that I said this backwards throughout the afternoon.}. Figure \ref{fig:hatmap} shows a distribution for two predictor variables we're very familiar with, together with the two eigenvectors from the variance matrix, and the corresponding surface of leverages. \begin{figure} <>= par(mfrow=c(1,2)) # Mobility data mobility <- read.csv("~/teaching/mreg/15/dap/1/mobility.csv") # Consider regressing mobility (or anything else) purely on geographic # coordinates # Where's the geographic center of the country? mean.coords <- colMeans(mobility[,c("Longitude","Latitude")]) # Variance-covariance matrix of the coordinates var.coords <- var(mobility[,c("Longitude","Latitude")]) # Eigen-decomposition eigen.coords <- eigen(var.coords) # Eigenvectors, scaled by the eigenvalues scaled.eigenvector.1 <- sqrt(eigen.coords$values[1])*eigen.coords$vectors[,1] scaled.eigenvector.2 <- sqrt(eigen.coords$values[2])*eigen.coords$vectors[,2] # Move +- along those eigenvectors away from the center offsets <- rbind(mean.coords+scaled.eigenvector.1, mean.coords+scaled.eigenvector.2, mean.coords-scaled.eigenvector.1, mean.coords-scaled.eigenvector.2) # Plot all the communities in the country plot(Latitude ~ Longitude, data=mobility, pch=19, cex=0.5, col="grey") # Mark the center points(mean.coords, pch=19) # Draw arrows along the two axes arrows(x0=rep(mean.coords[1],4), y0=rep(mean.coords[2],4), x1=offsets[,1], y1=offsets[,2]) # Now make a 3D plot library(scatterplot3d) # Coordinates versus hat-matrix values for the regression scatterplot3d(mobility$Longitude, mobility$Latitude, hatvalues(lm(Mobility ~ Longitude+Latitude, data=mobility)), zlim=c(0,0.06), pch=19, cex.symbols=0.5, xlab="Longitude", ylab="Latitude", zlab=expression(H[ii])) par(mfrow=c(1,1)) @ \caption{Left: The geographic coordinates of the communities from the \texttt{mobility} data, along with their mean, and arrows marking the eigenvectors of the variance-covariance matrix (lengths scaled by the eigenvalues). Right: leverages for each point when regressing rates of economic mobility (or anything else) on latitude and longitude. See online for the code.} \label{fig:hatmap} \end{figure} You may convince yourself that with one predictor variable, all of this collapses down to just $1/n + (x_i - \overline{x})^2/ns^2_X$ (Exercise \ref{exercise:mahalanobis-distance-in-1d}). This leads to plots which may be easier to grasp (Figure \ref{fig:H-mob-lm}). \begin{figure} <>= H.mob.lm <- hatvalues(lm(Mobility~Commute,data=mobility)) plot(mobility$Commute, H.mob.lm, ylim=c(0,max(H.mob.lm)), xlab="Fraction of workers with short commute", ylab=expression(H[ii])) abline(h=2/nrow(mobility),col="grey") rug(mobility$Commute,side=1) @ <>= <> @ \caption{Leverages ($H_{ii}$) for a simple regression of economic mobility (or anything else) against the fraction of workers with short commutes. The grey line marks the average we'd see if every point was exactly equally influential. Note how leverage increases automatically as \texttt{Commute} moves away from its mean in either direction. (See below for the \texttt{hatvalues} function.} \label{fig:H-mob-lm} \end{figure} One curious feature of the leverage is, and of the hat matrix in general, is that it doesn't care {\em what} we are regressing on the predictor variables; it could be economic mobility or sightings of Bigfoot, and the same design matrix will give us the same hat matrix and leverages. To sum up: The leverage of a data point just depends on the value of the predictors there; it increases as the point moves away from the mean of the predictors. It increases more if the difference is along low-variance coordinates, and less for differences along high-variance coordinates. \section{Studentized Residuals} We return once more to the hat matrix, the source of all knowledge. \begin{equation} \widehat{\mathbf{m}} = \mathbf{H}\mathbf{y} \end{equation} The residuals, too, depend only on the hat matrix: \begin{equation} \mathbf{e} = \mathbf{y} - \widehat{\mathbf{m}} = (\mathbf{I}-\mathbf{H})\mathbf{y} \end{equation} We know that the residuals vary randomly with the noise, so let's re-write this in terms of the noise (Exercise \ref{exercise:linear-functions-of-x-project-away}). \begin{equation} \mathbf{e} = (\mathbf{I}-\mathbf{H})\mathbf{\epsilon} \end{equation} Since $\Expect{\mathbf{\epsilon}} = \mathbf{0}$ and $\Var{\mathbf{\epsilon}} = \sigma^2\mathbf{I}$, we have \begin{equation} \Expect{\mathbf{e}} = \mathbf{0} \end{equation} and \begin{equation} \Var{\mathbf{e}} = \sigma^2 (\mathbf{I}-\mathbf{H}) (\mathbf{I}-\mathbf{H})^T = \sigma^2(\mathbf{I} - \mathbf{H}) \end{equation} If we also assume that the noise is Gaussian, the residuals are Gaussian, with the stated mean and variance. What does this imply for the residual at the $i^{\mathrm{th}}$ data point? It has expectation 0, \begin{equation} \Expect{e_i} = 0 \end{equation} and it has a variance which depends on $i$ through the hat matrix: \begin{equation} \Var{e_i} = \sigma^2(\mathbf{I} - \mathbf{H})_{ii} = \sigma^2(1-H_{ii}) \end{equation} In words: the bigger the leverage of $i$, the smaller the variance of the residual there. This is yet another sense in which points with high leverage are points which the model tries very hard to fit. Previously, when we looked at the residuals, we expected them to all be of roughly the same magnitude. This rests on the leverages $H_{ii}$ being all about the same size. If there are substantial variations in leverage across the data points, it's better to scale the residuals by their expected size. The usual way to do this is through the {\bf standardized} or {\bf studentized residuals} \begin{equation} r_i \equiv \frac{e_i}{\hat{\sigma} \sqrt{1-H_{ii}}} \end{equation} Why ``studentized''? Because we're dividing by an estimate of the standard error, just like in ``Student's'' $t$-test for differences in means\footnote{The distribution here is however not quite a $t$-distribution, because, while $e_i$ has a Gaussian distribution and $\hat{\sigma}$ is the square root of a $\chi^2$-distributed variable, $e_i$ is actually used in computing $\hat{\sigma}$, hence they're not statistically independent. Rather, $r_i^2/(n-p)$ has a $\beta(\frac{1}{2}, \frac{1}{2}(n-p-1))$ distribution \citep[p.\ 267]{Seber-Lee-linear-reg-2nd}. This gives us studentized residuals which all have the same distribution, and that distribution does approach a Gaussian as $n\rightarrow \infty$ with $p$ fixed.} All of the residual plots we've done before can also be done with the studentized residuals. In particular, the studentized residuals should look flat, with constant variance, when plotted against the fitted values or the predictors. \section{Leave-One-Out} Suppose we left out the $i^{\mathrm{th}}$ data point altogether. How much would that change the model? \subsection{Fitted Values and Cross-Validated Residuals} Let's take the fitted values first. The hat matrix, $\mathbf{H}$, is an $n\times n$ matrix. If we deleted the $i^{\mathrm{th}}$ observation when estimating the model, but still asked for a prediction at $\mathbf{x}_i$, we'd get a different, $n \times (n-1)$ matrix, say $\mathbf{H}^{(-i)}$. This in turn would lead to a new fitted value: \begin{equation} \label{eqn:leave-one-out-fitted-value} \hat{m}^{(-i)}({\mathbf{x}_i}) = \frac{(\mathbf{H}\mathbf{y})_i - \mathbf{H}_{ii}y_i}{1-\mathbf{H}_{ii}} \end{equation} Basically, this is saying we can take the old fitted value, and then subtract off the part of it which came from having included the observation $y_j$ in the first place. Because each row of the hat matrix has to add up to $1$ (Exercise \ref{exercise:constant-rows}), we need to include the denominator (Exercise \ref{exercise:leave-one-out-lemma}). The {\bf leave-one-out residual} is the difference between this and $y_i$: \begin{equation} e_i^{(-i)} \equiv y_i - \hat{m}^{(-i)}({\mathbf{x}_i}) \end{equation} That is, this is how far off the model's prediction of $y_i$ would be if it didn't actually get to see $y_i$ during the estimation, but had to honestly predict it. Leaving out the data point $i$ would give us an MSE of $\hat{\sigma}^2_{(-i)}$, and a little work says that \begin{equation} t_i \equiv \frac{e_i^{(-i)}}{\hat{\sigma}_{(-i)}\sqrt{1+\mathbf{x}_i^T(\mathbf{x}^T_{(-i)}\mathbf{x}_{(-i)})^{-1}\mathbf{x}_i}} ~ t_{n-p-1} \end{equation} These are called the {\bf cross-validated}, or {\bf jackknife}, or {\bf externally studentized}, residuals. (Some people use the name ``studentized residuals'' only for these, calling the others the ``standardized residuals''.) Fortunately, we can compute this without having to actually re-run the regression: \begin{eqnarray} t_i & = & \frac{e_i^{(-i)}}{\hat{\sigma}_{(-i)}\sqrt{1+\mathbf{x}_i^T(\mathbf{x}^T_{(-i)}\mathbf{x}_{(-i)})^{-1}\mathbf{x}_i}}\\ & = & \frac{e_i}{\hat{\sigma}_{(-i)}\sqrt{1-H_{ii}}}\\ & = & r_i \sqrt{\frac{n-p-1}{n-p-r_i^2}} \end{eqnarray} \subsection{Cook's Distance} Omitting point $i$ will generally change all of the fitted values, not just the fitted value at that point. We go from the vector of predictions $\widehat{\mathbf{m}}$ to $\widehat{\mathbf{m}}^{(-i)}$. How big a change is this? It's natural (by this point!) to use the squared length of the difference vector, \begin{equation} \|\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)}\|^2 = (\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)})^T(\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)}) \end{equation} To make this more comparable across data sets, it's conventional to divide this by $(p+1)\hat{\sigma}^2$, since there are really only $p+1$ independent coordinates here, each of which might contribute something on the order of $\hat{\sigma}^2$. This is called the {\bf Cook's distance} or {\bf Cook's statistic} for point $i$: \begin{equation} D_i = \frac{(\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)})^T(\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)})}{(p+1)\hat{\sigma}^2} \end{equation} As usual, there is a simplified formula, which evades having to re-fit the regression: \begin{equation} \label{eqn:cooks-statistic-from-hat-matrix} D_i = \frac{1}{p+1}e_i^2 \frac{H_{ii}}{(1-H_{ii})^2} \end{equation} Notice that $H_{ii}/(1-H_{ii})^2$ is a growing function of $H_{ii}$ (Figure \ref{fig:leverage-to-cooks-curve}). So this says that the total influence of a point over all the fitted values grows with both its leverage ($H_{ii}$) and the size of its residual when it is included ($e_i^2$). \begin{figure} <>= curve(x/(1-x)^2, from=0, to=1, xlab="Leverage H", ylab=expression(H/(1-H)^2)) @ <>= <> @ \caption{Illustration of the function $H/(1-H)^2$ relating leverage $H$ to Cook's distance. Notice that leverage must be $\geq 0$ and $\leq 1$, so this is the whole relevant range of the curve.} \label{fig:leverage-to-cooks-curve} \end{figure} \subsection{Coefficients} The leave-one-out idea can also be applied to the coefficients. Writing $\widehat{\mathbf{\beta}}^{(-i)}$ for the vector of coefficients we get when we drop the $i^{\mathrm{th}}$ data point. One can show \citep[p.\ 268]{Seber-Lee-linear-reg-2nd} that \begin{equation} \widehat{\mathbf{\beta}}^{(-i)} = \widehat{\mathbf{\beta}} - \frac{(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}_i^Te_i}{1-H_{ii}} \end{equation} Cook's distance can actually be computed from this, since the change in the vector of fitted values is $\mathbf{x}(\widehat{\mathbf{\beta}}^{(-i)} - \widehat{\mathbf{\beta}})$, so \begin{equation} \label{eqn:cooks-statistic-in-terms-of-change-in-beta} D_i = \frac{((\widehat{\mathbf{\beta}}^{(-i)} - \widehat{\mathbf{\beta}})^T \mathbf{x}^T \mathbf{x}(\widehat{\mathbf{\beta}}^{(-i)} - \widehat{\mathbf{\beta}})}{(p+1)\hat{\sigma}^2} \end{equation} \subsection{Leave-More-Than-One-Out} Sometimes, whole clusters of nearby points might be potential outliers. In such cases, removing just one of them might change the model very little, while removing them all might change it a great deal. Unfortunately there are ${n \choose k} = O(n^k)$ groups of $k$ points you could consider deleting at once, so while looking at all leave-one-out results is feasible, looking at all leave-two- or leave-ten- out results is not. Instead, you have to think. \section{Practically, and with R} We have three ways of looking at whether points are outliers: \begin{enumerate} \item We can look at their leverage, which depends only on the value of the predictors. \item We can look at their studentized residuals, either ordinary or cross-validated, which depend on how far they are from the regression line. \item We can look at their Cook's statistics, which say how much removing each point shifts all the fitted values; it depends on the product of leverage and residuals. \end{enumerate} The model assumptions don't put any limit on how big the leverage can get (just that it's $\leq 1$ at each point) or on how its distributed across the points (just that it's got to add up to $p+1$). Having most of the leverage in a few super-inferential points doesn't break the model, exactly, but it should make us worry. The model assumptions {\em do} say how the studentized residuals should be distributed. In particular, the cross-validated studentized residuals should follow a $t$ distribution. This is something we can test, either for specific points which we're worried about (say because they showed up on our diagnostic plots), or across all the points\footnote{Be careful about testing all the points. If you use a size $\alpha$ test and everything is fine, you'd see about $\alpha n$ rejections. A good, if not necessarily optimal, way to deal with this is to lower the threshold to $\alpha/n$ for each test --- another example of the Bonferroni correction from Lecture 18.}. Because Cook's distance is related to how much the parameters change, the theory of confidence ellipsoids (Lecture 18) can be used to get some idea of how big a $D_i$ is worrying\footnote{Remember we saw that for large $n$, $(\widehat{\mathbf{\beta}}-\mathbf{\beta})^T \mathbf{\Sigma}^{-1} (\widehat{\mathbf{\beta}} - \mathbf{\beta}) \sim \chi^2_{p+1}$, where $\mathbf{\Sigma}$ is the variance matrix of the coefficient estimates. But that's $\sigma^2(\mathbf{x}^T\mathbf{x})^{-1}$, so we get $\sigma^{-2}(\widehat{\mathbf{\beta}}-\mathbf{\beta})^T \mathbf{x}^T \mathbf{x} (\widehat{\mathbf{\beta}} - \mathbf{\beta}) \sim \chi^2_{p+1}$. Now compare with Eq.\ \ref{eqn:cooks-statistic-in-terms-of-change-in-beta}.}. Cook's original rule-of-thumb translates into worrying when $(p+1)D_i$ is bigger than about $\chi^2_{p+1}(0.1)$, though the $0.1$ is arbitrary\footnote{More exactly, he used an $F$ distribution to take account of small-$n$ uncertainties in $\hat{\sigma}^2$, and suggested worrying when $D_i$ was bigger than $F_{p+1,n-p-1}(0.1)$. This will come to the same thing for large $n$.}. However, this is not really a hypothesis test. \subsection{In R} Almost everything we've talked --- leverages, studentized residuals, Cook's statistics --- can be calculated using the \texttt{influence} function. However, there are more user-friendly functions which call that in turn, and are probably better to use. Leverages come from the `hatvalues` function, or from the `hat` component of what `influence` returns: <>= mob.lm <- lm(Mobility~Commute, data=mobility) hatvalues(mob.lm) influence(mob.lm)$hat # Same as previous line @ The standardized, or internally-studentized, residuals $r_i$ are available with \texttt{rstandard}: <>= rstandard(mob.lm) residuals(mob.lm)/sqrt(1-hatvalues(mob.lm)) # Same as previous line @ The cross-validated or externally-studentized residuals $t_i$ are available with \texttt{rstudent}: <>= rstudent(mob.lm) # Too tedious to calculate from rstandard though you could @ Cook's statistic is calculated with \texttt{cooks.distance}: <>= cooks.distance(mob.lm) @ Often the most useful thing to do with these is to plot them, and look at the most extreme points. (One might also rank them, and plot them against ranks.) Figure \ref{fig:outlier-plots} does so. The standardized and studentized residuals can also be put into our usual diagnostic plots, since they should average to zero and have constant variance when plotted against the fitted values or the predictors. (I omit that here because in this case, $1/\sqrt{1-H_{ii}}$ is sufficiently close to 1 that it makes no visual difference.) \begin{figure} <>= par(mfrow=c(2,2)) mob.lm <- lm(Mobility~Commute,data=mobility) plot(hatvalues(mob.lm), ylab="Leverage") abline(h=2/nrow(mobility), col="grey") plot(rstandard(mob.lm), ylab="Standardized residuals") plot(rstudent(mob.lm), ylab="Cross-validated studentized residuals") abline(h=qt(0.025,df=nrow(mobility)-2),col="red") abline(h=qt(1-0.025,df=nrow(mobility)-2),col="red") plot(cooks.distance(mob.lm), ylab="Cook's statistic") abline(h=qchisq(0.1,2)/2,col="grey") @ <>= <> @ \caption{Leverages, two sorts of standardized residuals, and Cook's distance statistic for each point in a basic linear model of economic mobility as a function of the fraction of workers with short commutes. The horizontal line in the plot of leverages shows the average leverage. The lines in studentized residual plot shows a 95\% $t$-distribution sampling interval. (What is the grey line in the plot of Cook's distances?) Note the clustering of extreme residuals {\em and} leverage around row 600, and another cluster of points with extreme residuals around row 400.} \label{fig:outlier-plots} \end{figure} We can now look at exactly which points have the extreme values, say the 10 most extreme residuals, or largest Cook's statistics: <<>>= mobility[rank(-abs(rstudent(mob.lm)),)<=10,] mobility[rank(-abs(cooks.distance(mob.lm)))<=10,] @ \subsection{\texttt{plot}} We have not used the \texttt{plot} function on an \texttt{lm} object yet. This is because most of what it gives us is in fact related to residuals (Figure \ref{fig:plot-lm}). The first plot is of residuals versus fitted values, plus a smoothing line, with extreme residuals marked by row number. The second is a Q-Q plot of the standardized residuals, again with extremes marked by row number. The third shows the square root of the absolute standardized residuals against fitted values (ideally, flat); the fourth plots standardized residuals against leverage, with contour lines showing equal values of Cook's distance. There are many options, described in \texttt{help(plot.lm)}. \begin{figure} <>= par(mfrow=c(2,2)) plot(mob.lm) par(mfrow=c(1,1)) @ <>= <> @ \caption{The basic \texttt{plot} function applied to our running example model.} \label{fig:plot-lm} \end{figure} \clearpage \section{Responses to Outliers} There are essentially three things to do when we're convinced there are outliers: delete them; change the model; or change how we estimate. \subsection{Deletion} Deleting data points should never be done lightly, but it is sometimes the right thing to do. The best case for removing a data point is when you have good reasons to think it's just wrong (and you have no way to fix it). Medical records which give a patient's blood pressure as 0, or their temperature as 200 degrees, are just impossible and have to be errors\footnote{This is true whether the temperature is in degrees Fahrenheit, degrees centigrade, or kelvins.}. Those points aren't giving you useful information about the process you're studying\footnote{Unless it's the very process of making errors of measurement and recording.}, so getting rid of them makes sense. The next best case is if you have good reasons to think that the data point isn't {\em wrong}, exactly, but belongs to a different phenomenon or population from the one you're studying. (You're trying to see if a new drug helps cancer patients, but you discover the hospital has included some burn patients and influenza cases as well.) Or the data point does belong to the right population, but also somehow to another one which isn't what you're interested in right now. (All of the data is on cancer patients, but some of them were also sick with the flu.) You should be careful about that last, though. (After all, some proportion of future cancer patients are also going to have the flu.) The next best scenario after that is that there's nothing quite so definitely wrong about the data point, but it just looks really weird compared to all the others. Here you are really making a judgment call that either the data really are mistaken, or not from the right population, but you can't put your finger on a concrete reason why. The rules-of-thumb used to identify outliers, like ``Cook's distance shouldn't be too big'', or ``Tukey's rule''\footnote{Which flags any point more than 1.5 times the inter-quartile range above the third quartile, or below the first quartile, on any dimension.}, are at best of this sort. It is always more satisfying, and more reliable, if investigating how the data were gathered lets you turn cases of this sort into one of the two previous kinds. The least good case for getting rid of data points which isn't just bogus is that you've got a model which almost works, and would work a lot better if you just get rid of a few stubborn points. This is really a sub-case of the previous one, with added special pleading on behalf of your favorite model. You are here basically trusting your model more than your data, so it had better be either a really good model or really bad data. Beyond this, we get into what can only be called ignoring inconvenient facts so that you get the answer you want. \subsection{Changing the Model} Outliers are points that break a pattern. This can be because the points are bad, or because we made a bad guess about the pattern. Figure \ref{fig:wrong-model} shows data where the cloud of points on the right are definite outliers for any linear model. But I drew those points following a quadratic model, and they fall perfectly along it (as they should). Deleting them, in order to make a linear model work better, would have been short-sighted at best. \begin{figure} <>= x <- sort(c(runif(n=30), runif(n=5, min=5.0, max=5.1))) y <- x^2+rnorm(n=length(x),0,0.05) par(mfrow=c(1,2)) plot(x,y) abline(lm(y~x)) quad.coefs <- coefficients(lm(y~poly(x,2,raw=TRUE))) curve(quad.coefs[1]+quad.coefs[2]*x+quad.coefs[3]*x^2,add=TRUE,col="blue") legend("topleft",legend=c("data","linear","quadratic"), lty=c(NULL,"solid","solid"), pch=c(16,NA,NA), col=c("black","black","blue")) plot(hatvalues(lm(y~x)), ylab=expression(H[ii])) points(hatvalues(lm(y~poly(x,2))), col="blue") @ \caption{The points in the upper-right are outliers for any linear model fit through the main body of points, but dominate the line because of their very high leverage; they'd be identified as outliers. But all points were generated from a quadratic model.} \label{fig:wrong-model} \end{figure} The moral of Figure \ref{fig:wrong-model} is that data points can look like outliers because we're looking for the wrong pattern. If when we find apparent outliers and we can't convince ourselves that data is erroneous or irrelevant, we should consider changing our model, before, or as well as, deleting them. \subsection{Robust Linear Regression} A final alternative is to change how we estimate our model. Everything we've done has been based on ordinary least-squares (OLS) estimation. Because the squared error grows very rapidly with the error, OLS can be very strongly influenced by a few large ``vertical'' errors\footnote{Suppose there are 100 data points, and we start with parameter values where $e_1 > 10$, while $e_2$ through $e_100 = 0$. Changing to a new parameter value where $e_i=1$ for all $i$ actually reduces the MSE, even though it moves us away from perfectly fitting 99\% of the data points.}. We might, therefore, consider using not a different statistical model, but a different method of estimating its parameters. Estimation techniques which are less influenced by outliers in the residuals than OLS are called {\bf robust estimators}, or (for regression models) {\bf robust regression}. Usually (though not always), robust estimation, like OLS, tries to minimize\footnote{Hence the name ``$M$-estimators''.} some average of a function of the errors: \begin{equation} \tilde{\mathbf{\beta}} = \argmin_{\mathbf{b}}{\frac{1}{n}\sum_{i=1}^{n}{\rho(y_i - \mathbf{x}_i \mathbf{b})}} \end{equation} Different choices of $\rho$, the {\bf loss function}, yield different estimators. $\rho(u) = |u|$ is {\bf least absolute deviation} (LAD) estimation\footnote{For minimizing absolute error, the scenario suggested in the previous footnote seems like a horrible idea, the average loss function goes from $0.1$ to $1.0$.}. $\rho(u) = u^2$ is OLS again. A popular compromise is to use {\bf Huber's} loss function\footnote{Often written $\psi$, since that's the symbol Huber used when he introduced it. Also, some people define it as $1/2$ of the way I have here; this way, though, it's identical to squared error for small $u$.} \begin{equation} \rho(u) = \left\{ \begin{array}{ll} u^2 & |u| \leq c\\ 2c|u| - c^2 & |u| \geq c \end{array} \right. \end{equation} Notice that Huber's loss looks like squared error for small errors, but like absolute error for large errors\footnote{If we set $c=1$ in our little scenario, the average loss would go from $0.19$ to $1.0$, a definite worsening.}. Huber's loss is designed to be continuous at $c$, and have a continuous first derivative there as well (which helps with optimization). We need to pick the scale $c$ at which it switches over from acting like squared error to acting like absolute error; this is usually done using a robust estimate of the noise standard deviation $\sigma$. Robust estimation with Huber's loss can be conveniently done with the \texttt{rlm} function in the \texttt{MASS} package, which, as the name suggests, is designed to work very much like \texttt{lm}. <<>>= library(MASS) summary(rlm(Mobility~Commute,data=mobility)) @ Robust linear regression is designed for the situation where it's still true that $Y = X\beta + \epsilon$, but the noise $\epsilon$ is not very close to Gaussian, and indeed is sometimes ``contaminated'' by wildly larger values. It does nothing to deal with non-linearity, or correlated noise, or even some points having excessive leverage because we're insisting on a linear model. \section{Exercises} \begin{enumerate} \item \label{exercise:mahalanobis-distance-in-1d} Prove that in a simple linear regression \begin{equation} H_{ii} = \frac{1}{n}\left( 1 + \frac{(x_i - \overline{x})^2}{s^2_X}\right) \end{equation} \item \label{exercise:linear-functions-of-x-project-away} Show that $(\mathbf{I}-\mathbf{H})\mathbf{x}\mathbf{c} = 0$ for any matrix $\mathbf{c}$. \item \label{exercise:constant-rows} Every row of the hat matrix has entries that sum to 1. \begin{enumerate} \item Show that if all of the $y_i$ are equal, say $c$, then $\hat{\beta}_0 = c$ and all the estimated slopes are 0. \item Using the previous part, show that $\mathbf{1}$, the $n\times 1$ matrix of all 1s, must be an eigenvector of the hat matrix with eigenvalue 1, $\mathbf{H}\mathbf{1} = \mathbf{1}$. \item Using the previous part, show that the sum of each row of $\mathbf{H}$ must be 1, $\sum_{j=1}^{n}{H_{ij}} = 1$ for all $i$. \end{enumerate} \item \label{exercise:leave-one-out-lemma} {\em Fitted values after deleting a point} \begin{enumerate} \item (Easier) Presume that $\mathbf{H}^{(-i)}$ can be found by setting $\mathbf{H}^{(-i)}_{jk} = \mathbf{H}_{jk}/(1-\mathbf{H}_{ji})$. Prove Eq.\ \ref{eqn:leave-one-out-fitted-value}. \item (Challenging) Let $\mathbf{x}^{(-i)}$ be $\mathbf{x}$ with its $i^{\mathrm{th}}$ row removed. By construction, $\mathbf{H}^{(-i)}$, the $n\times (n-1)$ matrix which gives predictions at all of the original data points, is \begin{equation} \mathbf{H}^{(-i)} = \mathbf{x}{((\mathbf{x}^{(-i)})^T\mathbf{x}^{(-i)})}^{-1}(\mathbf{x}^{(-i)})^T \end{equation} Show that this matrix has the form claimed in the previous problem. \end{enumerate} \item (Challenging) Derive Eq.\ \ref{eqn:cooks-statistic-from-hat-matrix} for Cook's statistic from the definition. {\em Hint:} First, derive a formula for $\widehat{\mathbf{m}}^{(-i)}_{j}$ in terms of the hat matrix. Next, substitute in to the definition of $D_i$. Finally, you will need to use properties of the hat matrix to simplify. \end{enumerate} \bibliography{locusts} \bibliographystyle{crs} \end{document}