\documentclass{article}
\input{../common}

\begin{document}
\title{Lecture 20: Outliers and Influential Points}
\author{36-401, Fall 2015, Section B}
\date{5 November 2015}
\maketitle


\tableofcontents

<< echo=FALSE >>=
library(knitr)
opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE,
               options(show.signif.stars=FALSE))
@

\newpage

An {\bf outlier} is a data point which is very far, somehow, from the rest of
the data.  They are often worrisome, but not always a problem.  When we are
doing regression modeling, in fact, we don't really care about whether some
data point is far from the rest of the data, but whether it breaks a pattern
the rest of the data seems to follow.  Here, we'll first try to build some
intuition for when outliers cause trouble in linear regression models.  Then
we'll look at some ways of quantifying how much influence particular data
points have on the model; consider the strategy of pretending that inconvenient
data doesn't exist; and take a brief look at the {\bf robust regression}
strategy, of replacing least squares estimates with others which are less
easily influenced.


\section{Outliers Are Data Points Which Break a Pattern}

Consider Figure \ref{fig:outliers}.  The points marked in red and blue are
clearly not like the main cloud of the data points, even though their $x$ and
$y$ coordinates are quite typical of the data as a whole: the $x$ coordinates
of those points aren't related to the $y$ coordinates in the right way, they
break a pattern.  On the other hand, the point marked in green, while its
coordinates are very weird on both axes, does not break that pattern --- it was
positioned to fall right on the regression line.

\begin{figure}[h]
<<outliers-of-various-sorts, echo=FALSE>>=
x <- runif(20)
y <- 2*x + rnorm(n=length(x),mean=0,sd=0.2)
xy <- cbind(x=x,y=y)
plot(xy, xlim=c(0,5), ylim=c(min(0,min(y)),9))
outliers <- rbind(c(0.5,0), c(0,1), c(4,8))
rug(c(x,outliers[,1]), side=1)
rug(c(y,outliers[,2]), side=2)
points(outliers, pch=c(4,6,15), col=c("red","blue","green"))
all.points <- data.frame(rbind(outliers,xy))
@
\caption{Points marked with a red $\times$ and a blue triangle are outliers for
  the regression line through the main cloud of points, even though their $x$
  and $y$ coordinates are quite typical of the marginal distributions (see rug
  plots along axes).  The point marked by the green square, while an outlier
  along both axes, falls right along the regression line.  (See the source file
  online for the figure-making code.)}
\label{fig:outliers}
\end{figure}

\clearpage


What affect do these different outliers have on a simple linear model here?
Table \ref{table:some-estimates} shows the estimates we get from using just
the black points, from adding only one of the three outlying points to the
black points, and from using all the points.  As promised, adding the
red or blue points shifts the line, while adding the green point changes
hardly anything at all.

\begin{table}
<<echo=FALSE>>=
some.coefs <- rbind(coefficients(lm(y~x, data=all.points[-(1:3),])),
                    coefficients(lm(y~x, data=all.points[-(2:3),])),
                    coefficients(lm(y~x, data=all.points[-c(1,3),])),
                    coefficients(lm(y~x, data=all.points[-(1:2),])),
                    coefficients(lm(y~x, data=all.points)))
rownames(some.coefs) <- c("black only","black+blue","black+red",
                          "black+green","all points")
kable(signif(some.coefs,3))
@
\caption{Estimates of the simple regression line from the black points in
  Figure \ref{fig:outliers}, plus re-estimates adding in various outliers.}
\label{table:some-estimates}
\end{table}

If we are worried that outliers might be messing up our model, we would like to
quantify how much the estimates change if we add or remove individual data
points.  Fortunately, we can quantify this using only quantities we estimated
on the complete data, especially the hat matrix.

\clearpage


\subsection{Examples with Simple Linear Regression}

To further build intuition, let's think about what happens with simple
linear regression for a moment; that is, our model is
\begin{equation}
Y = \beta_0 + \beta_1 X + \epsilon
\end{equation}
with a single real-valued predictor variable $X$.  When we estimate the
coefficients by least squares, we know that
\begin{equation}
\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}
\end{equation}
Let us turn this around.  The fitted value at $X=\overline{x}$ is
\begin{equation}
\hat{\beta}_0  + \hat{\beta}_1 \overline{x} = \overline{y}
\end{equation}
Suppose we had a data point, say the $i^\mathrm{th}$ point, where
$X=\overline{x}$.  Then the actual value of $y_i$ almost wouldn't matter for
the fitted value there --- the regression line {\em has} to go through
$\overline{y}$ at $\overline{x}$, never mind whether $y_i$ there is close to
$\overline{y}$ or far away.  If $x_i = \overline{x}$, we say that $y_i$ has
little {\em leverage} over $\hat{m}_i$, or little {\em influence} on
$\hat{m}_i$.  It has {\em some} influence, because $y_i$ is part of what we
average to get $\overline{y}$, but that's not a lot of influence.

Again, with simple linear regression, we know that
\begin{equation}
\hat{\beta}_1 = \frac{c_{XY}}{s^2_X}
\end{equation}
the ratio between the sample covariance of $X$ and $Y$ and the sample variance of
$X$.  How does $y_i$ show up in this?  It's
\begin{equation}
\hat{\beta}_1 = \frac{n^{-1}\sum_{i=1}^{n}{(x_i - \overline{x})(y_i - \overline{y})}}{s^2_X}
\end{equation}
Notice that when $x_i = \overline{x}$, $y_i$ doesn't actually matter at all to
the slope.  If $x_i$ is far from $\overline{x}$, then $y_i - \overline{y}$ will
contribute to the slope, and its contribution will get bigger (whether positive
or negative) as $x_i - \overline{x}$ grows.  $y_i$ will also make a big
contribution to the slope when $y_i -\overline{y}$ is big (unless, again, $x_i
=\overline{x}$).

Let's write a general formula for the predicted value, at an arbitrary point $X=x$.
\begin{eqnarray}
\hat{m}(x) & = & \hat{\beta}_0 + \hat{\beta}_1 x\\
& = & \overline{y} - \hat{\beta}_1 \overline{x} + \hat{\beta}_1 x\\
& = & \overline{y} + \hat{\beta}_1 (x - \overline{x})\\
& = & \overline{y} + \frac{1}{n}\frac{\sum_{i=1}^{n}{(x_i - \overline{x})(y_i-\overline{y})}}{s^2_X}(x-\overline{x})
\end{eqnarray}
So, in words:
\begin{itemize}
\item The predicted value is always a weighted average of all the $y_i$.
\item As $x_i$ moves away from $\overline{x}$, $y_i$ gets more weight (possibly
  a large negative weight).  When $x_i = \overline{x}$, $y_i$ only matters
  because it contributes to the global mean $\overline{y}$.
\item The weights on all data points increase in magnitude when the point $x$
  where we're trying to predict is far from $\overline{x}$.  If
  $x=\overline{x}$, only $\overline{y}$ matters.
\end{itemize}

All of this is still true of the fitted values at the original data points:
\begin{itemize}
\item If $x_i$ is at $\overline{x}$, $y_i$ only matters for the fit because it
  contributes to $\overline{y}$.
\item As $x_i$ moves away from $\overline{x}$, in either direction, it makes a
  bigger contribution to {\em all} the fitted values.
\end{itemize}

Why is this happening?  We get the coefficient estimates by minimizing
the mean squared error, and the MSE treats all data points equally:
\begin{equation}
\frac{1}{n}\sum_{i=1}^{n}{(y_i - \hat{m}(x_i))^2}
\end{equation}
But we're not just using any old function $\hat{m}(x)$; we're using a linear
function.  This has only two parameters, so we can't change the predicted
value to match each data
point --- altering the parameters to bring $\hat{m}(x_i)$ closer to $y_i$ might
actually increase the error elsewhere.  By minimizing the over-all MSE
with a linear function, we get two constraints,
\begin{equation}
\overline{y} = \hat{\beta}_0 + \hat{\beta}_1 \overline{x}
\end{equation}
and
\begin{equation}
\sum_{i}{e_i (x_i - \overline{x})} = 0
\end{equation}
The first of these makes the regression line insensitive to $y_i$ values when
$x_i$ is close to $\overline{x}$.  The second makes the regression line {\em
  very} sensitive to residuals when $x_i - \overline{x}$ is big --- when $x_i -
\overline{x}$ is large, a big residual ($e_i$ far from 0) is harder to balance
out than if $x_i -\overline{x}$ were smaller.

So, let's sum this up.
\begin{itemize}
\item Least squares estimation tries to bring all the predicted values closer
  to $y_i$, but it can't match each data point at once, because the fitted
  values are all functions of the same coefficients.
\item If $x_i$ is close to $\overline{x}$, $y_i$ makes little difference to the
  coefficients or fitted values --- they're pinned down by needing to go
  through the mean of the data.
\item As $x_i$ moves away from $\overline{x}$, $y_i - \overline{y}$ makes a
  bigger and bigger impact on both the coefficients and on the fitted values.
\end{itemize}

If we worry that some point isn't falling on the same regression line as the
others, we're really worrying that including it will throw off our estimate of
the line.  This is going to be a concern when $x_i$ is far from $\overline{x}$,
or when the combination of $x_i-\overline{x}$ and $y_i -\overline{y}$ makes
that point has a disproportionate impact on the estimates.  We should also be
worried if the residual values are too big, but when asking what's ``too big'',
we need to take into account the fact that the model will try harder to fit
some points than others.  A big residual at a point of high leverage is more of
a red flag than an equal-sized residual at point with little influence.

All of this will carry over to multiple regression models, but with
more algebra to keep track of the different dimensions.

\section{Influence of Individual Data Points on Estimates}

Recall that our least-squares coefficient estimator is
\begin{equation}
\widehat{\mathbf{\beta}} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\mathbf{y}
\end{equation}
from which we get our fitted values as
\begin{equation}
\widehat{\mathbf{m}} = \mathbf{x} \widehat{\mathbf{\beta}} =  \mathbf{x}(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\mathbf{y} = \mathbf{H}\mathbf{y}
\end{equation}
with the hat matrix $\mathbf{H} \equiv \mathbf{x}(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T$.  This leads to a very natural sense in which one observation
might be more or less influential than another:
\begin{equation}
\frac{\partial \hat{\beta}_k}{\partial y_i} = \left((\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\right)_{ki}
\end{equation}
and
\begin{equation}
\frac{\partial \hat{m}_k}{\partial y_i} = H_{ii}
\end{equation}
If $y_i$ were different, it would change the estimates for all the coefficients
and for all the fitted values.  The rate at which the $k^{\mathrm{th}}$
coefficient or fitted value changes is given by the $ki^{\mathrm{th}}$ entry in
these matrices --- matrices which, notice, are completely defined by the design
matrix $\mathbf{x}$.

\subsection{Leverage}

$H_{ii}$ is the influence of $y_i$ on its own fitted value; it tells us how
much of $\hat{m}_i$ is just $y_i$.  This turns out to be a key quantity in
looking for outliers, so we'll give it a special name, the {\bf leverage}.  It
is sometimes also written $h_i$.  Once again, the leverage of the
$i^{\mathrm{th}}$ data point doesn't depend on $y_i$, only on the design
matrix.

Because the general linear regression model doesn't assume anything about
the distribution of the predictors, other than that they're not collinear,
we can't say definitely that some values of the leverage break model
assumptions, or even are very unlikely under the model assumptions.  But
we can say some things about the leverage.

\paragraph{Average leverages} We showed in the homework that the trace of the hat
matrix equals the number of coefficients we estimate:
\begin{equation}
\tr{\mathbf{H}} = p+1
\end{equation}
But the trace of any matrix is the sum of its diagonal entries,
\begin{equation}
\tr{\mathbf{H}} = \sum_{i=1}^{n}{H_{ii}}
\end{equation}
so the trace of the hat matrix is the sum of each point's leverage.  The
average leverage is therefore $\frac{p+1}{n}$.  We don't expect every point to
have exactly the same leverage, but if some points have much more than others,
the regression function is going to be pulled towards fitting the high-leverage
points, and the function will tend to ignore the low-leverage points.

\paragraph{Leverage vs.\ geometry} Let's center all the predictor variables,
i.e., subtract off the mean of each predictor variable.  Call this new vector
of predictor variables $Z$, with the $n\times p$ matrix $\mathbf{z}$.  This
will not change any of the slopes, and will fix the intercept to be
$\overline{y}$.  The fitted values then come from
\begin{equation}
\hat{m}_i = \overline{y} + \frac{1}{n}(\mathbf{x}_i - \overline{\mathbf{x}})\Var{X}^{-1} \mathbf{z}^T \mathbf{y}
\end{equation}
This tells us that $y_i$ will have a lot of leverage if $(\mathbf{x}_i
-\overline{\mathbf{x}})\Var{X}^{-1}(\mathbf{x}_i - \overline{\mathbf{x}})^T$ is
big\footnote{This sort of thing --- take the difference between two vectors,
  multiply by an inverse variance matrix, and multiply by the difference vector
  again --- is called a {\bf Mahalanobis distance}. As we will see in a moment,
  it gives more attention to differences along coordinates where the variance
  is small, and less attention to differences along coordinates where the
  variance is high.}.  If the data point falls exactly at the mean of the
predictors, $y_i$ matters only because it contributes to the over-all mean
$\overline{y}$.  If the data point moves away from the mean of the predictors,
not all directions count equally.  Remember the eigen-decomposition of
$\Var{X}$:
\begin{equation}
\Var{X} = \mathbf{V} \mathbf{U} \mathbf{V}^T
\end{equation}
where $\mathbf{V}$ is the matrix whose columns are the eigenvectors of
$\Var{X}$, $\mathbf{V}^T = \mathbf{V}^{-1}$, and $\mathbf{U}$ is the diagonal
matrix of the eigenvalues of $\Var{X}$.  Each eigenvalue gives the variance of
the predictors along the direction of the corresponding eigenvector.  It follows
that
\begin{equation}
\Var{X}^{-1} = \mathbf{V} \mathbf{U}^{-1} \mathbf{V}
\end{equation}
So if the data point is far from the center of the predictors along a
high-variance direction, that doesn't count as much as being equally far along
a low-variance direction\footnote{I have an unfortunate feeling that I said
  this backwards throughout the afternoon.}.  Figure \ref{fig:hatmap} shows a
distribution for two predictor variables we're very familiar with, together
with the two eigenvectors from the variance matrix, and the corresponding
surface of leverages.

\begin{figure}
<<hatmap, echo=FALSE, out.height="0.5\\textheight">>=
par(mfrow=c(1,2))
# Mobility data
mobility <- read.csv("~/teaching/mreg/15/dap/1/mobility.csv")
# Consider regressing mobility (or anything else) purely on geographic
# coordinates
# Where's the geographic center of the country?
mean.coords <- colMeans(mobility[,c("Longitude","Latitude")])
# Variance-covariance matrix of the coordinates
var.coords <- var(mobility[,c("Longitude","Latitude")])
# Eigen-decomposition
eigen.coords <- eigen(var.coords)
# Eigenvectors, scaled by the eigenvalues
scaled.eigenvector.1 <- sqrt(eigen.coords$values[1])*eigen.coords$vectors[,1]
scaled.eigenvector.2 <- sqrt(eigen.coords$values[2])*eigen.coords$vectors[,2]
# Move +- along those eigenvectors away from the center
offsets <- rbind(mean.coords+scaled.eigenvector.1,
                 mean.coords+scaled.eigenvector.2,
                 mean.coords-scaled.eigenvector.1,
                 mean.coords-scaled.eigenvector.2)
# Plot all the communities in the country
plot(Latitude ~ Longitude, data=mobility, pch=19, cex=0.5, col="grey")
# Mark the center
points(mean.coords, pch=19)
# Draw arrows along the two axes
arrows(x0=rep(mean.coords[1],4), y0=rep(mean.coords[2],4),
       x1=offsets[,1], y1=offsets[,2])
# Now make a 3D plot
library(scatterplot3d)
# Coordinates versus hat-matrix values for the regression
scatterplot3d(mobility$Longitude, mobility$Latitude,
              hatvalues(lm(Mobility ~ Longitude+Latitude, data=mobility)),
              zlim=c(0,0.06), pch=19, cex.symbols=0.5,
              xlab="Longitude", ylab="Latitude", zlab=expression(H[ii]))
par(mfrow=c(1,1))
@
\caption{Left: The geographic coordinates of the communities from the
  \texttt{mobility} data, along with their mean, and arrows marking the
  eigenvectors of the variance-covariance matrix (lengths scaled by the
  eigenvalues).  Right: leverages for each point when regressing rates of
  economic mobility (or anything else) on latitude and longitude.  See online
  for the code.}
\label{fig:hatmap}
\end{figure}

You may convince yourself that with one predictor variable, all of this
collapses down to just $1/n + (x_i - \overline{x})^2/ns^2_X$ (Exercise
\ref{exercise:mahalanobis-distance-in-1d}).  This leads to plots which may be
easier to grasp (Figure \ref{fig:H-mob-lm}).

\begin{figure}
<<H-mob-lm, echo=FALSE>>=
H.mob.lm <- hatvalues(lm(Mobility~Commute,data=mobility))
plot(mobility$Commute, H.mob.lm, ylim=c(0,max(H.mob.lm)),
     xlab="Fraction of workers with short commute",
     ylab=expression(H[ii]))
abline(h=2/nrow(mobility),col="grey")
rug(mobility$Commute,side=1)
@
<<eval=FALSE>>=
<<H-mob-lm>>
@
\caption{Leverages ($H_{ii}$) for a simple regression of economic mobility (or
  anything else) against the fraction of workers with short commutes.  The grey
  line marks the average we'd see if every point was exactly equally
  influential.  Note how leverage increases automatically as \texttt{Commute}
  moves away from its mean in either direction.  (See below for the
  \texttt{hatvalues} function.}
\label{fig:H-mob-lm}
\end{figure}

One curious feature of the leverage is, and of the hat matrix in general, is
that it doesn't care {\em what} we are regressing on the predictor variables;
it could be economic mobility or sightings of Bigfoot, and the same design
matrix will give us the same hat matrix and leverages.


To sum up: The leverage of a data point just depends on the value of the
predictors there; it increases as the point moves away from the mean of the
predictors.  It increases more if the difference is along low-variance
coordinates, and less for differences along high-variance coordinates.


\section{Studentized Residuals}

We return once more to the hat matrix, the source of all knowledge.
\begin{equation}
\widehat{\mathbf{m}} = \mathbf{H}\mathbf{y}
\end{equation}
The residuals, too, depend only on the hat matrix:
\begin{equation}
\mathbf{e} = \mathbf{y} - \widehat{\mathbf{m}} = (\mathbf{I}-\mathbf{H})\mathbf{y}
\end{equation}
We know that the residuals vary randomly with the noise, so let's
re-write this in terms of the noise (Exercise \ref{exercise:linear-functions-of-x-project-away}).
\begin{equation}
\mathbf{e} = (\mathbf{I}-\mathbf{H})\mathbf{\epsilon}
\end{equation}
Since $\Expect{\mathbf{\epsilon}} = \mathbf{0}$ and $\Var{\mathbf{\epsilon}} = \sigma^2\mathbf{I}$, we have
\begin{equation}
\Expect{\mathbf{e}} = \mathbf{0}
\end{equation}
and
\begin{equation}
\Var{\mathbf{e}} = \sigma^2 (\mathbf{I}-\mathbf{H}) (\mathbf{I}-\mathbf{H})^T = \sigma^2(\mathbf{I} - \mathbf{H})
\end{equation}
If we also assume that the noise is Gaussian, the residuals are Gaussian,
with the stated mean and variance.

What does this imply for the residual at the $i^{\mathrm{th}}$ data point?
It has expectation 0,
\begin{equation}
\Expect{e_i} = 0
\end{equation}
and it has a variance which depends on $i$ through the hat matrix:
\begin{equation}
\Var{e_i} = \sigma^2(\mathbf{I} - \mathbf{H})_{ii} = \sigma^2(1-H_{ii})
\end{equation}
In words: the bigger the leverage of $i$, the smaller the variance of the
residual there.  This is yet another sense in which points with high leverage
are points which the model tries very hard to fit.

Previously, when we looked at the residuals, we expected them to all be of
roughly the same magnitude.  This rests on the leverages $H_{ii}$ being all
about the same size.  If there are substantial variations in leverage across
the data points, it's better to scale the residuals by their expected size.

The usual way to do this is through the {\bf standardized} or {\bf studentized residuals}
\begin{equation}
r_i \equiv \frac{e_i}{\hat{\sigma} \sqrt{1-H_{ii}}}
\end{equation}
Why ``studentized''?  Because we're dividing by an estimate of the standard
error, just like in ``Student's'' $t$-test for differences in
means\footnote{The distribution here is however not quite a $t$-distribution,
  because, while $e_i$ has a Gaussian distribution and $\hat{\sigma}$ is the
  square root of a $\chi^2$-distributed variable, $e_i$ is actually used in
  computing $\hat{\sigma}$, hence they're not statistically independent.
  Rather, $r_i^2/(n-p)$ has a $\beta(\frac{1}{2}, \frac{1}{2}(n-p-1))$
  distribution \citep[p.\ 267]{Seber-Lee-linear-reg-2nd}.  This gives us
  studentized residuals which all have the same distribution, and that
  distribution does approach a Gaussian as $n\rightarrow \infty$ with $p$
  fixed.}

All of the residual plots we've done before can also be done with the
studentized residuals.  In particular, the studentized residuals should look
flat, with constant variance, when plotted against the fitted values or the
predictors.


\section{Leave-One-Out}

Suppose we left out the $i^{\mathrm{th}}$ data point altogether.  How
much would that change the model?

\subsection{Fitted Values and Cross-Validated Residuals}

Let's take the fitted values first.  The hat matrix, $\mathbf{H}$, is an
$n\times n$ matrix.  If we deleted the $i^{\mathrm{th}}$ observation when
estimating the model, but still asked for a prediction at $\mathbf{x}_i$, we'd
get a different, $n \times (n-1)$ matrix, say $\mathbf{H}^{(-i)}$.  This in turn
would lead to a new fitted value:
\begin{equation}
\label{eqn:leave-one-out-fitted-value} \hat{m}^{(-i)}({\mathbf{x}_i}) = \frac{(\mathbf{H}\mathbf{y})_i - \mathbf{H}_{ii}y_i}{1-\mathbf{H}_{ii}}
\end{equation}
Basically, this is saying we can take the old fitted value, and then subtract
off the part of it which came from having included the observation $y_j$ in the
first place.  Because each row of the hat matrix has to add up to $1$ (Exercise
\ref{exercise:constant-rows}), we need to include the denominator (Exercise
\ref{exercise:leave-one-out-lemma}).

The {\bf leave-one-out residual} is the difference between this and $y_i$:
\begin{equation}
e_i^{(-i)} \equiv y_i - \hat{m}^{(-i)}({\mathbf{x}_i})
\end{equation}
That is, this is how far off the model's prediction of $y_i$ would be if it
didn't actually get to see $y_i$ during the estimation, but had to honestly
predict it.

Leaving out the data point $i$ would give us an MSE of $\hat{\sigma}^2_{(-i)}$,
and a little work says that
\begin{equation}
t_i \equiv \frac{e_i^{(-i)}}{\hat{\sigma}_{(-i)}\sqrt{1+\mathbf{x}_i^T(\mathbf{x}^T_{(-i)}\mathbf{x}_{(-i)})^{-1}\mathbf{x}_i}} ~ t_{n-p-1}
\end{equation}
These are called the {\bf cross-validated}, or {\bf jackknife}, or {\bf
  externally studentized}, residuals.  (Some people use the
name ``studentized residuals'' only for these, calling the others the ``standardized residuals''.)
Fortunately, we can compute this without
having to actually re-run the regression:
\begin{eqnarray}
t_i & = & \frac{e_i^{(-i)}}{\hat{\sigma}_{(-i)}\sqrt{1+\mathbf{x}_i^T(\mathbf{x}^T_{(-i)}\mathbf{x}_{(-i)})^{-1}\mathbf{x}_i}}\\
& = & \frac{e_i}{\hat{\sigma}_{(-i)}\sqrt{1-H_{ii}}}\\
& = & r_i \sqrt{\frac{n-p-1}{n-p-r_i^2}}
\end{eqnarray}

\subsection{Cook's Distance}

Omitting point $i$ will generally change all of the fitted values, not just
the fitted value at that point.  We go from the vector of
predictions $\widehat{\mathbf{m}}$ to $\widehat{\mathbf{m}}^{(-i)}$.  How
big a change is this?  It's natural (by this point!) to use the squared
length of the difference vector,
\begin{equation}
\|\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)}\|^2 = (\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)})^T(\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)})
\end{equation}
To make this more comparable across data sets, it's conventional to divide this by $(p+1)\hat{\sigma}^2$, since there are really only $p+1$ independent
coordinates here, each of which might contribute something on the order of
$\hat{\sigma}^2$.  This is called the {\bf Cook's distance} or {\bf Cook's
statistic} for point $i$:
\begin{equation}
D_i = \frac{(\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)})^T(\widehat{\mathbf{m}} - \widehat{\mathbf{m}}^{(-i)})}{(p+1)\hat{\sigma}^2}
\end{equation}
As usual, there is a simplified formula, which evades having to re-fit
the regression:
\begin{equation}
\label{eqn:cooks-statistic-from-hat-matrix}
D_i = \frac{1}{p+1}e_i^2 \frac{H_{ii}}{(1-H_{ii})^2}
\end{equation}
Notice that $H_{ii}/(1-H_{ii})^2$ is a growing function of $H_{ii}$
(Figure \ref{fig:leverage-to-cooks-curve}).  So this says that the total
influence of a point over all the fitted values grows with both its leverage
($H_{ii}$) and the size of its residual when it is included ($e_i^2$).


\begin{figure}
<<leverage-to-cooks-curve, echo=FALSE>>=
curve(x/(1-x)^2, from=0, to=1, xlab="Leverage H",
      ylab=expression(H/(1-H)^2))
@
<<eval=FALSE>>=
<<leverage-to-cooks-curve>>
@
\caption{Illustration of the function $H/(1-H)^2$ relating leverage $H$
to Cook's distance.  Notice that leverage must be $\geq 0$ and $\leq 1$,
so this is the whole relevant range of the curve.}
\label{fig:leverage-to-cooks-curve}
\end{figure}


\subsection{Coefficients}

The leave-one-out idea can also be applied to the coefficients.  Writing
$\widehat{\mathbf{\beta}}^{(-i)}$ for the vector of coefficients we get
when we drop the $i^{\mathrm{th}}$ data point.  One can show \citep[p.\ 268]{Seber-Lee-linear-reg-2nd} that
\begin{equation}
\widehat{\mathbf{\beta}}^{(-i)} = \widehat{\mathbf{\beta}} - \frac{(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}_i^Te_i}{1-H_{ii}}
\end{equation}

Cook's distance can actually be computed from this, since the change in the
vector of fitted values is $\mathbf{x}(\widehat{\mathbf{\beta}}^{(-i)} -
\widehat{\mathbf{\beta}})$, so
\begin{equation}
\label{eqn:cooks-statistic-in-terms-of-change-in-beta}
D_i = \frac{((\widehat{\mathbf{\beta}}^{(-i)} -
\widehat{\mathbf{\beta}})^T \mathbf{x}^T \mathbf{x}(\widehat{\mathbf{\beta}}^{(-i)} -
\widehat{\mathbf{\beta}})}{(p+1)\hat{\sigma}^2}
\end{equation}

\subsection{Leave-More-Than-One-Out}

Sometimes, whole clusters of nearby points might be potential outliers.
In such cases, removing just one of them might change the model very little,
while removing them all might change it a great deal.
Unfortunately there are ${n \choose k} = O(n^k)$ groups of $k$ points you could
consider deleting at once, so while looking at all leave-one-out
results is feasible, looking at all leave-two- or leave-ten- out results
is not.  Instead, you have to think.


\section{Practically, and with R}

We have three ways of looking at whether points are outliers:
\begin{enumerate}
\item We can look at their leverage, which depends only on the value of the
  predictors.
\item We can look at their studentized residuals, either ordinary or
  cross-validated, which depend on how far they are from the regression line.
\item We can look at their Cook's statistics, which say how much removing each
  point shifts all the fitted values; it depends on the product of leverage and
  residuals.
\end{enumerate}

The model assumptions don't put any limit on how big the leverage can get (just
that it's $\leq 1$ at each point) or on how its distributed across the points
(just that it's got to add up to $p+1$).  Having most of the leverage
in a few super-inferential points doesn't break the model, exactly, but it
should make us worry.

The model assumptions {\em do} say how the studentized residuals should be
distributed.  In particular, the cross-validated studentized residuals should
follow a $t$ distribution.  This is something we can test, either for specific
points which we're worried about (say because they showed up on our diagnostic
plots), or across all the points\footnote{Be careful about testing all the
  points.  If you use a size $\alpha$ test and everything is fine, you'd see
  about $\alpha n$ rejections.  A good, if not necessarily optimal, way to deal
  with this is to lower the threshold to $\alpha/n$ for each test --- another
  example of the Bonferroni correction from Lecture 18.}.

Because Cook's distance is related to how much the parameters change, the
theory of confidence ellipsoids (Lecture 18) can be used to get some idea of
how big a $D_i$ is worrying\footnote{Remember we saw that for large $n$,
  $(\widehat{\mathbf{\beta}}-\mathbf{\beta})^T \mathbf{\Sigma}^{-1}
  (\widehat{\mathbf{\beta}} - \mathbf{\beta}) \sim \chi^2_{p+1}$, where
  $\mathbf{\Sigma}$ is the variance matrix of the coefficient estimates.  But
  that's $\sigma^2(\mathbf{x}^T\mathbf{x})^{-1}$, so we get
  $\sigma^{-2}(\widehat{\mathbf{\beta}}-\mathbf{\beta})^T \mathbf{x}^T
  \mathbf{x} (\widehat{\mathbf{\beta}} - \mathbf{\beta}) \sim \chi^2_{p+1}$.
  Now compare with Eq.\ \ref{eqn:cooks-statistic-in-terms-of-change-in-beta}.}.
Cook's original rule-of-thumb translates into worrying when $(p+1)D_i$ is
bigger than about $\chi^2_{p+1}(0.1)$, though the $0.1$ is
arbitrary\footnote{More exactly, he used an $F$ distribution to take account of
  small-$n$ uncertainties in $\hat{\sigma}^2$, and suggested worrying when
  $D_i$ was bigger than $F_{p+1,n-p-1}(0.1)$.  This will come to the same thing
  for large $n$.}.  However, this is not really a hypothesis test.

\subsection{In R}

Almost everything we've talked --- leverages, studentized residuals,
Cook's statistics --- can be calculated using the \texttt{influence}
function.  However, there are more user-friendly functions which call
that in turn, and are probably better to use.

Leverages come from the `hatvalues` function, or from the `hat` component
of what `influence` returns:
<<eval=FALSE>>=
mob.lm <- lm(Mobility~Commute, data=mobility)
hatvalues(mob.lm)
influence(mob.lm)$hat  # Same as previous line
@

The standardized, or internally-studentized, residuals $r_i$ are
available with \texttt{rstandard}:
<<eval=FALSE>>=
rstandard(mob.lm)
residuals(mob.lm)/sqrt(1-hatvalues(mob.lm)) # Same as previous line
@

The cross-validated or externally-studentized residuals $t_i$
are available with \texttt{rstudent}:
<<eval=FALSE>>=
rstudent(mob.lm) # Too tedious to calculate from rstandard though you could
@

Cook's statistic is calculated with \texttt{cooks.distance}:
<<eval=FALSE>>=
cooks.distance(mob.lm)
@

Often the most useful thing to do with these is to plot them, and look at the
most extreme points.  (One might also rank them, and plot them against ranks.)
Figure \ref{fig:outlier-plots} does so.  The standardized and studentized
residuals can also be put into our usual diagnostic plots, since they should
average to zero and have constant variance when plotted against the fitted
values or the predictors.  (I omit that here because in this case,
$1/\sqrt{1-H_{ii}}$ is sufficiently close to 1 that it makes no visual
difference.)

\begin{figure}
<<outlier-plots, echo=FALSE>>=
par(mfrow=c(2,2))
mob.lm <- lm(Mobility~Commute,data=mobility)
plot(hatvalues(mob.lm), ylab="Leverage")
abline(h=2/nrow(mobility), col="grey")
plot(rstandard(mob.lm), ylab="Standardized residuals")
plot(rstudent(mob.lm), ylab="Cross-validated studentized residuals")
abline(h=qt(0.025,df=nrow(mobility)-2),col="red")
abline(h=qt(1-0.025,df=nrow(mobility)-2),col="red")
plot(cooks.distance(mob.lm), ylab="Cook's statistic")
abline(h=qchisq(0.1,2)/2,col="grey")
@
<<eval=FALSE>>=
<<outlier-plots>>
@
\caption{Leverages, two sorts of standardized residuals, and Cook's distance
  statistic for each point in a basic linear model of economic mobility as a
  function of the fraction of workers with short commutes.  The horizontal line
  in the plot of leverages shows the average leverage.  The lines in
  studentized residual plot shows a 95\% $t$-distribution sampling interval.
  (What is the grey line in the plot of Cook's distances?)
  Note the clustering of extreme residuals {\em and} leverage around row 600,
  and another cluster of points with extreme residuals around row 400.}
\label{fig:outlier-plots}
\end{figure}

We can now look at exactly which points have the extreme values, say the 10
most extreme residuals, or largest Cook's statistics:
<<>>=
mobility[rank(-abs(rstudent(mob.lm)),)<=10,]
mobility[rank(-abs(cooks.distance(mob.lm)))<=10,]
@

\subsection{\texttt{plot}}

We have not used the \texttt{plot} function on an \texttt{lm} object yet.  This
is because most of what it gives us is in fact related to residuals (Figure
\ref{fig:plot-lm}).  The first plot is of residuals versus fitted values, plus
a smoothing line, with extreme residuals marked by row number.  The second is a
Q-Q plot of the standardized residuals, again with extremes marked by row
number.  The third shows the square root of the absolute standardized residuals
against fitted values (ideally, flat); the fourth plots standardized residuals
against leverage, with contour lines showing equal values of Cook's distance.
There are many options, described in \texttt{help(plot.lm)}.

\begin{figure}
<<plot-lm, echo=FALSE>>=
par(mfrow=c(2,2))
plot(mob.lm)
par(mfrow=c(1,1))
@
<<eval=FALSE>>=
<<plot-lm>>
@
\caption{The basic \texttt{plot} function applied to our running example model.}
\label{fig:plot-lm}
\end{figure}

\clearpage

\section{Responses to Outliers}

There are essentially three things to do when we're convinced there are
outliers: delete them; change the model; or change how we estimate.

\subsection{Deletion}

Deleting data points should never be done lightly, but it is sometimes
the right thing to do.

The best case for removing a data point is when you have good reasons to think
it's just wrong (and you have no way to fix it).  Medical records which give a
patient's blood pressure as 0, or their temperature as 200 degrees, are just
impossible and have to be errors\footnote{This is true whether the temperature
  is in degrees Fahrenheit, degrees centigrade, or kelvins.}.  Those points
aren't giving you useful information about the process you're
studying\footnote{Unless it's the very process of making errors of measurement
  and recording.}, so getting rid of them makes sense.

The next best case is if you have good reasons to think that the data point
isn't {\em wrong}, exactly, but belongs to a different phenomenon or population
from the one you're studying.  (You're trying to see if a new drug helps cancer
patients, but you discover the hospital has included some burn patients and
influenza cases as well.)  Or the data point does belong to the right
population, but also somehow to another one which isn't what you're interested
in right now.  (All of the data is on cancer patients, but some of them were
also sick with the flu.)  You should be careful about that last, though.
(After all, some proportion of future cancer patients are also going to have
the flu.)

The next best scenario after that is that there's nothing quite so definitely
wrong about the data point, but it just looks really weird compared to all the
others.  Here you are really making a judgment call that either the data really
are mistaken, or not from the right population, but you can't put your finger
on a concrete reason why.  The rules-of-thumb used to identify outliers, like
``Cook's distance shouldn't be too big'', or ``Tukey's rule''\footnote{Which
  flags any point more than 1.5 times the inter-quartile range above the third
  quartile, or below the first quartile, on any dimension.}, are at best of
this sort.  It is always more satisfying, and more reliable, if investigating
how the data were gathered lets you turn cases of this sort into one of the two
previous kinds.

The least good case for getting rid of data points which isn't just bogus is
that you've got a model which almost works, and would work a lot better if you
just get rid of a few stubborn points.  This is really a sub-case of the
previous one, with added special pleading on behalf of your favorite model.
You are here basically trusting your model more than your data, so it had
better be either a really good model or really bad data.

Beyond this, we get into what can only be called ignoring inconvenient facts so
that you get the answer you want.

\subsection{Changing the Model}

Outliers are points that break a pattern.  This can be because the points are
bad, or because we made a bad guess about the pattern.  Figure
\ref{fig:wrong-model} shows data where the cloud of points on the right are
definite outliers for any linear model.  But I drew those points following a
quadratic model, and they fall perfectly along it (as they should).  Deleting
them, in order to make a linear model work better, would have been
short-sighted at best.

\begin{figure}
<<wrong-model, echo=FALSE>>=
x <- sort(c(runif(n=30), runif(n=5, min=5.0, max=5.1)))
y <- x^2+rnorm(n=length(x),0,0.05)
par(mfrow=c(1,2))
plot(x,y)
abline(lm(y~x))
quad.coefs <- coefficients(lm(y~poly(x,2,raw=TRUE)))
curve(quad.coefs[1]+quad.coefs[2]*x+quad.coefs[3]*x^2,add=TRUE,col="blue")
legend("topleft",legend=c("data","linear","quadratic"),
       lty=c(NULL,"solid","solid"), pch=c(16,NA,NA),
       col=c("black","black","blue"))
plot(hatvalues(lm(y~x)), ylab=expression(H[ii]))
points(hatvalues(lm(y~poly(x,2))), col="blue")
@
\caption{The points in the upper-right are outliers for any linear model fit
  through the main body of points, but dominate the line because of their very
  high leverage; they'd be identified as outliers.  But all points were
  generated from a quadratic model.}
\label{fig:wrong-model}
\end{figure}

The moral of Figure \ref{fig:wrong-model} is that data points can look
like outliers because we're looking for the wrong pattern.  If when we
find apparent outliers and we can't convince ourselves that data is
erroneous or irrelevant, we should consider changing our model, before,
or as well as, deleting them.

\subsection{Robust Linear Regression}

A final alternative is to change how we estimate our model.  Everything we've
done has been based on ordinary least-squares (OLS) estimation.  Because the
squared error grows very rapidly with the error, OLS can be very strongly
influenced by a few large ``vertical'' errors\footnote{Suppose there are 100
  data points, and we start with parameter values where $e_1 > 10$, while $e_2$
  through $e_100 = 0$.  Changing to a new parameter value where $e_i=1$ for all
  $i$ actually reduces the MSE, even though it moves us away from perfectly
  fitting 99\% of the data points.}.  We might, therefore, consider using not a
different statistical model, but a different method of estimating its
parameters.  Estimation techniques which are less influenced by outliers in the
residuals than OLS are called {\bf robust estimators}, or (for regression
models) {\bf robust regression}.

Usually (though not always), robust estimation, like OLS, tries to
minimize\footnote{Hence the name ``$M$-estimators''.} some average of a
function of the errors:
\begin{equation}
\tilde{\mathbf{\beta}} = \argmin_{\mathbf{b}}{\frac{1}{n}\sum_{i=1}^{n}{\rho(y_i - \mathbf{x}_i \mathbf{b})}}
\end{equation}
Different choices of $\rho$, the {\bf loss function}, yield different
estimators.  $\rho(u) = |u|$ is {\bf least absolute deviation} (LAD)
estimation\footnote{For minimizing absolute error, the scenario suggested in
  the previous footnote seems like a horrible idea, the average loss function
  goes from $0.1$ to $1.0$.}.  $\rho(u) = u^2$ is OLS again.  A popular
compromise is to use {\bf Huber's} loss function\footnote{Often written $\psi$,
  since that's the symbol Huber used when he introduced it.  Also, some people
  define it as $1/2$ of the way I have here; this way, though, it's identical
  to squared error for small $u$.}
\begin{equation}
\rho(u) = \left\{ \begin{array}{ll} u^2 & |u| \leq c\\
                                    2c|u| - c^2 & |u| \geq c
                  \end{array} \right.
\end{equation}
Notice that Huber's loss looks like squared error for small errors, but like
absolute error for large errors\footnote{If we set $c=1$ in our little
  scenario, the average loss would go from $0.19$ to $1.0$, a definite
  worsening.}.  Huber's loss is designed to be continuous at $c$, and have a
continuous first derivative there as well (which helps with optimization).  We
need to pick the scale $c$ at which it switches over from acting like squared
error to acting like absolute error; this is usually done using a robust
estimate of the noise standard deviation $\sigma$.

Robust estimation with Huber's loss can be conveniently done with the
\texttt{rlm} function in the \texttt{MASS} package, which, as the name suggests,
is designed to work very much like \texttt{lm}.

<<>>=
library(MASS)
summary(rlm(Mobility~Commute,data=mobility))
@

Robust linear regression is designed for the situation where it's still true
that $Y = X\beta + \epsilon$, but the noise $\epsilon$ is not very close to
Gaussian, and indeed is sometimes ``contaminated'' by wildly larger values.  It
does nothing to deal with non-linearity, or correlated noise, or even some
points having excessive leverage because we're insisting on a linear model.

\section{Exercises}
\begin{enumerate}
\item \label{exercise:mahalanobis-distance-in-1d} Prove that in a simple linear
  regression
  \begin{equation}
  H_{ii} = \frac{1}{n}\left( 1 + \frac{(x_i - \overline{x})^2}{s^2_X}\right)
  \end{equation}
\item \label{exercise:linear-functions-of-x-project-away} Show that
  $(\mathbf{I}-\mathbf{H})\mathbf{x}\mathbf{c} = 0$ for any matrix
  $\mathbf{c}$.
\item \label{exercise:constant-rows} Every row of the hat matrix has entries
  that sum to 1.
  \begin{enumerate}
  \item Show that if all of the $y_i$ are equal, say $c$, then $\hat{\beta}_0 =
    c$ and all the estimated slopes are 0.
  \item Using the previous part, show that $\mathbf{1}$, the $n\times 1$ matrix
    of all 1s, must be an eigenvector of the hat matrix with eigenvalue 1,
    $\mathbf{H}\mathbf{1} = \mathbf{1}$.
  \item Using the previous part, show that the sum of each row of $\mathbf{H}$
    must be 1, $\sum_{j=1}^{n}{H_{ij}} = 1$ for all $i$.
  \end{enumerate}
\item \label{exercise:leave-one-out-lemma} {\em Fitted values after deleting a
    point}
  \begin{enumerate}
  \item (Easier) Presume that $\mathbf{H}^{(-i)}$ can be found by setting
    $\mathbf{H}^{(-i)}_{jk} = \mathbf{H}_{jk}/(1-\mathbf{H}_{ji})$.  Prove Eq.\
    \ref{eqn:leave-one-out-fitted-value}.
  \item (Challenging) Let $\mathbf{x}^{(-i)}$ be $\mathbf{x}$ with its
    $i^{\mathrm{th}}$ row removed.  By construction, $\mathbf{H}^{(-i)}$, the
    $n\times (n-1)$ matrix which gives predictions at all of the original data
    points, is
    \begin{equation}
    \mathbf{H}^{(-i)} =
    \mathbf{x}{((\mathbf{x}^{(-i)})^T\mathbf{x}^{(-i)})}^{-1}(\mathbf{x}^{(-i)})^T
    \end{equation}
    Show that this matrix has the form claimed in the previous problem.
  \end{enumerate}
\item (Challenging) Derive Eq.\ \ref{eqn:cooks-statistic-from-hat-matrix} for
  Cook's statistic from the definition.  {\em Hint:} First, derive a formula
  for $\widehat{\mathbf{m}}^{(-i)}_{j}$ in terms of the hat matrix.  Next,
  substitute in to the definition of $D_i$.  Finally, you will need to use
  properties of the hat matrix to simplify.
\end{enumerate}


\bibliography{locusts}
\bibliographystyle{crs}

\end{document}