\documentclass{article}
\input{../common.tex}
\begin{document}
\title{Lecture 7: Diagnostics and Modifications for Simple Regression}
\author{36-401, Fall 2015, Section B}
\date{22 September 2015}
\maketitle
<< echo=FALSE >>=
library(knitr)
opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE)
@
\tableofcontents
In the previous lectures, we have laid out what regression analysis is for, why
we use statistical models to do it, the assumptions of the simple linear
regression model, and estimation and prediction for the simple regression model
using both the method of maximum likelihood and the method of least squares.
We could, at this point, follow the traditional route in a class like this of
plunging into detailed inferential calculations for the model (this is how you
find a confidence interval for such-and-such, etc.). However, those
calculations have little point if the assumptions for the model do not hold.
Thus, we will look at model checking {\em first}, and in later lectures go on
to the inferential theory.
It is somewhat more traditional to talk about ``diagnostics'' for the
regression model, rather than ``model checking''; which name is best depends on
how formal you want to sound. Just as in medicine, a good diagnostic procedure
is one which will {\em differentiate} between health (for us: the model
assumptions hold) and illness (the assumptions are violated). Indeed, a really
good diagnostic procedure will distinguish between different sorts of illness
(tell us {\em which} assumptions are violated). More mathematically, we want
to find consequences of the assumptions which will hold no matter what the
parameter values might be, but would be hard to achieve if the assumptions
fail. (Checking whether $\sum_{i}{\hat{\beta}_0 + \hat{\beta}_1 x_i} =
\sum_{i}{y_i}$ is not a useful diagnostic, because while that will be true if
the assumptions are all right, it would still be true if they were all
wrong\footnote{I could give you a demonstration of that equation, but that
would spoil one of the problems in the current homework}.) The simplest and
most useful check all involve the residuals.
\section{The Residuals}
In previous lectures, we defined the {\bf residual} at the $i^{\mathrm{th}}$
data point as the difference between the realized value of the response $y_i$
and what the estimated model would predict:
\[
e_i = y_i - \hat{m}(x_i)
\]
which, for the simple linear regression model, is just
\[
e_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)
\]
\paragraph{Residuals vs.\ the noise}
The residuals are very closely related to the noise variables $\epsilon_i$,
but with some important differences. If we take the basic equation for the simple linear model, $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$, we can re-arrange it to read
\[
\epsilon_i = Y_i - ( \beta_0 + \beta_1 x_i)
\]
This has the same form as the equation for the residuals, but it involves
the true parameters, not their estimates. If we take that equation for the $i^{\mathrm{th}}$ residual and substitute in the equation for $Y_i$,
\begin{eqnarray}
e_i & = & \beta_0 + \beta_1 x_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) + \epsilon_i\\
& (\beta_0 - \hat{\beta}_0) + (\beta_1 - \hat{\beta}_1) x_i + \epsilon_i
\end{eqnarray}
The terms in parentheses on the right-hand side are hopefully small, but
they're not (in general) zero.
\paragraph{Residuals as weighted sums of noise}
To understand what's going on with the residuals, it's helpful to write them
as a weighted sum of the $\epsilon$'s. Going back to previous
lectures,
\begin{eqnarray}
\hat{\beta}_0 & = & \beta_0 + \frac{1}{n}\sum_{i=1}^{n}{\left(1 -\overline{x} \frac{x_i - \overline{x}}{s^2_X} \right) \epsilon_i}\\
\hat{\beta}_1 & = & \beta_1 + \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i}
\end{eqnarray}
Substitute these in to the equation for $e_i$:
\begin{eqnarray}
e_i & = & \epsilon_i + \frac{1}{n}\sum_{j=1}^{n}{\left(1 -\overline{x} \frac{x_j - \overline{x}}{s^2_X} \right) \epsilon_j} + x_i \sum_{j=1}^{n}{\frac{x_j - \overline{x}}{ns^2_X}\epsilon_j} \\
& = & \sum_{j=1}^{n}{\left( \delta_{ij} + \frac{1}{n} + (x_i - \overline{x})\frac{x_j - \overline{x}}{ns^2_X}\right) \epsilon_j} \label{eqn:residual-as-sum-of-noise}
\end{eqnarray}
using the ``Kronecker delta'' ($\delta_{ij} =0$ if $i\neq j$, $=1$ if $i=j$)
and some algebra. The factor in parenthesis is a weight which gets applied to
each $\epsilon_j$ when summing them up --- a weight which depends on the $x$'s
alone. Let's abbreviate this weight by $c_{ij}$.
One of the assumptions of the simple linear model is that $\Expect{\epsilon_i|X} = 0$. Since we've shown that $e_i$ is a weighted sum of $\epsilon_j$,
it follows that
\[
\Expect{e_i|X} = \sum_{j}{c_{ij}\Expect{\epsilon_j|X}} = 0
\]
Since the simple linear model assumes that the $\epsilon$'s are uncorrelated
and all have variance $\sigma^2$, even conditional on $X$,
\begin{eqnarray}
\Var{e_i|X} & = & \sum_{j}{\Var{c_{ij}\epsilon_j|X}}\\
& = & \sum_{j}{c_{ij}^2 \Var{\epsilon_j|X}}\\
& = & \sigma^2\sum_{j=1}^{n}{c_{ij}^2}
\end{eqnarray}
(I will not bother writing out the sum explicitly.) From here, one can go
on to show
\[
\Var{e_i} = \frac{n-2}{n}\sigma^2
\]
though again I omit the details, so as not to spoil a future assignment.
If we make the Gaussian noise assumption, the $\epsilon_j$ are independent
Gaussians. It thus follows that $e_i$ also has a Gaussian distribution.
\paragraph{Contrast between residuals and noise terms}
The sum of the noise terms which produced the data is rarely zero.
The {\em expectation value} of the sum of the noise is zero,
\[
\Expect{\sum_{i=1}^{n}{\epsilon_i}} = \sum_{i=1}^{n}{\Expect{\epsilon_i}} = 0
\]
but the {\em variance} is not:
\[
\Var{\sum_{i=1}^{n}{\epsilon_i}} = \sum_{i=1}^{n}{\Var{\epsilon_i}} = n\sigma^2
\]
so the sum of the noise terms can't be exactly zero all the time:
\[
\sum_{i=1}^{n}{\epsilon_i} \neq 0
\]
(Indeed, if the $\epsilon_i$ follow a Gaussian distribution, then
$\sum_{i=1}^{n}{\epsilon_i} \sim N(0,n\sigma^2)$, and the probability that
$\sum_{i=1}^{n}{\epsilon_i} = 0$ is zero, not one.) Similarly, while the
$\epsilon$ are uncorrelated with $X$,
\[
\Cov{X,\epsilon} = \Expect{X\epsilon}-\Expect{\epsilon}\Expect{X} = 0
\]
the $\epsilon_i$ don't have a zero {\em sample} correlation with $X$:
\[
\sum_{i=1}^{n}{\epsilon_i (x_i-\overline{x})} \neq 0
\]
On the other hand, such equations do hold exactly and deterministically for the
residuals. In particular,
\[
\sum_{i=1}^{n}{e_i} = 0
\]
when the simple linear model is estimated by least squares, no matter what.
This is because this equation is a consequence of the estimating equations and
the estimating equations alone --- it applies all the time, on any data set,
not just with probability 1 but without exception. Similarly,
\[
\sum_{i=1}^{n}{(x_i - \overline{x}) e_i} = 0
\]
also follows from the estimating equations. I will not go over the
derivations, so as not to spoil the homework you are currently doing.
(If we think of $(e_1, e_2, \ldots e_n)$ as an $n$-dimensional vector, these
two equations tell us that not every vector is possible. Only vectors which
live in an $(n-2)$-dimensional subspace are allowed. This is, so to speak, the
same $n-2$ as in $\Var{e_i}$.)
These two equations imply that even when the $\epsilon$'s are independent, the
residuals are not. (After all, if we know all but one residual, the last one
is completely determined.) However, the dependence is typically very slight
and subtle, and it gets weaker as $n$ grows (because each $e_i$ is making a
comparatively-small contribution to the sum that must be zero). So the
residuals should show only negligible correlation (Exercise
\ref{exercise:correlation-between-residuals}).
\subsection{Summary on Properties of the Residuals}
Let's sum up the most relevant observations from the last couple of paragraphs.
\begin{enumerate}
\item The residuals should have expectation zero, conditional on $x$,
$\Expect{e_i|X=x} = 0$. (The residuals should also have an over-all sample
mean of exactly zero.)
\item The residuals should show a constant variance, unchanging with $x$.
\item The residuals can't be completely uncorrelated with each other, but the
correlation should be extremely weak, and grow negligible as
$n\rightarrow\infty$.
\item If the noise is Gaussian, the residuals should also be Gaussian.
\end{enumerate}
Each one of these points leads to a diagnostic, to a check on the model.
These take the form of our plots, which you should always, always make
for any regression you run.
\begin{figure}
<>=
# Load the data set from the last homework
library(gamair)
data(chicago)
# Plot deaths each day vs. temperature
plot(death ~ tmpd, data=chicago,
xlab="Temperature (Farenheit)",
ylab="Mortality (deaths/day)")
# Estimate and store a linear model
death.temp.lm <- lm(death ~ tmpd, data=chicago)
abline(death.temp.lm, col="blue")
@
<>=
<>
@
\caption{Plot of the data from the last homework, used as a running example,
along with the estimated linear model (in blue).}
\end{figure}
\clearpage
\subsection{Plot the Residuals Against the Predictor}
Make a scatter-plot with the residuals on the vertical axis and the predictor
variable on the horizontal axis. Because $\Expect{e|X=x} = 0$, and
$\Var{e|X=x}$ is constant, this should, ideally, look like a constant-width
blur of points around a straight, flat line at height zero. Deviations from
this --- changing width, curvature, substantial regions of the $x$ axis where
the average residuals are either positive or negative --- are all signs that
the model is mis-specified. In particular, curved or stepped patterns indicate
that $\Expect{e|X=x} \neq 0$, which in turn means that $\Expect{\epsilon|X=x}
\neq 0$, which means that the simple-linear part of the simple linear
regression model is wrong\footnote{See \S \ref{sec:misspecification}.}. One
needs either more predictor variables (getting us into multiple regression), or
a different functional form for the regression (getting us into nonlinear
regression), or both.
\begin{figure}
<>=
# Always plot residuals vs. predictor variable
plot(chicago$tmpd, residuals(death.temp.lm),
xlab="Temperature (F)",
ylab="Prediction error (deaths/day)")
abline(h=0, col="grey")
@
<>=
<>
@
\caption{Residuals (vertical axis) vs.\ the predictor variable of temperature
(horizontal axis).}
\end{figure}
\clearpage
\paragraph{Plotting Against Another Variable} If you have other potential
predictor variables, you should be able to plot the residual against them, and
also see a flat line around zero. If not, that's an indication that the other
variable does in fact help predict the response, and so you should probably
incorporate that in your model. (Part of the residuals from your simple linear
model are really the contributions of that other predictor, which you were
treating as noise out of ignorance\footnote{Whether there really is such a
thing as ``noise'', or only predictor variables we have neglected, is a deep
and subtle question we don't have to answer here.}.) In particular, if you
make such a plot and you see the points in it fall around a straight line,
that's an excellent sign that you need a multiple linear regression model.
(Exercise: make a plot of the residuals from this model against one
of the pollution variables from the data set. Does it look like noise?)
\subsection{Plot the Magnitude of the Residuals Against the Predictor}
Because $\Expect{e|X=x} = 0$, $\Var{e|X=x} = \Expect{e^2|X=x}$. (Why?) This
means we can check whether the variance of the residuals is constant by
plotting the {\em squared} residuals against the predictor variable. This
should give a scatter of points around a flat line, whose height should be
around the in-sample MSE. Regions of the $x$ axis where the residuals are
persistently above or below this level are evidence of a problem with the
simple linear regression model. This could be due to non-constant noise
variance (``heteroskedasticity'', in the jargon), or due to getting the
functional form of the regression wrong. One can often get a clue as to what
is driving the problem by looking to see whether the regions where the squared
residuals are too big are also regions where the residuals are persistently
above or below zero.
\begin{figure}
<>=
# Always plot residuals^2 vs.
# predictor variable
plot(chicago$tmpd, residuals(death.temp.lm)^2,
xlab="Temperature (F)",
ylab="Squared prediction error (deaths/day)^2")
abline(h=mean(residuals(death.temp.lm)^2),
lwd=2, col="grey")
@
<>=
<>
@
\caption{Squared residuals vs.\ temperature.}
\end{figure}
Sometimes, particularly when the model is not doing so well, squaring the
residuals leads to a visually uninformative plot, because big residuals lead to
really, really big squared residuals, and it's hard to make out any detail. A
common fix is to then plot the absolute value of the residuals, with the
reference horizontal line being at the square root of the mean squared error.
\begin{figure}
<>=
plot(chicago$tmpd, abs(residuals(death.temp.lm)),
xlab="Temperature (F)",
ylab="Absolute prediction error (deaths/day)")
abline(h=sqrt(mean(residuals(death.temp.lm)^2)),
lwd=2, col="grey")
@
<>=
<>
@
\caption{Absolute residuals vs.\ temperature; plotting these rather than the
squared residuals reduces the {\em visual} impact of the few huge residuals.}
\end{figure}
\clearpage
\subsection{Plot the Residuals Against Coordinates and Each Other}
Lots of the time, our data were collected in a certain order --- each data
point has some coordinates, in space or in time or both. Under the simple
linear regression model, these shouldn't matter\footnote{Unless the predictor
variable is one of the coordinates, of course.}, so you should always plot
the residuals against the coordinates. (Even if you have no coordinates, you
should always plot residuals against the row numbers of the data set.)
Clusters of nearby observations with unusually high or low residuals are a bad
sign.
\begin{figure}
<>=
# Always plot residuals vs. coordinate
plot(as.Date(chicago$time,origin="1993-12-31"),
residuals(death.temp.lm), xlab="Date",
ylab="Prediction error (deaths/day)")
@
<>=
<>
@
\caption{Residuals vs.\ date.}
\end{figure}
Of course, a certain amount of apparent clustering will happen naturally in any
random process, so if this looks worrisome, one should really do some sort of
formal test. Fortunately, there are lots of good test procedures for finding
``runs'' in what should be random noise. A quick hack, though, is simply to
put the residuals in a totally random order and re-plot them:
<>=
sample(residuals(my.model))
@
will take the residuals vector of \texttt{my.model} and randomly permute it;
plotting these permuted residuals against the coordinate will then give an
example of how things should look. Do this a few times, and you'll get a good
sense of how much apparent clumping of residuals should be produced by chance.
This trick can also be used when plotting the residuals, or squared residuals,
against the predictor variable, and can be formalized as what's called a
``permutation test''.
(Exercise: Make a few plots of the permuted, shuffled residuals against date.)
\clearpage
\paragraph{Residuals vs.\ Residuals}
A related diagnostic plot is particularly useful when the observations are
taken at regular intervals along some axis (usually time but occasionally
distance): make a scatter-plot with one point for each observation except the
very last; the horizontal coordinate comes from that point's residual, and the
vertical coordinate comes from the {\em next} point's residual. Ideally, this
should give a blob of points with no particular structure. If the residuals
are Gaussian, follow any other bell-ish distribution, it should show a circular
blob. (If they were uniform, the blob should fill out a square --- why?)
Falling along a line or curve, or even a tilted ellipse, would be an indication
of correlation between successive residuals, which in turn may be a sign that
the noise is correlated\footnote{Or, again, it could be a sign that we've got
the functional form wrong, so it's our systematic error which is
correlated --- see \S \ref{sec:misspecification}.}.
\begin{figure}
<>=
# Always plot successive residuals against each other
# head() and tail() here are used to get "every day except the last"
# and "every day except the first", respectively
# see help(head)
plot(head(residuals(death.temp.lm),-1),
tail(residuals(death.temp.lm),-1),
xlab="Residual today",
ylab="Residual tomorrow")
abline(lm(tail(residuals(death.temp.lm),-1) ~ head(residuals(death.temp.lm),-1)))
@
<>=
<>
@
\caption{Residuals for each day (except the first) plotted as a function of the
residuals of the day before. The straight line shows a regression of
tomorrow's residual on today's residual, which ideally should be a totally
flat line.}
\end{figure}
Again, if you're not sure whether you're looking at a worryingly big departure
from blob-hood, try permuting the residuals before re-plotting.
\clearpage
\subsection{Plot the Distribution of the Residuals}
Under the Gaussian noise assumption, the residuals should also follow a
Gaussian distribution. We should therefore make plots of the distribution of
the residuals, and compare that to a Gaussian.
The most basic plot of the distribution for the residuals is of course a
histogram. This should be over-laid with a Gaussian probability density ---
but which Gaussian? The most reasonable one has mean 0 (because we know the
residuals average to 0), and the same standard deviation as the residuals
(because that's the MLE of the standard deviation in a Gaussian model). At
that point, one can {\em see} whether the distribution of residuals looks like
that of the best-fitting Gaussian.
\begin{figure}
<>=
# Always plot distribution of residuals
hist(residuals(death.temp.lm),breaks=40,
freq=FALSE,
xlab="Residual (deaths/day)", main="")
curve(dnorm(x,mean=0,sd=sd(residuals(death.temp.lm))),
add=TRUE,col="blue")
@
<>=
<>
@
\caption{Histogram of the residuals, on a density scale, and the theoretical
Gaussian distribution with the same mean (0) and standard deviation.}
\end{figure}
\clearpage
\paragraph{Q-Q plots}
An alternative is what's called a ``quantile-quantile'' or ``Q-Q'' plot. (The
textbook, old-fashionedly, calls this a ``normal probability plot.) This takes
a bit more thought to get used to than visually comparing density estimates,
but with practice becomes most sensitive and less subjective. Here's the idea.
As you remember from intro.\ prob., knowing the cumulative distribution
function (CDF) tells us all there is to know, mathematically, about a
probability distribution; call this $F(x) = \Prob{X \leq x}$. If the
distribution is continuous, the CDF has an inverse function, $F^{-1}$, where
$F^{-1}(p)$ is the unique $x$ such that $\Prob{X \leq x} = p$. This is called
the {\bf quantile function} --- it tells us what level of $x$ will contain a
fraction $p$ of the total probability. Since saying things like ``the $0.95$
quantile'' is rather awkward, we usually pronounce it as ``the
$95^{\mathrm{rm}}$ percentile'', meaning the value greater than or equal to
95\% of the population. If we know the quantile function, we can invert it to
get the CDF, so the quantile function also completely determines the
probability distribution\footnote{In R, the CDF and quantile functions have
names beginning with \texttt{p} and \texttt{q} --- \texttt{pnorm} and
\texttt{qnorm} for the Gaussian, \texttt{pexp} and \texttt{qexp} for the
exponential, etc.}.
As $p$ varies from $0$ to $1$, $F^{-1}(p)$ will vary from the smallest possible
value for the distribution up to the largest possible value. If our
distribution is a Gaussian, with mean $\mu$ and variance $\sigma^2$, then
$F^{-1}(p) = \sigma \Phi^{-1}(p) + \mu$, where $\Phi$ is the standard Gaussian
CDF. (Why?) So if instead of plotting $F^{-1}$ against $p$, we make a plot
where $F^{-1}(p)$ goes on one axis and $\Phi^{-1}(p)$ goes on the other, as we
sweep $p$ from $0$ to $1$ we'll get a straight line. Conversely, if we weren't
sure whether the distribution $F$ we were interested in was Gaussian, but we
did one of these quantile-quantile plots against the standard Gaussian and a
got a straight line, then we'd know $F$ was, in fact, Gaussian.
With a finite sample from a distribution (like, say, the vector of residuals),
we don't really have $F$ or $F^{-1}$. However, we can use the {\bf sample
quantiles} or {\bf empirical quantiles}. Start with our observations, say
$x_1, x_2, \ldots x_n$. Now put them in increasing order: to help distinguish
the ordered from unordered observations, I'll use a common convention where the
subscripts in parentheses show order, so
\[
x_{(1)} \leq x_{(2)} \leq \ldots \leq x_{(n-1)} \leq x_{(n)}
\]
(These ordered values of the data are sometimes called the {\bf order
statistics}.) Now $x_{(i)}$ is $\geq$ a fraction $i/n$ of the sample
observations, so $\hat{F}^{-1}(i/n) = x_{(i)}$. When we make a $Q-Q$ plot
against a Gaussian distribution, we therefore put $x_{(i)}$ on one axis, and
$\Phi^{-1}(i/n)$ on the other, and hope to see a straight line if the
distribution of $x$ is indeed Gaussian\footnote{You might worry about what
happens when $i=n$, since $\Phi^{-1}(1) = \infty$. The answer is that I have
slightly over-simplified what goes on that axis. When you call
\texttt{qqnorm} in R, it actually goes through a calculation to find
$\Expect{X_{(i)}}$ for $i \in 1:n$ under a Gaussian distribution, and plots
{\em that} against the observed value $x_{(i)}$. As $n\rightarrow\infty$,
$\Expect{X_{(i})} \rightarrow \Phi^{-1}(i/n)$, but the difference can be
important for small $n$, and/or $i/n$ close to 0 or 1.}
To sum up: we put the data in order, and then plot $x_{(i)}$ against
$\Phi^{-1}(i/n)$. If the data are from a Gaussian distribution, these points
should fall along a straight line. Small wiggles around the line are to be
anticipated from chance; systematic deviations from a line indicate systematic
departures from a Gaussian distribution.
\begin{figure}
<>=
# An alternative: plot vs. theoretical Gaussian distribution
qqnorm(residuals(death.temp.lm))
qqline(residuals(death.temp.lm))
@
<>=
<>
@
\caption{QQ plot of the residuals, using the standard Gaussian distribution as
the reference distribution.}
\end{figure}
\paragraph{Q-Q plots for other distributions} One can make a Q-Q plot for data
against any other distribution one likes, provided one knows the reference
distribution's quantile function; R just provides code for the Gaussian
case, however.
\paragraph{Q-Q plots for two data distributions} With two data sets, say $x_1,
\ldots x_n$ and $y_1, y_m$, one can compare their sample quantiles. The
easiest way is when $n=m$, since then one just plots $x_{(i)}$ against
$y_{(i)}$. (If $n\neq m$, one might pick a grid of $p$ values, work out
$\hat{F}^{-1}_{x}(p)$ and $\hat{F}^{-1}_{y}(p)$ by interpolation, and plot
those against each other.) This should show points around the $x=y$ line
when $x$ and $y$ are both drawn from the same distribution.
\paragraph{P-P plots} An alternative to $Q-Q$ plots are $P-P$ plots, where both
axes run from $0$ to $1$. Call the horizontal coordinate $p_1$ and the
vertical coordinate $p_2$; what we plot is $p_2 = \hat{F}(F^{-1}(p_1))$.
$F^{-1}(p_1)$ is the quantile which ought, under the distribution $F$, to be
$\geq$ a fraction $p_1$ of the population; $\hat{F}(F^{-1}(p_1))$ shows the
actual fraction of the sample which it exceeds. R doesn't have a built-in
function to make this --- can you write one?
\paragraph{Formal tests} Comparing the histogram to a theoretical density can
be formalized with a $\chi^2$ test\footnote{This is why the visual comparison
is informally called ``$\chi$ by eye''.}. Checking whether the $Q-Q$ plot
follows a straight line can be formalized in the Kolmogorov-Smirnov test. In
both cases, since we're really estimating the a parameter of the reference
distribution (the standard deviation), we need to take some care to account
that in a hypothesis test. (A tweak to the K-S test which does this, when
testing for Gaussianity, is the ``Lilliefors test'', which you can find in the
package \texttt{nortest}.)
\clearpage
\subsection{Generalization}
If the model assumptions are correct, it should be able to work about equally
well on new data from the same source. Because the parameters were adjusted to
fit the data we used to estimate the model, we should expect the prediction
errors on new data to be slightly larger in magnitude, but they shouldn't be
biased or otherwise show patterns. An important basic check on the model is
therefore to divide the data into two parts, estimate the model on one part,
the {\bf training set}, and then examine the predictions and the residuals on
the rest of the data, the {\bf testing set}.
We can either make the division into training and testing sets by random
sampling, or systematically. A random division ensures that the testing set
has (almost) the same distribution as the training set. The averaged squared
error on the testing set is therefore an unbiased estimate of the true mean
squared error on new data. We will topic later in the course, under the
heading of ``cross-validation'', since it is one of the most useful ways of
selecting among competing models.
\begin{figure}
<>=
# Always look at whether the model can extrapolate to
# new data
# Basic check: randomly divide into two parts, here say 90% of the data
# vs. 10%
# Use the "training set" to estimate the model
training.rows <- sample(1:nrow(chicago), size=round(nrow(chicago)*0.9),
replace=FALSE)
training.set <- chicago[training.rows,]
# We'll use the "testing set" to see how well it does
testing.set <- chicago[-training.rows,]
# Estimate the model on the training set only
training.lm <- lm(death ~ tmpd, data=training.set)
# Make predictions on the testing set
# The model didn't get to see these points while it was being
# estimated, so this really checks (or tests) whether it can
# predict
testing.preds <- predict(training.lm, newdata=testing.set)
# Unfortunately residuals() doesn't know about the new data set
# so calculate the residuals by hand
testing.residuals <- testing.set$death-testing.preds
@
\caption{Code setting up a {\em random} division of the data into training and
testing sets, and looking at how well the model does on points in the testing
set (which it didn't get to see during estimation).}
\end{figure}
\begin{figure}
<>=
# Plot our residuals against the predictor variable
plot(testing.set$tmpd, testing.residuals,
xlab="Daily mean temperature (F)",
ylab="Prediction error (deaths/day)",
main="Out-of-sample residuals")
abline(h=0,col="grey")
abline(h=mean(testing.residuals),col="red")
@
<>=
<>
@
\caption{Plot of residuals vs.\ temperature for the testing set. Remember that
the data points here were {\em not} available to the model during estimation.
The grey line marks the average we'd see on the training set (zero), while
the red line shows the average on the testing set.}
\label{fig:cv-residuals-plot}
\end{figure}
\begin{figure}
<>=
# Plot absolute residuals vs. predictor variable
plot(testing.set$tmpd, abs(testing.residuals),
xlab="Daily mean temperature (F)",
ylab="Absolute prediction error (deaths/day)",
main="Out-of-sample absolute residuals")
abline(h=sqrt(mean(residuals(training.lm)^2)),col="grey")
abline(h=sqrt(mean(testing.residuals^2)),col="red")
@
<>=
<>
@
\caption{As in Figure \ref{fig:cv-residuals-plot}, but looking at the squared
residuals.}
\end{figure}
\clearpage
\paragraph{Extrapolative Generalization}
An alternative to random division, which can be even more useful for model
checking, is to systematically make the testing set into a part of the data
where the over-all fit of the model seems dubious based on other diagnostics.
With our running examples, for instance, the model seems to do decently at
lower temperatures, but starts to look iffy at high temperatures. We might,
then, fit the model to low temperatures, and see whether it can extrapolate to
high temperatures, or whether it seems to make systematic errors there. (We
could also estimate the model on the high-temperature portion of the data, and
see how well it extrapolates to the lower temperatures, or estimate on the
middle range and see about both extremes, etc., etc.)
\begin{figure}
<>=
# Find the low-temperature days
lowtemp.rows <- which(chicago$tmpd < 75) # About 90% of the data
# Divide into low- and high- temperature data sets
lowtemp.set <- chicago[lowtemp.rows,]
hightemp.set <- chicago[-lowtemp.rows,]
# Estimate the model on the colder days only
lowtemp.lm <- lm(death ~ tmpd, data=lowtemp.set)
# For you: how much do the parameters change, as compared to
# using all the data?
# Now predict on the high-temperature days
# Again, these are new data points, but now systematically
# different (because of their temperature) from the
# data used to estimate
hightemp.preds <- predict(lowtemp.lm, newdata=hightemp.set)
# Calculate our own residuals
hightemp.residuals <- hightemp.set$death-hightemp.preds
@
\caption{Setting up a division of the data into a low-temperature training set
and a high-temperature testing set.}
\end{figure}
\begin{figure}
<>=
# Plot residuals vs. temperature
plot(hightemp.set$tmpd, hightemp.residuals,
xlab="Daily mean temperature (F)",
ylab="Prediction error (deaths/day)",
main="Out-of-sample residuals")
# Flat line at 0 (ideal, if the model is right)
abline(h=0,col="grey")
# Flat line at the mean of the new residuals
abline(h=mean(hightemp.residuals),col="red")
# Regressing the new residuals on temperature does not look good...
abline(lm(hightemp.residuals ~ hightemp.set$tmpd),col="purple")
@
<>=
<>
@
\caption{Residuals vs.\ temperature, where the testing set here consists of
days with temperature $\geq 75$ degrees Farenheit, and the training set only
those $< 75$ degrees. The grey line indicates the average residual we'd see
on the training data (zero); the red line the average residual on the testing
data; the purple a regression of residual on temperature, which ideally
should have had slope zero.}
\label{fig:cv-by-temp-plot}
\end{figure}
\begin{figure}
<>=
# Similar plots for the absolute residuals
plot(hightemp.set$tmpd, abs(hightemp.residuals),
xlab="Daily mean temperature (F)",
ylab="Absolute prediction error (deaths/day)",
main="Out-of-sample absolute residuals")
abline(h=sqrt(mean(residuals(lowtemp.lm)^2)),col="grey")
abline(h=sqrt(mean(hightemp.residuals^2)),col="red")
abline(lm(abs(hightemp.residuals) ~ hightemp.set$tmpd),col="purple")
@
<>=
<>
@
\caption{As in Figure \ref{fig:cv-by-temp-plot}, but for absolute residuals.}
\end{figure}
\paragraph{Generalize All the Things!} {\em All} of the diagnostic plots
discussed earlier can be combined with the trick of estimating the model on one
part of the data, and then seeing whether it generalizes to the testing set.
This is slightly more complicated than doing everything on the full data, but
arguably has more power to detect problems with the model.
\clearpage
\section{Nonlinear Functions of $X$}
When we plot the residuals against the predictor variable, we may see a curved
or stepped pattern. This strongly suggests that the relationship between $Y$
and $X$ is not linear. At this point, it is often useful to fall back to, in
fact, plotting the $y_i$ against the $x_i$, and try to guess at the functional
form of the curve.
\subsection{Transformations}
The easy case is when $Y = \beta_0 + \beta_1 f(x) + \epsilon$ for some fixed,
easily-computed function $f$ like $\sqrt{x}$, $x^2$, $\log{x}$, $\sin{x}$, etc.
We then calculate $f(x_i)$, and run a simple linear regression of $y_i$ on
$f(x_i)$. The interpretation of the coefficients hardly changes, except that
we need to replace $X$ everywhere with $f(X)$ --- $\beta_0 =
\Expect{Y|f(X)=0}$, $\beta_1$ is the difference $\Expect{Y|f(X)=f_0}$ and
$\Expect{Y|f(X)=f_0-1}$, etc. This is called ``transforming the predictor'',
and it only works this simply when the transformation itself doesn't have to be
estimated, but can just be guessed at. Ideally, in fact, we derive the
transformation from some physical (chemical, biological, psychological,
sociological, economic, \ldots) theory. For instance, there are good reasons
in physiology and psychology to say that an organism's behavioral response to a
stimulus should vary with the logarithm of the stimulus's physical
intensity\footnote{This is also (roughly) true, at least in people, of the
perceived intensity of the stimulus. Full sunlight carries $\approx 1000
\mathrm{W}/\mathrm{m}^2$ of power, while the light of a full moon is $\approx
0.025 \mathrm{W}/\mathrm{m}^2$, yet moonlight doesn't seem forty thousand
times dimmer. The logarithmic relationship between perceived and physical
intensity is sometimes called ``Fechner's law'' in psychophysics.}. A good
check on such a transformation is to plot the $y_i$ against the $f(x_i)$, and
see that the data now fall on a straight line.
\subsection{Nonlinear Least Squares}
If the transformation does have to be estimated, but the functional form is
known\footnote{For instance, if $Y=\beta_0 + \beta_1 \log{(x+\beta_2)} +
\epsilon$, for unknown $k$.}, then the method of least squares (or maximum
likelihood) still applies. Taking the derivative of the mean squared error (or
the log likelihood) with respect to the parameters and setting them equal to
zero gives a set of normal or estimating equations. Usually, however, these
equations are nonlinear, and don't have a closed-form solution. Finding the
solution is thus called ``solving a nonlinear least squares (NLS) problem''.
When we have theoretical reasons to use some thoroughly nonlinear model, say $Y
= \beta_0 x^{\beta_1} + \epsilon$, we can still estimate it using NLS in this
way. Aside from needing to solve the estimating equations numerically, there
are some special considerations to NLS, which we'll either cover later in this
class (time permitting) or in 402\footnote{The relevant R commands are
\texttt{optim}, which is a general-purpose minimization function, and
\texttt{nls}, which is, unsurprisingly, specialized to nonlinear least
squares.}.
\subsection{Smoothing}
An alternative to using a parametrized nonlinear model is to try to let the
data tell us what the appropriate curve is --- to take a ``non-parametric''
approach. The most basic sort of curve-fitting would just take a little interval around any point $x$, say from $x-h$ to $x+h$, and average all the $y_i$ where $x_i$ was in the interval:
\[
\hat{m}(x) = \frac{\sum_{i=1}^{n}{y_i I_{[x-h,x+h]}(x_i)}}{\sum_{j=1}^{n}{I_{[x-h,x+h]}(x_j)}}
\]
(Here $I_{[a,b]]}(x)$ is the {\bf indicator function} for the interval $[a,b]$, i.e., 1 if $a \leq x \leq b$, and 0 otherwise.)
This sort of local averaging gives an $\hat{m}(x)$ which can make jerk steps as
$x$ changes. Another approach is {\bf spline smoothing}: this looks for the
function, called a {\bf spline} which comes closest to the data points, subject
to a constraint on the average curvature of the function. Allowing no
curvature at all gives back the least squares line; allowing unlimited
curvature gives a function which interpolates exactly between the data
points\footnote{This turns out to be equivalent to a kind of weighted local
averaging.}. Of course one has to decide how much curvature to allow; the
best idea is generally to do what's called ``cross-validation'': hold back a
little bit of the data, fit the spline with some level of curvature to the rest
of the data, and see how well it predicts the held-back part; pick the
curvature which generalizes best to unseen data points. While there is lot of
R code for splines, because they're very useful, the most user-friendly is
called \texttt{smooth.spline}.
<>=
smooth.spline(x,y,cv=TRUE)
@
returns an object which contains the estimated spline. (The default,
\texttt{CV=FALSE}, is to use a fast approximation to cross-validation called
``generalized cross-validation''; \texttt{CV=TRUE} is often a bit more
accurate, if you can afford the time.) The commands \texttt{fitted} and
\texttt{residuals} work on this object just like they do on \texttt{lm}
objects; \texttt{predict} works very similarly, but the argument giving the new
values must be a vector called \texttt{x}, not a data frame. If passed
into the \texttt{lines} or \texttt{points} command, we get a plot of
the fitted values.
\begin{figure}
<>=
plot(death ~ tmpd, data=chicago,
xlab="Temperature (Farenheit)",
ylab="Mortality (deaths/day)")
death.temp.ss <- smooth.spline(x=chicago$tmpd, y=chicago$death, cv=TRUE)
abline(death.temp.lm, col="blue")
lines(death.temp.ss, col="red")
@
<>=
<>
@
\caption{Scatter-plot of mortality as a function of temperature, along with the
estimated linear model (blue) and the estimated smoothing spline (red).
Notice how the smoothing spline tracks the linear model over a wide range of
temperatures, but then turns {\em up} for high temperatures.}
\end{figure}
\begin{figure}
<>=
plot(chicago$tmpd, residuals(death.temp.lm),
xlab="Temperature (F)",
ylab="Prediction error (deaths/day)")
abline(h=0, col="grey")
lines(smooth.spline(x=chicago$tmpd, y=residuals(death.temp.lm), cv=TRUE),
col="red")
@
<>=
<>
@
\caption{Estimating a smoothing spline on the residuals of the linear model.
Ideally, the spline would be close to zero everywhere; here we see it's
working pretty well, {\em except} at high temperature.}
\end{figure}
\texttt{smooth.spline} can also be used on the squared residuals (not the
absolute residuals -- why?) to try to get an estimate of the conditional
variance, if you're pretty sure that the functional form of the regression
is right.
\begin{figure}
<>=
plot(chicago$tmpd, residuals(death.temp.lm)^2, log="y",
xlab="Temperature (F)",
ylab="Squared prediction error (deaths/day)^2")
abline(h=mean(residuals(death.temp.lm)^2),
lwd=2, col="grey")
lines(smooth.spline(x=chicago$tmpd, y=residuals(death.temp.lm)^2,
cv=TRUE),
col="red")
@
<>=
<>
@
\caption{Smoothing spline of squared residuals versus temperature. (Notice the
logarithmic scale for the vertical axis, to compensate for the fact that some
of the residuals are really big.) If we thought that the functional form of
the regression was right, this would be a reasonable estimate of the
conditional variance. Should we think that?}
\end{figure}
\clearpage
\section{Transforming the Response}
Another way to try to accommodate nonlinearity is to transform the response
variable, rather than the predictor. That is, one imagines the model is
\[
g(Y) = \beta_0 + \beta_1 x + \epsilon_i
\]
for some invertible function $g$. In more old-fashioned sources, like our
textbook, this is advocated as a way of handling non-constant variance, or
non-Gaussian noise\footnote{To be quite honest, this seems like as clear a case
of being enslaved by one's own tools as one could hope to find in the
sciences. More charitably: it made a certain amount of sense before people
figures out how to deal with non-constant variance (around 1935) and
non-Gaussian noise (around 1980).}. A better rationale is that it might in
fact be true. Since the transformation $g$ has an inverse, we can write
\[
Y = g^{-1}\left(\beta_0 + \beta_1 x + \epsilon_i \right)
\]
Even if $\epsilon_i \sim N(0,\sigma^2)$, this implies hat $Y$ will have a
non-Gaussian distribution, with a non-linear relationship between
$\Expect{Y|X=x}$ and $x$, and a non-constant variance. If that's actually the
case, we'd like to incorporate that into the model. For instance, $Y$ might be
an integer-valued count variable, or even a binary-valued categorical variable,
and then we pretty much have to have some sort of non-Gaussian noise in $Y$.
(Indeed, the noise around $\Expect{Y|X=x}$ isn't even strictly additive.)
Let me illustrate these points by working through a log transformation of $Y$.
Suppose
\[
\log{Y} = \beta_0 + \beta_1 x + \epsilon
\]
where $\epsilon \sim N(0,\sigma^2)$, independent of $x$.
The inverse function to $\log$ is $\exp$, so this is logically equivalent to
\[
Y = e^{\beta_0 + \beta_1 x}e^{\epsilon} = e^{\beta_0 + \beta_1 x} \xi
\]
where $\xi = e^{\epsilon}$ follows a {\bf log-normal} distribution, with
$\Expect{\log{\xi}}=0$, $\Var{\log{\xi}}=\sigma^2$. One can
show\footnote{Check Wikipedia if you don't believe me.} that $\Expect{\xi} =
e^{\sigma^2/2}$, that $\Var{\xi} = (e^{\sigma^2}-1)e^{\sigma^2}$, that the
median of $\xi$ is 1, and that $\xi$ is, consequently, skewed to the right,
with an asymmetric distribution. Conversely, if $Y = e^{\beta_0 + \beta_1 x} +
\epsilon$, with $\epsilon \sim N(0,\sigma^2)$, then it is {\em not} the case
that $\log{Y} = \beta_0 + \beta_1 x + \eta$, for some Gaussian $\eta$.
The point of this is that transforming the response thoroughly changes the
interpretation of the parameters: $\beta_0$ is not $\Expect{Y|X=x}$, but
$\Expect{g(Y)|X=x}$, $\beta_1$ is the slope of $\Expect{g(Y)|X=x}$, etc. It
also implies very particular, even peculiar, models for noise around the
regression line. This makes it impossible to compare mean squared errors or
log-likelihoods before and after transformations\footnote{More exactly,
log-likelihoods {\em can} be compared, but we need to compensate for the
transformation.}. One can, however, look at the residuals for the {\em
transformed} response, and see whether they are flat in the predictor
variable, etc.
\subsection{Box-Cox Transformations}
The great statisticians G. E. P. Box and D. R. Cox introduced a family
of transformations which includes powers of $Y$ and taking the
logarithm of $Y$, parameterized by a number $\lambda$:
\[
b_{\lambda}(y) = \frac{y^{\lambda}-1}{\lambda}
\]
In the limit $\lambda \rightarrow 0$, this becomes $\log{y}$ (Exercise
\ref{exercise:box-cox}). If one assumes that
\begin{eqnarray}
b_{\lambda}(Y) & = & \beta_0 + \beta_1 x + \epsilon\\
\epsilon & \sim & N(0, \sigma^2)\\
\epsilon & \mathrm{independent\ of} & x\\
\end{eqnarray}
then one can estimate $\lambda$ by maximizing the likelihood. This is
implemented in R through the function \texttt{boxcox} in the package
\texttt{MASS}.
\newpage
\begin{figure}
<>=
library(MASS)
# Works with a previously-fit lm model
boxcox(death.temp.lm)
@
<>=
<>
@
\caption{Plot of the log-likelihood of different values of $\lambda$ in the
Box-Cox transformation, applied to the regression of deaths on temperature.
The dashed lines indicate an approximate 95\% confidence interval for
$\lambda$.}
\end{figure}
\begin{figure}
<>=
# You can also give a model formula, and suppress plotting
bc.death.temp <- boxcox(death ~ tmpd, data=chicago, plotit=FALSE)
# Result is a list showing lambda values ($x component) vs. log-likelihood
# (the $y component)
# We can use this to get a (rough) maximum value for lambda
(lambda.hat <- bc.death.temp$x[which.max(bc.death.temp$y)])
@
\caption{Another way of using the \texttt{boxcox} function.}
\end{figure}
\begin{figure}
<>=
chicago$bc.death <- (chicago$death^lambda.hat-1)/lambda.hat
bc.lm <- lm(bc.death ~ tmpd, data=chicago)
plot(death ~ tmpd, data=chicago,
xlab="Temperature (Farenheit)",
ylab="Mortality (deaths/day)")
points(chicago[,"tmpd"], (lambda.hat*fitted(bc.lm)+1)^(1/lambda.hat), pch=19,
col="blue")
@
<>=
<>
@
\caption{Actual data (hollow circles) and fitted values from the Box-Cox
transformation.}
\end{figure}
\clearpage
It is important to remember that this family of transformations is just
something that Box and Cox made up because it led to tractable math. There is
absolutely no justification for it in probability theory, general
considerations of mathematical modeling, or scientific theories. There is also
no reason to think that the correct model will be one where some transformation
of $Y$ is a linear function of the predictor variable plus Gaussian noise.
Estimating a Box-Cox transformation by maximum likelihood does not relieve us
of the need to run all the diagnostic checks {\em after} the transformation.
Even the best Box-Cox transformation may be utter rubbish.
I will add that while Box-Cox transformations have been part of courses like
this since they were introduced in 1964, I have never encountered a real,
non-textbook data-analysis problem where they helped, with {\em one} exception,
which I will discuss immediately.
\paragraph{Log transformations with log-normal multiplicative noise} The usual
argument for anticipating Gaussian noise is the central limit theorem: if lots
of little disturbing causes add up their contributions to the response
variable, and they're roughly independent and of comparable size, we'll expect
that their net effect is Gaussian. If instead of adding up they multiply
together, however, the central limit theorem does not apply directly. Since
logarithms turn multiplication into addition, when effects multiply we may
anticipate log-normal fluctuations. If the simple linear regression model applies after taking the log of everything,
\begin{eqnarray}
\log{Y} & = & \beta_0 + \beta_1 log{x} + \epsilon\\
\epsilon & \sim & N(0,\sigma^2)\\
\epsilon & \mathrm{independent\ of} & X
\end{eqnarray}
then, undoing the log,
\begin{eqnarray}
Y & = & e^{\beta_0} x^{\beta_1} \eta \label{eqn:lognormal-regression}\\
\nonumber \eta & \sim & \log{N}(0,\sigma^2)\\
\nonumber \eta & \mathrm{independent\ of} & X
\end{eqnarray}
This sort of model is not actually unheard of in practice, because there
really are situations where causes multiply.
Eq.\ \ref{eqn:lognormal-regression} is a {\em different model} from
\[
Y = e^{\beta_0} x^{\beta_1} + \epsilon
\]
We would estimate the formal model by doing a linear regression of $\log{Y}$
on $\log{X}$ with Gaussian noise. We would estimate the latter model
by nonlinear least squares.
xs
\section{Looking Forward}
Non-constant variance in a linear model is actually comparatively easy to
handle, if we can work out what variance is: the trick is a modification of the
method of least squares called ``weighted least squares'', which we will cover
later in the course. Similarly, correlated noise can be handled through a
modification called ``generalized least squares'', which we'll also cover.
There are a range of simulation-based techniques for doing inference when the
Gaussian-noise assumption fails; these have opaque, forcedly-whimsical names
like ``the jackknife'' and ``the bootstrap'', and we'll get to these towards
the end of the course. Dealing with nonlinearity will occupy us for much of
402, though in a few weeks we will look at polynomial regression.
The easiest direction to go in is to add more predictor variables, and hope
that the response is linear in each of them. We will explore this path after
the midterm.
\section{Residuals under Misspecification}
\label{sec:misspecification}
(This section is optional, but strongly recommended.)
Suppose that the conditional expectation function $\mu(x) \equiv \Expect{Y|X=x}$
is not linear in $x$, but we use a linear model anyway. We know, from the first
lecture, that there is a best linear approximation to this $\mu$,
with slope $\beta_1 = \Cov{X,Y}/\Var{X}$ and intercept $\beta_0 = \Expect{Y}-\beta_1 \Expect{X}$. So we can write
\begin{eqnarray}
Y & = & \mu(x) + \epsilon\\
Y & = & \beta_0 + \beta_1 x + (\mu(x) - \beta_0 + \beta_1 x) + \epsilon
\end{eqnarray}
It's annoying to keep writing out $\mu(x) - \beta_0 + \beta_1 x$, so let's
define that to be a new function, $\nu(x) \equiv \mu(x) - \beta_0 + \beta_1 x$.
You can show (Exercise \ref{exercise:properties-of-nu}) that
\[
\Expect{\nu(X)} = 0
\]
and that
\[
\Cov{\nu(X), X} = 0
\]
However, it is not in general the case that
\[
\Expect{\nu(X) | X=x} =0
\]
Suppose now that --- through ignorance or as a deliberate strategy --- we
fit a linear model by least squares. It will be the case
that
\[
Y_i = \beta_0 + \beta_1 x_i + \eta_i
\]
where $\eta_i$ has expectation zero and is uncorrelated with $X$. But
what we are treating as noise is partially the real noise, and partially
the systematic effect of ignoring the nonlinearities:
\[
\eta_i = \epsilon_i + \nu(x_i)
\]
Thus, what looks like noise to the linear model will {\em not}
have conditional expectation zero:
\[
\Expect{\eta|X=x} = \nu(x)
\]
The conditional variance doesn't alter,
\[
\Var{\eta|X=x} = \Var{\epsilon|X=x}
\]
but it's no longer equal to the expected square:
\[
\Expect{\eta^2|X=x} = \nu^2(x) + \Var{\epsilon|X=x}
\]
As mentioned in earlier lectures, when we run least squares, our slope and
intercept will still tend to converge on those of the optimal linear model. We
can even still use the expressions we worked out for $\hat{\beta}_0$ (or
$\hat{\beta}_1$) in terms of $\beta_0$ (or $\beta_1$) and a weighted sum of
random variables --- but now those are the $\eta_i$, not the $\epsilon_i$.
This implies that conditional expectation of the residuals will {\em not} be
zero (in general), the conditional expectation of the squared residuals will
not be constant, and so forth.
\section*{Exercises}
To think through or for practice, not to hand in.
\begin{enumerate}
\item \label{exercise:correlation-between-residuals} Using Eq.\
\ref{eqn:residual-as-sum-of-noise}, find the covariance, conditional on all
the $x_i$, between $e_i$ and $e_k$, for arbitrary $i, k \in 1:n$.
\item \label{exercise:box-cox} Show that $b_{\lambda}(y) \rightarrow \log{y}$
as $\lambda \rightarrow 0$. {\em Hint:} L'H{\^o}pital's rule.
\item \label{exercise:properties-of-nu} Consider the $\nu(x)$ function from \S
\ref{sec:misspecification}.
\begin{enumerate}
\item Show that $\Expect{\nu(X)} = 0$. {\em Hint:} Show that $\Expect{Y} =
\Expect{\mu(X)}$.
\item Show that $\Cov{X, \nu(X)} = 0$. {\em Hint:} Show that $\Cov{X, Y} =
\Cov{X, \mu(X)}$.
\end{enumerate}
\end{enumerate}
\end{document}