\documentclass{article}
\input{../common}
\begin{document}
\title{Lecture 15: Diagnostics and Inference for Multiple Linear Regression}
\author{36-401, Section B, Fall 2015}
\date{20 October 2015}
\maketitle

\tableofcontents

<< echo=FALSE >>=
library(knitr)
opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE)
@

\section{Lighting Review of Multiple Linear Regression}

In the multiple linear regression model, we assume that the response $Y$ is a linear function of all the predictors, plus a constant, plus noise:
\begin{equation}
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots \beta_p X_p + \epsilon
\end{equation}
We make no assumptions about the (marginal or joint) distributions of the
$X_i$, but we assume that $\Expect{\epsilon|X} = 0$, $\Var{\epsilon|X} =
\sigma^2$, and that $\epsilon$ is uncorrelated across measurements.  In
matrix form, the model is
\begin{equation}
\mathbf{Y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}
\end{equation}
where $\mathbf{X}$ includes an initial column of all 1s.

When we add the Gaussian noise assumption, we are making all of the assumptions
above, and further assuming that
\begin{equation}
\mathbf{\epsilon} \sim MVN(\mathbf{0}, \sigma^2 \mathbf{I})
\end{equation}
independently of $\mathbf{X}$.

The least squares estimate of the coefficients is
\begin{equation}
\widehat{\mathbf{\beta}} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\mathbf{y}
\end{equation}
Under the Gaussian noise assumption, this is also the maximum likelihood
estimate.

The fitted values (i.e., estimates of the conditional means at data points
used to estimate the model) are given by the ``hat'' or ``influence'' matrix:
\begin{equation}
\widehat{\mathbf{m}} = \mathbf{x}\widehat{\mathbf{\beta}} = \mathbf{H}\mathbf{y}
\end{equation}
which is symmetric and idempotent.  The residuals are given by
\begin{equation}
\mathbf{e} = (\mathbf{I}- \mathbf{H})\mathbf{y}
\end{equation}
and $\mathbf{I}-\mathbf{H}$ is also symmetric and idempotent.

The expected mean squared error, which is the maximum likelihood
estimate of $\sigma^2$, has a small negative bias:
\begin{equation}
\Expect{\hat{\sigma}^2} = \Expect{\frac{1}{n}\mathbf{e}^T\mathbf{e}} = \sigma^2 \frac{n-p-1}{n} = \sigma^2\left(1 - \frac{p+1}{n}\right)
\end{equation}

Since $\mathbf{H}\mathbf{x}\mathbf{\beta} = \mathbf{x}\mathbf{\beta}$, the
residuals can also be written
\begin{equation}
\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{\epsilon}
\end{equation}
hence
\begin{equation}
\Expect{\mathbf{e}} = \mathbf{0}
\end{equation}
and
\begin{equation}
\label{eqn:var-of-residuals}
\Var{\mathbf{e}} = \sigma^2(\mathbf{I} - \mathbf{H})
\end{equation}

Under the Gaussian noise assumption, $\widehat{\mathbf{\beta}}$, $\widehat{\mathbf{m}}$ and $\mathbf{e}$ all have Gaussian distributions (about which
more below, \S \ref{sec:sampling-distributions}).

\subsection{Point Predictions}
\label{sec:point-predictions}

Say that $\mathbf{x}^{\prime}$ is the $m \times (p+1)$ dimensional matrix
storing the values of the predictor variables at $m$ points where we want to
make predictions.  (These may or may not include points we used to estimate the
model, and $m$ may be bigger, smaller or equal to $n$.)  Similarly, let
$\mathbf{Y}^{\prime}$ be the $m\times 1$ matrix of random values of $Y$ at
those points.  The point predictions we want to make are
\begin{equation}
\Expect{\mathbf{Y}^{\prime}| \mathbf{X}^{\prime} = \mathbf{x}^{\prime}} = \mathbf{m}(\mathbf{x}^{\prime}) = \mathbf{x}^{\prime} \mathbf{\beta}
\end{equation}
and we {\em estimate} this by
\begin{equation}
\widehat{\mathbf{m}}(\mathbf{x}^{\prime}) = \mathbf{x}^{\prime}\widehat{\mathbf{\beta}}
\end{equation}
which is to say
\begin{equation}
\widehat{\mathbf{m}}(\mathbf{x}^{\prime}) = \mathbf{x}^{\prime}(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\end{equation}
(It's easy to verify that when $\mathbf{x}^{\prime} = \mathbf{x}$, this reduces
to $\mathbf{H}\mathbf{y}$.)

Notice that the point predictions we make {\em anywhere} are always weighted
sums (linear combinations) of the values of the response we happened to observe
when we estimated the model.  The weights just depend on the values of the
predictors at the original data points, and at the points where we'll be making
predictions --- changing the responses doesn't change those weights.


\section{Diagnostics for Multiple Linear Regression}

Before proceeding to detailed statistical inference, we need to check our
modeling assumptions, which means we need diagnostics.

\subsection{Plot All the Things!}

\marginpar{\includegraphics[width=\marginparwidth]{all-the-things-pos}}

All of the plots we learned how to do for simple linear regression remain
valuable:
\begin{enumerate}
\item {\em Plot the residuals against the predictors.} This now means $p$
  distinct plots, of course.  Each of them should show a flat scatter of points
  around $0$ (because $\Expect{\epsilon|X_i} = 0$), of roughly constant width
  (because $\Var{\epsilon|X_i} = \sigma^2)$.  Curvature or steps to this plot
  is a sign of potential nonlinearity, or of an omitted variable.  Changing
  width is a potential sign of non-constant variance.
\item {\em Plot the squared residuals against the predictors.}  Each of these
  $p$ plots should show a flat scatter of points around $\hat{\sigma}^2$.
\item {\em Plot the residuals against the fitted values.} This is an extra plot,
  redundant when we only have one predictor (because the fitted values were
  linear in the predictor).
\item {\em Plot the squared residuals against the fitted values.}
\item {\em Plot the residuals against coordinates.} If observations are dated,
  time-stamped, or spatially located, plot the residuals as functions of time,
  or make a map.  If there is a meaningful order to the observations, plot
  residuals from successive observations against each other.  Because the
  $\epsilon_i$ are uncorrelated, all of these plots should show a lack of
  structure.
\item {\em Plot the residuals' distribution against a Gaussian.}
\end{enumerate}

Out-of-sample predictions, with either random or deliberately selected
testing sets, also remain valuable.

\subsubsection{Collinearity}

A linear dependence between two (or more) columns of the $\mathbf{x}$ matrix is
called {\bf collinearity}, and it keeps us from finding a solution by least
squares.  (In fact, collinearity at the population level makes the coefficients
ill-defined, not just impossible to estimate.)  Collinearity between a pair of
variables will show up in a pairs plot as an exact straight line.  Collinearity
among more than two variables will not.  For instance, if $X_3 = (X_1 +
X_2)/2$, we can't include all three variables in a regression, but
we'd not see that from any of the pairs.

Computationally, collinearity will show up in the form of the determinant of
$\mathbf{x}^T\mathbf{x}$ being zero.  Equivalently, the smallest eigenvalue of
$\mathbf{x}^T\mathbf{x}$ will be zero.  If \texttt{lm} is given a collinear set
of predictor variables, it will sometimes give an error messages, but more
often it will decide not to estimate one of the collinear variables, and return
an \texttt{NA} for the offending coefficient.

We will return to the subject of collinearity in lecture 17.

\subsubsection{Interactions}

Another possible complication for multiple regression which we didn't have with
the simple regression model is that of {\em interactions} between variables.
One of our assumptions is that each variable makes a distinct, additive
contribution to the response, and the size of this contribution is completely
insensitive to the contributions of other variables.  If this is {\em not} true
--- if the relationship between $Y$ and $X_i$ changes depending on the value of
another predictor, $X_j$ --- then there is an {\bf interaction} between them.

There are several ways of looking for interactions.  We will return to this
subject in Lecture 19, but, for now, I'll stick with describing some
diagnostic procedures.

\paragraph{Sub-divide and re-estimate} The simplest thing to do, if you suspect
an interaction between $X_i$ and $X_j$, is to sub-divide the data based on the
value of $X_j$, into two or more parts, and re-estimate the model.  If there is
no interaction, the coefficient on $X_i$ should be the same, up to estimation
error, in each part of the data.  (That is, there should be no significant
difference in the estimated coefficients.)  While in principle straightforward,
drawbacks to this include having to guess how to sub-divide the data (into two
parts? three? more?), and at what values of $X_j$ to make the cuts.

\paragraph{Scatterplot with color or symbols} A more visual alternative is to
plot the residuals against $X_i$, as usual, but to give each point a color
which varies continuously with the value of $X_j$.  In the absence of
interactions, there should be no pattern to the colors.  If there are
interactions, however, we could predict what the residuals will be from knowing
both variables, so we should tend to see similarly-colored regions in the plot.

If color is not available, a similar effect can be obtained by using different
plotting symbols, corresponding to different values of $X_j$.

\paragraph{3D Plots}


\begin{figure}
<<>>=
# Minimal simulation of interactions
# Model: Y = sin(X1*X2) + noise
X <- matrix(runif(200),ncol=2)
Y <- sin(X[,1]*X[,2])+rnorm(200,0,0.1)
df <- data.frame(Y=Y, X1=X[,1], X2=X[,2])
missed.interact <- lm(Y ~ X1+X2, data=df)
@
\caption{Simulating data from the model $Y=\sin{X_1 X_2} + \epsilon$, to
  illustrate detecting interactions.  Self-checks: what is the distribution of
  $X_1$ and $X_2$?  what is $\sigma^2$?}
\label{fig:minimal-sim}
\end{figure}

\begin{figure}
<<>>=
coefficients(summary(lm(Y~X1+X2, data=df, subset=which(df$X2 < median(df$X2)))))
coefficients(summary(lm(Y~X1+X2, data=df, subset=which(df$X2 > median(df$X2)))))
@
\caption{Here we have sub-setted the data based on the value of the second
  predictor (dividing it, somewhat arbitrarily, at its median).  Notice that
  the difference in the two coefficients for $X_1$ is much larger than their
  standard errors.  Can you give a significance level for the difference in
  means?}
\label{fig:subdivide-and-re-estimate}
\end{figure}

\begin{figure}
<<color-for-interactions, echo=FALSE>>=
# Create a vector of gradually-changing colors, with one entry for
# each data point
the.colors <- rainbow(n=nrow(df))
# For each data point, see how it ranks according to X2, from smallest (1)
# to largest
the.ranks <- rank(df$X2)
# Plot residuals vs. X1, colored according to X2
  # Defining the color and rank vectors makes this next line a bit less
  # mysterious, but it's not necessary; this could all be a one-liner.
plot(df$X1, residuals(missed.interact), pch=19, col=the.colors[the.ranks],
     xlab=expression(X[1]), ylab="Residuals")
@
<<eval=FALSE>>=
<<color-for-interactions>>
@
\caption{Plotting residuals from the linear model against $X_1$, with the color
  of the point set by the value of $X_2$.  Notice the clumping of points with
  similar colors: this means that knowing both $X_1$ and $X_2$ lets us predict
  the residual.  Horizontal bands of the same color, on the other hand, would
  show that $X_2$ helped predict the residuals but $X_1$ did not, pointing to a
  mis-specification for the dependence of $Y$ on $X_2$.}
\label{fig:color-for-interactions}
\end{figure}

\begin{figure}
<<scatter3d-for-interactions, echo=FALSE>>=
library(scatterplot3d)
# Make a 3D scatterplot of residuals against the two predictor variables
s3d <-scatterplot3d(x=df$X1, y=df$X2, z=residuals(missed.interact),
                    tick.marks=TRUE, label.tick.marks=TRUE,
                    xlab=expression(X[1]), ylab=expression(X[2]),
                    zlab="Residuals")
# Add a plane with intercept 0 and both slopes also 0, for visual
# reference
s3d$plane3d(c(0,0,0), lty.box="solid")
@
<<eval=FALSE>>=
<<scatter3d-for-interactions>>
@
\caption{Residuals (vertical axis) vs.\ predictor variables.  Notice that there
  are regions where the residuals are persistent positive or negative, but that
  these are defined by the value of {\em both} variables, not one or the other
  alone.}
\label{fig:scatter3d-for-interactions}
\end{figure}


\clearpage

\subsection{Remedies}

All of the remedies for model problems we discussed earlier, for the simple
linear model, are still available to us.

\paragraph{Transform the response} We can change the response variable from
$Y$ to $g(Y)$, in the hope that the assumptions of the linear-Gaussian
model are more nearly satisfied for this new variable.    That is, we hope
that
\begin{equation}
g(Y) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p  + \epsilon, ~ \epsilon \sim N(0,\sigma0^2
\end{equation}
The Box-Cox method, if you want to use it, will work as well as it did for
simple linear models.  Computationally, we'd just fill the $n\times 1$ response
matrix with $[ g(y_1)\ g(y_2)\ \ldots g(y_n)]^T$, and proceed as with any other
multiple regression.

However, see the handout on transformations for cautions on interpretation
after such transformations.

\paragraph{Transform the predictors} We can also transform each of the predictors,
making the model
\begin{equation}
Y = \beta_0 + \beta_1 f_1(X_1) + \ldots \beta_p f_p(X_p) + \epsilon, ~ \epsilon \sim N(0, \sigma^2)
\end{equation}
As the notation suggests, each $X_i$ could be subject to a different
transformation.  Again, it's just a matter of what we put in the columns of the
$\mathbf{x}$ matrix before solving for $\widehat{\mathbf{\beta}}$.  Again,
see the handout on transformations for cautions on interpretations.

(A model of this form is called an {\bf additive} model; in 402 we will
look extensively at how they can be estimated, by automatically
searching for near-optimal transformations.)


An alternative is to transform, not each predictor variable, but their linear combination:
\begin{equation}
Y = h\left(\beta_0 + \beta_1 X_1 + \ldots \beta_p X_p\right) + \epsilon, ~ \epsilon \sim N(0, \sigma^2)
\end{equation}
This is called a ``single index'' model, because there is only one
combination of the predictors, the weighted sum $\beta_1 X_1 + \ldots \beta_p X_p$, which
matters to the response.
Notice that this is {\em not} the same model as the transform-$Y$ model, even
if $h=g^{-1}$, because of the different role of the noise.

\paragraph{Changing the variables used} One option which is available to us
with multiple regression is to add in new variables, or to remove ones we're
already using.  This should be done carefully, with an eye towards
satisfying the model assumptions, rather than blindly increasing some
score.  We will discuss this extensively in lectures 20 and 26.

\subsection{Plot {\em All} the Things?}

\marginpar{\includegraphics{all-the-things-neg}}


There is one important caution about exuberant diagnostics plotting.  This is
that the more checks you run, the bigger the chance that you will find
something which looks weird {\em just by chance}.  If we were doing formal
hypothesis tests, and insisted on a uniform false positive rate of $\alpha$,
then after running $r$ tests, we'd expect to make $\approx r \alpha$
rejections, {\em even if all of our null hypotheses are true}.  (Why?)  If you
are doing lots of diagnostic plots --- say, 20 or 30 or more --- it becomes a
very good idea to do some randomization to see whether the {\em magnitude} of
the bad-looking things you're seeing is about what you should be anticipating
from one plot or another, even if everything was absolutely fine.


\section{Inference for Multiple Linear Regression}

Unless I say otherwise, all results in this section presume that all of the
modeling assumptions, Gaussian noise very much included, are correct.
Also, all distributions stated are conditional on $\mathbf{x}$.

\subsection{Sampling Distributions}
\label{sec:sampling-distributions}

As in the simple linear model, the sampling distributions are the basis of all
inference.

\subsubsection{Gaussian Sampling Distributions}

\paragraph{Gaussian distribution of coefficient estimators} In the simple linear model, because the noise $\epsilon$
is Gaussian, and the coefficient estimators were linear in the noise, $\widehat{\beta}_0$  and $\widehat{\beta}_1$ were also Gaussian.  This remains true in
for Gaussian multiple linear regression models:
\begin{eqnarray}
\widehat{\mathbf{\beta}} & = & (\mathbf{x}^T\mathbf{x})\mathbf{x}^T \mathbf{Y}\\
& = & (\mathbf{x}^T\mathbf{x})\mathbf{x}^T(\mathbf{x}\mathbf{\beta} + \mathbf{\epsilon})\\
& = & \mathbf{\beta} + (\mathbf{x}^T\mathbf{x})\mathbf{x}^T\mathbf{\epsilon}
\end{eqnarray}
Since $(\mathbf{x}^T\mathbf{x})\mathbf{x}^T\mathbf{\epsilon}$ is a constant
times a Gaussian, it is also a Gaussian; adding on another Gaussian still
leaves us with a Gaussian.  We saw the expectation and variance last time,
so
\begin{equation}
\label{eqn:sampling-dist-of-beta-hat}
\widehat{\mathbf{\beta}} \sim MVN(\mathbf{\beta}, \sigma^2(\mathbf{x}^T\mathbf{x})^{-1})
\end{equation}

It follows that
\begin{equation}
\label{eqn:sampling-dist-of-one-beta-hat}
\widehat{\beta}_i  \sim N\left(\beta_i, \sigma^2(\mathbf{x}^T\mathbf{x})^{-1}_{ii}\right)
\end{equation}

\paragraph{Gaussian distribution of estimated conditional means} The same logic
applies to the estimates of conditional means.  In \S \ref{sec:point-predictions}, we saw that the estimated conditional means at new observations $\mathbf{x}^{\prime}$ are given by
\begin{equation}
\widehat{\mathbf{m}}(\mathbf{x}^{\prime}) = \mathbf{x}^{\prime}(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\end{equation}
so (Exercise \label{exercise:dist-of-est-cond-means})
\begin{equation}
\label{eqn:dist-of-est-cond-means}
\widehat{\mathbf{m}}(\mathbf{x}^{\prime}) \sim MVN(\mathbf{x}^{\prime}\mathbf{\beta}, \sigma^2 \mathbf{x}^{\prime}(\mathbf{x}^T\mathbf{x})^{-1}(\mathbf{x}^{\prime})^T)
\end{equation}

\paragraph{Gaussian distribution of fitted values}
Eq.\ \ref{eqn:dist-of-est-cond-means} simplifies for the special case of
the fitted values, i.e., the estimated conditional means on the original
data.
\begin{equation}
\widehat{\mathbf{m}}(\mathbf{x}^{\prime}) \sim MVN(\mathbf{x}\mathbf{\beta}, \sigma^2 \mathbf{H})
\end{equation}

\paragraph{Gaussian distribution of residuals}
Similarly, the residuals have a Gaussian distribution:
\begin{equation}
\label{eqn:dist-of-residuals}
\mathbf{e} \sim MVN(0, \sigma^2(\mathbf{I}-\mathbf{H})
\end{equation}

\subsubsection{$\hat{\sigma}^2$ and Degrees of Freedom}

The in-sample mean squared error $\hat{\sigma}^2 =
n^{-1}\mathbf{e}^T\mathbf{e}$ has the distribution
\begin{equation}
\frac{n\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-(p+1)}
\end{equation}

I won't prove this here, because it involves the same sort of tedious
manipulations of Gaussians as I evaded in showing the special-case
$\chi^2_{n-2}$ result for simple linear models.
To give a hint of what's going on, though, I'll make two (related) observations.

\paragraph{Constraints on the residuals}
The residuals are not all independent of each other.  In the case of the simple
linear model, the fact that we estimated the model by least squares left us
with two constraints, $\sum_{i}{e_i} = 0$ and $\sum_{i}{e_i x_i} = 0$.  If we
had only one constraint, that would let us fill in the last residual if we knew
the other $n-1$ residuals.  Having two constraints meant that knowing any $n-2$
residuals determined the remaining two.

We got those constraints from the normal or estimating equations, which in turn
came from setting the derivative of the mean squared error (or of the
log-likelihood) to zero.  In the multiple regression model, when we set the
derivative to zero, we get the matrix equation
\begin{equation}
\mathbf{x}^T(\mathbf{y} - \mathbf{x}\widehat{\mathbf{\beta}}) = \mathbf{0}
\end{equation}
But the term in parentheses is just $\mathbf{e}$, so the equation is
\begin{equation}
\mathbf{x}^T\mathbf{e} = \mathbf{0}
\end{equation}
Expanding out the matrix multiplication,
\begin{equation}
\left[\begin{array}{c} \sum_i{e_i} \\ \sum_{i}{x_{i1} e_i} \\ \vdots \\
    \sum_{i}{x_{ip}e_i} \end{array}\right] = \left[ \begin{array}{c} 0 \\ 0 \\
    \vdots \\ 0 \end{array}\right]
\end{equation}
Thus the residuals are subject to $p+1$ linear constraints, and knowing any
$n-(p+1)$ of them will fix the rest.

\paragraph{Geometric interpretation of constraints} The vector of residuals
$\mathbf{e}$ is a point in an $n$-dimensional space.  As a random vector,
without any constraints it could lie anywhere in that space, as, for instance,
$\mathbf{\epsilon}$ can.  The constraints, however, for it to live in a
lower-dimensional subspace, specifically, a space of dimension $n-(p+1)$.

\paragraph{Bias of $\hat{\sigma}^2$}

As more of a formal manipulation, when we look at the expectation of
$\hat{\sigma}^2$, we get
\begin{eqnarray}
\Expect{\hat{\sigma}^2} & = & \frac{1}{n}\Expect{\mathbf{e}^T\mathbf{e}}\\
& = & \frac{1}{n}\Expect{((\mathbf{I}-\mathbf{H}) \mathbf{e})^T ((\mathbf{I}-\mathbf{H}) \mathbf{e})}\\
& = & \frac{1}{n}\Expect{\mathbf{e}^T(\mathbf{I}^T-\mathbf{H}^T)(\mathbf{I}-\mathbf{H})\mathbf{e}}\\
& = & \frac{1}{n}\Expect{\mathbf{e}^T(\mathbf{I} - \mathbf{H} - \mathbf{H}^T + \mathbf{H}^T\mathbf{H})\mathbf{e}}\\
& = & \frac{1}{n}\Expect{\mathbf{e}^T(\mathbf{I}-\mathbf{H})\mathbf{e}}
\end{eqnarray}
using the easily-checked facts that $\mathbf{H} = \mathbf{H}^T$, and that
$\mathbf{H}^2 = \mathbf{H}$.  We've therefore reduced the expectation to a
quadratic form, and so (Lecture 13)
\begin{eqnarray}
\Expect{\hat{\sigma}^2} & = & \frac{1}{n} \tr{((\mathbf{I}-\mathbf{H})\Var{\mathbf{e}})}\\
& = & \frac{1}{n}\tr{((\mathbf{I}-\mathbf{H})\sigma^2(\mathbf{I}-\mathbf{H}))}\\
& = & \frac{\sigma^2}{n}\tr{(\mathbf{I}-\mathbf{H})^2}\\
& = & \frac{\sigma^2}{n}\tr{(\mathbf{I}-\mathbf{H})}
\end{eqnarray}
since we've just seen that $(\mathbf{I}-\mathbf{H})^2 =
(\mathbf{I}-\mathbf{H})$, and (Eq.\ \ref{eqn:var-of-residuals})
$\Var{\mathbf{e}} = \sigma^2(\mathbf{I}-\mathbf{H})$.
Making one last push,
\begin{equation}
\Expect{\hat{\sigma}^2} = \frac{\sigma^2}{n}(n-p-1)
\end{equation}
since $\tr{\mathbf{I}} = n$ while (as you proved in the homework) $\tr{H} = p+1$.

\subsection{$t$ Distributions for Coefficient and Conditional Mean Estimators}

From Eq.\ \ref{eqn:sampling-dist-of-one-beta-hat}, it follows that
\begin{equation}
\frac{\hat{\beta}_i - \beta_i}{\sigma^2(\mathbf{x}^T\mathbf{x})^{-1}_{ii}} \sim N(0,1)
\end{equation}
This would be enough to let us do hypothesis tests and form confidence
intervals, if only we knew $\sigma^2$,  Since that's estimated itself,
and $\hat{\sigma}^2$ has a distribution derived from  a $\chi^2_{n-p-1}$,
we can go through the same arguments we did in the simple linear model case
to get $t$ distributions.  Specifically,
\begin{equation}
\frac{\hat{\beta}_i - \beta_i}{\sehat{\hat{\beta}_i}} \sim t_{n-p-1}
\end{equation}
The same applies to the estimated conditional means, and to the distribution of
a new $Y^{\prime}$ around the estimated conditional mean (in a prediction
interval).  Thus, all the theory we did for parametric and predictive inference
in the simple model carries over, just with a different number of degrees of
freedom.

As with the simple model, $t_{n-p-1} \rightarrow N(0,1)$, so $t$ statistics
approach $z$ statistics as the sample size grows.


\subsection{What, Exactly, Is R Testing?}

The \texttt{summary} function lists a $p$-value for each coefficient in a
linear model.  For each coefficient, say $\beta_i$, this is the $p$-value in
testing the hypothesis that $\beta_i = 0$.  It is important to be very clear
about exactly what this means.

The hypothesis being tested is ``$Y$ is a linear function of all of the $X_i$,
$i \in 1:p$, with constant-variance, independent Gaussian noise, and it just so
happens that $\beta_i = 0$''.  Since, as we saw in Lecture 14, the optimal
coefficients for each predictor variable depend on which other variables are
included in the model (through the off-diagonal terms in
$(\mathbf{x}^T\mathbf{x})^{-1}$), this is a {\em very} specific hypothesis.  In
particular, whether the null hypothesis that $\beta_i = 0$ is true or not can
easily depend on what other variables are included in the regression.  What is
really being checked here is, in ordinary language, something like ``If you
included all these other variables, would the model really fit {\em that} much
better if you gave $X_i$ a non-zero slope?''

\subsubsection{Why, on Earth, Would You Want to Test That?}

I am afraid that usually the answer is ``you do not actually want to test
that''.  You should ask yourself, carefully, whether it would really make any
difference to you to know that the coefficient was precisely zero.  (See
Lecture 8, for some ideas about when that's worth testing and when it isn't.)

\subsubsection{What Will Tend to Make a $\hat{\beta}$ Significant?}

The $t$ statistic for testing $\beta_i = 0$ is
\begin{equation}
\frac{\hat{\beta}_i}{\sehat{\hat{\beta}_i}}
\end{equation}
We know that $\hat{\beta}_i$, being unbiased, will have a distribution
centered on $\beta_i$, and the typical deviation away from that
will in fact be about $\sehat{\hat{\beta}_i}$ in size, so we need to
get a grip on that standard error.

From the theory above,
\begin{equation}
\sehat{\hat{\beta}_i} = \sqrt{\frac{\hat{\sigma}^2}{n}{\left( \frac{1}{n} \mathbf{x}^T\mathbf{x} \right)}^{-1}}
\end{equation}
You showed in the homework (problem 3) that
\begin{equation}
\left( \frac{1}{n} \mathbf{x}^T\mathbf{x} \right) = \left[\begin{array}{ccccc}
 1 & \overline{x_1} & \overline{x_2} & \ldots & \overline{x_p}\\
 \overline{x_1} & \overline{x_1^2} & \overline{x_1 x_2} & \ldots & \overline{x_1 x_p}\\
\vdots & \vdots & \vdots & \vdots & \vdots\\
\overline{x_p} & \overline{x_1 x_p} & \ldots & \overline{x_p^2}
 \end{array} \right]
\end{equation}
What will happen when we invert this?  You can check (Exercise
\ref{exercise:3by3-inverse-info}) that if $\overline{x_i x_j} =
\bar{x_i}\bar{x_j}$ for all $i, j$, we'll get a diagonal matrix.  Except for
the very first entry on the diagonal (corresponding to the intercept), the
entries will be inversely proportional to the variances of the predictor
variables.  If $\overline{x_i x_j} \neq \bar{x_i}\bar{x_j}$, the predictors are
correlated, and this is going to increase the variance of their coefficients.

So, to sum up, four things control the standard error in $\hat{\beta}_i$:
$\sigma^2$, the variance around the true regression function, since all
standard errors are proportional to $\sigma$; $n$, since (all else being equal)
all the standard errors are proportional to $1/\sqrt{n}$; the sample variance
of $X_i$ (since having data more widely spread on that axis makes it easier to
find the slope); and the sample correlation between $X_i$ and the other $X_j$
(since strong correlations, positive or negative, make it harder to find their
{\em specific} slopes).

What are the consequences?
\begin{enumerate}
\item Since, on any one data set, $\sigma^2$ and $n$ are the same for all
  coefficients, the ones which are going to have the biggest test statistics,
  and so be ``most significant'', are the ones where (i) $|\beta_i|$ is large,
  (ii) the sample variance of $X_i$ is large, and (iii) the sample correlation
  of $X_i$ with other predictors is small.
\item The coefficients with the smallest $p$-values aren't necessarily the
  largest, let alone the most important; they may just be the most precisely
  measured.
\item Two people dealing with the same system, with precisely the same
  parameters and even the same $n$, can find different sets of coefficients to
  be significant, if their design matrices $\mathbf{x}$ differ.  In fact, there
  need be no overlap in which coefficients are significant at all\footnote{In
    this case, the natural thing to do would be to combine the data sets.}.
\item Adding or removing predictors will change which coefficients are
  significant, not just by changing the $\beta_i$, but also changing the
  standard error.
\item Holding all the parameters fixed and letting $n$ grow, the $t$ statistic
  will go off to $\pm \infty$, unless $\beta_i = 0$ exactly.  Every non-zero
  coefficient eventually becomes significant at arbitrarily small levels.
\end{enumerate}

The same reasoning as in lecture 8 shows that $p$-values will tend to go
to zero exponentially fast as $n$ grows, unless of course $\beta_i=0$.


\subsubsection{Things It Would Be Very Stupid to Do, So Of Course You Would Never Even {\em Think} of Doing Them}

\begin{itemize}
\item Saying ``$\beta_i$ wasn't significantly different from zero, so $X_i$
  doesn't matter for $Y$''.  After all, $X_i$ could still be an important cause
  of $Y$, but we don't have enough data, or enough variance on $X_i$, or enough
  variance in $X_i$ uncorrelated with other $X$'s, to accurately estimate its
  slope.  All of these would prevent us from saying that $\beta_i$ was {\em
    significantly} different from 0, i.e., distinguishable from 0 with high
  reliability.
\item Saying ``$\beta_i$ was significantly different from zero, so $X_i$ really
  matters to $Y$''.  After all, any $\beta_i$ which is not {\em exactly} zero
  can be made arbitrarily significant by increasing $n$ and/or the sample
  variance of $X_i$.  That is, its $t$ statistic will go to $\pm \infty$, and
  the $p$-value as small as you have patience to make it.
\item Deleting all the variables whose coefficients didn't have stars by them,
  and re-running the regression.  After all, since it makes no sense to pretend
  that the statistically significant variables are the only ones which matter,
  limiting the regression to the statistically significant variables is even
  less sensible.
\item Saying ``all my coefficients are really significant, so the
  linear-Gaussian model must be right''.  After all, all the hypothesis tests
  on individual coefficients {\em presume} the linear Gaussian model, both in
  the null and in the alternative.  The tests have no power to notice
  nonlinearities, non-constant noise variance, or non-Gaussian noise.
\end{itemize}


\section{Further Reading}

The marginal figures are taken from Allie Brosh, ``This Is Why I'll Never Be an
Adult'',
\href{http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html}{{\em
    Hyperbole and a Half}, 17 June 2010}, without permission but with the
deepest possible respect.  If these notes do nothing beyond inspiring you to
read one of the greatest moral psychologists of our age, they will have done
more than many classes.

On a profoundly lower plane, \citet{Berk-on-regression} has one of the most
sensible discussions of the uses and abuses of statistical inference in
multiple regression I know of.

We will discuss additive models (where we automatically search for
transformations of the predictors) extensively in 402 \citep[ch.\
9]{CRS-ADAfaEPoV}.  Single-index models are used widely in econometrics; see,
for instance, \citet{Li-Racine-nonparametric-econometrics}.


\section{Exercises}

To think through or practice on, not to hand in.

\begin{enumerate}
\item \label{exercise:var-of-residuals} Prove Eq.\ \ref{eqn:var-of-residuals}.
\item \label{exercise:dist-of-est-cond-means} Prove Eq.\
  \ref{eqn:dist-of-est-cond-means}
\item {\em What if all null hypotheses were true?} Draw a $\mathbf{Y}$ from a
  standard Gaussian distribution with 1000 observations.  Draw $\mathbf{X}$ by
  setting $p=100$, and giving each $X_i$ a standard Gaussian distribution.
  \begin{enumerate}
  \item Regress $Y$ on all 100 $X$'s (plus an intercept).  How many of the
    $\beta_i$s are significant at the 10\% level?  At the 5\% level?  At the
    1\% level?  What is the $R^2$?  The adjusted $R^2$?
  \item Re-run the regression using just the variables which are significant at
    the 5\% level.  Plot a histogram of the change in coefficient for each
    variable from the old regression to the new regression.  How many variables
    are now significant at the 1\% level?  What is the $R^2$?  The adjusted
    $R^2$?
  \end{enumerate}
  (This problem is inspired by an old example of David Freedman's.)
\item \label{exercise:3by3-inverse-info} {\em Standard errors and correlations among the predictors} Assume that
  $p=2$, so $n^{-1} \mathbf{x}^T\mathbf{x}$ is a $3 \times 3$ matrix.
  \begin{enumerate}
  \item Suppose that $\overline{x_1 x_2} = \bar{x_1} \bar{x_2}$, so there is no
    sample covariance between the two predictors.  Find $(\frac{1}{n}
    \mathbf{x}^T \mathbf{x})^{-1}$ in terms of $\bar{x_1}$, $\bar{x_2}$,
    $\overline{x_1^2}$ and $\overline{x_2^2}$.  Simplify, where possible, to
    eliminate second moments in favor of variances.
  \item Give the general form of the inverse, $(\frac{1}{n} \mathbf{x}^T
    \mathbf{x})^{-1}$, without assuming $\overline{x_1 x_2} = \bar{x_1}
    \bar{x_2}$.  How, qualitatively, do the variances of the slope estimates
    depend on the variances and covariances of the predictors?
  \end{enumerate}
\end{enumerate}

\bibliography{locusts}
\bibliographystyle{crs}

\end{document}