\documentclass{article}
\input{../common}
\begin{document}
\title{Lecture 16: Polynomial and Categorical Regression}
\author{36-401, Fall 2015, Section B}
\date{22 October 2015}
\maketitle
\tableofcontents
<< echo=FALSE >>=
library(knitr)
opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE,
options(show.signif.stars=FALSE))
@
\section{Essentials of Multiple Linear Regression}
We predict a scalar random variable $Y$ as a linear function of $p$
different predictor variables $X_1, \ldots X_p$, plus noise:
\[
Y = \beta_0 + \beta_1 X_1 + \ldots \beta_p X_p + \epsilon
\]
and assume that $\Expect{\epsilon|X}=0$, $\Var{\epsilon|X} = \sigma^2$,
with $\epsilon$ being uncorrelated across observations. In matrix form,
\[
\mathbf{Y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}
\]
the design matrix $\mathbf{X}$ including an extra column of 1s to handle
the intercept, and $\Expect{\mathbf{\epsilon}|\mathbf{X}} = \mathbf{0}$,
$\Var{\mathbf{\epsilon}|\mathbf{X}} = \sigma^2 \mathbf{I}$.
If we add the Gaussian noise assumption, $\mathbf{\epsilon} \sim MVN(\mathbf{0}, \sigma^2\mathbf{I})$, independent of all the predictor variables.
The least squares estimate of the coefficient vector, which is also the maximum likelihood estimate
if the noise is Gaussian, is
\[
\widehat{\mathbf{\beta}} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\]
These are unbiased, with variance $\sigma^2(\mathbf{x}^T\mathbf{x})^{-1}$.
Under the Gaussian noise assumption, $\widehat{\mathbf{\beta}}$ itself has a
Gaussian distribution. The standard error $\sehat{\hat{\beta}_i} =
\sigma\sqrt{(\mathbf{x}^T\mathbf{x})^{-1}_{ii}}$. Fitted values are given by
$\mathbf{x}\widehat{\mathbf{\beta}} = \mathbf{H}\mathbf{y}$, and residuals by
$\mathbf{e} = (\mathbf{I}-\mathbf{H})\mathbf{y}$. Fitted values
$\mathbf{\widehat{m}}$ and residuals $\mathbf{e}$ are also unbiased and have
Gaussian distributions, with variance matrices $\sigma^2 \mathbf{H}$ and
$\sigma^2(\mathbf{I}-\mathbf{H})$, respectively.
When (as is usually the case) $\sigma^2$ is unknown, the maximum likelihood
estimator is the in-sample mean-squared error, $n^{-1}(\mathbf{e}^T\mathbf{e})$
is a negatively biased estimator of $\sigma^2$: $\Expect{\hat{\sigma}^2} =
\sigma^2 \frac{n-p-1}{n}$. Under the Gaussian noise assumption,
$n\hat{\sigma}^2/\sigma^2 \sim \chi^2_{n-p-1}$. Also under the Gaussian noise
assumption, the Gaussian sampling distribution of any particular coefficient or
conditional mean can be converted into a $t$ distribution, with $n-p-1$ degrees
of freedom, by using the appropriate standard error, obtained by plugging
in the de-biased estimate of $\sigma^2$.
None of these results require any assumptions on the predictor variables $X_i$,
except that they take real numerical values, and that they are linearly
independent.
\section{Adding Curvature: Polynomial Regression}
Because the predictor variables are almost totally arbitrary, there is
no harm in making one predictor variable a function of another, so long as
it isn't a linear function. In particular, there is nothing
wrong with a model like
\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + \ldots \beta_d X_1^d + \beta_{d+1} X_2 + \ldots \beta_{p+d-1} X_p + \epsilon
\]
where instead of $Y$ being linearly related to $X_1$, it's polynomially
related, with the order of the polynomial being $d$. We just add $d-1$ columns
to the design matrix $\mathbf{x}$, containing $x_1^2, x_1^3, \ldots x_1^d$, and
treat them just as we would any other predictor variables. With this expanded
design matrix, it's still true that $\widehat{\mathbf{x}} =
(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\mathbf{y}$, that fitted values are
$\mathbf{H}\mathbf{y}$ (using the expanded $\mathbf{x}$ to get $\mathbf{H}$),
etc. The number of degrees of freedom for the residuals will be
$n-(p+1+(d-1))$.
Nor is there principled reason why every predictor variable can't have its own
polynomial, each with (potentially) a different degree
$d_i$. In that case, numbering the $\beta$s sequentially gets tricky,
and better notation would be something like
\[
Y= \beta_0 + \sum_{i=1}^{p}{\sum_{j=1}^{d_i}{\beta_{i,j} X_i^j}}+\epsilon
\]
though then we'd have to remember to ``stack'' the $\beta_{i,j}$s into a vector
of length $1+\sum_{i=1}^p{d_i}$ for estimation.
Mathematically, we are treating $X_i$ and $X_i^2$ (and $X_i^3$, etc.) as
distinct predictor variables, but that's fine, since they won't be linearly
dependent on each other\footnote{Well, hardly ever: if $X_i$ was only ever,
say, $0$ or $1$, then it would be each to $X_i^2$. Such awkward cases happen
with probability 0 for continuous variables.}, or linearly dependent on other
predictors\footnote{Again, you can contrive awkward cases where this is not
true, if you really want to. For instance, if $X_1$ and $X_2$ are horizontal
and vertical coordinates of points laid out on a circle, they are linearly
independent of each other and of their own squares, but $X_1^2$ and $X_2^2$
are linearly dependent. (Why?) The linear dependence would be broken if the
points were laid out in an ellipse or oval, however. (Why?)}. Again, we
just expand the design matrix with extra columns for all the desired powers of
each predictor variable. The number of degrees of freedom for the residuals
will be $n-(1+\sum_{i}{d_i})$.
There are a bunch of mathematical and statistical points to make about
polynomial regression, but let's take a look at how we'd actually
estimate one of these models in R first.
\subsection{R Practicalities}
There are a couple of ways of doing polynomial regression in R.
The most basic is to manually add columns to the data frame with the desired
powers, and then include those extra columns in the regression formula:
<>=
df$x.sq <- df$x^2
lm(y~x+x.sq, data=df)
@
I do not recommend using this form, since it means that you need to do a lot of
repetitive, boring, error-prone work, and get it exactly right. (For example,
to do predictions with \texttt{predict}, you'd need to specify the values for
all the powers of all the predictors.)
A somewhat more elegant alternative is to tell R to use various powers in the
formula itself:
<>=
lm(y ~ x + I(x^2), data=df)
@
Here \texttt{I()} is the {\bf identity function}, which tells R ``leave this
alone''. We use it here because the usual symbol for raising to a power,
\verb_^_, has a special meaning in linear-model formulas, relating to
interactions. (We'll cover this in Lecture 19, or, if you're impatient, see
\texttt{help(formula.lm)}.) When you do this, \texttt{lm} will create the relevant columns in the matrix it uses internally to calculate the estimates, but
it leaves \texttt{df} alone. When it comes time to make
a prediction, however, R will take care of the transformations on the
new data.
Finally, since it can grow tedious to write out all the powers one wants, there
is the convenience function \texttt{poly}, which will create all the necessary
columns for a polynomial of a specified degree:
<>=
lm(y ~ poly(x,2), data=df)
@
Here the second argument, \texttt{degree}, tells \texttt{poly} what order of
polynomial to use. R remembers how this works when the estimated model is used
in \texttt{predict}. My {\em advice} is to use \texttt{poly}, but the other
forms aren't wrong.
\paragraph{Small demo} Here is a small demo of polynomial regression, using
the data from the first data analysis project.
<<>>=
# Load the data
mobility <- read.csv("http://www.stat.cmu.edu/~cshalizi/mreg/15/dap/1/mobility.csv")
mob.quad <- lm(Mobility ~ Commute + poly(Latitude,2)+Longitude, data=mobility)
@
This fits a quadratic in the \texttt{Latitude} variable, but linear terms
for the other two predictors. You will notice that \texttt{summary} does
nothing strange here:
<<>>=
summary(mob.quad)
@
and we can use \texttt{predict} as usual:
<<>>=
predict(mob.quad, newdata=data.frame(Commute=0.298,
Latitude=40.57, Longitude=-79.58))
@
See also Figure \ref{fig:hypothetical-pghs} for an illustration that this
really is giving us behavior which is non-linear in the \texttt{Latitude}
variable.
\begin{figure}
<>=
hypothetical.pghs <- data.frame(Commute=0.287,
Latitude=seq(from=min(mobility$Latitude),
to=max(mobility$Latitude), length.out=100),
Longitude=-79.92)
plot(hypothetical.pghs$Latitude, predict(mob.quad, newdata=hypothetical.pghs),
xlab="Latitude", ylab="Expected mobility", type="l")
@
<>=
<>
@
\caption{Predicted rates of economic mobility for hypothetical communities at
the same longitude as Pittsburgh, and with the same proportion of workers
with short commutes, but different latitudes.}
\label{fig:hypothetical-pghs}
\end{figure}
\subsection{Properties, Issues, and Caveats}
\paragraph{Diagnostic plots} The appropriate diagnostic plot is of residuals
against the predictor. There is no need to make separate plots of residuals
against each power of the predictor.
\paragraph{Smoothness} Polynomial functions vary continuously in all their
arguments. In fact, they are ``smooth'' in the sense in which mathematicians
use that word, meaning that all their derivatives exist and are continuous,
too. This is desirable if you think the real regression function you're trying
to model is smooth, but not if you think there are sharp thresholds or jumps.
Polynomials {\em can} approximate thresholds arbitrarily closely, but you end
up needing a very high order polynomial.
\paragraph{Interpretation} In a linear model, we were able to offer
simple interpretations of the coefficients, in terms of slopes of the
regression surface.
In the multiple linear regression model, we could say
\[
\beta_i = \Expect{Y|X_i = x_i+1, X_{-i} = x_{-i}} - \Expect{Y|X_i = x_i, X_{-i}=x_{-i}}
\]
(``$\beta_i$ is the difference in the expected response when $X_i$ is increased
by one unit, all other predictor variables being equal''), or
\[
\beta_i = \frac{ \Expect{Y|X_i = x_i+h, X_{-i} = x_{-i}} - \Expect{Y|X_i = x_i, X_{-i}=x_{-i}}}{h}
\]
(``$\beta_i$ is the slope of the expected response as $X_i$ is varied, all
other predictor variables being equal''), or
\[
\beta_i = \frac{\partial\Expect{Y|X=x}}{\partial x_i}
\]
(``$\beta_i$ is the rate of change in the expected response as $X_i$
varies'').
None of these statements is true any more in a polynomial regression.
Take them in reverse order. The rate of change in $\Expect{Y|X}$ when
we vary $X_i$ is now
\[
\frac{\partial\Expect{Y|X=x}}{\partial x_i} = \sum_{j=1}^{d}{j\beta_{i,j} x^{j-1}_i}
\]
This not only involves all the coefficients for all the powers of $X_i$, but
also has a different answer at different points $x_i$. The linear coefficient
on $X_i$, $\beta_{i,1}$, is the rate of change when $X_i=0$, but not otherwise.
There just is no one answer to ``what's the rate of change?''.
Similarly, if we ask for the slope,
\[
\frac{ \Expect{Y|X_i = x_i+h, X_{-i} = x_{-i}} - \Expect{Y|X_i = x_i, X_{-i}=x_{-i}}}{h}
\]
that isn't given by one single number either; it depends on the starting value
$x_i$ and the size of the change $h$. If $h$ is very close to very, the slope
will be approximately $h\sum_{j=1}^{d}{j\beta_{i,j} x^{j-1}_i}$, but not,
generally, otherwise. If you really want to know, you have to actually plug in
to the polynomial.
Finally, the change associated with a one-unit change in $X_i$ is just a
special case of the slope when $h=1$, and so not equal to any of the
coefficients either. It will definitely change as the starting point $x_i$
changes.
Rather than trying to give one single rate of change (or slope or
response-associated-to-a-one-unit-change) when none exists, a more honest
procedure is to make a plot, either of the polynomial itself, or of the
derivative. (See the example in the model report for the first DAP.)
\paragraph{Interpreting the polynomial as a transformation of $X_i$} If you
really wanted to, you could try to complete the square (cube, other polynomial)
to re-write the polynomial
\[
\beta_1 X_1 + \beta_2 X_1^2+ \ldots \beta_d X_1^d = k + \beta_d\prod_{j=1}^{d}{(X_1 - c_j)}
\]
You could then say that $\beta_d$ was the change in the response for a one-unit
change in $\prod_{j=1}^{d}{(X_1 - c_j)}$, etc., etc. The zeroes or roots of
the polynomial, $c_j$, will be functions of the coefficients on the lower
powers of $X_1$, but their sampling distributions, unlike those of the
$\beta_j$, would be very tricky, and so, consequently, would their confidence
sets. Moreover, it is not very common for the transformed predictor
$\prod_{j=1}^{d}{(X_1 - c_j)}$ to itself be a particularly interpretable
variable, so this is often a considerable amount of work for little gain.
\paragraph{``Testing for nonlinearity''} It is not uncommon to see people
claiming to test whether the relationship between $Y$ and $X_i$ is linear by
adding a quadratic term in $X_i$ and testing whether the coefficient on it
significantly different from zero. This would work fine if you knew that the
only possible sort of nonlinearity was quadratic --- that if the relationship
wasn't a straight line, it was a parabola. Since it is perfectly possible to
have a very nonlinear relationship where the coefficient on $X_i^2$ is zero,
this is not a very powerful test.
\paragraph{Over-fitting and wiggliness} A polynomial of degree $d$ can exactly
fit any $d$ points. (Any two points lie on a line, any three on a parabola,
etc.) Using a high-order polynomial, or even summing a large number of
low-order polynomials, can therefore lead to curves which come very close to
the data we used to estimate them, but predict very badly. In particular,
high-order polynomials can display very wild oscillations in between the data
points. Plotting the function in between the data points (using
\texttt{predict}) is a good way of noting this. We will also look at more
formal checks when we cover cross-validation later in the course.
\paragraph{Picking the polynomial order} The best way to pick the polynomial
order is on the basis of some actual scientific theory which says that the
relationship between $Y$ and $X_i$ should, indeed, by a polynomial of order
$d_i$. Failing that, carefully examining the diagnostic plots is your next
best bet. Finally, the methods we'll talk about for variable and model
selection in forthcoming lectures can also be applied to picking the order
of a polynomial, though as we will see, you need to be very careful about
what those methods actually do, and whether that's really what you want.
\subsection{Orthogonal Polynomials}
I have written out polynomial regression above in its most readily-comprehended
way, but that is not always the most best way to estimate it. We know, from
our previous examination of multiple linear regression, that we'll get smaller
standard errors when our predictor variables are uncorrelated. While $X_i$ and
its higher powers are linearly independent, they are generally (for most
distributions) somewhat correlated. An alternative to regressing on the powers
of $X_i$ is to regress on linear function of $X_i$, a quadratic function of
$X_i$, a cubic, etc., which are chosen so that they are {\em un-}correlated on
the data. These functions, being uncorrelated, are called {\bf orthogonal}.
Any polynomial could also be expressed as a linear combination of these {\bf
basis functions}, which are thus called {\bf orthogonal polynomials}. The
advantage, again, is that the estimates of coefficients on these basis
functions have less variance than using the powers of $X_i$.
In fact, this is what the \texttt{poly} function does by default; to force
it to use the powers of $X_i$, we need to set the \texttt{raw}
option to \texttt{TRUE}.
To be concrete, let's start with the linear function. We'll arrange it
so that it has mean zero (and therefore doesn't contribute to the intercept):
\[
\sum_{i=1}^{n}{\alpha_{i10} + \alpha_{i11}x_{i1}} = 0
\]
Here I am using $\alpha_{ijk}$ to indicate the coefficient on $X_i^k$ in
the $j^{\mathrm{th}}$ order basis function for $X_i$. This is one equation
with two unknowns, so we need another equation to be able to solve the system.
What \texttt{poly} does is to impose a constraint on the sample variance:
\[
\sum_{i=1}^{n}{(\alpha_{i10} + \alpha_{i11}x_{i1})^2} = 1
\]
(Why is this a constraint on the variance?) The quadratic function
is found by requiring that it have mean zero,
\[
\sum_{i=1}^{n}{\alpha_{i20} + \alpha_{i21}x_{i1} + \alpha_{i22}x_{i1}^2} = 0 ~,
\]
that it be uncorrelated with the linear function,
\[
\sum_{i=1}^{n}{\left(\alpha_{i10} + \alpha_{i11}x_{i1} \right) \left(\alpha_{i20} + \alpha_{i21}x_{i1} + \alpha_{i22}x_{i1}^2 \right)} = 0 ~,
\]
and that it have the same variance as the linear function:
\[
\sum_{i=1}^{n}{\left(\alpha_{i20} + \alpha_{i21}x_{i1} + \alpha_{i22}x_{i1}^2\right)^2} = 1
\]
To get the $j^{\mathrm{th}}$ basis function, we need all the $j-1$ basis
functions that came before it, so we can make sure it has mean 0, that it's
uncorrelated with all of the others, and that it has the same variance. All of
the coefficients I've written $\alpha$ are encoded in the attributes of the
output of \texttt{poly}, though not always in an especially humanly-readable
way. (For details, see \texttt{help(poly)}, and the references it cites.)
Notice that changing the sample values of $X_i$ will change the basis
functions; one reason to use the powers of $X_i$ instead would be to make it
easier to compare coefficients across data sets. If the distribution of $X_i$
is known, one can work out systems of orthogonal polynomials in advance, for
instance, the Legendre polynomials which are orthogonal when the predictor
variable has a uniform distribution\footnote{See, for instance, Wikipedia,
s.v. ``Legendre polynomials''.}.
\subsection{Non-Polynomial Function Bases}
There are basically three reasons to want to use polynomials. First, many
scientific theories claim that there are polynomial relationships between
variables in the real world. Second, they're things we've all been familiar
with since basic algebra, so we understand them very well, we find them
un-intimidating, and very little math is required to use them. Third, they
have the nice property that any well-behaved function can be approximated
arbitrarily closely by a polynomial of sufficiently high degree\footnote{See
further reading, below, for details.}.
If we don't have strong scientific reasons to want to use polynomials, and are
willing to go beyond basic algebra, there are many other systems of functions
which also have the universal approximation property. If we're just doing
curve fitting, it can be just as good, and sometimes much better, to use one
of these other function bases. For instance, we might use sines and cosines
at multiples of a basic frequency $\omega$,
\[
\sum_{j=1}^{d}{\gamma_{i1j}\sin{(j \omega X_i)} + \gamma_{i2j}\cos{(j \omega X_i)}}
\]
Such a basis would be especially appropriate for variables which are really
angles, or when there is a periodicity in the system. Exactly matching a sum
of sines and cosines like the above would require an infinite-order polynomial;
conversely, matching a linear function with a sum of sines and cosines would
require letting $d \rightarrow \infty$.
As this suggests, there is a bit of an art to picking a suitable function
basis; as it also suggests, it's an area where knowledge of more advanced
mathematics (specifically, functional analysis) can be really useful to
actually doing statistics.
\clearpage
\section{Categorical Predictors}
We often have variables which we think are related to $Y$ which are not real
numbers, but are qualitative rather than quantitative --- answers to ``what
kind?'' rather than to ``how much?''. For people, these might be things like
sex, gender, race, caste, religious affiliation, education attainment,
occupation, whether they've had chicken pox, whether they have previously
defaulted on a loan, or their country of citizenship. For geographic
communities (as in the data analysis project), state was a categorical
variable, though not one we used because we didn't know how.
Some of these are purely qualitative, coming in distinct types, but with no
sort of order or ranking implied; these are often specifically called
``categorical'', and the distinct values ``categories''. (The values are also
called ``levels'', though that's not a good metaphor without an order.) Other
have distinct levels which can be put in a sensible order, but there is no real
sense that the {\em distance} between one level and the next is the same ---
they are {\bf ordinal} but not {\bf metric}. When it is necessary to
distinguish non-ordinal categorical variables, they are often called {\bf
nominal}, to indicate that their values have names but no order.
In R, categorical variables are represented by a special data type called
\texttt{factor}, which has as a sub-type for ordinal variables the data type
\texttt{ordered}.
In this section, we'll see how to include both categorical and ordinal
variables in multiple linear regression models, by {\bf coding} them as
numerical variables, which we know how to handle.
\subsection{Binary Categories}
The simplest case is that of a binary variable $B$, one which comes in two
qualitatively different types. To represent this in a format which fits with
the regression model, we pick one of the two levels or categories as the
``reference'' or ``baseline'' category. We then add a column $X_B$ to the
design matrix $\mathbf{x}$ which indicates, for each data point, whether it
belongs to the reference category ($X_B=0$) or to the other category ($X_B=1$).
This is called an {\bf indicator variable} or {\bf dummy variable}. That is,
we {\bf code} the qualitative categories as 0 and 1.
We then regress on the indicator variable, along with all of the others,
getting the model
\[
Y = \beta_0 + \beta_B X_b + \beta_1 X_1 + \ldots \beta_p X_p + \epsilon
\]
The coefficient $\beta_b$ is the expected difference in $Y$ between two units
which are identical, except that one of them has $X_b=0$ and the other has
$X_b=1$. That is, it's the expected difference in the response between members
of the reference category and members of the other category, all else being
equal. For this reason, $\beta_B$ is often called the {\bf contrast}
between the two classes.
Geometrically, if we plot the expected value of $Y$ against $X_1, \ldots X_p$,
we will now get {\em two} regression surfaces: they will be parallel to each
other, and offset by $\beta_B$. We thus have a model where each category gets
its own intercept: $\beta_0$ for the reference class, $\beta_0 + \beta_B$ for
the other class. You should, at this point, convince yourself that if we had
switched which class was the reference class, we'd get exactly the same slopes,
only with the over-all intercept being $\beta_0 + \beta_B$ and the contrast
being $-\beta_B$ (Exercise \ref{exercise:doesnt-matter-which-class-is-ref}).
\paragraph{In R} If a data frame has a column which is a two-valued factor
already, and it's included in the right-hand side of the regression formula,
\texttt{lm} handles creating the column of indicator variables internally.
Here, for instance, we use a classic data set to regress the weight of a cat's
heart on its body weight and its sex. (If it worked, such a model would be
useful in gauging doses of veterinary heart medicines.)
<<>>=
library(MASS)
data(cats)
Hwt.lm <- lm(Hwt ~ Sex+Bwt, data=cats)
summary(Hwt.lm)
@
\texttt{Sex} is coded as \texttt{F} and \texttt{M}, and R's output
indicates that it chose \texttt{F} as the reference category.
\paragraph{Diagnostics} The mean of the residuals within each category is
guaranteed to be zero (Exercise \ref{exercise:zero-mean-residual-by-category}),
but they should also have the same variance and otherwise the same
distribution, so there is still some point in plotting residuals against
$X_B$. Sometimes a little jitter on the horizontal axis helps,
or making a box-plot.
\paragraph{Inference} There is absolutely nothing special about the inferential
statistics for the estimated contrast $\hat{\beta}_B$. It works just like
inference for any other regression coefficient.
\paragraph{Why not just split the data?} If we want to give each class its own
intercept, why not just split the data and estimate two models, one for each
class? The answer is that sometimes we'll do just this, especially if there's
a lot of data for each class. However, if the regression surfaces for the two
categories really are parallel to each other, by splitting the data we're
losing some precision in our estimate of the common slopes, without gaining
anything. In fact, if the two surfaces are {\em nearly} parallel, for moderate
sample sizes the small bias that comes from pretending the slopes are all equal
can be overwhelmed by the reduction in variance.
\paragraph{Why not two columns?} It's natural to wonder why we have to pick out
one level as the reference, and estimate a contrast. Why not add {\em two}
columns to $\mathbf{x}$, one indicating each class? The problem is that then
those two columns will be linearly dependent (they'll always add up to one), so
the data would be collinear and the model in-estimable.
\paragraph{Why not two slopes?} The model we've specified has two parallel
regression surfaces, with the same slopes but different intercepts. We could
also have a model with the same intercept across categories, but different
slopes for each variable. Geometrically, this would mean that the regression
surfaces weren't parallel, but would meet at the origin (and elsewhere). We'll
see how to make that work when we deal with interactions in a few lectures.
If we wanted different slopes and intercepts, we might as well just split
the data.
\paragraph{Contrasts need contrasts} Just as we can't estimate $\beta_i$ if
$\Var{X_i} = 0$, we can't estimate any categorical contrasts if all the data
points belong to the same category.
\subsubsection{``Adjusted effect of a category''}
As I said, $\beta_B$ is the expected difference in $Y$ between two individuals
which have the same value for all of the variables {\em except} the category.
This is generally {\em not} the same as the difference in expectations
between the two categories:
\[
\beta_B \neq \Expect{Y|X_B=1} - \Expect{Y|X_B=0}
\]
One of the few situations where $\beta_B = \Expect{Y|X_B=1} - \Expect{Y|X_B=0}$
is when the distribution of all the {\em other} variables is the same between
the categories. (Said another way, the categories are statistically
independent of the other predictors.) Another is when there are no other
predictors.
Because of this, it's very natural to want to interpret $\beta_B$ as the
difference in the response between the two groups, {\em adjusting for} all of
the other variables. It's even common to talk about $\beta_B$ as ``the
adjusted effect'' of the category. As you might imagine, such interpretations
come up all the time in disputes about discrimination.
Even leaving aside the emotional charge of such arguments, it is wise to be
cautious about such interpretations, for several reasons.
\begin{enumerate}
\item The regression is only properly adjusting for all of the other variables
if it's well-specified. If it's not, the contrast between the categories
will also pick up some of the average difference in bias (due to getting the
model wrong), which is not relevant.
\item As usual, finding that the contrast coefficient isn't significant doesn't
necessarily mean there is no contrast! It means that the contrast, if there
is one, can't be reliably distinguished from 0, which could be because it's
very small or because we can't estimate it well. Again as usual, a
confidence interval is called for.
\item It's not clear that we always {\em do} want to adjust for other
variables, even when we can measure them. For instance, if economists in
Lilliput found no effect on income between those who broke their eggs at the
big end and those at the little end, after adjusting for education and
occupational prestige \citep{Swift-Gulliver}, that wouldn't necessarily
settle the question of whether big-endians were discriminated against. After
all, it might be that they have less access to education and high-paid jobs
{\em because} they were big-endians. And this could be true even if
Lilliputians were initially randomly assigned between big- and little-
end-breaking. The same goes for finding that there {\em is} an ``adjusted
effect''.
\end{enumerate}
The last point brings us close to topics of causal inference, which we won't
get to until 402. For now, a good rule of thumb is not to adjust for variables
which might themselves be effects of the variable we're interested in.
\subsection{Categorical Variables with More than Two Levels}
Suppose our categorical variable $C$ has more than two levels, say $k$ of them.
We can handle it in almost exactly the same way as the binary case. We pick
one level --- it really doesn't matter which --- as the reference level. We
then introduce $k-1$ columns into the design matrix $\mathbf{x}$, which are
indicators for the other categories. If, for instance, $k=3$ and the classes
are \texttt{North}, \texttt{South}, \texttt{West}, we pick one level, say
\texttt{North}, as the reference, and then add a column $X_{\texttt{South}}$
which is 1 for data points in class \texttt{South} and 0 otherwise, and another
column $X_{\texttt{West}}$ which is 1 for data points in that class and 0
otherwise.
Having added these columns to the design matrix, we regress as usual, and get
$k-1$ contrasts. The over-all $\beta_0$ is really the intercept for the
reference class; the contrasts are added to $\beta_0$ to get the intercept for
each class. Geometrically, we now have $k$ parallel regression surfaces, one
for each level of the variable.
\paragraph{Interpretation} $\beta_{C=c}$ is the expected difference between two
individuals who are otherwise identical, except that one is in the reference
category and the other is in class $c$. The expected difference between two
otherwise-identical individuals in two different categories, say $c$ and $d$,
is therefore the difference in their contrasts, $\beta_{C=d} - \beta_{C=c}$.
\paragraph{Diagnostics and inference} Work just the same as in the binary case.
\paragraph{Why not $k$ columns?} Because, just like in the binary case,
that would make all those columns for that variable sum to 1,
causing problems with collinearity.
\paragraph{Contrasts need contrast} If we know there are $k$ categories, but
some of them don't appear in our data, we can't estimate their contrasts.
\paragraph{Category-specific slopes and splitting the data} The same remarks apply
as under binary predictor variables.
\subsection{Two, Three, Many Categorical Predictors}
Nothing in what we did above requires that there be only one categorical
predictor; the other variables in the model could be indicator variables for
other categorical predictors. Nor do all the categorical predictors have to
have the same number of categories. The only wrinkle with having multiple
categories is that $\beta_0$, the over-all intercept, is now the intercept for
individuals where {\em all} categorical variables are in their respective
reference levels. Each combination of categories gets its own regression
surface, still parallel to each other.
With multiple categories, it is natural to want to look at interactions --- to
let their be an intercept for left-handed little-endian plumber, rather than
just adding up contrasts for being left-handed and being a little-endian and
being a plumber. We'll look at that when we deal with interactions.
\subsection{Analysis of Variance: Only Categorical Predictors}
A model in which there are {\em only} categorical predictors is, for historical
reasons, often called an {\bf analysis of variance} model. Estimating such a
model presents absolutely no special features beyond what we have already
covered, but it's worth a paragraph or two on the interpretation
and the origins of such models.
Suppose, for simplicity, that there are two categorical predictors, $B$ and
$C$, and the reference level for each is written $\emptyset$. The conditional expectation of $Y$ will be pinned down by giving
a level for each, say $b$ and $c$, respectively. Then
\[
\Expect{Y|B=b, C=c} = \beta_0 + \beta_b\delta_{b\emptyset} + \beta_c\delta_{c\emptyset}
\]
That is, we add the appropriate contrast for each categorical variable, and
nothing else. (This presumes no interactions, a limitation which we'll lift
next week.) Conversely, if we knew $\Expect{Y|B=b,C=c}$ for every category, we
could work out the contrasts without having to ever (explicitly) compute
$(\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}$, which was a very real
consideration before computation became so cheap\footnote{To see how, notice
that $\hat{\beta}_0$ can be estimated by the sample mean of all cases where
$B=\emptyset$, $C=\emptyset$. Then to get, say, $\beta_b$, we average the
difference in means between cases where $B=b, C=c$ and $B=\emptyset, C=c$ for
each level $c$ of the other variable. (This averaging of differences
eliminates the contribution from $\beta_c$.)}. Obviously, however, it is not
much of an issue now.
As for the name, it arises from the basic probability fact sometimes called the
``law of total variance'':
\[
\Var{Y} = \Var{\Expect{Y|X}} + \Expect{\Var{Y|X}}
\]
If $X$ is our complete set of categorical variables, each of which defines a
group, this says ``The total variance of the response is the variance in
average responses across groups, plus the average variance within a group''.
Thus, after estimating the contrasts, we have decomposed or analyzed the
variance in $Y$ into between-group and across-group variance. This was
extremely useful in the early days of agricultural and industrial
experimentation, but has frankly become a bit of a fossil, if not a fetish.
An ``analysis of covariance'' model is just a regression with both qualitative
and quantitative predictors.
\subsection{Ordinal Variables}
An ordinal variable, as I said, is one where the qualitatively-distinct
levels can be put in a sensible order, but there's no implication
that the distance from one level to the next is constant. At our
present level of sophistication, we have basically two ways to handle them:
\begin{enumerate}
\item Ignoring the ordering and treat them like nominal categorical variables.
\item Ignoring the fact that they're only ordinal and not metric, assign them
numerical codes (say 1, 2, 3, \ldots) and treat them like ordinary numerical
variables.
\end{enumerate}
The first procedure is unbiased, but can end up dealing with a lot of distinct
coefficients. It also has the drawback that if the relationship between $Y$
and the categorical variable is monotone, that may not be respected by the
coefficients we estimate. The second procedure is very easy, but usually
without any substantive or logical basis. It implies that each step up in the
ordinal variable will predict exactly the {\em same} difference in $Y$, and why
should that be the case? If, after treating an ordinal variable like a nominal
one, we get contrasts which are all (approximately) equally spaced, we might
then try the second approach.
Other procedures for ordinal variables which are, perhaps, more conceptually
satisfying need much more math than we're presuming here; see the further
reading.
\subsection{Detailed R Example}
The data set for the first data analysis project included a categorical
variable, \texttt{State}, which we did not use. Let's try adding it
to the model.
First, let's do some basic counting and examination:
<<>>=
# How many levels does State have?
nlevels(mobility$State)
# What are they?
levels(mobility$State)
@
There are 51 levels for \texttt{State}, as there should be, corresponding to the 50 states and the District of Columbia. We see that these are given by the two-letter postal codes, in alphabetical order.
Running a model with \texttt{State} and \texttt{Commute} as the predictors, we
therefore expect to get 52 coefficients (1 intercept, 1 slope, and 51-1 = 50
contrasts). R will calculate contrasts from the first level, which here is
\texttt{AK}, or Alaska.
<<>>=
mob.state <- lm(Mobility ~ Commute + State, data=mobility)
signif(coefficients(mob.state),3)
@
In the interest of space, I won't run \texttt{summary} on this, but you can.
You will find that quite a few of the contrasts are statistically significant.
We'd expect about $50\times 0.05 = 2.5$ to be significant at the 5\% level,
even if all the true contrasts were zero, but many more them are than this
baseline. As usual, of course, it doesn't mean the model is right; it just
means that if we were going to put in an intercept, a slope for
\texttt{Commute}, and a contrast for every other state, we should really put in
contrasts for those states as well.
One issue with the simple linear regression from the DAP was that its
residuals were very strongly correlated spatially. We might hope that
adding all these state-by-state contrasts has gotten rid of some of that
correlation.
\begin{figure}
<<>>=
# Set up a function to make maps
# "Terrain" color levels set based on quantiles of the variable being plotted
# Inputs: vector to be mapped over the data frame; number of levels to
# use for colors; other plotting arguments
# Outputs: invisibly, list giving cut-points and the level each observation
# was assigned
mapper <- function(z, levels, ...) {
# Which quantiles do we need?
probs <- seq(from=0, to=1, length.out=(levels+1))
# What are those quantiles?
z.quantiles <- quantile(z, probs)
# Assign each observation to its quantile
z.categories <- cut(z, z.quantiles, include.lowest=TRUE)
# Make up a color scale
shades <- terrain.colors(levels)
plot(x=mobility$Longitude, y=mobility$Latitude,
col=shades[z.categories], ...)
invisible(list(quantiles=z.quantiles, categories=z.categories))
}
@
\caption{Function for making maps, from the model DAP 1.}
\label{fig:mapper}
\end{figure}
\begin{figure}
<>=
residuals.basic.map <- mapper(residuals(lm(Mobility~Commute, data=mobility)),
levels=4, pch=19,
cex=0.5, xlab="Longitude", ylab="Latitude",
main="Mobility")
legend("topright", legend=levels(residuals.basic.map$categories), pch=19,
col=terrain.colors(4), cex=0.8)
@
<>=
<>
@
\caption{Map of residuals from a basline linear regression of the rate of
economic mobility on the fraction of workers with short commutes.}
\end{figure}
\begin{figure}
<>=
residuals.state.map <- mapper(residuals(mob.state), levels=4, pch=19,
cex=0.5, xlab="Longitude", ylab="Latitude",
main="Mobility")
legend("topright", legend=levels(residuals.state.map$categories), pch=19,
col=terrain.colors(4), cex=0.8)
@
<>=
<>
@
\caption{Map of the residuals for the model with state-level contrasts. (See
Figure \ref{fig:mapper} for the \texttt{mapper} function.)}
\label{fig:state-level-residuals}
\end{figure}
When we have a large number of categories, it's often tempting to try
compressing them to a smaller number, by grouping together some of the levels.
If we do this right, we reduce the variance in our estimates of the
coefficients, while introducing little (if any) bias.
To illustrate this, let's try boiling down the 51 states (and DC) into two
categories: the South versus the rest of the country. The South has long been
quite distinct from the rest of the country culturally, politically and
economically, in ways which are, plausibly, very relevant to economic mobility.
More relevantly, when we looked at the residuals in the DAP, there was a big
cluster of negative residuals in the southeastern part of the map. To make
this concrete, I'll define the South as consisting of those states which joined
the Confederacy during the Civil War (Alabama, Arkansas, Florida, Georgia,
Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, Texas and
Virginia).
Let's start by adding the relevant column to the data frame:
<<>>=
# The states of the Confederacy
Confederacy <- c("AR", "AL", "FL", "GA", "LA", "MS", "NC", "SC", "TN", "TX", "VA")
mobility$Dixie <- mobility$State %in% Confederacy
@
The new \texttt{Dixie} column of \texttt{mobility} will contain the values
\texttt{TRUE}, for each community located in one of those states, and
\texttt{FALSE}, for the rest. R will in such circumstances treat
\texttt{FALSE} as the reference category.
<<>>=
mob.dixie <- lm(Mobility ~ Commute + Dixie, data=mobility)
signif(coefficients(summary(mob.dixie)),3)
@
The contrast for the old Confederacy versus the rest of the country is
negative, meaning those states have lower levels of economic mobility,
and highly statistically significant. Of course, the model
could still be wrong. The residuals, while better than a model
with no geographic contrasts, don't look as random as in the one with
contrasts for each state.
\begin{figure}
<>=
residuals.dixie.map <- mapper(residuals(mob.dixie), levels=4, pch=19,
cex=0.5, xlab="Longitude", ylab="Latitude",
main="Mobility")
legend("topright", legend=levels(residuals.dixie.map$categories), pch=19,
col=terrain.colors(4), cex=0.8)
@
<>=
<>
@
\caption{Map of the residuals from the model based on \texttt{Commute}, and a
categorical contrast between the old Confederacy and the rest of the
country.}
\label{fig:residuals-dixie}
\end{figure}
\clearpage
\section{Further Reading}
Polynomial regression and categorical predictors are both ancient topics; I
don't know who first introduced either.
The above discussion has assumed that when we use a polynomial, we use the {\em
same} polynomial for all values of $X_i$. An alternative is to use
different, low-order polynomials in different regions. If these piecewise
polynomial functions are required to be continuous, they are called {\bf
splines}, and regression with splines will occupy us for much of 402, because
it gives us ways to tackle lots of the issues with polynomials, like
over-fitting \citep[chs.\ 8 and 9]{CRS-ADAfaEPoV}. Personally, I have found
splines to almost always be a better tool than polynomial regression, but they
do demand a bit more math.
The matter of ``adjusted effects'' and causal inference will occupy us for
about the last quarter of 36-402.
\citet{Tutz-regression-for-categorical} is a thorough and modern survey of
regression with categorical {\em response} variables. We will go over this in
some detail in 402, but his book covers many topics we won't have time for.
\citet{Winship-Mare-regression-with-ordinal-variables} proposes some
interesting techniques for dealing with ordinal variables, under the (strong)
assumption that they arise from taking continuous variables and breaking them
into discrete categories. This seems to require rather strong assumptions
about the measurement process. Another direction we could go would be to
estimate a separate contrast for each level of an ordinal variable (except the
lowest), but require these to be either all increasing or all decreasing, so
the response to the ordinal variable was monotone. This would mean solving a
{\em constrained} least squares (or maximum likelihood) problem to get the
estimates, not an unconstrained on. Worse, the constraints would be a somewhat
awkward set of inequalities. Still, it's do-able in principle, though I don't
know of a straightforward R implementation.
Analysis of variance models were introduced by R. A. Fisher, probably the
greatest statistician who ever lived, in connection with problems in genetics
and in designing and interpreting experiments. They have given rise to a huge
literature and an elaborate system of notation and terminology, much of which
boils down to short-cuts for computing regression estimates when the design
matrix $\mathbf{x}$ has very special structure. As I said, there were many
decades when such short-cuts were vital, but I am frankly skeptical how much
value these techniques retain in the present day. In the interest of balance,
see \citet{Gelman-pro-ANOVA} for a contrary view.
I mentioned that one reason to use polynomials is that any well-behaved
function can be approximated arbitrarily closely by polynomials of sufficiently
high degree. Obviously ``well-behaved'' needs a proper definition, as (perhaps
less obviously) does ``approximated arbitrarily closely''. What I had in mind
was the Stone-Weierstrass theorem, which states that you can pick any
continuous function $f$, interval $[a,b]$, and tolerance $\epsilon > 0$ you
like, and I can find some polynomial which is within $\epsilon$ of $f$
everywhere on the interval,
\[
\max_{a \leq x \leq b}{\left| f(x) - \sum_{j=1}^{d}{\gamma_j x^j}\right|} \leq \epsilon
\]
provided there is no limit on the order $d$ or the magnitude of the
coefficients $\gamma_j$. This is a standard result of real analysis, which
will be found in almost textbook on that subject, or on functional analysis or
approximation theory. There are parallel results for other
function bases.
\section{Exercises}
To think through or practice on, not to hand in.
\begin{enumerate}
\item \label{exercise:doesnt-matter-which-class-is-ref} Consider regressing $Y$
on a binary categorical variable $B$, plus some other predictors. Suppose we
switch which level is the reference category and which one is contrasted with
it. Show that this produces the following changes to the parameters, and
leaves all the others unchanged:
\begin{eqnarray}
\beta_0 & \rightarrow & \beta_0 + \beta_B\\
\beta_B & \rightarrow & -\beta_B
\end{eqnarray}
{\em Hint:} Show that the change to the indicator variable is $X_B
\rightarrow 1 - X_B$.
\item \label{exercise:zero-mean-residual-by-category} Consider again regressing
$Y$ on a binary variable $B$, plus some other predictors, and estimating all
coefficients by least squares. Show that the average of all residuals where
$X_B=1$ must be exactly 0, as must the average of all residuals where
$X_B=0$. {\em Hint:} Use the estimating equations to show $\sum_{i}{e_i} =
0$, $\sum_{i}{e_i x_{Bi}} = 0$, and algebra to show $\sum_{i}{e_i(1-x_{Bi})}
= 0$.
\end{enumerate}
\bibliography{locusts}
\bibliographystyle{crs}
\end{document}