---
title: Trends, Smoothing, Detrending
author: 36-467/36-667
date: 30 August 2018
bibliography: locusts.bib
output:
html_document:
toc: true
number_sections: true
toc_depth: 4
---
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\newcommand{\TrueNoise}{\epsilon}
\newcommand{\EstNoise}{\widehat{\TrueNoise}}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\dof}{DoF}
\DeclareMathOperator{\det}{det}
\]
# In our last episode...
Our data will generally be a realization of a **stochastic process**, i.e., a
collection of random variables spread over time and space. The random variable
we observe at spatial coordinate $r$ and time $t$ will be, generically,
$X(r,t)$. If we just care about spatial data, we'll just have $X(r)$, and
purely temporal data (like the Kyoto data) will be $X(t)$. In general, the
random variables at different points will all be dependent on each other.
Because they are random variables, it makes sense to think about their central
tendency, and the expected value or mean is often a convenient one:
\[
\TrueRegFunc(r,t) = \Expect{X(r,t)}
\]
This is a _deterministic_ function of the coordinates; it's what we mean
by the **trend**.
Mathematically, we can always decompose the deterministic random variables into
the trend plus random or stochastic **fluctuations**, which always
have expectation zero:
\begin{eqnarray}
X(r,t) & = & \TrueRegFunc(r,t) + \TrueNoise(r,t)\\
\Expect{\TrueNoise(r,t)} & = & 0
\end{eqnarray}
We can't assert anything more about $\TrueNoise$ without further assumptions. In
particular, we can't make the usual assumptions that are made in regression
courses, that $\TrueNoise$ is uncorrelated, that it has constant variance, that
it has the same distribution at each point, that it's Gaussian, etc. But
we can, without any loss of generality, say $\Expect{\TrueNoise(r,t)} = 0$
everywhere.
# Learning the Trend
We would like to learn the trend function $\TrueRegFunc$.
## Many realizations
If we have _many_ independent realizations of the process, we could just
average them. Say we can re-run some experiment many times, getting
independent realizations $X^{(1)}, X^{(2)}, \ldots X^{(m)}$. Then
\[
\frac{1}{m}\sum_{i=1}^{m}{X^{(i)}(r,t)} \rightarrow \TrueRegFunc(r,t)
\]
by the ordinary law of large numbers. As my mention of "experiment" suggests,
this idea _can_ be applied in the experimental sciences. In neuroscience,
for instance, we might repeatedly measure what happens to a particular neuron
(or group of neurons, or brain region) after we apply some stimulus to
an experimental animal[^whiskers]. This is, however, not available
in non-experimental contexts.
[^whiskers]: I once analyzed an experiment where a mechanical vibrator repeatedly gave very precise tweaks to a rat's whiskers. (Rats devote a _lot_ of their brain to processing sensations from their whiskers, which help them navigate in the dark.)
## Single realization
More typically, we have only _one_ realization of the $X$ process --- only one
history of cherry-blossoms in Kyoto, only one map of ground-water under
Wisconsin, etc. We will only be able to learn about $\TrueRegFunc$ under
assumptions.
### Trends from theory
In some situations, we might have a well-confirmed scientific theory which
tells us _exactly_ what the trend should be, say that $\TrueRegFunc(r,t) =
f(r,t)$ for some completely-specified function $f$. There are very few
situations where this applies --- _perhaps_ in parts of astronomy and physics.
More common is the situation where theory pins down a specific functional
_form_ for the trend, but there are some unknown coefficients in the formula.
That is, $\TrueRegFunc(r,t) = f(r,t,\theta)$ for one or more parameters
$\theta$. One would then need to estimate $\theta$ from the data $X$ to obtain
an estimate of the trend. The dependence of the observations on each other
makes the statistical inference more complicated, which we'll come back to
later in the course, but in principle this is just about estimating parameters.
Of course, for these approaches to be good ideas, you have to actually have
reason to think your theory really does give you a good idea of the trend.
There are domains where this is true --- physics, chemistry, some parts of
biology (especially genetics, evolution and ecology) --- where we have very
well-supported theories which are detailed enough to pin down specific
functional forms. (Some economists think this is true in economics as well,
but I think they're fooling themselves.) Beyond that, though, scientific
theories are much looser -- they might tell us that one variable tends to go up
with another, but just don't have the detail to give a functional form. This
might not be a permanent condition --- it used to be true even of physics! ---
but for now, in lots of scientific areas, let alone in business or policy, we
just don't have credible theories which tell us about trends.
### Trends from data
We might, however, make a different sort of assumption, which is just that
$\TrueRegFunc$ doesn't change very rapidly --- that it is a **smooth**
function. The basic idea we need is that $\TrueRegFunc(t-h) \approx \TrueRegFunc(t) \approx
\TrueRegFunc(t+h)$, at least for sufficiently small $h$, at least at most $t$. (You can
write out the corresponding idea for spatio-temporal trends, with some more
notation.) Each $X(t)$ is a noisy observation of $\TrueRegFunc(t)$, but if $\TrueRegFunc(t-h)$
and $\TrueRegFunc(t+h)$ are close to $\TrueRegFunc(t)$, then $X(t-h)$ and $X(t+h)$ will be noisy
observations of things close to $\TrueRegFunc(t)$, giving us more information.
To make this concrete, and see how it can work, let's think of a very crude
estimate:
\[
\EstRegFunc(t_i) = \frac{1}{3}\sum_{k=-1}^{k=+1}{X(t_{i+k})}
\]
Each $X$ is trend plus fluctuation, so
\begin{eqnarray}
\EstRegFunc(t_i) & = & \frac{1}{3}\sum_{k=-1}^{k=+1}{\TrueRegFunc(t_{i+k})} + \frac{1}{3}\sum_{k=-1}^{k=+1}{\TrueNoise(t_{i+k})}\\
& = & \TrueRegFunc(t_i) + \frac{1}{3}\sum_{k=-1}^{k=+1}{(\TrueRegFunc(t_{i+k}) - \TrueRegFunc(t_i))} + \frac{1}{3}\sum_{k=-1}^{k=+1}{\TrueNoise(t_{i+k})}\\
& = & \text{truth} + \text{bias} + \text{noise}
\end{eqnarray}
The first term is exactly what we want. The third term is noise, and the
average of many noise terms is _generally_ smaller than the individual noises.
(If the $\TrueNoise$ were uncorrelated with equal variance $\sigma^2$, this would
have variance $\sigma^2/3$.) The middle term is **bias**, systematic error.
The more rapidly $\TrueRegFunc$ changes, and the further apart our observations are
located, big this bias gets. Conversely, if $\TrueRegFunc$ changes slowly and our
observations are closely spaced, the bias will be small.
At this point, we _could_ go into considerable detail about exactly what the
bias is, under various precise mathematical characterizations of "smooth
function". (For instance, we could use Taylor approximation[^taylor] to relate
the bias to the first and second derivatives of $\TrueRegFunc$.) But it'll be more
productive to think about the more general _type_ of estimator which includes
the simple averaging of near-by values. There is, after all, no reason
to just use the immediate neighbors, nor to use equal weights.
[^taylor]: As you remember from calculus, the Taylor expansion of a function $f$ around a point $t_0$ is $f(t) = f(t_0) + ((t-t_0) f^{\prime}(t_0) + \frac{1}{2}(t-t_0)^2 f^{\prime\prime}(t_0) + \ldots$, where $f^{\prime}$, $f^{\prime\prime}$, etc., are the first, second, etc., derivatives. In a Taylor approximation, we truncate the series at a certain order. The error of a $k^{\mathrm{th}}$ order approximation is $O((t-t_0)^{k+1})$.
# Linear smoothers
We have data points $x(t_1), \ldots x(t_n)$; for conciseness I will sometimes
abbreviate these as $x_1, \ldots x_n$.
The general form of a **linear smoother** is
\[
\EstRegFunc(t) = \sum_{j=1}^{n}{w(t, t_j) x_j}
\]
That is, the predictions or fitted values, the estimates of the trend
$\TrueRegFunc$, are always linear combinations of the observed values of $x$.
Notice that the smoother is linear in $x$, not in $t$ --- the resulting curve
can be _very_ nonlinear in $t$.
## Examples of linear smoothers
Most of the smoothing, regression and estimation techniques used in practice
fall in this class.
- _Global constant_: The trivial case, $w=1/n$ no matter what.
- $k$ _Nearest neighbors_ (kNN): $w(t, t_j) = 1/k$ if $t_j$ is one of the $k$
points closest to $t$, and otherwise $w=0$.
- _Moving average_ (MA): $w(t, t_j) = 0$ outside some distance from $t$, and
constant within it.
+ If data points are always equally spaced in time (or space, etc.), moving averages and nearest neighbors are equivalent, but they can differ for irregularly-spaced data.
+ You can also do one-sided moving averages, usually for the past.
- _Weighted moving averages_: $w(t, t_j) = f(|t-t_j|/\tau)$, for some fixed function $f$. A common choice is the exponential function, giving **exponentially-weighted moving averages**.
- _Kernels_: Like moving averages, but with weights normalized, so we're always doing a proper average. That is, one picks a probability-density function $K$, and some scale, or **bandwidth**, $h$, and
\[
w(t, t_j) = \frac{K(|t-t_j|/h)}{\sum_{j^{\prime}=1}^{n}{K(|t-t_{j^{\prime}}|/h)}}
\]
- _Linear and polynomial regression_: It's not obvious, but these too
are linear smoothers, though the weights are weird. (See the optional section on linear regression below.)
- _Splines_: I teased you with these last time. They're smooth, piecewise
polynomials, and so they, too, are linear smoothers. (We'll go over the exact
formulas below.)
## Properties of linear smoothers
The fact that lots of statistical methods can be seen as linear smoothers is
useful because it gives us a unified way of understanding how these methods
work, and in particular of when they will do a good job at recovering
trends. Before establishing these properties, I need to introduce one
more bit of notation.
At each $t_i$ where we took an observation, we have a fitted value
$\EstRegFunc(t_i)$, or for short $\EstRegFunc_i$. Group these together
into a vector, or $n\times 1$ matrix,
\[
\mathbf{\EstRegFunc} = \left[\begin{array}{c} \EstRegFunc(t_1) \\ \EstRegFunc(t_2) \\ \vdots \\ \EstRegFunc(t_n) \end{array} \right]
\]
Similarly, group all the observations of $x$ into a vector $\mathbf{x}$.
Then the fitted vectors are the observations _times_ a matrix:
\[
\mathbf{\EstRegFunc} = \mathbf{w} \mathbf{x}
\]
where $w_{ij} = w(t_i, t_j)$.
### Expectation of the fitted values; bias
The first thing this lets us get at is the expected value of the fitted
values:
\begin{eqnarray}
\Expect{\mathbf{\EstRegFunc}} & = & \Expect{\mathbf{w}\mathbf{X}}\\
& = & \mathbf{w}\Expect{\mathbf{X}}\\
& = & \mathbf{w}\Expect{\mathbf{\TrueRegFunc} + \mathbf{\TrueNoise}}\\
& = & \mathbf{w}\Expect{\mathbf{\TrueRegFunc}} + \mathbf{w}\Expect{\mathbf{\TrueNoise}}\\
& = & \mathbf{w}\mathbf{\TrueRegFunc}
\end{eqnarray}
because $\mathbf{\TrueRegFunc}$ is a constant vector, and the noise $\TrueNoise$
always has expectation 0.
The **bias** of any estimator is the difference between its expected value and
the truth. The estimator is **unbiased** when that difference is zero. So we
have a simple equation for when a smoother gives us an unbiased estimate of the
trend:
\begin{equation}
\text{unbiased trend estimates} \Leftrightarrow \mathbf{\TrueRegFunc} = \mathbf{w}\mathbf{\TrueRegFunc}
\end{equation}
This says that we get an unbiased estimate if, and only if,
$\mathbf{\TrueRegFunc}$ is an eigenvector of $\mathbf{w}$ with eigenvalue 1.
(See below for some reminders of the necessary linear algebra.) Otherwise,
we get some bias in our estimate.
### Shrinkage
Suppose (what is generally true) that $\mathbf{w}$ has $n$ linearly independent
eigenvectors, say $\mathbf{v}_1, \ldots \mathbf{v}_n$, with corresponding
eigenvalues $\lambda_1, \ldots \lambda_n$. These eigenvectors form a basis for
space of vectors, so for an arbitrary vector $\mathbf{x}$,
\[
\mathbf{x} = \sum_{i=1}^{n}{c_i \mathbf{v}_i}
\]
for some coordinates $c_1, \ldots c_n$. These coordinates say how much
$\mathbf{x}$ projects on to the eigenvectors, i.e., how similar it is to each
of the patterns those eigenvectors represent.
Now what happens when we multiply by $\mathbf{w}$, i.e., when we apply our
favorite linear smoother to $\mathbf{x}$?
\[
\mathbf{w}\mathbf{x} = \sum_{i=1}^{n}{\mathbf{w} c_i \mathbf{v}_i} = \sum_{i=1}^{n}{c_i \lambda_i \mathbf{v}_i}
\]
The components of $\mathbf{x}$ which match eigenvectors $\mathbf{v}_i$ with
large eigenvalues $\lambda_i$ get enhance, and those which match eigenvectors
with small eigenvalues get **shrunk**.
Most of the time, when we're dealing with smoothers, the largest eigenvalue is
exactly 1. The eigenvectors with eigenvalue 1 define an **invariant**
subspace, one which is left alone by the smoother. Because all the other
eigenvalues are smaller, smoothing _shrinks_ the part of the data which doesn't
conform to the invariant subspace. If the eigenvalues which are less than 1
are in fact exactly 0, every component of the data which doesn't fall in the
invariant subspace is projected away, and the vector of fitted values
$\mathbf{\EstRegFunc}$ has to fall in the span of the leading eigenvectors. If
the smaller eigenvalues aren't quite 0, the shrinkage isn't quite so extreme,
and $\mathbf{\EstRegFunc}$ doesn't have to be in the invariant subspace, but
it is always closer to that space than the data was.
1. If all the weights are non-negative, $w_{ij} \geq 0$, and the weights for
each fitted value sum to 1, $\sum_{j}{w_{ij}} = 1$ for all $i$, then the
largest eigenvalue of $\mathbf{w}$ is exactly 1, and all the other eigenvalues
are $\leq 1$ in magnitude. So when the linear smoother really does proper
averages to get each fitted value, the vector of fitted values is shrunk towards
the invariant subspace.
2. For linear regression, every eigenvalue of $\mathbf{w}$ is either 0 or 1.
Hence vector of fitted values is always in the invariant subspace.
#### A little example
Let's go back to the idea of averaging each observation with the one
ahead and behind it. We need to decide what to do with the first
and last observations; I'll set them to weights of 1/2
```{r}
n <- 10
w <- matrix(0, nrow=10, ncol=10)
diag(w) <- 1/3
for (i in 2:(n-1)) {
w[i,i+1] <- 1/3
w[i,i-1] <- 1/3
}
w[1,1] <- 1/2
w[1,2] <- 1/2
w[n,n-1] <- 1/2
w[n,n] <- 1/2
```
(You can check that the `w` this produces has the right entries.)
Here are the eigenvalues:
```{r}
eigen(w)$values
```
Notice, as promised, that 1 is an eigenvalue, and all the others are smaller in magnitude. The corresponding eigenvector is
```{r}
eigen(w)$vectors[,1]
```
i.e., _constant_. This should make sense --- if every entry in $\mathbf{x}$
was the same, averaging wouldn't change anything.
The next few eigenvectors are better grasped by visualization than by looking
at the numbers.
```{r}
plot(1:10, eigen(w)$vectors[,2], lty="solid", type="l",
xlab="Position", ylab="Eigenvectors of smoothing matrix")
lines(1:10, eigen(w)$vectors[,3], lty="dashed")
lines(1:10, eigen(w)$vectors[,4], lty="dotted")
legend("bottomleft", legend=c(expression(j==2), expression(j==3),
expression(j==3)),
lty=c("solid","dashed","dotted"))
```
These look somewhat like sine waves, because they _are_ somewhat like sine
waves[^movingaveragenadsinewaves]. Because the eigenvalues which go with these
waves are pretty close to 1, the components of the original data which match
these sine waves will be shrunk just a little, while the other components of
the data will be shrunk a lot. So these (plus the constant) are the sorts of
patterns which this moving average will preserve; everything else it tends to
ignore.
[^movingaveragenadsinewaves]: We will come back to this connection, between moving averages and sine waves, when we look at Fourier analysis later in the course. For now, you can take it on trust. --- If you are unwilling to take it on trust, think about how you'd approximate a 2nd derivative with finite differences: $\frac{\frac{x(t+h) - x(t)}{h} - \frac{(x(t) - x(t-h)}{h}}{h} = \frac{x(t+h) - 2 x(t) + x(t-h)}{h^2}$. So replacing $x(t)$ with the moving average is approximately the same as replacing $x(t)$ with $x(t) + \kappa x^{\prime\prime}(t)$ for some scaling factor $\kappa$. And you can verify for yourself that every sine or cosine function is an eigenfunction of this differential operator, i.e., that $\left(1+\kappa\frac{d^2}{dt^2}\right)\sin{\omega t} = \lambda\sin{\omega t}$. (Can you find $\lambda$?) So it should make some sense that the eigenvectors of the smoothing matrix are close to the eigenfunctions of the differential operator.
### Variance of the fitted values
Because the data contains fluctuations, which are random, it makes sense
to worry about the variance and covariance of the fitted values. These
are easy:
\begin{eqnarray}
\Var{\mathbf{\EstRegFunc}} & = & \Var{\mathbf{w}\mathbf{X}}\\
& = & \mathbf{w}\Var{\mathbf{X}}\mathbf{w}^T\\
& = & \mathbf{w}\Var{\mathbf{\TrueRegFunc} + \mathbf{\TrueNoise}}\mathbf{w}^T\\
& = & \mathbf{w}\Var{\mathbf{\TrueNoise}}\mathbf{w}^T
\end{eqnarray}
Of course, we don't necessarily know what $\Var{\mathbf{\TrueNoise}}$ is.
In other classes, we often assume that all of the noise terms have equal
variance $\sigma^2$, and are uncorrelated with each other. That means
$\Var{\mathbf{\TrueNoise}} = \sigma^2 \mathbf{I}$, and we get the special
case
\[
\Var{\EstRegFunc} = \sigma^2 \mathbf{w}\mathbf{w}^T
\]
### More about shrinkage; degrees of freedom
As the data randomly vary, the fitted values are going to randomly
vary in response. One way to get a sense of how sensitive the fitted
values are to the data is to look at the sum of the covariances,
\begin{eqnarray}
\sum_{i=1}^{n}{\Cov{\EstRegFunc_i, X_i}} & = & \sum_{i=1}^{n}{\Cov{\sum_{j=1}^{n}{w_{ij} X_j}, X_i}}\\
& = & \sum_{i=1}^{n}{\sum_{j=1}^{n}{w_{ij} \Cov{X_i, X_j}}}\\
& = & \sum_{i=1}^{n}{\sum_{j=1}^{n}{w_{ij} \Cov{\TrueNoise_i, \TrueNoise_j}}}
\end{eqnarray}
Again, in the special case where $\Cov{\TrueNoise_i, \TrueNoise_j} = 0$ if $i\neq j$, and $=\sigma^2$ if $i=j$, this simplifies to
\[
\sum_{i=1}^{n}{\Cov{\EstRegFunc_i, X_i}} = \sigma^2 \tr{\mathbf{w}}
\]
Remembering that the trace of a matrix equals the sum of its eigenvalues,
we use this to _define_ the **effective number of degrees of freedom**:
\[
\dof{\mathbf{w}} \equiv \tr{\mathbf{w}}
\]
If $\mathbf{w}$ is an idempotent projection matrix, this is exactly equal
to the number of dimensions we're projecting down to. If $\mathbf{w}$
isn't a projection matrix, we pick up the dimension of the
invariant space (= the number of $\lambda=1$ eigenvalues), plus a bit more.
## Splines: Direct control over smoothness
I have teased you several times by now by mentioning **splines**, or more formally **smoothing splines**[^splines]. These
are smooth curves which we find by balancing coming close to the data
points, against not having too much wiggliness or curvature. More specifically,
we pick a $\lambda > 0$, and then solve the optimization problem
\[
\EstRegFunc = \argmin_{m}{\frac{1}{n}\sum_{i=1}^{n}{(x_i - m(t_i))^2} + \lambda\int{(m^{\prime\prime}(t))^2 dt}}
\]
The first term we're optimizing over is the mean squared error. The second
term is the average curvature of the function. It would be zero for any linear
function (whose second derivative is 0 everywhere), and it grows the more the
function bends and twists around. One way to read this is that we are
**penalizing** curvature. In fact, you can think of $\lambda$ as a price ---
we're willing to trade one unit more of curvature for at least $\lambda$ units
of reduced MSE. (Conversely, we're willing to increase the MSE by one unit, if
in doing so we reduce the curvature by at least $1/\lambda$ units.) As
$\lambda \rightarrow 0$, we get incredibly wiggly functions which go through
the data points exactly. As $\lambda \rightarrow \infty$, we approach the
best-fitting straight line, i.e., linear regression.
[^splines]: People have defined many variants on splines. When I use the word without an adjective, I mean smoothing splines, as defined by the optimization problem immediately following this footnote.
The optimization here is over all possible functions with second derivatives.
You might well ask how we could possibly do this, since it's an
infinite-dimensional space. The answer is that the solution is _always_
a piecewise cubic polynomial, with pieces at the data points $t_i$, and
that the solution is also continuous, with continuous first and second
derivatives. (See the optional section below, if you care.) All of
this means that we only have $n$ unknown coefficients to find, rather than
having to search over all possible functions. (Again, see the optional
section below.)
One upshot of being a piecewise polynomial is that the spline is a linear
smoother, with weights which do a kind of local averaging. Another is that
there is a direct correspondence between $\lambda$ and the effective degrees of
freedom, where increasing $\lambda$ lowers the degrees of freedom and vice
versa. (The exact relationship depends on the values of $t_i$ --- again, see
the optional section.)
### Multi-dimensional splines
All of the above is for one-dimensional smoothing. If we're dealing with
data over two dimensions, so it's really $x(r,t)$ rather than
just $x(t)$, we can define the analogous problem:
\[
\argmin_{m}{\frac{1}{n}\sum_{i=1}^{n}{(x_i - m(r_i, t_i))^2} + \lambda\int{\left[ {\left(\frac{\partial^2 m}{dr^2}\right)}^2 + 2{\left(\frac{\partial^2 m}{dr dt}\right)}^2 + {\left(\frac{\partial^2 m}{dt^2}\right)}^2\right] dr dt}}
\]
(Three or more coordinates is similar, but with even more notation.) These
are called **thin-plate** splines.
An alternative approach is to use what are called a "tensor-product splines".
If you dive into the math (see the optional section), you find that
one-dimensional splines are linear combinations of "basis functions" $B_1(t),
\ldots B_n(t)$ --- the exact shape of the basis functions depends on the $t_i$ but not on the $x_i$. In two dimensions, we can find the basis functions for the two coordinates, say $B_i$ for $r$ and $C_j$ for $t$, and take products, so
\[
m(r,t) = \sum_{i=1}^{n}{\sum_{j=1}^{n}{\beta_{ij} B_i(r) C_j(t)}}
\]
- Thin-plate splines have the nice interpretation of trading curvature against ME, just like in 1D. Tensor-product splines are harder to interpret.
+ On the other hand, this itself makes more sense when the coordinates have similar scales and meaning --- if $r$ and $t$ are both map coordinates, total curvature is a lot more meaningful than if $r$ is space and $t$ is time.
- Tensor-product splines are usually easier to calculate than thin-plate splines.
### Splines in R
The simplest way to work with splines in R is to use the `smooth.spline`
command. This takes two vectors, `x` (corresponding to our $t$ variable) and
`y` (corresponding to our $x$ variable), and returns a slightly complicated
object. The `x` and `y` components contain the `x` values, and the _fitted_
`y` values, so it can be used for plotting. (See the example in the slides for
lecture 1.) You can also calculate predictions at _new_ values, not
corresponding to the data.
For more sophisticated uses, especially multi-dimensional splines, you need to
use a package. I recommend the `mgcv` package [@Wood-mgcv-2004,@Wood-on-GAMs],
which is primarily designed for smooth, additive models, and as such includes a
lot of different smoothers. A command like
```{r, eval=FALSE}
library(mgcv)
my.fit <- gam(x ~ s(t), data=df)
```
estimates a regression model where `x` (drawn from data frame `df`) is a smooth
spline function of `t` (drawn from the same source), and stores the result in the object `my.fit`. On the other hand,
```{r, eval=FALSE}
gam(x ~ s(r,t), data=df)
```
models `x` as a thin-plate spline over the two variables `r` and `t`, while
```{r, eval=FALSE}
gam(x ~ te(r, t), data=df)
```
uses a tensor-product spline. If you want to model `x` as the _sum_ of two
smooth functions, that would be
```{r, eval=FALSE}
gam(x ~ s(r) + s(t), data=df)
```
There are also a large number of more advanced packages exclusively for dealing
with splines --- too many for me to list here.
### Hodrick-Prescott filtering
In econometrics, it is very common to use what is called a "Hodrick-Prescott
filter" to extract the trend of a time series. (The canonical reference is
@Hodrick-Prescott-on-HP-filter, but this circulated as a working paper for many
years before being formally published.) As shown by
@Paige-Trindade-on-HP-filter, this is _exactly_ the same as applying a
smoothing spline to the data. The only difference is that Hodrick and Prescott
supplied a (fallacious) argument for why $\lambda$ should always be exactly
1600.
### Some mathematical details for splines
> This section is optional, but a lot of it is cool. The text is largely ripped off from @CRS-ADAfaEPoV.
#### Solutions to the spline problem are piecewise-cubic polynomials
The spline problem
asks us to find the function which
minimize the sum of the MSE and a certain integral. Even the MSE can be
brought inside the integral, using Dirac delta functions[^dirac]:
\begin{equation}
\mathcal{L} = \int{\left[\lambda(m^{\prime\prime}(t))^2 + \frac{1}{n} \sum_{i=1}^{n}{(x_i - m(t_i))^2 \delta(t-t_i)}\right] dt}
\end{equation}
In what follows, without loss of generality, assume that the $t_i$ are ordered,
so $t_1 \leq t_2 \leq \ldots t_i \leq t_{i+1} \leq \ldots t_n$. With some loss
of generality but a great gain in simplicity, assume none of the $t_i$ are
equal, so we can make those inequalities strict.
[^dirac]: The Dirac delta function (named after the physics P. A. M. Dirac) is defined by how it integrates: for any function $f(t)$, $\int{f(t) \delta(t) dt} = f(0)$. You can think of this as the limit of a sequence of Gaussian pdfs, all centered at 0, with variances shrinking towards zero. The limit would be something like a function $\delta$ which was $0$ except at the origin, but had an infinitely skinny, infinitely tall spike at the origin, with the total area under the curve being exactly 1. There isn't really a _function_ like that, but it's definitely true that $\lim_{\sigma\rightarrow 0}{\int{f(t) \frac{1}{\sigma}phi(t/\sigma) dt}} = f(0)$, so $\delta(t)$ makes sense _in integrals_. One can in fact build a whole theory of "generalized functions" which make sense in integrals in this way, including actual functions as a special case. If this interests you, @Kolmogorov-Fomin-analysis is still one of the best ways in to the subject. (Dirac, being a physicist, didn't let something like $\delta$ not really being a function stop him from calculating.)
The subject which deals with maximizing or minimizing integrals of functions is
the calculus of variations, and one of its basic tricks is to write the integrand as a function of
$x$, the function, and its derivatives:
\begin{equation}
\mathcal{L} = \int{L(t,m,m^{\prime},m^{\prime\prime}) dt}
\end{equation}
where, in our case,
\begin{equation}
L = \lambda(m^{\prime\prime}(t))^2 + \frac{1}{n} \sum_{i=1}^{n}{(x_i - m(t_i))^2 \delta(t-t_i)}
\end{equation}
This sets us up to use a general theorem of the calculus of variations, to the
effect that any function $\hat{m}$ which minimizes $L$ must also solve $L$'s
**Euler-Lagrange equation**:
\[
\left.{\frac{\partial L}{\partial m} - \frac{d}{dt}\frac{\partial L}{\partial m^{\prime}} + \frac{d^2}{dt^2}\frac{\partial L}{\partial m^{\prime\prime}}}\right|_{m=\hat{m}} = 0
\]
In our case, the Euler-Lagrange equation reads
\[
-\frac{2}{n}\sum_{i=1}^{n}{(x_i-\hat{m}(t_i))\delta(t-t_i)} + 2\lambda\frac{d^2}{dt^2}\hat{m}^{\prime\prime}(t) = 0
\]
Remembering that $\hat{m}^{\prime\prime}(t) = d^2\hat{m}/dt^2$,
\[
\frac{d^4}{dt^4}\hat{m}(t) = \frac{1}{n\lambda}\sum_{i=1}^{n}{(x_i-\hat{m}(t_i))\delta(t-t_i)}
\]
The right-hand side is zero at any point $t$ other than one of the $t_i$, so the
fourth derivative has to be zero in between the $t_i$. This in turn means that
the function must be piecewise cubic. Now fix an $t_i$, and pick any two points
which bracket it, but are both greater than $t_{i-1}$ and less than $t_{i+1}$;
call them $l$ and $u$. Integrate our Euler-Lagrange equation from $l$ to
$u$:
\begin{eqnarray}
\int_{l}^{u}{\frac{d^4}{dt^4}\hat{m}(t) dt} & = & \int_{l}^{u}{\frac{1}{n\lambda}\sum_{i=1}^{n}{(x_i-\hat{m}(t_i))\delta(t-t_i)}}\\
\hat{m}^{\prime\prime\prime}(u) - \hat{m}^{\prime\prime\prime}(l) &= & \frac{x_i - \hat{m}(t_i)}{n\lambda}
\end{eqnarray}
That is, the third derivative makes a jump when we move across $t_i$, though
(since the fourth derivative is zero), it doesn't matter which pair of points
above and below $t_i$ we compare third derivatives at. Integrating the
equation again,
\begin{equation}
\hat{m}^{\prime\prime}(u) - \hat{m}^{\prime\prime}(l) = (u-l)\frac{x_i - \hat{m}(t_i)}{n\lambda}
\end{equation}
Letting $u$ and $l$ approach $x_i$ from either side, so $u-l \rightarrow 0$, we
see that $\hat{m}^{\prime\prime}$ makes no jump at $t_i$. Repeating this trick
twice more, we conclude the same about $\hat{m}^{\prime}$ and $\hat{m}$ itself.
In other words, $\hat{m}$ must be continuous, with continuous first and second
derivatives, and a third derivative that is constant on each $(t_i, t_{i+1})$
interval. Since the fourth derivative is zero on those intervals (and
undefined at the $t_i$), the function must be a piecewise cubic, with the piece
boundaries at the $t_i$, and continuity (up to the second derivative) across
pieces.
To see that the optimal function must be linear below $t_1$ and above $t_n$,
suppose that it wasn't. Clearly, though, we could reduce the curvature as much
as we want in those regions, without altering the value of the function at the
boundary, or even its first derivative there. This would yield a better
function, i.e., one with a lower value of $\mathcal{L}$, since the MSE would be
unchanged and the average curvature would be smaller. Taking this to the
limit, then, the function must be linear outside the observed data range.
#### Finding the spline coefficients
Splines, I said, are piecewise cubic polynomials. To see how to fit them,
let's think about how to fit a global cubic polynomial. We would define
four **basis functions**,
\begin{eqnarray}
B_1(x) & = & 1\\
B_2(x) & = & t\\
B_3(x) & = & t^2\\
B_4(x) & = & t^3
\end{eqnarray}
and chose to only consider regression functions that are linear combinations
of the basis functions,
\begin{equation}
m(t) = \sum_{j=1}^{4}{\beta_j B_j(t)}
\end{equation}
Such regression functions would be linear in the _transformed_ variables
$B_1(t), \ldots B_4(t)$, even though it is nonlinear in $t$.
To estimate the coefficients of the cubic polynomial, we would apply each basis
function to each data point $t_i$ and gather the results in an $n\times 4$
matrix $\mathbf{b}$,
\begin{equation}
b_{ij} = B_j(t_i)
\end{equation}
Then we would do ordinary least squares using the $\mathbf{b}$ matrix in place of the usual data
matrix $\mathbf{t}$:
\begin{equation}
\hat{\beta} = (\mathbf{b}^T\mathbf{b})^{-1}\mathbf{b}^T \mathbf{x}
\end{equation}
Since splines are piecewise cubics, things proceed similarly, but we need to be
a little more careful in defining the basis functions. Recall that we have $n$
values of $t$, $t_1, t_2, \ldots t_n$, and we can assume they're in increasing
order. These $n$ **knots** define $n+1$ pieces or segments:
$n-1$ of them between the knots, one from $-\infty$ to $t_1$, and one from
$t_n$ to $+\infty$. A third-order polynomial on each segment would seem to
need a constant, linear, quadratic and cubic term per segment. So the segment
running from $t_i$ to $t_{i+1}$ would need the basis functions
\begin{equation}
\mathbf{1}_{(t_i,t_{i+1})}(t), ~ (t-t_i)\mathbf{1}_{(t_i,t_{i+1})}(t), ~ (t-t_i)^2\mathbf{1}_{(t_i,t_{i+1})}(t), ~ (t-t_i)^3\mathbf{1}_{(t_i,t_{i+1})}(t)
\end{equation}
where as usual the indicator function $\mathbf{1}_{(t_i,t_{i+1})}(t)$ is 1 if
$t \in (t_i,t_{i+1})$ and $0$ otherwise. This makes it seem like we need
$4(n+1) = 4n+4$ basis functions.
However, we know from linear algebra that the number of basis vectors we need
is equal to the number of dimensions of the vector space. The number of
adjustable coefficients for an arbitrary piecewise cubic with $n+1$ segments is
indeed $4n+4$, but splines are constrained to be smooth. The spline must be
continuous, which means that at each $t_i$, the value of the cubic from the
left, defined on $(t_{i-1}, t_i)$, must match the value of the cubic from the
right, defined on $(t_{i}, t_{i+1})$. This gives us one constraint per data
point, reducing the number of adjustable coefficients to at most $3n+4$. Since
the first and second derivatives are also continuous, we are down to just $n+4$
coefficients. Finally, we know that the spline function is linear outside the
range of the data, i.e., on $(-\infty,t_1)$ and on ($t_n,\infty)$, lowering the
number of coefficients to $n$. There are no more constraints, so we end up
needing only $n$ basis functions. And in fact, from linear algebra, any set of
$n$ piecewise cubic functions which are linearly independent can
be used as a basis. One common choice is
\begin{eqnarray}
B_1(t) & = & 1\\
B_2(t) & = & t\\
B_{i+2}(x) & = & \frac{(t-t_i)_{+}^3 - (t-t_n)_{+}^3}{t_n - t_i} - \frac{(t-t_{n-1})_{+}^3 - (t-t_n)_{+}^3}{t_n - t_{n-1}}
\end{eqnarray}
where $(a)_{+} = a$ if $a > 0$, and $=0$ otherwise. This rather
unintuitive-looking basis has the nice property that the second and third
derivatives of each $B_j$ are zero outside the interval $(t_1, t_n)$.
Now that we have our basis functions, we can once again write the spline
as a weighted sum of them,
\begin{equation}
m(t) = \sum_{j=1}^{m}{\beta_j B_j(t)}
\end{equation}
and put together the matrix $\mathbf{b}$ where $b_{ij} = B_j(t_i)$. We can
write the spline objective function in terms of the basis functions,
\begin{equation}
n\mathcal{L} = (\mathbf{x}-\mathbf{b}\beta)^T (\mathbf{x}-\mathbf{b}\beta) + n\lambda \beta^T \Omega \beta
\end{equation}
where the matrix $\Omega$ encodes information about the curvature of the basis
functions:
\begin{equation}
\Omega_{jk} = \int{B^{\prime\prime}_j(t) B^{\prime\prime}_k(t) dt}
\end{equation}
Notice that only the quadratic and cubic basis functions will make non-zero
contributions to $\Omega$. With the choice of basis above, the second
derivatives are non-zero on, at most, the interval $(t_1, t_n)$, so each of the
integrals in $\Omega$ is going to be finite. $\Omega$ is something we (or,
realistically, R) can calculate _once_, no matter what $\lambda$ is. Now
we can find the smoothing spline by differentiating with respect to $\beta$:
\begin{eqnarray}
0 & = & -2\mathbf{b}^T\mathbf{x} + 2 \mathbf{b}^T\mathbf{b} \hat{\beta} + 2n\lambda\Omega\hat{\beta}\\
\mathbf{b}^T\mathbf{x} & = & \left(\mathbf{b}^T \mathbf{b} + n\lambda\Omega\right)\hat{\beta}\\
\hat{\beta} & = & {\left(\mathbf{b}^T \mathbf{b} + n\lambda\Omega\right)}^{-1} \mathbf{b}^T\mathbf{x} \label{eqn:spline-coefficients-as-ridge-regression}
\end{eqnarray}
Notice, incidentally, that we can now show splines are linear
smoothers:
\begin{eqnarray}
\mathbf{\EstRegFunc} & = & \mathbf{b}\hat{\beta}\\
& = & \mathbf{b}{\left(\mathbf{b}^T \mathbf{b} + n\lambda\Omega\right)}^{-1} \mathbf{b}^T\mathbf{x}
\end{eqnarray}
Once again, if this were ordinary linear regression, the OLS estimate of the
coefficients would be $(\mathbf{t}^T\mathbf{t})^{-1} \mathbf{t}^T \mathbf{x}$.
In comparison to that, we've made two changes. First, we've substituted the
basis-function matrix $\mathbf{b}$ for the original matrix of predictor
variables, $\mathbf{t}$ --- a change we'd have made already for a polynomial
regression. Second, the "denominator" is not $\mathbf{t}^T\mathbf{t}$,
or even $\mathbf{b}^T \mathbf{b}$, but
$\mathbf{b}^T \mathbf{b} + n\lambda\Omega$. Since $\mathbf{t}^T\mathbf{t}$ is
$n$ times the covariance matrix of the independent variables, we are taking the
covariance matrix of the spline basis functions and adding some extra
covariance --- how much depends on the shapes of the functions (through
$\Omega$) and how much smoothing we want to do (through $\lambda$). The larger
we make $\lambda$, the less the actual data matters to the fit.
In addition to explaining how splines can be fit quickly (do some matrix
arithmetic), this illustrates two important tricks. One, which we won't
explore further here, is to turn a nonlinear regression problem into one which
is linear in another set of basis functions. This is like using not just one
transformation of the input variables, but a whole library of them, and letting
the data decide which transformations are important. There remains the issue
of selecting the basis functions, which can be quite tricky. In addition to
the spline basis, most choices are various
sorts of waves --- sine and cosine waves of different frequencies, various
wave-forms of limited spatial extent ("wavelets"), etc. The ideal is to
chose a function basis where only a few non-zero coefficients would need to be
estimated, but this requires some understanding of the data...
The other trick is that of stabilizing an unstable estimation problem by adding
a penalty term. This reduces variance at the cost of introducing some bias.
#### Effective degrees of freedom
Since the effective degrees of freedom of a linear smoother is $\tr{\mathbf{w}}$, we've just shown that the a spline has
\[
\tr{\left(\mathbf{b}{\left(\mathbf{b}^T \mathbf{b} +
n\lambda\Omega\right)}^{-1} \mathbf{b}^T\right)}
\]
degrees of freedom. Holding the data (and so $\mathbf{b}$ and $\Omega$) fixed,
as $\lambda$ goes up, this will go down, and vice versa.
# Choosing How, or How Much, to Smooth --- Cross-validation
Pretty much any smoothing method we might use has some adjustable
control settings[^hyperparameters] --- the size of the moving-average
window, say, or the "bandwidth" of a kernel --- which we need to set. How
should we tune these control settings?
[^hyperparameters]: Some people call these "hyperparameters", rather than "control settings". I think this is a misleading name, because they're not in any way parameters of the process we're studying, they're just aspects of _our_ method.
One of the simplest, but also most practical, ways of picking control settings
is to **make the smoother predict well on data it didn't see**. The uses to
which we're putting our smoother --- interpolating to fill in gaps in the data,
extrapolating beyond the range of the data, filtering to remove noise from the
data --- are all, in the end, about predicting unseen values. The most natural
way to make sure the smoother can predict unseen values is to _have it predict
unseen values_.
This brings us to the idea of **cross-validation**. There are many variants,
but the key notion is this:
- Take the data set, and divide the data points into a **training set** and a **testing set**.
- Fit each smoother (or other model) to the training set.
- Evaluate each smoother on the testing set, say by taking the squared error.
- Average the scores for each model over different choices of training and testing sets.
This gives a score for each smoother, which says how well it was able to
predict data which it _didn't_ get to see during the fitting or training
process. This is exactly what we want.
## Leave-one-out Cross-validation
The most straight-forward form of this is **leave-one-out cross-validation**
(LOOCV).[^adaripoff] We do the following for each smoother we're considering:
[^adaripoff]: Some of the text in this section is modified from @CRS-ADAfaEPoV.
- For each of the $n$ data points:
+ Fit each smoother using every data point _except_ data point $i$; call this function $\EstRegFunc^{(-i)}$;
+ Calculate the prediction at data point $i$, $\EstRegFunc^{(-i)}(t_i)$;
+ Calculate squared[^squarederror] prediction error $(x_i - \EstRegFunc^{(-i)}(t_i))^2$.
- Average the squared prediction errors for all models over all $n$ data points,
to get $n^{-1}\sum_{i=1}^{n}{(x_i - \EstRegFunc^{(-i)}(t_i))^2}$.
[^squarederror]: The same idea works if you'd rather use the absolute error, a "robust" loss function, etc.
This makes it sound as though, with LOOCV, we need to estimate each possible
smoother $n$ times, which would be slow for a large data set. There is in fact
a trick for linear smoothers which lets us fit the smoother just once.
What happens when we hold back data point $i$, and then
make a prediction at $t_i$? The observed response at $i$ can't contribute
to the prediction, but otherwise the linear smoother should work as before, so
\[
\EstRegFunc^{(-i)}(t_i)= \frac{({\mathbf{w} \mathbf{x})}_i - w_{ii} x_i}{1-w_{ii}}
\]
The numerator just removes the contribution to $\EstRegFunc(t_i)$ that came
from $x_i$, and the denominator just re-normalizes the weights in the smoother.
Now a little algebra says that
\[
x_i - \EstRegFunc^{(-i)}(t_i) = \frac{x_i - \EstRegFunc(x_i)}{1-w_{ii}}
\]
The quantity on the left of that equation is what we want to square and average
to get the leave-one-out CV score, but everything on the right can be
calculated from the fit we did to the whole data. The leave-one-out CV score
is therefore
\begin{equation}
\frac{1}{n}\sum_{i=1}^{n}{\left(\frac{x_i-\EstRegFunc(t_i)}{1-w_{ii}}\right)^2}
\end{equation}
Again, this formula requires _both_ leave-one-out-CV _and_ a linear smoother.
If we had independent observations, you might think that LOOCV would give us
$n$ independent, unbiased estimates of how well a smoother predicts new data,
and so an unbiased, low-variance estimate of model performance. The problem is
that even if the _data points_ are independent, $\EstRegFunc^{(-i)}$ and
$\EstRegFunc^{(-j)}$ are based on $n-2$ shared data points, so the
leave-one-out errors for each data point are pretty correlated with each other.
(Can you work out the covariance between $x_i - \EstRegFunc^{(-i)}(t_i)$ and
$x_j - \EstRegFunc^{(-j)}(t_j)$?) One consequence of this is that LOOCV tends
to over-fit --- to pick models with more degrees of freedom than are strictly
necessary.
We sometimes therefore prefer to use _many_ data points in each testing set, so
the predictions are less correlated with each other. The commonest approach is
**$k$-fold CV**, where we divide all the data points into $k$ equal groups, or
"folds", and each fold gets to be the testing set in turn. (Typically $k=5$ or
$10$.) This is much less prone to over-fitting than leave-one-out.
## Special issues on cross-validation for dependent data
These forms of cross-validation work very, very well for independent data, to
the point where they're pretty much the standard for any situation involving
statistical modeling of independent data. But what about dependent data? If
each data point tends to be like its neighbors, you might well worry that a
smoother which _just_ copies the neighbors will have an unfair advantage in
cross-validation. At the very least, this would seem to lead to more noise in
our estimate of how well each smoother predicts.
A natural way out of this is to do leave-one-out cross-validation, but to leave
out a buffer of size (say) $h$ around each point
[@Burman-Chow-Nolan-cross-validation-for-dependent]. This buffer is not part
of the training set, but we also don't use it in evaluating the prediction of
the smoother. (We _do_ use values in the buffer when we _calculate_ the
prediction at the held-out point, however.) If remote observations are nearly
independent of each other, and the buffer is big enough, then the testing set
is nearly independent of the training set, and everything is well-behaved. You
can combine this idea with $k$-fold cross-validation, where you omit a buffer
around each block [@Racine-consistent-cv-for-dependent-data].
The obvious question is "how big a buffer?", to which the obvious answer is
"big enough to make the training and test points nearly independent". Since we
usually don't know that, we typically rely on rules of thumb.
@Burman-Chow-Nolan-cross-validation-for-dependent suggest using a constant
fraction of $n$, say $h=0.5 n$. @Racine-consistent-cv-for-dependent-data
suggests that this rule is prone to over-fitting, and that a better one would
be $h = n^{0.25}$. I don't know of a really great, automatic resolution of
this issue.
In practice, it can often be easiest to just live with the extra imprecision
and over-fitting that come from ignoring dependence, but we'll come back to
this later when we look in more detail at time-series forecasting.
## Smoothing ratios and proportions
Before moving on from smoothers, I should mention one last special
consideration. When our data consists of a ratio or proportion, say $X(t) =
Y(t)/Z(t)$, we _could_ smooth it the same way we'd smooth any other numerical
data. However, if the denominator is small, any noise in it can have a massive
effect on the ratio. It can make a lot more sense to _separately_ smooth $Y$
and $Z$, and then take the ratio of the smoothed values.
@Kafadar-smoothing-geographical, from whom I learned this point, explores how
best to do this with splines and related smoothers.
# Detrended values and residuals
After we have chosen how to smooth the data, and gotten an
estimate of the trend, we also have an estimate of the fluctuations.
Since we can always write
\[
X(r,t) = \TrueRegFunc(r,t) + \TrueNoise(r,t)
\]
where $\TrueRegFunc$ is a deterministic (but unknown) function and $\TrueNoise$ is a stochastic process, the natural estimate of the latter is
\[
\EstNoise(r,t) \equiv X(r,t) - \EstRegFunc(r,t)
\]
These natural estimates of the noise are the **residuals** of the smoother.
When we use a linear smoother, we can find the residuals in matrix form:
\begin{eqnarray}
\mathbf{\EstNoise} & = & \mathbf{x} - \mathbf{\EstRegFunc}\\
& = & \mathbf{x} - \mathbf{w}\mathbf{x}\\
& = & (\mathbf{I} - \mathbf{w})\mathbf{x}
\end{eqnarray}
Thus the residuals, like the fitted values, also linear in $\mathbf{x}$. We
can therefore work out properties of the residuals much as we did properties
of the fitted values.
The expected value of the residual vector is
\begin{eqnarray}
\Expect{\mathbf{\EstNoise}} & = & \Expect{(\mathbf{I}-\mathbf{w})\mathbf{X}}\\
& = & (\mathbf{I}-\mathbf{w})\mathbf{\TrueRegFunc}
\end{eqnarray}
This will be zero --- i.e., match the true expectation of the noise --- if and only if $\TrueRegFunc$ is an eigenvalue of $\mathbf{I}-\mathbf{w}$ with
eigenvalue 0.
--- It should not be hard to convince yourself that if $\mathbf{v}$ is an
eigenvector of $\mathbf{w}$ with eigenvalue $\lambda$, then $\mathbf{v}$ is
also an eigenvector of $\mathbf{I}-\mathbf{w}$, with eigenvalue $1-\lambda$.
It should also not be hard to check that if $\mathbf{w}$ is symmetric and
idempotent, then so is $\mathbf{I} - \mathbf{w}$.
The variance of the residuals is
\[
\Var{\mathbf{\EstNoise}} = (\mathbf{I}-\mathbf{w}) \Var{\mathbf{\epsilon}} (\mathbf{I}-\mathbf{w})^T
\]
Under the special assumption that the true noise has constant variance and is uncorrelated, $\Var{\mathbf{\epsilon}} = \sigma^2 \mathbf{I}$, and this simplifies
to
\[
\sigma^2 (\mathbf{I}-\mathbf{w})(\mathbf{I}-\mathbf{w})^T
\]
Notice that in general this is _not_ $\sigma^2 \mathbf{I}$. That is, the
residuals generally have somewhat different variances, and correlation
structure, than the true noise. Homework 2 will, among other things, explore
some consequences of this.
# Linear regression as a linear smoother
> This section can be skipped, if you're willing to take the fact that linear regression is really, or also, a linear smoother on trust. We will be coming back to some of this stuff later, when we look at linear predictors more generally.
## Simple linear regression as a linear smoother
Think of the simple linear regression model:
\[
m(t) = \beta_0 + \beta_1 t
\]
We usually estimate this by minimizing the mean squared error (MSE) --- what's
called the **ordinary least sqaures** or OLS estimate:
\begin{eqnarray}
(\hat{\beta}_0, \hat{\beta}_1) = \argmin_{b_0, b_1}{\frac{1}{n}\sum_{i=1}^{n}{(x_i - (b_0 + b_1 t_i))^2}}
\end{eqnarray}
Take the derivatives with respect to $b_0$ and $b_1$, and set them to zero.
\begin{eqnarray}
\frac{1}{n}\sum_{i=1}^{n}{2(x_i - (\hat{\beta}_0 + \hat{\beta}_1 t_i))(-1)} & = & 0\\
\frac{1}{n}\sum_{i=1}^{n}{2(x_i - (\hat{\beta}_0 + \hat{\beta}_1 t_i))(-t_i)} & = & 0
\end{eqnarray}
We can drop the over-all factor of $-2$ in what follows.
First, solve for $\hat{\beta}_0$ in terms of $\hat{\beta}_1$:
\begin{eqnarray}
\overline{x} - \hat{\beta}_0 - \hat{\beta}_1 \overline{t} & = & 0\\
\overline{x} - \hat{\beta}_1 \overline{t} & = & \hat{\beta}_0
\end{eqnarray}
Now solve for $\hat{\beta}_1$:
\begin{eqnarray}
\overline{xt} - \hat{\beta}_0 \overline{t} - \hat{\beta}_1 \overline{t^2} & = & 0\\
\overline{xt} - \overline{x}\overline{t} + \hat{\beta}_1 (\overline{t})^2 - \hat{\beta}_1 \overline{t^2} & = & 0\\
\frac{\overline{xt} - \overline{x}\overline{t}}{\overline{t^2} - \overline{t}^2} & = & \hat{\beta}_1
\end{eqnarray}
It's convenient to abbreviate the denominator as $s_t^2$.
Putting this together,
\[
\EstRegFunc(t) = \overline{x} + \frac{t - \overline{t}}{s_t^2}(\overline{xt} - \overline{x}\overline{t})
\]
This is linear in $x$, so we know it can be put in the linear-smoother form.
In fact,
\[
\EstRegFunc(t) = \sum_{j=1}^{n}{\left(\frac{1}{n} + \frac{(t-\overline{t})(t_j - \overline{t})}{n s_t^2}\right) x_j}
\]
This particular form of weights ensures both that the estimated curve is a
straight line, and that it comes as close to the data as possible, on average,
among all straight lines.
Beyond that, however, the weights are really weird. The first part is just the
sample mean, but the variable part is $\propto
(t-\overline{t})(t_j-\overline{t})$. This means we give more weight to data
points which are far from the center of the data, no matter what. We also ramp
up all the weights as the point $t$ where we're making a prediction moves away
from the center of the data.
## Linear regression with multiple predictors
Suppose we want to make our prediction a linear combination of multiple
variables:
\[
m(t,z, \ldots u) = \beta_0 + \beta_ 1 + \beta_2 z + \ldots \beta_p u
\]
We fit this by least squares again. It simplifies the math immensely
if we define two matrices,
\[
\mathbf{t} = \left[ \begin{array}{ccccc}
1 & t_1 & z_1 & \ldots & u_1\\
1 & t_2 & z_2 & \ldots & u_2\\
\vdots & \vdots & \vdots & \ldots & \vdots\\
1 & t_n & z_n & \ldots & u_n \end{array} \right]
\]
and
\[
\mathbf{b} = \left[ \begin{array}{c}
b_0 \\ b_1 \ldots b_p \end{array}\right]
\]
Then the fitted values are
\[
\mathbf{m} = \mathbf{t}\mathbf{b}
\]
and the mean-squared error is
\[
\frac{1}{n}\sum_{i=1}^{n}{(x_i - (b_0 + b_1 t_i + b_2 z_i + \ldots + b_p u_i))^2} = \frac{1}{n} (\mathbf{x} - \mathbf{t}\mathbf{b})^T (\mathbf{x} - \mathbf{t}\mathbf{b})
\]
Taking the derivative and setting it to zero
\[
\frac{2}{n}\mathbf{t}^T (\mathbf{x} - \mathbf{t}\mathbf{\widehat{\beta}}) = 0
\]
Solving,
\[
\mathbf{\widehat{\beta}} = (\mathbf{t}^T\mathbf{t})^{-1} \mathbf{t}^T \mathbf{x}
\]
The fitted valued come from
\[
\mathbf{\EstRegFunc} = \mathbf{t} (\mathbf{t}^T\mathbf{t})^{-1} \mathbf{t}^T \mathbf{x}
\]
so the weight matrix in this case is
\[
\mathbf{w} = \mathbf{t} (\mathbf{t}^T\mathbf{t})^{-1} \mathbf{t}^T
\]
You can check that this matrix is symmetric, and that it is idempotent. You
can also check that when we just deal with one predictor, these
matrix formulas are equivalent to the ones for simple linear regression
above.
These formula handles _all_ of the coefficients at once. You can
check, by explicitly taking the partial derivative just with respect to $b_0$
and setting that to 0, that it is still the case that
\[
\hat{\beta}_0 = \overline{y} - (\hat{\beta}_1 \overline{t} + \hat{\beta}_2 \overline{z} + \ldots + \hat{\beta}_p \overline{u})
\]
This tells us that the individual weights in $\mathbf{w}$ still have the same
form --- they're a constant, plus terms that give more weight to points far
from the center of the data.
## Transformations of one predictor; polynomial regression
Nothing in the logic for the previous section requires that the extra
predictors be new variables. If we want to do polynomial regression,
\[
m(t) = \beta_0 + \beta_1 t + \beta_2 t^2 + \ldots \beta_p t^p
\]
then everything carries through exactly as before. The matrix $\mathbf{t}$
would just have columns containing the constant, $t_i$, $t_i^2$ and so on.
What's important is that we're just taking an additive combination of the
functions of $t$ on the right-hand side of the model formula, and that
all of the functions on the RHS are linearly independent of each other.
(If there was linear dependence, then the matrix $\mathbf{t}^T \mathbf{t}$
wouldn't be invertible.)
# Some linear algebra reminders: eigen-stuff
> This section can be skipped if you are comfortable working with the eigenvalues and eigenvectors of matrices. Even if you are, it won't hurt to at least skim it.
A (non-zero) vector or $n\times 1$ matrix $\mathbf{v}$ is an **eigenvector** of
the $n\times n$ matrix $\mathbf{w}$, with **eigenvalue** $\lambda$, when
\[
\mathbf{w}\mathbf{v} = \lambda\mathbf{v}
\]
We can re-write this as
\[
(\mathbf{w} - \lambda\mathbf{I})\mathbf{v} = 0
\]
and demanding $\mathbf{v} \neq 0$ tells us that the $\lambda$ which make
this true have to obey the **characteristic polynomial** or **eigenpolynomial** of the matrix
\[
\det{(\mathbf{w} - \lambda\mathbf{I})} = 0
\]
In R, the command to find the eigenvalues and eigenvectors of a matrix
is `eigen`:
```{r}
theta <- runif(n=2)
# Why "mark"?
mark <- matrix(c(theta[1], 1-theta[1], 1-theta[2], theta[2]),
byrow=TRUE, nrow=2, ncol=2)
mark
eigen(mark)
```
This returns a list with two components, one of them a vector of eigenvalues,
the other a matrix whose columns are the eigenvectors.
## General properties of eigenvalues and eigenvectors
There are a number of salient properties of eigenvalues and eigenvectors.
1. There are at most $n$ distinct eigenvalues of $\mathbf{w}$, which might be
complex[^neigenvalues]. When there are fewer than $n$ distinct eigenvalues, we
nonetheless write the eigenvalues as $\lambda_1, \lambda_2, \ldots \lambda_n$,
keeping track of how often each one is a root of the characteristic polynomial.
[^neigenvalues]: $\det{(\mathbf{w} - \lambda\mathbf{I})} = 0$ is called the characteristic _polynomial_ because working out the determinant gives us an $n^{\mathrm{th}}$ order polynomial in $\lambda$. This has at most $n$ distinct roots, some of which might be complex; the same $\lambda$ can appear multiple times as a root.
2. The determinant of a matrix is the product of its eigenvalues, "counted
with multiplicity":
\[
\det{\mathbf{w}} = \prod_{j=1}^{n}{\lambda_j}
\]
The trace of a matrix is the sum of its eigenvalues:
\[
\tr{\mathbf{w}} = \sum_{j=1}^{n}{\lambda_j}
\]
3. For any matrix $\mathbf{w}$, and any power $k > 0$, $\mathbf{v}$ is
an eigenvector of $\mathbf{w}$ with eigenvalue $\lambda$ if and only
if $\mathbf{v}$ is an eigenvector of $\mathbf{w}^k$ with eigenvalue $\lambda^k$.
If $\mathbf{w}$ is invertible, then $\mathbf{v}$ is an eigenvector of
$\mathbf{w}^{-1}$ with eigenvalue $1/\lambda$.
4. If $\mathbf{v}$ is an eigenvector with eigenvalue $\lambda$, then for any scalar $a \neq 0$, $a\mathbf{v}$ is also an eigenvector, with the same eigenvalue[^scaleeigenvectors]. This means that the eigenvectors, unlike the eigenvalues, aren't uniquely defined. Instead, we usually _choose_ eigenvectors to have nice properties, e.g., length 1. Also, $-\mathbf{v}$ is always another eigenvector, so two calculations of eigenvectors might end up with different signs.
[^scaleeigenvectors]: By the ordinary rules for matrix multiplication, $\mathbf{w}(a\mathbf{v}) = a\mathbf{w}\mathbf{v} = a\lambda \mathbf{v}$, because $\mathbf{v}$ is an eigenvector.
5. If $\mathbf{v}$ and $\mathbf{u}$ are eigenvectors with the same eigenvalue
$\lambda$, then so is $a\mathbf{v}+b\mathbf{u}$, for any scalars $a, b$. (Check this yourself!)
6. For fixed $\lambda$, the
eigenvectors solve the equation $(\mathbf{w} - \lambda\mathbf{I})\mathbf{v}
= 0$. This matrix equation is equivalent to a system of $n$ linear
equations in the $n$ unknowns $(v_1, v_2, \ldots v_n)$. If there are no
redundant (linearly dependent) equations in this system, then the solution
is unique, up to the over-all scaling factor we just saw in (4). But if
there are redundant equations, we can have a whole space of solutions, the
**eigenspace** of $\lambda$. The number of redundant equations gives the
dimension of this eigenspace, a.k.a. the **geometric multiplicity** of
$\lambda$. When there are multiple, linearly-independent eigenvectors with
eigenvalue $\lambda$, we can _choose_ orthogonal eigenvectors to represent
the whole eigenspace.
7. If $\mathbf{w}$ is symmetric, and $\mathbf{v}$ and $\mathbf{u}$ are
eigenvectors with distinct eigenvalues $\lambda_1 \neq \lambda_2$, then $\mathbf{v}^T\mathbf{u} = 0$, and we say the vectors are orthogonal[^orthgonaleigenvectors]. _Remark_: The orthogonality of
eigenvectors actually holds for many asymmetric matrices as well, so long as
they are **normal**, meaning $\mathbf{w}^T\mathbf{w} = \mathbf{w}\mathbf{w}^T$.
The proof is similar, but more involved. I am afraid I mis-spoke in lecture when I asserted that it is always true for all matrices[^counterexample].
[^orthgonaleigenvectors]: Since $\mathbf{w}$ is symmetric, $\mathbf{w} = \mathbf{w}^T$, we have $\mathbf{u}^T \mathbf{w}\mathbf{v} = \mathbf{v}^T \mathbf{w}\mathbf{u}$. Because these are eigenvectors, this means
\begin{eqnarray}
\mathbf{u}^T \lambda_1 \mathbf{v} & = & \mathbf{v}^T \lambda_2 \mathbf{u}\\
\lambda_1 \mathbf{u}^T\mathbf{v} & = & \lambda_2 \mathbf{v}^T \mathbf{u}
\end{eqnarray}
but since $\mathbf{v}^T\mathbf{u} = \mathbf{u}^T\mathbf{v}$ (it's a $1\times 1$ matrix, which is always symmetric), we get
\[
(\lambda_1 - \lambda_2)( \mathbf{u}^T\mathbf{v}) = 0
\]
and we assumed $\lambda_1 \neq \lambda_2$.
```{r, include=FALSE}
w <- matrix(c(2,2,0,3), nrow=2)
eigen(w)$values
e <- eigen(w)$vectors
e[,1] %*% e[,2]
```
[^counterexample]: Here's a counter-example. The matrix $\left[ \begin{array}{cc} 2 & 0 \\ 2 & 3 \end{array} \right]$, has eigenvalues are $`r eigen(w)$values[1]`$ and $`r eigen(w)$values[2]`$, and the corresponding eigenvectors are $\left[ `r e[,1]` \right]$ and $\left[ `r e[,2]` \right]$, which aren't orthogonal.
8. What is _generally_ true, of most matrices, is that the eigenvectors span the whole
$n$-dimensional space. (There are "defective" matrices where this is not
true.) In those cases, we can stack the eigenvectors column-wise into a matrix, say $\mathbf{e}$, and have $\mathbf{w}\mathbf{e} =\mathbf{e} \mathbf{\lambda}$,
where $\mathbf{\lambda}$ is the diagonal matrix of the eigenvalues. But,
because the eigenvectors are (by assumption) linearly independent,
$\mathbf{e}^{-1}$ exists, and we have the **eigendecomposition**
\[
\mathbf{w} = \mathbf{e}\mathbf{\lambda}\mathbf{e}^{-1}
\]
## Eigen-properties of some special types of matrices
### Symmetric matrices
If $\mathbf{w} = \mathbf{w}^T$, then all the eigenvalues are real. Because
eigenvectors with different eigenvalues are distinct in this case, and we can
always normalize all the eigenvectors, we have $\mathbf{e}^{-1} =
\mathbf{e}^T$, and $\mathbf{w} = \mathbf{e}\mathbf{\lambda}\mathbf{e}^T$.
### "Stochastic" matrices
A matrix is called **stochastic** if $w_{ij} \geq 0$ for all $i, j$,
and if $\sum_{j=1}^{n}{w_{ij}} = 1$ for all $i$. Some very important
properties follow from this:
1. The largest eigenvalue is exactly 1.
2. The eigenvectors corresponding to the eigenvalue 1 have only non-negative entries.
3. All other eigenvalues have magnitude $\leq 1$.
These are all consequences of a result in linear algebra called the
Frobenius-Perron theorem, which we will take on trust.
### Projection / idempotent matrices
A matrix is called **idempotent**[^idempotent] when $\mathbf{w}^2 = \mathbf{w}
\mathbf{w} = \mathbf{w}$. It follows[^idempotenteigenvalues] that the only
possible eigenvalues are 0 and 1.
Projecting on to a linear subspace (line, plane, hyper-plane, etc.) always
corresponds to an idempotent matrix. (Points already in the space aren't
affected by the projection, so projecting twice is the same as projecting
once.) The dimension of the space corresponds to the number of eigenvalues
equal to 1.
[^idempotent]: From Latin _idem_, "the same", and _potens_, "power", hence something which is the same as its powers.
[^idempotenteigenvalues]: If $\mathbf{v}$ is an eigenvector of $\mathbf{w}$ with eigenvalue $\lambda$, then clearly $\mathbf{w}^2 \mathbf{v} = \lambda^2\mathbf{v}$, so $\lambda^2$ is an eigenvalue of $\mathbf{w}$. But, since we're assuming $\mathbf{w}$ is idempotent, $\mathbf{w}^2 = \mathbf{w}$, so all eigenvalues must obey the equation $\lambda^2=\lambda$, and the only two solutions are 0 and 1.
# References