\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our last episode…

Data is a vector
Break the vector up into additive components
Each component is a vector is a pattern
Which components?
Before: components were eigenvectors of the smoother matrix
- Made sense to work with them because we were smoothing
Can we get the data to tell us what the “right” components are?

What the data looks like

Data is $\mathbf{x} = n\times p$ matrix
- Could be multivariate data at $n$ time points
- Or multivariate data at $n$ locations
- Or scalar data at $n$ time points and $p$ locations
Write $\vec{x}_i$ for row $i$ ($1\times p$ matrix)
First, center the data, just to reduce book-keeping
- i.e., subtract the mean of each column
- Optional: scale each column to equal variance

Finding a “principal” component

We don’t want to keep track of $p$ dimensions
We want one dimension
We also don’t want to distort the data too much
Pick a direction in the $p$-dimensional space, and a length-1 vector $\vec{w}$
What’s the best $\vec{w}$?

Projections

$\vec{x}_i \cdot \vec{w} =$ length of $\vec{x}_i$’s projection on to the direction of $\vec{w}$
$(\vec{x}_i \cdot \vec{w})\vec{w} =$ the actual projected vector

How well does the projection approximate the original?

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

How well does the projection approximate the original?

Add up across all the data vectors:

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

First bit doesn’t depend on $\vec{w}$, so doesn’t matter for minimizing
So we want to maximize \[ L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2} \]

Minimizing MSE is maximizing variance

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

($\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$)

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] Why?

Minimizing MSE is maximizing variance

so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction with the greatest variance

OK, how do we find this magic direction?

Matrix form: all the lengths of projections is $\mathbf{x}\mathbf{w}$ $[n\times 1]$

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: $\V \equiv \frac{\X^T \X}{n} =$ sample covariance matrix of the vectors

OK, how do we find this magic direction?

We need to maximize \[\begin{equation} \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w \end{equation}\] Constraint: $\vec{w}$ has length 1 $\Leftrightarrow$ $\w^T \w = 1$

Add a Lagrange multiplier $\lambda$

\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\]

Set to zero:

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum is the leading eigenvector of $\V$

About the sample covariance matrix

$\V$ is a special matrix: symmetric and non-negative definite:

Eigenvalues are all real (b/c symmetric)
If $\lambda_i \neq \lambda_j$, then $v_i \perp v_j$ (b/c symmetric)
Eigenvectors form a basis, and (can be chosen to be) orthonormal
All $\lambda_i \geq 0$ (b/c non-negative definite)

\[\begin{eqnarray} \text{Lead eigenvector of} \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

Multiple principle components

What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.?
Intuition: take the direction of maximum variance $\perp$ the first principal component
Then direction of maximum variance $\perp$ the first two principal components
These are the eigenvectors of $\V$, in order of decreasing $\lambda$
Gory details go at the end of these slides

In R

prcomp is the best PCA command
Slightly complicated object returned
Understand it with a (spatial) data example

USA, $\approx 1977$

Dataset pre-loaded in R:

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

Principal components of the USA, $\approx 1977$

state.pca <- prcomp(state.x77,scale.=TRUE)

Principal components of the USA, $\approx 1977$

The weight/loading matrix $\w$ gets called $rotation (why?):

signif(state.pca$rotation[,1:2], 2)

##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of $\V$

Break for in-class exercise

What does the first principal component represent? What kind of state would score very high on it, and what kind of state would score very low?
Ditto for the 2nd PC

Principal components of the USA, $\approx 1977$

signif(state.pca$sdev, 2)

## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34

Standard deviations along each principal component $=\sqrt{\lambda_i}$

If we keep $k$ components, \[ R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]

(Denominator $=\tr{\V}$ — why?)

Principal components of the USA, $\approx 1977$

signif(state.pca$x[1:10, 1:2], 2)

##               PC1   PC2
## Alabama      3.80 -0.23
## Alaska      -1.10  5.50
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Colorado    -2.10  0.51
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns are $\vec{x}_i \cdot \vec{w}_1$ and $\vec{x}_i \cdot \vec{w}_2$

PC1 is kinda southern

size of state abbreviation $\propto$ projection on to PC1
coordinates = state capitols, except for AK and HI

PC1 is kinda the legacy of slavery

Correlation of PC1 with having been a slave state in 1861 is 0.78
Correlation of PC1 with having been in the Confederacy is 0.8

Summing up

Finding the first $k$ PCs = finding the best $k$-dimensional approximation
= finding the $k$ directions of maximum variance
Then project on to the low-dimensional space
PCs describe contrasts in the data

The gory details

Use $k$ directions in a $p\times k$ matrix $\w$

Require: $\mathbf{w}^T\mathbf{w} = \mathbf{I}$, the basis vectors are orthonormal

$\X \w =$ matrix of projection lengths $[n\times k]$

$\X \w \w^T =$ matrix of projected vectors $[n\times p]$

$\X - \X \w \w^T =$ matrix of vector residuals $[n\times p]$

$(\X-\X\w\w^T)(\X-\X\w\w^T)^T =$ matrix of inner products of vector residuals $[n\times n]$

$\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$ sum of squared errors $[1\times 1]$

The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize $\frac{1}{n}\tr{(\X\w\w^T\X^T)}$

The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of $\w$ must be an eigenvector of $\V$.

The Lagrange multiplier trick

Want to solve \[ \max_{w}{L(w)} \] with constraint $f(w) = c$
Option I: Learn techniques for constrained optimization
- Drawback: Who wants to major in OR?
Option II: Use $f(w)=c$ to eliminate one coordinate of $w$
- Then $w=g(v,c)$ for some function $g$ and unconstrained $v$
- Do $\max_{v}{L(g(v,c))}$
- An unconstrained problem with one less variable
- Drawback: math is hard!
Option III: Solve an unconstrained problem with more variables
- Drawback: Would only occur to a French mathematician

The Lagrange multiplier trick

Define a Lagrangian \[ \mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c) \]
- $=L(w)$ when the constraint holds
$\lambda$ is the Lagrange multiplier which enforces the constraint $f(w)=c$
Now do an unconstrained optimization over $w$ and $\lambda$: \[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

The Lagrange multiplier trick

\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

Take derivatives: \[\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\ \frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w} \end{eqnarray}\]
Set to 0 at the maximum: \[\begin{eqnarray} f(w) & = & c\\ \frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w} \end{eqnarray}\]
- We’ve automatically recovered the constraint!
1 equation per unknown $\Rightarrow$ Solve for $\lambda$, $w$

The Lagrange multiplier trick

More equality constraints $\Rightarrow$ more Lagrange multipliers
Inequality constraints, $g(w) \leq d$, are trickier
Is the unconstrained maximum inside the feasible set?
- Yes: problem solved
- No: constrained maximum is on the boundary
- Boundary is usually $g(w) = d$ $\Rightarrow$ treat like an equality
- There are subtleties; sometimes we need to learn some OR

Principal Components Analysis I

In our last episode…

What the data looks like

Finding a “principal” component

Projections

How well does the projection approximate the original?

How well does the projection approximate the original?

Minimizing MSE is maximizing variance

Minimizing MSE is maximizing variance

Minimizing MSE is maximizing variance

OK, how do we find this magic direction?

OK, how do we find this magic direction?

The magic direction is an eigenvector

About the sample covariance matrix

Multiple principle components

In R

USA, \(\approx 1977\)

Principal components of the USA, \(\approx 1977\)

Principal components of the USA, \(\approx 1977\)

Break for in-class exercise

Principal components of the USA, \(\approx 1977\)

Principal components of the USA, \(\approx 1977\)

PC1 is kinda southern

PC1 is kinda the legacy of slavery

Summing up

The gory details

The gory details

The gory details

The Lagrange multiplier trick

The Lagrange multiplier trick

The Lagrange multiplier trick

The Lagrange multiplier trick

The Lagrange multiplier trick