Principal Components Analysis I

6 September 2018

$\newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}}$

In our last episode…

• Data is a vector
• Break the vector up into additive components
• Each component is a vector is a pattern
• Which components?

• Before: components were eigenvectors of the smoother matrix
• Made sense to work with them because we were smoothing
• Can we get the data to tell us what the “right” components are?

What the data looks like

• Data is $$\mathbf{x} = n\times p$$ matrix
• Could be multivariate data at $$n$$ time points
• Or multivariate data at $$n$$ locations
• Or scalar data at $$n$$ time points and $$p$$ locations
• Write $$\vec{x}_i$$ for row $$i$$ ($$1\times p$$ matrix)

• First, center the data, just to reduce book-keeping
• i.e., subtract the mean of each column
• Optional: scale each column to equal variance

Finding a “principal” component

• We don’t want to keep track of $$p$$ dimensions
• We want one dimension
• We also don’t want to distort the data too much

• Pick a direction in the $$p$$-dimensional space, and a length-1 vector $$\vec{w}$$
• What’s the best $$\vec{w}$$?

Projections

• $$\vec{x}_i \cdot \vec{w} =$$ length of $$\vec{x}_i$$’s projection on to the direction of $$\vec{w}$$

• $$(\vec{x}_i \cdot \vec{w})\vec{w} =$$ the actual projected vector

How well does the projection approximate the original?

Do it for one vector first:

$\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}$

How well does the projection approximate the original?

Add up across all the data vectors:

$\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}$
• First bit doesn’t depend on $$\vec{w}$$, so doesn’t matter for minimizing
• So we want to maximize $L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}$

Minimizing MSE is maximizing variance

$\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}$

($$\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$$)

But $\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0$ Why?

Minimizing MSE is maximizing variance

so $L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}}$

Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction with the greatest variance

OK, how do we find this magic direction?

Matrix form: all the lengths of projections is $$\mathbf{x}\mathbf{w}$$ $$[n\times 1]$$

$\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}$

Fact: $$\V \equiv \frac{\X^T \X}{n} =$$ sample covariance matrix of the vectors

OK, how do we find this magic direction?

We need to maximize $$$\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w$$$ Constraint: $$\vec{w}$$ has length 1 $$\Leftrightarrow$$ $$\w^T \w = 1$$

Add a Lagrange multiplier $$\lambda$$

$\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}$

Set to zero:

$\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}$

The magic direction is an eigenvector

$\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}$

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, $\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda$ so the maximum is the leading eigenvector of $$\V$$

About the sample covariance matrix

$$\V$$ is a special matrix: symmetric and non-negative definite:

• Eigenvalues are all real (b/c symmetric)
• If $$\lambda_i \neq \lambda_j$$, then $$v_i \perp v_j$$ (b/c symmetric)
• Eigenvectors form a basis, and (can be chosen to be) orthonormal
• All $$\lambda_i \geq 0$$ (b/c non-negative definite)
$\begin{eqnarray} \text{Lead eigenvector of} \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}$

Multiple principle components

• What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.?
• Intuition: take the direction of maximum variance $$\perp$$ the first principal component
• Then direction of maximum variance $$\perp$$ the first two principal components
• These are the eigenvectors of $$\V$$, in order of decreasing $$\lambda$$
• Gory details go at the end of these slides

In R

• prcomp is the best PCA command
• Slightly complicated object returned
• Understand it with a (spatial) data example

USA, $$\approx 1977$$

Dataset pre-loaded in R:

head(state.x77)
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

Principal components of the USA, $$\approx 1977$$

state.pca <- prcomp(state.x77,scale.=TRUE)

Principal components of the USA, $$\approx 1977$$

The weight/loading matrix $$\w$$ gets called $rotation (why?): signif(state.pca$rotation[,1:2], 2)
##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of $$\V$$

Break for in-class exercise

• What does the first principal component represent? What kind of state would score very high on it, and what kind of state would score very low?
• Ditto for the 2nd PC

signif(state.pca$sdev, 2) ## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34 Standard deviations along each principal component $$=\sqrt{\lambda_i}$$ If we keep $$k$$ components, $R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}}$ (Denominator $$=\tr{\V}$$ — why?) Principal components of the USA, $$\approx 1977$$ signif(state.pca$x[1:10, 1:2], 2)
##               PC1   PC2
## Alabama      3.80 -0.23
## Alaska      -1.10  5.50
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Colorado    -2.10  0.51
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns are $$\vec{x}_i \cdot \vec{w}_1$$ and $$\vec{x}_i \cdot \vec{w}_2$$

PC1 is kinda southern

• size of state abbreviation $$\propto$$ projection on to PC1

• coordinates = state capitols, except for AK and HI

PC1 is kinda the legacy of slavery

• Correlation of PC1 with having been a slave state in 1861 is 0.78
• Correlation of PC1 with having been in the Confederacy is 0.8

Summing up

• Finding the first $$k$$ PCs = finding the best $$k$$-dimensional approximation
• = finding the $$k$$ directions of maximum variance
• Then project on to the low-dimensional space
• PCs describe contrasts in the data

The gory details

Use $$k$$ directions in a $$p\times k$$ matrix $$\w$$

Require: $$\mathbf{w}^T\mathbf{w} = \mathbf{I}$$, the basis vectors are orthonormal

$$\X \w =$$ matrix of projection lengths $$[n\times k]$$

$$\X \w \w^T =$$ matrix of projected vectors $$[n\times p]$$

$$\X - \X \w \w^T =$$ matrix of vector residuals $$[n\times p]$$

$$(\X-\X\w\w^T)(\X-\X\w\w^T)^T =$$ matrix of inner products of vector residuals $$[n\times n]$$

$$\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$$ sum of squared errors $$[1\times 1]$$

The gory details

$\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}$

so maximize $$\frac{1}{n}\tr{(\X\w\w^T\X^T)}$$

The gory details

“trace is cyclic” so $\tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)}$ so we want to maximize $\tr{\left(\w^T \frac{\X^T \X}{n}\w\right)}$ under the constraint $\w^T \w = \mathbf{I}$

This is the same form we saw before, so it has the same sort of solution: each column of $$\w$$ must be an eigenvector of $$\V$$.

The Lagrange multiplier trick

• Want to solve $\max_{w}{L(w)}$ with constraint $$f(w) = c$$
• Option I: Learn techniques for constrained optimization
• Drawback: Who wants to major in OR?
• Option II: Use $$f(w)=c$$ to eliminate one coordinate of $$w$$
• Then $$w=g(v,c)$$ for some function $$g$$ and unconstrained $$v$$
• Do $$\max_{v}{L(g(v,c))}$$
• An unconstrained problem with one less variable
• Drawback: math is hard!
• Option III: Solve an unconstrained problem with more variables
• Drawback: Would only occur to a French mathematician

The Lagrange multiplier trick

• Define a Lagrangian $\mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c)$
• $$=L(w)$$ when the constraint holds
• $$\lambda$$ is the Lagrange multiplier which enforces the constraint $$f(w)=c$$
• Now do an unconstrained optimization over $$w$$ and $$\lambda$$: $\max_{w, \lambda}{\mathcal{L}(w,\lambda)}$

The Lagrange multiplier trick

$\max_{w, \lambda}{\mathcal{L}(w,\lambda)}$

• Take derivatives: $\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\ \frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w} \end{eqnarray}$
• Set to 0 at the maximum: $\begin{eqnarray} f(w) & = & c\\ \frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w} \end{eqnarray}$
• We’ve automatically recovered the constraint!
• 1 equation per unknown $$\Rightarrow$$ Solve for $$\lambda$$, $$w$$

The Lagrange multiplier trick

• More equality constraints $$\Rightarrow$$ more Lagrange multipliers
• Inequality constraints, $$g(w) \leq d$$, are trickier
• Is the unconstrained maximum inside the feasible set?
• Yes: problem solved
• No: constrained maximum is on the boundary
• Boundary is usually $$g(w) = d$$ $$\Rightarrow$$ treat like an equality
• There are subtleties; sometimes we need to learn some OR

The Lagrange multiplier trick

• Why not to do this instead? $\max_{\lambda, w}{L(w) + \lambda(f(w)-c))^2}$