# Principal Components Analysis I

6 September 2018


# In our last episode…

• Data is a vector
• Break the vector up into additive components
• Each component is a vector is a pattern
• Which components?

• Before: components were eigenvectors of the smoother matrix
• Made sense to work with them because we were smoothing
• Can we get the data to tell us what the “right” components are?

# What the data looks like

• Data is $$\mathbf{x} = n\times p$$ matrix
• Could be multivariate data at $$n$$ time points
• Or multivariate data at $$n$$ locations
• Or scalar data at $$n$$ time points and $$p$$ locations
• Write $$\vec{x}_i$$ for row $$i$$ ($$1\times p$$ matrix)

• First, center the data, just to reduce book-keeping
• i.e., subtract the mean of each column
• Optional: scale each column to equal variance

# Finding a “principal” component

• We don’t want to keep track of $$p$$ dimensions
• We want one dimension
• We also don’t want to distort the data too much

• Pick a direction in the $$p$$-dimensional space, and a length-1 vector $$\vec{w}$$
• What’s the best $$\vec{w}$$?

# Projections

• $$\vec{x}_i \cdot \vec{w} =$$ length of $$\vec{x}_i$$’s projection on to the direction of $$\vec{w}$$

• $$(\vec{x}_i \cdot \vec{w})\vec{w} =$$ the actual projected vector

# How well does the projection approximate the original?

Do it for one vector first:

$\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}$

# How well does the projection approximate the original?

Add up across all the data vectors:

$\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}$
• First bit doesn’t depend on $$\vec{w}$$, so doesn’t matter for minimizing
• So we want to maximize $L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}$

# Minimizing MSE is maximizing variance

$\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}$

($$\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$$)

But $\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0$ Why?

# Minimizing MSE is maximizing variance

so $L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}}$

# Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction with the greatest variance

# OK, how do we find this magic direction?

Matrix form: all the lengths of projections is $$\mathbf{x}\mathbf{w}$$ $$[n\times 1]$$

$\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}$

Fact: $$\V \equiv \frac{\X^T \X}{n} =$$ sample covariance matrix of the vectors

# OK, how do we find this magic direction?

We need to maximize $$$\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w$$$ Constraint: $$\vec{w}$$ has length 1 $$\Leftrightarrow$$ $$\w^T \w = 1$$

Add a Lagrange multiplier $$\lambda$$

$\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}$

Set to zero:

$\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}$

# The magic direction is an eigenvector

$\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}$

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, $\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda$ so the maximum is the leading eigenvector of $$\V$$

# About the sample covariance matrix

$$\V$$ is a special matrix: symmetric and non-negative definite:

• Eigenvalues are all real (b/c symmetric)
• If $$\lambda_i \neq \lambda_j$$, then $$v_i \perp v_j$$ (b/c symmetric)
• Eigenvectors form a basis, and (can be chosen to be) orthonormal
• All $$\lambda_i \geq 0$$ (b/c non-negative definite)
$\begin{eqnarray} \text{Lead eigenvector of} \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}$

# Multiple principle components

• What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.?
• Intuition: take the direction of maximum variance $$\perp$$ the first principal component
• Then direction of maximum variance $$\perp$$ the first two principal components
• These are the eigenvectors of $$\V$$, in order of decreasing $$\lambda$$
• Gory details go at the end of these slides

# In R

• prcomp is the best PCA command
• Slightly complicated object returned
• Understand it with a (spatial) data example

# USA, $$\approx 1977$$

head(state.x77)
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

# Principal components of the USA, $$\approx 1977$$

state.pca <- prcomp(state.x77,scale.=TRUE)

# Principal components of the USA, $$\approx 1977$$

The weight/loading matrix $$\w$$ gets called $rotation (why?): signif(state.pca$rotation[,1:2], 2)
##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of $$\V$$

# Break for in-class exercise

• What does the first principal component represent? What kind of state would score very high on it, and what kind of state would score very low?
• Ditto for the 2nd PC

signif(state.pca$sdev, 2) ## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34 Standard deviations along each principal component $$=\sqrt{\lambda_i}$$ If we keep $$k$$ components, $R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}}$ (Denominator $$=\tr{\V}$$ — why?) # Principal components of the USA, $$\approx 1977$$ signif(state.pca$x[1:10, 1:2], 2)
##               PC1   PC2
## Alabama      3.80 -0.23
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns are $$\vec{x}_i \cdot \vec{w}_1$$ and $$\vec{x}_i \cdot \vec{w}_2$$

# PC1 is kinda southern

• size of state abbreviation $$\propto$$ projection on to PC1

• coordinates = state capitols, except for AK and HI

# PC1 is kinda the legacy of slavery

• Correlation of PC1 with having been a slave state in 1861 is 0.78
• Correlation of PC1 with having been in the Confederacy is 0.8

# Summing up

• Finding the first $$k$$ PCs = finding the best $$k$$-dimensional approximation
• = finding the $$k$$ directions of maximum variance
• Then project on to the low-dimensional space
• PCs describe contrasts in the data

# The gory details

Use $$k$$ directions in a $$p\times k$$ matrix $$\w$$

Require: $$\mathbf{w}^T\mathbf{w} = \mathbf{I}$$, the basis vectors are orthonormal

$$\X \w =$$ matrix of projection lengths $$[n\times k]$$

$$\X \w \w^T =$$ matrix of projected vectors $$[n\times p]$$

$$\X - \X \w \w^T =$$ matrix of vector residuals $$[n\times p]$$

$$(\X-\X\w\w^T)(\X-\X\w\w^T)^T =$$ matrix of inner products of vector residuals $$[n\times n]$$

$$\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$$ sum of squared errors $$[1\times 1]$$

# The gory details

$\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}$

so maximize $$\frac{1}{n}\tr{(\X\w\w^T\X^T)}$$

# The gory details

“trace is cyclic” so $\tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)}$ so we want to maximize $\tr{\left(\w^T \frac{\X^T \X}{n}\w\right)}$ under the constraint $\w^T \w = \mathbf{I}$

This is the same form we saw before, so it has the same sort of solution: each column of $$\w$$ must be an eigenvector of $$\V$$.

# The Lagrange multiplier trick

• Want to solve $\max_{w}{L(w)}$ with constraint $$f(w) = c$$
• Option I: Learn techniques for constrained optimization
• Drawback: Who wants to major in OR?
• Option II: Use $$f(w)=c$$ to eliminate one coordinate of $$w$$
• Then $$w=g(v,c)$$ for some function $$g$$ and unconstrained $$v$$
• Do $$\max_{v}{L(g(v,c))}$$
• An unconstrained problem with one less variable
• Drawback: math is hard!
• Option III: Solve an unconstrained problem with more variables
• Drawback: Would only occur to a French mathematician

# The Lagrange multiplier trick

• Define a Lagrangian $\mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c)$
• $$=L(w)$$ when the constraint holds
• $$\lambda$$ is the Lagrange multiplier which enforces the constraint $$f(w)=c$$
• Now do an unconstrained optimization over $$w$$ and $$\lambda$$: $\max_{w, \lambda}{\mathcal{L}(w,\lambda)}$

# The Lagrange multiplier trick

$\max_{w, \lambda}{\mathcal{L}(w,\lambda)}$

• Take derivatives: $\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\ \frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w} \end{eqnarray}$
• Set to 0 at the maximum: $\begin{eqnarray} f(w) & = & c\\ \frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w} \end{eqnarray}$
• We’ve automatically recovered the constraint!
• 1 equation per unknown $$\Rightarrow$$ Solve for $$\lambda$$, $$w$$

# The Lagrange multiplier trick

• More equality constraints $$\Rightarrow$$ more Lagrange multipliers
• Inequality constraints, $$g(w) \leq d$$, are trickier
• Is the unconstrained maximum inside the feasible set?
• Yes: problem solved
• No: constrained maximum is on the boundary
• Boundary is usually $$g(w) = d$$ $$\Rightarrow$$ treat like an equality
• There are subtleties; sometimes we need to learn some OR

# The Lagrange multiplier trick

• Why not to do this instead? $\max_{\lambda, w}{L(w) + \lambda(f(w)-c))^2}$