--- title: Principal Components Analysis I author: 36-467/667 date: 6 September 2018 output: slidy_presentation --- $\newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}}$ ## In our last episode... - Data is a vector - Break the vector up into additive components - Each component is a vector is a pattern - Which components? - Before: components were eigenvectors of the smoother matrix + Made sense to work with them because we were smoothing - Can we get the data to tell us what the "right" components are? ## What the data looks like - Data is $\mathbf{x} = n\times p$ matrix + Could be multivariate data at $n$ time points + Or multivariate data at $n$ locations + Or scalar data at $n$ time points and $p$ locations - Write $\vec{x}_i$ for row $i$ ($1\times p$ matrix) - First, center the data, just to reduce book-keeping + i.e., subtract the mean of each column + Optional: scale each column to equal variance ## Finding a "principal" component - We don't want to keep track of $p$ dimensions - We want _one_ dimension - We also don't want to distort the data too much - Pick a direction in the $p$-dimensional space, and a length-1 vector $\vec{w}$ - What's the best $\vec{w}$? ## Projections - $\vec{x}_i \cdot \vec{w} =$ length of $\vec{x}_i$'s projection on to the direction of $\vec{w}$ - $(\vec{x}_i \cdot \vec{w})\vec{w} =$ the actual projected vector ## {r, echo=FALSE} # Make an arc demo.theta <- runif(10,min=0,max=pi/2) demo.x <- cbind(cos(demo.theta),sin(demo.theta)) # Center the coordinates demo.x <- scale(demo.x, center=TRUE, scale=FALSE) # Plot the points plot(demo.x,xlab=expression(x^1),ylab=expression(x^2), xlim=c(-1,1), ylim=c(-1,1), main="Projections of data on to an arbitrary vector") # Pick a direction (not a very good one), by a unit vector demo.w <- c(cos(-3*pi/8), sin(-3*pi/8)) # Draw an arrow for the unit vector arrows(0,0,demo.w[1],demo.w[2],col="blue") text(demo.w[1],demo.w[2],pos=4,labels=expression(w)) # Draw a dashed line for the whole line abline(0,b=demo.w[2]/demo.w[1],col="blue",lty="dashed") # Get the length of the projection of each data point on to the line projection.lengths <- demo.x %*% demo.w # Get the actual projections projections <- projection.lengths %*% demo.w # Draw those as points along the line points(projections,pch=16,col="blue") # Draw lines from each data point to its projection segments(x0=demo.x[,1], y0=demo.x[,2],x1=projections[,1],y1=projections[,2], col="grey")  ## How well does the projection approximate the original? Do it for one vector first: \begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray} ## How well does the projection approximate the original? Add up across all the data vectors: \begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray} - First bit doesn't depend on $\vec{w}$, so doesn't matter for minimizing - So we want to maximize $L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}$ ## Minimizing MSE is maximizing variance \begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray} ($\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$) But $\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0$ Why? ## Minimizing MSE is maximizing variance so $L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}}$ ## Minimizing MSE is maximizing variance > The direction which gives us the best approximation of the data is the direction with the greatest variance ## OK, how do we find this magic direction? Matrix form: all the _lengths_ of projections is $\mathbf{x}\mathbf{w}$ $[n\times 1]$ \begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray} Fact: $\V \equiv \frac{\X^T \X}{n} =$ sample covariance matrix of the vectors ## OK, how do we find this magic direction? We need to maximize $$\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w$$ Constraint: $\vec{w}$ has length 1 $\Leftrightarrow$ $\w^T \w = 1$ Add a **Lagrange multiplier** $\lambda$ \begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray} Set to zero: \begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray} ## The magic direction is an eigenvector \begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray} > THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION! At the solution, $\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda$ so the maximum is the _leading_ eigenvector of $\V$ ## About the sample covariance matrix $\V$ is a special matrix: symmetric and non-negative definite: - Eigenvalues are all real (b/c symmetric) - If $\lambda_i \neq \lambda_j$, then $v_i \perp v_j$ (b/c symmetric) - Eigenvectors form a basis, and (can be chosen to be) orthonormal - All $\lambda_i \geq 0$ (b/c non-negative definite) \begin{eqnarray} \text{Lead eigenvector of} \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray} ## {r, echo=FALSE} plot(demo.x,xlab=expression(x^1),ylab=expression(x^2), xlim=c(-1,1), ylim=c(-1,1)) # Pick a direction (not a very good one), by a unit vector demo.w <- c(cos(-3*pi/8), sin(-3*pi/8)) # Draw an arrow for the unit vector arrows(0,0,demo.w[1],demo.w[2],col="blue") text(demo.w[1],demo.w[2],pos=4,labels="arbitrary w") # Draw a dashed line for the whole line abline(0,b=demo.w[2]/demo.w[1],col="blue",lty="dashed") # Get the length of the projection of each data point on to the line projection.lengths <- demo.x %*% demo.w # Get the actual projections projections <- projection.lengths %*% demo.w # Draw those as points along the line points(projections,pch=16,col="blue") # Draw lines from each data point to its projection segments(x0=demo.x[,1], y0=demo.x[,2], x1=projections[,1], y1=projections[,2], col="grey") # Find the variance-covariance matrix of the data v <- var(demo.x) # First principal component is the leading eigenvector pc.1 <- eigen(v)$vectors[,1] # Draw the arrow and the dashed line, but in a different color arrows(0,0,pc.1[1],pc.1[2],col="red") abline(0,b=pc.1[2]/pc.1[1],col="red",lty="dashed") text(pc.1[1], pc.1[2], pos=4, labels="PC1") # Get the projection lengths and actual projections projection.lengths <- demo.x %*% pc.1 projections <- projection.lengths %*% pc.1 # Draw the points points(projections, pch=16, col="red") segments(x0=demo.x[,1], y0=demo.x[,2], x1=projections[,1], y1=projections[,2], col="pink") legend("topright", legend=c("Data", "Projection on arbitrary w", "Projection on PC1", "Error with arbitrary w", "Error with PC1"), pch=c(1, 16, 16, NA, NA), lty=c("blank", "blank", "blank", "solid", "solid"), col=c("black", "blue", "red", "grey", "pink"))  ## Multiple principle components - What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.? - Intuition: take the direction of maximum variance$\perp$the first principal component - Then direction of maximum variance$\perp$the first two principal components - These are the eigenvectors of$\V$, in order of decreasing$\lambda$- Gory details go at the end of these slides ## In R - prcomp is the best PCA command - Slightly complicated object returned - Understand it with a (spatial) data example ## USA,$\approx 1977$Dataset pre-loaded in R: {r} head(state.x77)  ## Principal components of the USA,$\approx 1977${r} state.pca <- prcomp(state.x77,scale.=TRUE)  ## Principal components of the USA,$\approx 1977$The weight/loading matrix$\w$gets called $rotation (why?): {r} signif(state.pca$rotation[,1:2], 2)  Each column is an eigenvector of$\V$## Break for in-class exercise - What does the first principal component represent? What kind of state would score very high on it, and what kind of state would score very low? - Ditto for the 2nd PC ## Principal components of the USA,$\approx 1977${r} signif(state.pca$sdev, 2)  Standard deviations along each principal component $=\sqrt{\lambda_i}$ If we keep $k$ components, $R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}}$ (Denominator $=\tr{\V}$ --- why?) ## Principal components of the USA, $\approx 1977$ {r} signif(state.pca$x[1:10, 1:2], 2)  Columns are$\vec{x}_i \cdot \vec{w}_1$and$\vec{x}_i \cdot \vec{w}_2$## PC1 is kinda southern {r, echo=FALSE} plot.states_scaled <- function(sizes,min.size=0.4,max.size=2,text.color="black", ...) { plot(state.center,type="n",...) out.range = max.size - min.size in.range = max(sizes)-min(sizes) scaled.sizes = out.range*((sizes-min(sizes))/in.range) text(state.center,state.abb,cex=scaled.sizes + min.size, col=text.color) invisible(scaled.sizes) } plot.states_scaled(state.pca$x[,1],min.size=0.3,max.size=1.5, xlab="longitude",ylab="latitude")  - size of state abbreviation $\propto$ projection on to PC1 - coordinates = state capitols, except for AK and HI ## PC1 is kinda the legacy of slavery {r, echo=FALSE} state.confed <- rep(0,times=50) names(state.confed) <- rownames(state.x77) state.confed[c("South Carolina", "Mississippi", "Florida", "Alabama", "Georgia", "Louisiana", "Texas", "Virginia", "Arkansas", "Tennessee", "North Carolina")] <- 1 state.slave <- state.confed state.slave[c("Kentucky","Missouri","Maryland","Delaware","West Virginia")] <- 1 state.colors <- rep("blue", 50) state.colors[state.confed==1] <- "grey" state.colors[(state.slave==1) & (state.confed==0)] <- "lightblue" plot.states_scaled(state.pca$x[,1],min.size=0.3,max.size=1.5, xlab="longitude",ylab="latitude", text.color=state.colors)  - Correlation of PC1 with having been a slave state in 1861 is r signif(cor(state.pca$x[,1],state.slave),2) - Correlation of PC1 with having been in the Confederacy is r signif(cor(state.confed, state.pca$x[,1]), 2) ## Summing up - Finding the first$k$PCs = finding the best$k$-dimensional approximation - = finding the$k$directions of maximum variance - Then project on to the low-dimensional space - PCs describe _contrasts_ in the data ## The gory details Use$k$directions in a$p\times k$matrix$\w$_Require:_$\mathbf{w}^T\mathbf{w} = \mathbf{I}$, the basis vectors are orthonormal$\X \w =$matrix of projection lengths$[n\times k]\X \w \w^T =$matrix of projected _vectors_$[n\times p]\X - \X \w \w^T =$matrix of vector residuals$[n\times p](\X-\X\w\w^T)(\X-\X\w\w^T)^T =$matrix of inner products of vector residuals$[n\times n]\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$sum of squared errors$[1\times 1]$## The gory details \begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray} so maximize$\frac{1}{n}\tr{(\X\w\w^T\X^T)}$## The gory details "trace is cyclic" so $\tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)}$ so we want to maximize $\tr{\left(\w^T \frac{\X^T \X}{n}\w\right)}$ under the constraint $\w^T \w = \mathbf{I}$ This is the same form we saw before, so it has the same sort of solution: each column of$\w$must be an eigenvector of$\V$. ## The Lagrange multiplier trick - Want to solve $\max_{w}{L(w)}$ with **constraint**$f(w) = c$- Option I: Learn techniques for constrained optimization + Drawback: Who wants to major in OR? - Option II: Use$f(w)=c$to eliminate one coordinate of$w$+ Then$w=g(v,c)$for some function$g$and _unconstrained_$v$+ Do$\max_{v}{L(g(v,c))}$+ An unconstrained problem with one less variable + Drawback: math is hard! - Option III: Solve an unconstrained problem with _more_ variables + Drawback: Would only occur to a French mathematician ## The Lagrange multiplier trick - Define a **Lagrangian** $\mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c)$ +$=L(w)$when the constraint holds -$\lambda$is the **Lagrange multiplier** which **enforces** the constraint$f(w)=c$- Now do an unconstrained optimization over$w$and$\lambda$: $\max_{w, \lambda}{\mathcal{L}(w,\lambda)}$ ## The Lagrange multiplier trick $\max_{w, \lambda}{\mathcal{L}(w,\lambda)}$ - Take derivatives: \begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\ \frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w} \end{eqnarray} - Set to 0 at the maximum: \begin{eqnarray} f(w) & = & c\\ \frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w} \end{eqnarray} + We've automatically recovered the constraint! - 1 equation per unknown$\Rightarrow$Solve for$\lambda$,$w$## The Lagrange multiplier trick - More equality constraints$\Rightarrow$more Lagrange multipliers - Inequality constraints,$g(w) \leq d$, are trickier - Is the unconstrained maximum inside the **feasible set**? + Yes: problem solved + No: constrained maximum is on the boundary + Boundary is _usually_$g(w) = d\Rightarrow\$ treat like an equality + There are subtleties; sometimes we need to learn some OR ## The Lagrange multiplier trick - Why not to do this instead? $\max_{\lambda, w}{L(w) + \lambda(f(w)-c))^2}$