--- title: Principal Components Analysis author: 36-467/667 date: 11 September 2019 output: slidy_presentation bibliography: locusts.bib --- {r, include=FALSE} # General set-up options library(knitr) opts_chunk$set(size="small",background="white", highlight=FALSE, cache=TRUE, autodep=TRUE, tidy=TRUE, warning=FALSE, message=FALSE, tidy.opts=list(comment=FALSE))  $\newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin}$ ## Previously... - Represent items in our data base as vectors of numerical features - Want to look for patterns in those features - First, or as an example, low-dimensional structure in the vectors ## What the data looks like - Data is$\mathbf{x} = n\times p$matrix + Could be multivariate data at$n$time points + Or multivariate data at$n$locations + Or scalar data at$n$time points and$p$locations + Write$\vec{x}_i$for row$i$($1\times p$matrix) - First, center the data, just to reduce book-keeping + i.e., subtract the mean of each column + Optional: scale each column to equal variance ## Finding _the_ "principal" component - We don't want to keep track of$p$dimensions - We want _one_ dimension - We also don't want to distort the data too much > - Pick a direction in the$p$-dimensional space, and a length-1 vector$\vec{w}$> - What's the best$\vec{w}$? ## Projections > -$\vec{x}_i \cdot \vec{w} =$length of$\vec{x}_i$'s projection on to the direction of$\vec{w}$> -$(\vec{x}_i \cdot \vec{w})\vec{w} =$the actual projected vector ## {r, echo=FALSE} # Make an arc demo.theta <- runif(10,min=0,max=pi/2) demo.x <- cbind(cos(demo.theta),sin(demo.theta)) # Center the coordinates demo.x <- scale(demo.x, center=TRUE, scale=FALSE) # Plot the points plot(demo.x,xlab=expression(x^1),ylab=expression(x^2), xlim=c(-1,1), ylim=c(-1,1), main="Projections of data on to an arbitrary vector") # Pick a direction (not a very good one), by a unit vector demo.w <- c(cos(-3.5*pi/8), sin(-3.5*pi/8)) # Draw an arrow for the unit vector arrows(0,0,demo.w[1],demo.w[2],col="blue") text(demo.w[1],demo.w[2],pos=4,labels=expression(w)) # Draw a dashed line for the whole line abline(0,b=demo.w[2]/demo.w[1],col="blue",lty="dashed") # Get the length of the projection of each data point on to the line projection.lengths <- demo.x %*% demo.w # Get the actual projections projections <- projection.lengths %*% demo.w # Draw those as points along the line points(projections,pch=16,col="blue") # Draw lines from each data point to its projection segments(x0=demo.x[,1], y0=demo.x[,2],x1=projections[,1],y1=projections[,2], col="grey")  ## How well does the projection approximate the original? Do it for one vector first: \begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray} ## How well does the projection approximate the original? - Add up across all the data vectors: \begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray} - First bit doesn't depend on$\vec{w}$, so doesn't matter for minimizing - So we want to maximize $L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}$ ## Minimizing MSE is maximizing variance \begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray} ($\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$) But $\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0$ (Why?) ## Minimizing MSE is maximizing variance so $L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}}$ ## Minimizing MSE is maximizing variance > The direction which gives us the best approximation of the data is the direction with the greatest variance ## OK, how do we find this magic direction? Matrix form: all _lengths_ of projections$=\mathbf{x}\mathbf{w}[n\times 1]$\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray} Fact:$\V \equiv \frac{\X^T \X}{n} =$sample covariance matrix of the vectors ## OK, how do we find this magic direction? - We need to maximize $$\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w$$ - Constraint:$\vec{w}$has length 1$\Leftrightarrow\w^T \w = 1$- Add a **Lagrange multiplier**$\lambda$and take derivatives \begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray} - Set derivatives to zero: \begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray} ## The magic direction is an eigenvector \begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray} > THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION! At the solution, $\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda$ so the maximum is the _leading_ eigenvector of$\V$## About the sample covariance matrix$\V$is a special matrix: symmetric and non-negative definite: - Eigenvalues are all real (b/c symmetric) - If$\lambda_i \neq \lambda_j$, then$v_i \perp v_j$(b/c symmetric) - Eigenvectors form a basis, and (can be chosen to be) orthonormal - All$\lambda_i \geq 0$(b/c non-negative definite) \begin{eqnarray} \text{Lead eigenvector of} \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray} ## {r, echo=FALSE} plot(demo.x,xlab=expression(x^1),ylab=expression(x^2), xlim=c(-1,1), ylim=c(-1,1)) # Draw an arrow for the unit vector arrows(0,0,demo.w[1],demo.w[2],col="blue") text(demo.w[1],demo.w[2],pos=4,labels="arbitrary w") # Draw a dashed line for the whole line abline(0,b=demo.w[2]/demo.w[1],col="blue",lty="dashed") # Get the length of the projection of each data point on to the line projection.lengths <- demo.x %*% demo.w # Get the actual projections projections <- projection.lengths %*% demo.w # Draw those as points along the line points(projections,pch=16,col="blue") # Draw lines from each data point to its projection segments(x0=demo.x[,1], y0=demo.x[,2], x1=projections[,1], y1=projections[,2], col="grey") # Find the variance-covariance matrix of the data v <- var(demo.x) # First principal component is the leading eigenvector pc.1 <- eigen(v)$vectors[,1] # Draw the arrow and the dashed line, but in a different color arrows(0,0,pc.1[1],pc.1[2],col="red") abline(0,b=pc.1[2]/pc.1[1],col="red",lty="dashed") text(pc.1[1], pc.1[2], pos=4, labels="PC1") # Get the projection lengths and actual projections projection.lengths <- demo.x %*% pc.1 projections <- projection.lengths %*% pc.1 # Draw the points points(projections, pch=16, col="red") segments(x0=demo.x[,1], y0=demo.x[,2], x1=projections[,1], y1=projections[,2], col="pink") legend("topright", legend=c("Data", "Projection on arbitrary w", "Projection on PC1", "Error with arbitrary w", "Error with PC1"), pch=c(1, 16, 16, NA, NA), lty=c("blank", "blank", "blank", "solid", "solid"), col=c("black", "blue", "red", "grey", "pink"))  ## Multiple principle components - What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.? - Intuition: take the direction of maximum variance $\perp$ the first principal component - Then direction of maximum variance $\perp$ the first two principal components - These are the eigenvectors of $\V$, in order of decreasing $\lambda$ - Gory details go at the end of these slides ## Some properties of the eigenvalues - All eigenvalues $\geq 0$ - In general, $p$ non-zero eigenvalues - If the data are _exactly_ in a $q$-dimensional subspace, then exactly $q$ non-zero eigenvalues - If $n < p$, at most $n$ non-zero eigenvalues + Two points define a line, three define a plane, ... ## Some properties of the PCs - The principal components are orthonormal + $\vec{w}_i \cdot \vec{w}_i = 1$ + $\vec{w}_i \cdot \vec{w}_j = 0$ (unless $i=j$) + $\w^T\w = \mathbf{I}$ - PC1 is the direction of maximum variance through the data + That variance is $\lambda_1$, biggest eigenvalue of $\V$ - PC $i+1$ is the direction of maximum variance $\perp$ PC1, PC2, $\ldots$ PC $i$ + That variance is $\lambda_{i+1}$ ## Some properties of PC scores $\S = \X \w$ - Average score on each PC $=0$ (b/c we centered the data) \begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ \end{eqnarray} Show that this $=\mathbf{\Lambda}$ (Hint: $\V = \w \mathbf{\Lambda} \w^T$) ## Some properties of PC scores $\S = \X \w$ - Average score on each PC $=0$ (b/c we centered the data) \begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w = \w^T (\w \mathbf{\Lambda} \mathbf{w}^T) \w & = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\ & = & \mathbf{\Lambda} \end{eqnarray} - Variance of score on PC $i$ $=\lambda_i$ (by construction) - Covariance of score on PC $i$ with score on PC $j$ $=0$ ## Some properties of PCA as a whole - If we use all $p$ principal components, we have the **eigendecomposition** of $\V$: $\V = \w \mathbf{\Lambda} \mathbf{w}^T$ $\mathbf{\Lambda}=$ diagonal matrix of eigenvalues $\lambda_1, \ldots \lambda_p$ - If we use all $p$ principal components, $\X = \S\w^T$ - If we use only the top $q$ PCs, we get: + the best rank-$q$ approximation to $\V$ + the best dimension-$q$ approximation to $\X$ ## Another way to think about PCA - The original coordinates are correlated - There is always another coordinate system with _uncorrelated_ coordinates - We're rotating to that coordinate system + Rotating to new coordinates $\Rightarrow$ multiplying by an orthogonal matrix + That matrix is $\mathbf{w}$ + The new coordinates are the scores ## Interpreting PCA results > - PCs are linear combinations of the original coordinates > + Good idea to scale variables with different units of measurement first > + $\Rightarrow$ PCs change as you add or remove coordinates > + Put in 1000 measures of education and PC1 is education... > - Very tempting to **reify** the PCs > + i.e., to "make them a thing" > + sometimes totally appropriate... > + sometimes not at all appropriate... > + Be very careful when the _only_ evidence is the PCA ## PCA is exploratory analysis, not statistical inference > - We assumed no model > + Pro: That's the best linear approximation, no matter what > + Con: doesn't tell us how much our results are driven by noise > - Prediction: PCA predicts nothing > - Inference: _If_ $\V \rightarrow \Var{X}$ _then_ PCs $\rightarrow$ eigenvectors of $\Var{X}$ > + But PCA doesn't **need** this assumption > + Doesn't tell us about uncertainty > - Next time, will look at a PCA-like _model_: factor models ## In R - prcomp is the best built-in PCA command - Slightly complicated object returned - Understand it with a real data example from a recent, high-profile paper ## Example: The axis of historical complexity @Turchin-et-al-single-axis-of-complexity {r, include=FALSE} soccomp <- read.csv("http://www.stat.cmu.edu/~cshalizi/dst/18/hw/03/soccomp.irep1.csv") complexities <- 4:12 # Column numbers  r length(complexities) variable measuring social complexity, over time, for dozens of locations across the world Dates from r min(soccomp$Time) to r max(soccomp$Time) {r} summary(soccomp[,complexities], digits=2)  ## Example: The axis of historical complexity {r} soccomp.pca <- prcomp(soccomp[,complexities], scale=TRUE)  What are the parts of the return value? {r} str(soccomp.pca)  - sdev = standard deviations along PCs $=\sqrt{\lambda_i}$ - rotation = matrix of eigenvectors $=\w$ - x = scores on all PCs $=\S$ ## Example: The axis of historical complexity {r} # R^2 for k components = sum(first k eigenvalues)/sum(all eigenvalues) # cumsum(x) returns the cumulative sums along the vector x plot(cumsum(soccomp.pca$sdev^2)/sum(soccomp.pca$sdev^2), xlab="Number of components used", ylab=expression(R^2), ylim=c(0,1)) # Add the lines indicating the desired levels abline(h=0.75, lty="dashed") abline(h=0.90, lty="dotted")  _One_ principal component already keeps more than 75% of the variance (3 components keep just under 90%) ## Example: The axis of historical complexity {r} # Plot the coordinates/weights/loadings of PC1 # But suppress the usual labels on the horizontal axis plot(soccomp.pca$rotation[,1], xlab="", main="PCs of the complexity measures", xaxt="n", ylab="PC weight", ylim=c(-1,1), type="l") # Label the horizontal axis by the names of the variables # The las option turns the axis labels on their side (so they don't overlap # each other) axis(side=1, at=1:9, labels=colnames(soccomp)[complexities], las=2, cex.axis=0.5) # Now add PC2 and PC3, in distinct colors lines(soccomp.pca$rotation[,2], col="red") lines(soccomp.pca$rotation[,3], col="blue") # Legend legend("bottomright", legend=c("PC1", "PC2", "PC3"), lty="solid", col=c("black", "red", "blue")) # Guide-to-the-eye horizontal line at 0 abline(h=0, col="grey")  - PC1: Nearly equal positive weight on all variables + High scores: large population, large areas, large capital cities, much hierarchy, very differentiated, etc. + Low scores: small _and_ unsophisticated - PC2: negative weight on population, area, levels of hierarchy, positive weight on gov't, infrastructure, information, money + High scores: small but sophisticated + Low scores: big but dumb ## Example: The axis of historical complexity {r} # Add scores to data frame soccomp$PC1 <- soccomp.pca$x[,1] # Plot vs. time (all areas) with(soccomp, plot(PC1 ~ Time))  ## Example: The axis of historical complexity {r, echo=FALSE} # Set up a grid of smaller plots par(mfrow=c(2,3)) # We're doing the same thing to each area, so list the areas we want... desired.NGAs <- c("Cahokia", "Kachi Plain", "Latium", "Middle Yellow River Valley", "Niger Inland Delta", "Susiana") # ... and then loop over them, making a plot of PC1 against time for (the.NGA in desired.NGAs) { with(soccomp[soccomp$NGA==the.NGA,], plot(PC1 ~ Time, ylim=c(-6, 4), main=paste(the.NGA))) # ylim here enforces a common vertical scale for all plots --- optional, # but makes them easier to compare } # Turn off the grid of smaller plots for succeeding code chunks par(mfrow=c(1,1))  - Illinois, North India, Italy around Rome, central China, West Africa, Persia/Iran - Spot the collapses of civilization ## Example: The axis of historical complexity {r} plot(soccomp.pca$x[,1], soccomp.pca$x[,2], xlab="Score on PC1", ylab="Score on PC2")  - Correlation of scores between PC1 and PC2 is r signif(cor(soccomp.pca$x[,1], soccomp.pca$x[,2])) - Uncorrelated but _dependent_ ## Some more data-mining-y applications - Latent semantic indexing - Multidimensional scaling - Netflix ## Latent semantic indexing [@Indexing-by-latent-semantic-analysis;@solution-to-platos-problem] > - Build bag-of-words representation of our documents, $p \approx 10^4$ or even $10^5$ > - Do PCA > - Keep $q \ll p$ components > - Similarity search on components > + Project new queries on to those components ## LSI for the New York _Times_ stories {r} source("http://www.stat.cmu.edu/~cshalizi/dm/19/hw/01/nytac-and-bow.R") art.stories <- read.directory("nyt_corpus/art") music.stories <- read.directory("nyt_corpus/music") art.BoW.list <- lapply(art.stories, table) music.BoW.list <- lapply(music.stories, table) nyt.BoW.frame <- make.BoW.frame(c(art.BoW.list, music.BoW.list), row.names=c(paste("art", 1:length(art.BoW.list), sep="."), paste("music", 1:length(music.BoW.list), sep="."))) nyt.nice.frame <- div.by.euc.length(idf.weight(nyt.BoW.frame)) nyt.pca <- prcomp(nyt.nice.frame)  ## LSI for the New York _Times_ stories First component: {r} nyt.latent.sem <- nyt.pca$rotation signif(sort(nyt.latent.sem[,1],decreasing=TRUE)[1:30],2) signif(sort(nyt.latent.sem[,1],decreasing=FALSE)[1:30],2)  - What's up with "p", "m", personal pronouns? ## LSI for the New York _Times_ stories Second component: {r} signif(sort(nyt.latent.sem[,2],decreasing=TRUE)[1:30],2) signif(sort(nyt.latent.sem[,2],decreasing=FALSE)[1:30],2)  ## Visualization {r} story.classes <- c(rep("art", times=length(art.stories)), rep("music", times=length(music.stories))) plot(nyt.pca$x[,1:2], pch=ifelse(story.classes=="music","m","a"), col=ifelse(story.classes=="music","blue","red"))  ## Multidimensional scaling - Given: High-dimensional vectors with known distances - Desired: Low-d vectors with nearly the same distances - One approach: PCA + Because: triangle inequality - So we've just done MDS ## Netflix - The data: user ratings of movies (1--5), say $n$ users by $p$ movies + (Really, _accounts_ rather than _users_...) - Are users features for movies, or are movies features for users? - How do we handle missing values? ## Low-rank approximation - If we use _all_ the scores, we just re-express the data: $\X = \S\w^T = (\X\w) \w^T$ - Suppose we just keep the top $q$ PCs, so $\w_{q}$ is $p\times q$ $\S_{q} = \X\w_{q} ~[n\times q]$ - Approximation to the data: $\X_{q} = \S_{q} \w_{q}^T ~ [n\times q][q\times p]$ - $\X_q$ is a rank-$q$ matrix, and it's the closest rank-$q$ matrix to $\X$ ## Low-rank approximation - Given: $[n\times p]$ matrix $\mathbf{x}$, desired rank $q < \min{n,p}$ - Desired: $[n\times q]$ matrix $\mathbf{f}$, $[q\times p]$ matrix $g$ such that $\mathbf{x} \approx \mathbf{f} \mathbf{g}$ - If we are given the _complete_ matrix $\mathbf{x}$, the solution comes from a generalized eigendecomposition of $\mathbf{x}$ called the **singular value decomposition** - If we don't have complete data, find $\mathbf{f}$ and $\mathbf{g}$ by numerical minimization of $\sum_{(i,j) ~ \mathrm{observed}}{\left(x_{ij} - \sum_{k=1}^{q}{f_{ik} g_{kj}}\right)^2}$ + (Usually alternate between minimizing over $\mathbf{f}$ and minimizing over $\mathbf{g}$ but lots of tricks) - Once we have $\mathbf{f}$ and $\mathbf{g}$, we get a _prediction_ for the unseen entries in $\mathbf{x}$ + What is the implicit model here? ## Back to Netflix [@Statistical-significance-of-netflix] > - Baseline (predict the mean for everything): $RMSE = 1.1296$ > - Netflix's original in-house algorithm: $RMSE =0.9525$ > - Netflix challenge: beat that by 10% for $1 million prize > - SVD _alone_:$RMSE=0.9167$(r signif(100*(0.9525-0.9167)/(0.9525), 3)% improvement out of the box) ## Summing up - We have multivariate data$\X$(dimension =$[n\times p]$) - We want the best$q$-dimensional _linear_ approximation - Solution: **Principal components analysis** + Take the$q$leading eigenvectors of$\V \equiv \frac{1}{n}\X^T \X =$sample/empirical covariance matrix + These eigenvectors = the principal components = directions of largest variance = biggest contrasts within the data + Project the data on to the principal components + Equivalently: rotate to new, uncorrelated coordinates - Assemble the eigenvectors into$\w_q$($[p\times q]$) + **Scores** for the data are$\X\w_q \equiv \S_q$($[n\times q]$) + **Approximations** are$(\X\w_q) \w_q^T = \S_q \w_q^T$($[n\times p]$) - No inference or prediction: that comes next ## Backup: Gory details of PCA in matrix form Use$k$directions in a$p\times k$matrix$\w$_Require:_$\mathbf{w}^T\mathbf{w} = \mathbf{I}$, the basis vectors are orthonormal$\X \w =$matrix of projection lengths$[n\times k]\X \w \w^T =$matrix of projected _vectors_$[n\times p]\X - \X \w \w^T =$matrix of vector residuals$[n\times p](\X-\X\w\w^T)(\X-\X\w\w^T)^T =$matrix of inner products of vector residuals$[n\times n]\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$sum of squared errors$[1\times 1]$## Backup: The gory details \begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray} so maximize$\frac{1}{n}\tr{(\X\w\w^T\X^T)}$## Backup: The gory details "trace is cyclic" so $\tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)}$ so we want to maximize $\tr{\left(\w^T \frac{\X^T \X}{n}\w\right)}$ under the constraint $\w^T \w = \mathbf{I}$ This is the same form we saw before, so it has the same sort of solution: each column of$\w$must be an eigenvector of$\V$. ## Backup: The Lagrange multiplier trick - Want to solve $\max_{w}{L(w)}$ with **constraint**$f(w) = c$- Option I: Learn techniques for constrained optimization + Drawback: Who wants to major in OR? - Option II: Use$f(w)=c$to eliminate one coordinate of$w$+ Then$w=g(v,c)$for some function$g$and _unconstrained_$v$+ Do$\max_{v}{L(g(v,c))}$+ An unconstrained problem with one less variable + Drawback: math is hard! - Option III: Solve an unconstrained problem with _more_ variables + Drawback: Would only occur to a French mathematician ## Backup: The Lagrange multiplier trick - Define a **Lagrangian** $\mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c)$ +$=L(w)$when the constraint holds -$\lambda$is the **Lagrange multiplier** which **enforces** the constraint$f(w)=c$- Now do an unconstrained optimization over$w$and$\lambda$: $\max_{w, \lambda}{\mathcal{L}(w,\lambda)}$ ## Backup: The Lagrange multiplier trick $\max_{w, \lambda}{\mathcal{L}(w,\lambda)}$ - Take derivatives: \begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\ \frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w} \end{eqnarray} - Set to 0 at the maximum: \begin{eqnarray} f(w) & = & c\\ \frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w} \end{eqnarray} + We've automatically recovered the constraint! - 1 equation per unknown$\Rightarrow$Solve for$\lambda$,$w$## Backup: The Lagrange multiplier trick - More equality constraints$\Rightarrow$more Lagrange multipliers - Inequality constraints,$g(w) \leq d$, are trickier - Is the unconstrained maximum inside the **feasible set**? + Yes: problem solved + No: constrained maximum is on the boundary + Boundary is _usually_$g(w) = d\Rightarrow\$ treat like an equality + There are subtleties; sometimes we need to learn some OR ## Backup: The Lagrange multiplier trick - Why not to do this instead? $\max_{\lambda, w}{L(w) + \lambda(f(w)-c))^2}$ ## References