\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

Previously…

Represent items in our data base as vectors of numerical features
Want to look for patterns in those features
First, or as an example, low-dimensional structure in the vectors

What the data looks like

Data is $\mathbf{x} = n\times p$ matrix
- Could be multivariate data at $n$ time points
- Or multivariate data at $n$ locations
- Or scalar data at $n$ time points and $p$ locations
- Write $\vec{x}_i$ for row $i$ ($1\times p$ matrix)
First, center the data, just to reduce book-keeping
- i.e., subtract the mean of each column
- Optional: scale each column to equal variance

Finding the “principal” component

We don’t want to keep track of $p$ dimensions
We want one dimension
We also don’t want to distort the data too much

Pick a direction in the $p$-dimensional space, and a length-1 vector $\vec{w}$
What’s the best $\vec{w}$?

Projections

$\vec{x}_i \cdot \vec{w} =$ length of $\vec{x}_i$’s projection on to the direction of $\vec{w}$
$(\vec{x}_i \cdot \vec{w})\vec{w} =$ the actual projected vector

How well does the projection approximate the original?

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

How well does the projection approximate the original?

Add up across all the data vectors:

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

First bit doesn’t depend on $\vec{w}$, so doesn’t matter for minimizing
So we want to maximize \[ L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2} \]

Minimizing MSE is maximizing variance

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

($\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$)

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] (Why?)

Minimizing MSE is maximizing variance

so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction with the greatest variance

OK, how do we find this magic direction?

Matrix form: all lengths of projections $=\mathbf{x}\mathbf{w}$ $[n\times 1]$

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: $\V \equiv \frac{\X^T \X}{n} =$ sample covariance matrix of the vectors

OK, how do we find this magic direction?

We need to maximize \[\begin{equation} \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w \end{equation}\]
Constraint: $\vec{w}$ has length 1 $\Leftrightarrow$ $\w^T \w = 1$
Add a Lagrange multiplier $\lambda$ and take derivatives

\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\]

Set derivatives to zero:

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum is the leading eigenvector of $\V$

About the sample covariance matrix

$\V$ is a special matrix: symmetric and non-negative definite:

Eigenvalues are all real (b/c symmetric)
If $\lambda_i \neq \lambda_j$, then $v_i \perp v_j$ (b/c symmetric)
Eigenvectors form a basis, and (can be chosen to be) orthonormal
All $\lambda_i \geq 0$ (b/c non-negative definite)

\[\begin{eqnarray} \text{Lead eigenvector of} \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

Multiple principle components

What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.?
Intuition: take the direction of maximum variance $\perp$ the first principal component
Then direction of maximum variance $\perp$ the first two principal components
These are the eigenvectors of $\V$, in order of decreasing $\lambda$
Gory details go at the end of these slides

Some properties of the eigenvalues

All eigenvalues $\geq 0$
In general, $p$ non-zero eigenvalues
If the data are exactly in a $q$-dimensional subspace, then exactly $q$ non-zero eigenvalues
If $n < p$, at most $n$ non-zero eigenvalues
- Two points define a line, three define a plane, …

Some properties of the PCs

The principal components are orthonormal
- $\vec{w}_i \cdot \vec{w}_i = 1$
- $\vec{w}_i \cdot \vec{w}_j = 0$ (unless $i=j$)
- $\w^T\w = \mathbf{I}$
PC1 is the direction of maximum variance through the data
- That variance is $\lambda_1$, biggest eigenvalue of $\V$
PC $i+1$ is the direction of maximum variance $\perp$ PC1, PC2, $\ldots$ PC $i$
- That variance is $\lambda_{i+1}$

Some properties of PC scores

\[ \S = \X \w \]

Average score on each PC $=0$ (b/c we centered the data)

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ \end{eqnarray}\]

Show that this $=\mathbf{\Lambda}$

(Hint: $\V = \w \mathbf{\Lambda} \w^T$)

Some properties of PC scores

\[ \S = \X \w \]

Average score on each PC $=0$ (b/c we centered the data)

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w = \w^T (\w \mathbf{\Lambda} \mathbf{w}^T) \w & = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\ & = & \mathbf{\Lambda} \end{eqnarray}\]

Variance of score on PC $i$ $=\lambda_i$ (by construction)
Covariance of score on PC $i$ with score on PC $j$ $=0$

Some properties of PCA as a whole

If we use all $p$ principal components, we have the eigendecomposition of $\V$: \[ \V = \w \mathbf{\Lambda} \mathbf{w}^T \] $\mathbf{\Lambda}=$ diagonal matrix of eigenvalues $\lambda_1, \ldots \lambda_p$
If we use all $p$ principal components, \[ \X = \S\w^T \]
If we use only the top $q$ PCs, we get:
- the best rank-$q$ approximation to $\V$
- the best dimension-$q$ approximation to $\X$

Another way to think about PCA

The original coordinates are correlated
There is always another coordinate system with uncorrelated coordinates
We’re rotating to that coordinate system
- Rotating to new coordinates $\Rightarrow$ multiplying by an orthogonal matrix
- That matrix is $\mathbf{w}$
- The new coordinates are the scores

Interpreting PCA results

PCs are linear combinations of the original coordinates
- Good idea to scale variables with different units of measurement first
- $\Rightarrow$ PCs change as you add or remove coordinates
- Put in 1000 measures of education and PC1 is education…
Very tempting to reify the PCs
- i.e., to “make them a thing”
- sometimes totally appropriate…
- sometimes not at all appropriate…
- Be very careful when the only evidence is the PCA

PCA is exploratory analysis, not statistical inference

We assumed no model
- Pro: That’s the best linear approximation, no matter what
- Con: doesn’t tell us how much our results are driven by noise
Prediction: PCA predicts nothing
Inference: If $\V \rightarrow \Var{X}$ then PCs $\rightarrow$ eigenvectors of $\Var{X}$
- But PCA doesn’t need this assumption
- Doesn’t tell us about uncertainty
Next time, will look at a PCA-like model: factor models

In R

prcomp is the best built-in PCA command
Slightly complicated object returned
Understand it with a real data example from a recent, high-profile paper

Example: The axis of historical complexity

Turchin et al. (2018)

9 variable measuring social complexity, over time, for dozens of locations across the world

Dates from -9600 to 1900

summary(soccomp[, complexities], digits = 2)

##      PolPop       PolTerr          CapPop        levels      government  
##  Min.   :1.4   Min.   :-0.22   Min.   :1.4   Min.   :0.0   Min.   :0.00  
##  1st Qu.:4.2   1st Qu.: 3.65   1st Qu.:3.5   1st Qu.:1.8   1st Qu.:0.24  
##  Median :6.0   Median : 5.18   Median :4.3   Median :3.0   Median :0.62  
##  Mean   :5.5   Mean   : 4.78   Mean   :4.2   Mean   :2.9   Mean   :0.55  
##  3rd Qu.:6.8   3rd Qu.: 5.97   3rd Qu.:5.1   3rd Qu.:4.0   3rd Qu.:0.86  
##  Max.   :8.5   Max.   : 7.40   Max.   :6.3   Max.   :6.6   Max.   :1.00  
##     infrastr       writing         texts          money    
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.0  
##  1st Qu.:0.34   1st Qu.:0.26   1st Qu.:0.10   1st Qu.:1.8  
##  Median :0.75   Median :0.82   Median :0.93   Median :4.0  
##  Mean   :0.64   Mean   :0.65   Mean   :0.63   Mean   :3.4  
##  3rd Qu.:0.90   3rd Qu.:0.86   3rd Qu.:0.97   3rd Qu.:5.0  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :6.0

Example: The axis of historical complexity

soccomp.pca <- prcomp(soccomp[, complexities], scale = TRUE)

What are the parts of the return value?

str(soccomp.pca)

## List of 5
##  $ sdev    : num [1:9] 2.634 0.733 0.646 0.581 0.481 ...
##  $ rotation: num [1:9, 1:9] 0.351 0.32 0.339 0.341 0.332 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:9] 5.515 4.779 4.229 2.923 0.552 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ scale   : Named num [1:9] 1.59 1.561 1.112 1.449 0.325 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ x       : num [1:414, 1:9] -4.35 -4.24 -4.11 -3.67 -3.51 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

sdev = standard deviations along PCs $=\sqrt{\lambda_i}$
rotation = matrix of eigenvectors $=\w$
x = scores on all PCs $=\S$

Example: The axis of historical complexity

plot(cumsum(soccomp.pca$sdev^2)/sum(soccomp.pca$sdev^2), xlab = "Number of components used", 
    ylab = expression(R^2), ylim = c(0, 1))
abline(h = 0.75, lty = "dashed")
abline(h = 0.9, lty = "dotted")

One principal component already keeps more than 75% of the variance (3 components keep just under 90%)

Example: The axis of historical complexity

plot(soccomp.pca$rotation[, 1], xlab = "", main = "PCs of the complexity measures", 
    xaxt = "n", ylab = "PC weight", ylim = c(-1, 1), type = "l")
axis(side = 1, at = 1:9, labels = colnames(soccomp)[complexities], las = 2, 
    cex.axis = 0.5)
lines(soccomp.pca$rotation[, 2], col = "red")
lines(soccomp.pca$rotation[, 3], col = "blue")
legend("bottomright", legend = c("PC1", "PC2", "PC3"), lty = "solid", col = c("black", 
    "red", "blue"))
abline(h = 0, col = "grey")

PC1: Nearly equal positive weight on all variables
- High scores: large population, large areas, large capital cities, much hierarchy, very differentiated, etc.
- Low scores: small and unsophisticated
PC2: negative weight on population, area, levels of hierarchy, positive weight on gov’t, infrastructure, information, money
- High scores: small but sophisticated
- Low scores: big but dumb

Example: The axis of historical complexity

soccomp$PC1 <- soccomp.pca$x[, 1]
with(soccomp, plot(PC1 ~ Time))

Example: The axis of historical complexity

Illinois, North India, Italy around Rome, central China, West Africa, Persia/Iran
Spot the collapses of civilization

Example: The axis of historical complexity

plot(soccomp.pca$x[, 1], soccomp.pca$x[, 2], xlab = "Score on PC1", ylab = "Score on PC2")

Correlation of scores between PC1 and PC2 is 1.5406210^{-14}
Uncorrelated but dependent

Some more data-mining-y applications

Latent semantic indexing
Multidimensional scaling
Netflix

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

Build bag-of-words representation of our documents, $p \approx 10^4$ or even $10^5$
Do PCA
Keep $q \ll p$ components
Similarity search on components
- Project new queries on to those components

LSI for the New York Times stories

source("http://www.stat.cmu.edu/~cshalizi/dm/19/hw/01/nytac-and-bow.R")
art.stories <- read.directory("nyt_corpus/art")
music.stories <- read.directory("nyt_corpus/music")
art.BoW.list <- lapply(art.stories, table)
music.BoW.list <- lapply(music.stories, table)
nyt.BoW.frame <- make.BoW.frame(c(art.BoW.list, music.BoW.list), row.names = c(paste("art", 
    1:length(art.BoW.list), sep = "."), paste("music", 1:length(music.BoW.list), 
    sep = ".")))
nyt.nice.frame <- div.by.euc.length(idf.weight(nyt.BoW.frame))
nyt.pca <- prcomp(nyt.nice.frame)

LSI for the New York Times stories

First component:

nyt.latent.sem <- nyt.pca$rotation
signif(sort(nyt.latent.sem[, 1], decreasing = TRUE)[1:30], 2)

##       music        trio     theater   orchestra   composers       opera 
##       0.110       0.084       0.083       0.067       0.059       0.058 
##    theaters           m    festival        east     program           y 
##       0.055       0.054       0.051       0.049       0.048       0.048 
##      jersey     players   committee      sunday        june     concert 
##       0.047       0.047       0.046       0.045       0.045       0.045 
##    symphony       organ     matinee   misstated instruments           p 
##       0.044       0.044       0.043       0.042       0.041       0.041 
##         X.d       april      samuel        jazz     pianist     society 
##       0.041       0.040       0.040       0.039       0.038       0.038

signif(sort(nyt.latent.sem[, 1], decreasing = FALSE)[1:30], 2)

##       she       her        ms         i      said    mother    cooper 
##    -0.260    -0.240    -0.200    -0.150    -0.130    -0.110    -0.100 
##        my  painting   process paintings        im        he       mrs 
##    -0.094    -0.088    -0.071    -0.070    -0.068    -0.065    -0.065 
##        me  gagosian       was   picasso     image sculpture      baby 
##    -0.063    -0.062    -0.058    -0.057    -0.056    -0.056    -0.055 
##   artists      work    photos       you    nature    studio       out 
##    -0.055    -0.054    -0.051    -0.051    -0.050    -0.050    -0.050 
##      says      like 
##    -0.050    -0.049

What’s up with “p”, “m”, personal pronouns?

LSI for the New York Times stories

Second component:

signif(sort(nyt.latent.sem[, 2], decreasing = TRUE)[1:30], 2)

##         art      museum      images     artists   donations     museums 
##       0.150       0.120       0.095       0.092       0.075       0.073 
##    painting         tax   paintings   sculpture     gallery  sculptures 
##       0.073       0.070       0.065       0.060       0.055       0.051 
##     painted       white    patterns      artist      nature     service 
##       0.050       0.050       0.047       0.047       0.046       0.046 
##  decorative        feet     digital      statue       color    computer 
##       0.043       0.043       0.043       0.042       0.042       0.041 
##       paris         war collections     diamond       stone     dealers 
##       0.041       0.041       0.041       0.041       0.041       0.040

signif(sort(nyt.latent.sem[, 2], decreasing = FALSE)[1:30], 2)

##          her          she      theater        opera           ms 
##       -0.220       -0.220       -0.160       -0.130       -0.130 
##            i         hour   production         sang     festival 
##       -0.083       -0.081       -0.075       -0.075       -0.074 
##        music      musical        songs        vocal    orchestra 
##       -0.070       -0.070       -0.068       -0.067       -0.067 
##           la      singing      matinee  performance         band 
##       -0.065       -0.065       -0.061       -0.061       -0.060 
##       awards    composers         says           my           im 
##       -0.058       -0.058       -0.058       -0.056       -0.056 
##         play     broadway       singer       cooper performances 
##       -0.056       -0.055       -0.052       -0.051       -0.051

Visualization

story.classes <- c(rep("art", times = length(art.stories)), rep("music", times = length(music.stories)))
plot(nyt.pca$x[, 1:2], pch = ifelse(story.classes == "music", "m", "a"), col = ifelse(story.classes == 
    "music", "blue", "red"))

Multidimensional scaling

Given: High-dimensional vectors with known distances
Desired: Low-d vectors with nearly the same distances
One approach: PCA
- Because: triangle inequality
So we’ve just done MDS

Netflix

The data: user ratings of movies (1–5), say $n$ users by $p$ movies
- (Really, accounts rather than users…)
Are users features for movies, or are movies features for users?
How do we handle missing values?

Low-rank approximation

If we use all the scores, we just re-express the data: \[ \X = \S\w^T = (\X\w) \w^T \]
Suppose we just keep the top $q$ PCs, so $\w_{q}$ is $p\times q$ \[ \S_{q} = \X\w_{q} ~[n\times q] \]
Approximation to the data: \[ \X_{q} = \S_{q} \w_{q}^T ~ [n\times q][q\times p] \]
$\X_q$ is a rank-$q$ matrix, and it’s the closest rank-$q$ matrix to $\X$

Low-rank approximation

Given: $[n\times p]$ matrix $\mathbf{x}$, desired rank $q < \min{n,p}$
Desired: $[n\times q]$ matrix $\mathbf{f}$, $[q\times p]$ matrix $g$ such that \[ \mathbf{x} \approx \mathbf{f} \mathbf{g} \]
If we are given the complete matrix $\mathbf{x}$, the solution comes from a generalized eigendecomposition of $\mathbf{x}$ called the singular value decomposition
If we don’t have complete data, find $\mathbf{f}$ and $\mathbf{g}$ by numerical minimization of \[ \sum_{(i,j) ~ \mathrm{observed}}{\left(x_{ij} - \sum_{k=1}^{q}{f_{ik} g_{kj}}\right)^2} \]
- (Usually alternate between minimizing over $\mathbf{f}$ and minimizing over $\mathbf{g}$ but lots of tricks)
Once we have $\mathbf{f}$ and $\mathbf{g}$, we get a prediction for the unseen entries in $\mathbf{x}$
- What is the implicit model here?

Back to Netflix (Feuerverger, He, and Khatri 2012)

Baseline (predict the mean for everything): $RMSE = 1.1296$
Netflix’s original in-house algorithm: $RMSE =0.9525$
Netflix challenge: beat that by 10% for $1 million prize
SVD alone: $RMSE=0.9167$ (3.76% improvement out of the box)

Summing up

We have multivariate data $\X$ (dimension = $[n\times p]$)
We want the best $q$-dimensional linear approximation
Solution: Principal components analysis
- Take the $q$ leading eigenvectors of $\V \equiv \frac{1}{n}\X^T \X =$ sample/empirical covariance matrix
- These eigenvectors = the principal components = directions of largest variance = biggest contrasts within the data
- Project the data on to the principal components
- Equivalently: rotate to new, uncorrelated coordinates
Assemble the eigenvectors into $\w_q$ ($[p\times q]$)
- Scores for the data are $\X\w_q \equiv \S_q$ ($[n\times q]$)
- Approximations are $(\X\w_q) \w_q^T = \S_q \w_q^T$ ($[n\times p]$)
No inference or prediction: that comes next

Backup: Gory details of PCA in matrix form

Use $k$ directions in a $p\times k$ matrix $\w$

Require: $\mathbf{w}^T\mathbf{w} = \mathbf{I}$, the basis vectors are orthonormal

$\X \w =$ matrix of projection lengths $[n\times k]$

$\X \w \w^T =$ matrix of projected vectors $[n\times p]$

$\X - \X \w \w^T =$ matrix of vector residuals $[n\times p]$

$(\X-\X\w\w^T)(\X-\X\w\w^T)^T =$ matrix of inner products of vector residuals $[n\times n]$

$\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$ sum of squared errors $[1\times 1]$

Backup: The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize $\frac{1}{n}\tr{(\X\w\w^T\X^T)}$

Backup: The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of $\w$ must be an eigenvector of $\V$.

Backup: The Lagrange multiplier trick

Want to solve \[ \max_{w}{L(w)} \] with constraint $f(w) = c$
Option I: Learn techniques for constrained optimization
- Drawback: Who wants to major in OR?
Option II: Use $f(w)=c$ to eliminate one coordinate of $w$
- Then $w=g(v,c)$ for some function $g$ and unconstrained $v$
- Do $\max_{v}{L(g(v,c))}$
- An unconstrained problem with one less variable
- Drawback: math is hard!
Option III: Solve an unconstrained problem with more variables
- Drawback: Would only occur to a French mathematician

Backup: The Lagrange multiplier trick

Define a Lagrangian \[ \mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c) \]
- $=L(w)$ when the constraint holds
$\lambda$ is the Lagrange multiplier which enforces the constraint $f(w)=c$
Now do an unconstrained optimization over $w$ and $\lambda$: \[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

Backup: The Lagrange multiplier trick

\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

Take derivatives: \[\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\ \frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w} \end{eqnarray}\]
Set to 0 at the maximum: \[\begin{eqnarray} f(w) & = & c\\ \frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w} \end{eqnarray}\]
- We’ve automatically recovered the constraint!
1 equation per unknown $\Rightarrow$ Solve for $\lambda$, $w$

Backup: The Lagrange multiplier trick

More equality constraints $\Rightarrow$ more Lagrange multipliers
Inequality constraints, $g(w) \leq d$, are trickier
Is the unconstrained maximum inside the feasible set?
- Yes: problem solved
- No: constrained maximum is on the boundary
- Boundary is usually $g(w) = d$ $\Rightarrow$ treat like an equality
- There are subtleties; sometimes we need to learn some OR

Backup: The Lagrange multiplier trick

Why not to do this instead? \[ \max_{\lambda, w}{L(w) + \lambda(f(w)-c))^2} \]

References

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41:391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

Feuerverger, Andrey, Yu He, and Shashi Khatri. 2012. “Statistical Significance of the Netflix Challenge.” Statistical Science 27:202–31. https://doi.org/10.1214/11-STS368.

Landauer, Thomas K., and Susan T. Dumais. 1997. “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review 104:211–40. http://lsa.colorado.edu/papers/plato/plato.annote.html.

Turchin, Peter, Thomas E. Currie, Harvey Whitehouse, Pieter François, Kevin Feeney, Daniel Mullins, Daniel Hoyer, et al. 2018. “Quantitative Historical Analysis Uncovers a Single Dimension of Complexity That Structures Global Variation in Human Social Organization.” Proceedings of the National Academy of Sciences (USA) 115:E144–E151. https://doi.org/10.1073/pnas.1708800115.

Principal Components Analysis

Previously…

What the data looks like

Finding the “principal” component

Projections

How well does the projection approximate the original?

How well does the projection approximate the original?

Minimizing MSE is maximizing variance

Minimizing MSE is maximizing variance

Minimizing MSE is maximizing variance

OK, how do we find this magic direction?

OK, how do we find this magic direction?

The magic direction is an eigenvector

About the sample covariance matrix

Multiple principle components

Some properties of the eigenvalues

Some properties of the PCs

Some properties of PC scores

Some properties of PC scores

Some properties of PCA as a whole

Another way to think about PCA

Interpreting PCA results

PCA is exploratory analysis, not statistical inference

In R

Example: The axis of historical complexity

Example: The axis of historical complexity

Example: The axis of historical complexity

Example: The axis of historical complexity

Example: The axis of historical complexity

Example: The axis of historical complexity

Example: The axis of historical complexity

Some more data-mining-y applications

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

LSI for the New York Times stories

LSI for the New York Times stories

LSI for the New York Times stories

Visualization

Multidimensional scaling

Netflix

Low-rank approximation

Low-rank approximation

Back to Netflix (Feuerverger, He, and Khatri 2012)

Summing up

Backup: Gory details of PCA in matrix form

Backup: The gory details

Backup: The gory details

Backup: The Lagrange multiplier trick

Backup: The Lagrange multiplier trick

Backup: The Lagrange multiplier trick

Backup: The Lagrange multiplier trick

Backup: The Lagrange multiplier trick

References