Linear Dimension Reduction (mostly Principal Components)

36-462/662, Spring 2022

24 March 2022 (Lecture 18)

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

Previously…

Represent items in our data base as vectors of numerical variables
Want to look for patterns in those features
First, or as an example, low-dimensional structure in the vectors

What the data looks like

Data is $\mathbf{x} = n\times p$ matrix
- Could be multivariate ($p$-dimensional) data on $n$ items in our data base
  - Or multivariate data from $n$ moments in time
  - Or multivariate data from $n$ spatial locations
  - Or scalar data at $n$ time points and $p$ locations
  - Or multivariate data from $n_S$ locations at $n_T$ moments ($n=n_S n_T$)
  - Etc., etc.
- Write $\vec{x}_i$ for row $i$ ($1\times p$ matrix)
First, center the data, just to reduce book-keeping
- i.e., subtract the mean of each column
- Optional: scale each column to equal variance

Finding the “principal” component

We don’t want to keep track of $p$ dimensions
We want one dimension
We also don’t want to distort the data too much

Pick a direction in the $p$-dimensional space, and a length-1 vector $\vec{w}$
What’s the best $\vec{w}$?

Projections

$\vec{x}_i \cdot \vec{w} =$ length of $\vec{x}_i$’s projection on to the direction of $\vec{w}$
- (Strictly the signed length)
- (Remember we’re requiring $\|\vec{w}\| = 1$)
$(\vec{x}_i \cdot \vec{w})\vec{w} =$ the actual projected vector

How well does the projection approximate the original? (1)

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \| \vec{x_i}\|^2 - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

How well does the projection approximate the original? (2)

Average across all the data vectors:

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

First bit doesn’t depend on $\vec{w}$, so doesn’t matter for minimizing
So we want to maximize \[ L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2} \]

Minimizing MSE is maximizing variance (1)

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

($\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$)

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] (Why?)

Minimizing MSE is maximizing variance (2)

so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance (3)

The direction which gives us the best approximation of the data is the direction along which the projections have the greatest variance

OK, how do we find this magic direction? (1)

Matrix form: all lengths of projections $=\mathbf{x}\mathbf{w}$ $[n\times 1]$

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i=1}^{n}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: $\V \equiv \frac{\X^T \X}{n} =$ sample covariance matrix of the vectors

OK, how do we find this magic direction? (2)

We need to maximize \[\begin{equation} \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w \end{equation}\]
Constraint: $\vec{w}$ has length 1 $\Leftrightarrow$ $\w^T \w = 1$
Add a Lagrange multiplier $\lambda$ times the constraint equation, and take derivatives

\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\]

Set derivatives to zero:

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum comes from the leading eigenvector of $\V$

The first principal component (PC1) and scores on the first principal component

\[\begin{eqnarray} \text{Lead eigenvector of}~ \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

Write the leading eigenvector as $\vec{w}_1$, or, as a $p\times 1$ matrix, $\mathbf{w}_1$
The score of the $i^{\mathrm{th}}$ data point on PC1 is $\vec{x}_i \cdot \vec{w}$, the (signed) length of $\vec{x}_i$’s projection on to $\vec{w}$
The $[n\times 1]$ matrix of all scores on PC1 is $\mathbf{x} \mathbf{w}_1$
The scores are the new feature values we’d use after dimension reduction
The $[n\times p]$ matrix of all approximations using just PC1 is $\mathbf{x} \mathbf{w}_1 \mathbf{w}_1^T$

Multiple principle components

What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.?
If we want to use $q$ principal components, we’ll need to make $\w$ a $[p\times q]$ matrix
- Each column of $\w$ will be the vector in a different direction of the linear subspace we’re projecting down to
- $\X\w$ will be the $n\times q$ matrix of new feature values
Intuition: take the direction of maximum variance $\perp$ the first principal component
Then direction of maximum variance $\perp$ the first two principal components
These are the eigenvectors of $\V$, in order of decreasing $\lambda$
- Proof: See the “gory details” back-up slides
But clearly we need to know something about the eigenvalues and eigenvectors

About the sample covariance matrix (1)

$\V$ is a $p\times p$ matrix, so it has $p$ eigenvalues $\lambda_1, \ldots \lambda_p$, and $p$ eigenvectors, $\vec{w}_1, \ldots \vec{w}_p$
- Some eigenvalues might be repeated
- We can always chose eigenvectors to have length 1 (normalize them)
- We can stack the eigenvectors in to a $p\times p$ matrix $\w$ where each column is a different $\vec{w}_i$
$\V$ is a covariance matrix, so it has two special properties;
1. It’s symmetric
  - Meaning: $\V = \V^T$
  - Since $v_{ij} = \Cov{X_i, X_j} = \Cov{X_j, X_i} = v_{ji}$
2. It’s non-negative definite
  - Meaning: $\vec{a} \cdot (\V \vec{a}) \geq 0$, for any vector $\vec{a}$
  - You’re showing that all covariance matrices are non-negative definite in HW 9
  - Compatible with some entries in $\V$ being negative numbers

About the sample covariance matrix (2)

Because $\V$ is symmetric:
- All its eigenvalues are real (not complex)
  - Somewhat nontrivial math
- Eigenvectors with distinct eigenvalues are orthogonal: If $\lambda_i \neq \lambda_j$, then $\vec{w}_i \perp \vec{w}_j$
  - See back-up for proof
- Eigenvectors with equal eigenvalues can be chosen to be orthogonal
  - See backup for proof
- So there are $p$ orthogonal, normalized (“orthonormal”) eigenvectors
- In $p$ dimensions, any set of $p$ orthogonal vectors is a basis
- $\Rightarrow$ any vector $\vec{a}$ is a linear combination of the eigenvectors, $\vec{a} = \sum_{j=1}^{p}{(\vec{a} \cdot \vec{w}_j) \vec{w}_j}$
Because $\V$ is non-negative definite:
- All $\lambda_i \geq 0$
  - For any eigenvector $\vec{w}_i$, $\vec{w}_i \cdot (\V \vec{w}_i) = \lambda_i (\vec{w}_i \cdot \vec{w}_i)$ but this must be $\geq 0$ because $\V$ is non-negative definite

Some properties of the eigenvalues

All eigenvalues $\geq 0$
- Again, this is because $\V$ is a variance matrix
In general, $p$ non-zero eigenvalues
If the data are exactly in a $q$-dimensional subspace, then exactly $q$ non-zero eigenvalues
If $n < p$, at most $n$ non-zero eigenvalues
- Two points define a line, three define a plane, …

Some properties of the PC vectors

The principal components are orthonormal
- $\vec{w}_i \cdot \vec{w}_i = 1$
- $\vec{w}_i \cdot \vec{w}_j = 0$ (unless $i=j$)
- In matrix form, $\w^T\w = \mathbf{I}$
  - Remember, $\w =$ the $p\times p$ matrix whose columns are the PC vectors
- Again, this is because $\V$ is symmetric
PC1 is the direction of maximum variance through the data
- That variance is $\lambda_1$, biggest eigenvalue of $\V$
  - As we saw in the proof above
PC $(i+1)$ is the direction of maximum variance $\perp$ PC1, PC2, $\ldots$ PC $i$
- That variance is $\lambda_{i+1}$
  - See the “gory details” backup slides

The eigendecomposition of $\V$

We saw that $\V$’s eigenvectors are an orthonormal basis, so for any $\vec{a}$, \[ \vec{a} = \sum_{j=1}^{p}{\vec{w}_j (\vec{w}_j \cdot \vec{a})} \]
$\therefore$ \[ \V \vec{a} = \sum_{j=1}^{p}{(\V \vec{w}_j) (\vec{w}_j \cdot \vec{a})} = \sum_{j=1}^{p}{\vec{w}_j \lambda_j (\vec{w}_j \cdot \vec{a})} \]
Again, define $\w$ as the $p\times p$ matrix whose columns are the eigenvectors
- So $\w^T$ is the $p\times p$ matrix whose rows are the eigenvectors
Define $\mathbf{\Lambda}$ as the $p\times p$ diagonal matrix whose diagonal is $\lambda_1, \lambda_2, \ldots \lambda_p$
What we’ve just show is that for any vector $\vec{a}$, \[ \V \vec{a} = \w \mathbf{\Lambda} \w^T \vec{a} \] so \[ \V = \w \mathbf{\Lambda} \w^T \]
This is the eigendecomposition of $\V$

Some properties of the PC scores (1)

Define the $[n\times p]$ matrix of the scores of all data points on all PCs: \[ \S \equiv \X \w \]

$s_{ij} =$ score of the $i^{\mathrm{th}}$ data point on PC $j$
Average score on each PC $=0$ (b/c we centered the data)
- Here, “average” is over the $n$ data points, not over the $p$ PCs
- That is, for each $j \in 1:p$, $\frac{1}{n}\sum_{i=1}^{n}{s_{ij}} = 0$
This implies that \[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ \end{eqnarray}\]

Some properties of the PC scores (2)

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w = \w^T (\w \mathbf{\Lambda} \mathbf{w}^T) \w & = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\ & = & \mathbf{\Lambda} \end{eqnarray}\]

Variance of score on PC $j$ $=\lambda_j$
- $=$ variance over the $n$ data points; each has its own score on PC $i$
- That is, $\frac{1}{n}\sum_{i=1}^{n}{s_{ij}^2} = \lambda_j$
Covariance of score on PC $j$ with score on PC $k$ $=0$
- Again, covariance over the $n$ data points, each point gets its own score on both PCs
- That is, $\frac{1}{n}\sum_{i=1}^{n}{s_{ij} s_{ik}} = 0$

Some properties of PCA as a whole

If we use all $p$ principal components, we have the eigendecomposition of $\V$: \[ \V = \w \mathbf{\Lambda} \mathbf{w}^T \]
- Again, $\mathbf{\Lambda}=$ diagonal matrix of eigenvalues $\lambda_1, \ldots \lambda_p$ and $\w =$ matrix whose columns are the PCs
If we use all $p$ principal components, \[ \X = \S\w^T \]
If we use only the top $q$ PCs, we get (see backup):
- the best rank-$q$ approximation to $\V$
- the best dimension-$q$ approximation to $\X$
- the best rank-$q$ approximation to $\X$

Another way to think about PCA

The original coordinates are correlated
There is always another coordinate system with uncorrelated coordinates
We’re rotating to that coordinate system
- Rotating to new coordinates $\Rightarrow$ multiplying by an orthogonal matrix
- That matrix is $\mathbf{w}$
- The new coordinates are the scores

Interpreting PCA results

PCs are linear combinations of the original coordinates
- Usually a good idea to scale variables with different units of measurement first
$\Rightarrow$ PCs change as you add or remove coordinates
- Put in 1000 measures of education and PC1 is education…
Very tempting to reify the PCs
- i.e., to “make them a thing”
- sometimes totally appropriate…
- sometimes not at all appropriate…
- Be very careful when the only evidence is the PCA

USA, $\approx 1977$

Dataset pre-loaded in R:

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

What’re the units here?
Inflation: $1 in 1974 $\approx$ $ $5.88$ in 2022
- Or $1 in 2022 $\approx$ $ $0.17$ in 1974
Reference: US income per person in 2020 $\approx$ $ 64k in 2020 (source)

Principal components of the USA, $\approx 1977$

state.pca <- prcomp(state.x77,scale.=TRUE)
str(state.pca)

## List of 5
##  $ sdev    : num [1:8] 1.897 1.277 1.054 0.841 0.62 ...
##  $ rotation: num [1:8, 1:8] 0.126 -0.299 0.468 -0.412 0.444 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:8] 4246.42 4435.8 1.17 70.88 7.38 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ scale   : Named num [1:8] 4464.49 614.47 0.61 1.34 3.69 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ x       : num [1:50, 1:8] 3.79 -1.053 0.867 2.382 0.241 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

The principal component vectors

The weight/loading matrix $\w$ gets called $rotation (why?):

signif(state.pca$rotation[,1:2], 2)

##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of $\V$

-Break for in-class exercise

What kind of state would get a large positive score on the 1st PC, and what kind of state would get a large negative score?
What kind of state would get a large positive score on the 2nd PC, and what kind of state would get a large negative score?

Scores on the principal components

signif(state.pca$x[1:10, 1:2], 2)

##               PC1   PC2
## Alabama      3.80 -0.23
## Alaska      -1.10  5.50
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Colorado    -2.10  0.51
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns here are $\vec{x}_i \cdot \vec{w}_1$ and $\vec{x}_i \cdot \vec{w}_2$

So, for instance, \[ s_{\text{Alabama}, 1} = 0.13 x_{\text{Alabama}, \text{Population}} + -0.3 x_{\text{Alabama}, \text{Income}} + \ldots -0.033 x_{\text{Alabama}, \text{Area}} \]

(after centering and scaling the features)

PC1 is kinda southern (1)

signif(head(sort(state.pca$x[,1])),2)

##    Minnesota North Dakota         Iowa         Utah     Nebraska     Colorado 
##         -2.4         -2.4         -2.3         -2.3         -2.2         -2.1

signif(tail(sort(state.pca$x[,1])),2)

## North Carolina        Georgia South Carolina        Alabama    Mississippi 
##            2.7            3.3            3.7            3.8            4.0 
##      Louisiana 
##            4.2

PC1 is kinda southern (2)

size of state abbreviation $\propto$ projection on to PC1
coordinates = state capitols, except for AK and HI

PC1 is kinda southern (3)

(Correlation of PC1 with having been in the Confederacy is 0.8)

The eigenvalues / variances along each PC

signif(state.pca$sdev, 2)

## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34

Standard deviations along each principal component $=\sqrt{\lambda_i}$

If we keep $q$ components, \[ R^2 = \frac{\sum_{i=1}^{q}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]

(Denominator $=\tr{\V}$ — why?)

Scree plot (1)

plot(state.pca$sdev^2, xlab="PC", ylab="Variance", type="b")

Scree plot (2)

plot(cumsum(state.pca$sdev^2), xlab="Number of PCs", ylab="Cumulative variance",
     type="b", ylim=c(0, sum(state.pca$sdev^2)))

PCA as exploratory analysis, not statistical inference

We assumed no model
- Pro: That’s the best linear approximation, no matter what
- Con: doesn’t tell us how much our results are driven by noise
Prediction: PCA predicts nothing
Inference: If $\V \rightarrow \Var{X}$ then PCs $\rightarrow$ eigenvectors of $\Var{X}$
- But PCA doesn’t need this assumption
- Doesn’t tell us about uncertainty
There are PCA-like models (like factor models), which we’ll come back to later, but those models $\neq$ PCA

Some statistical-learning / data-mining-y applications

Latent semantic indexing
Multidimensional scaling
Principal components regression
Netflix / recommender systems

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

Build bag-of-words representation of our documents, $p \approx 10^4$ or even $10^5$
- That is, represent documents by vectors saying how often each word appears
Do PCA
Keep $q \ll p$ components
- These are the “latent semantic dimensions” of the documents
Use the scores on those components as the new features for
- Classification
- Prediction
- Similarity-based search
- Analogies
- Many, many other things which people now think you need to use complicated neural networks to do
Worked example in the backup slides

Multidimensional scaling (MDS)

Given: High-dimensional vectors with known distances
Desired: Low-dimensional vectors with nearly the same distances
One approach: PCA
- Because: triangle inequality; gory details in Hand, Mannila, and Smyth (2001), sec. 3.7

Principal components regression

Run PCA on the features in $\mathbf{x}$
Take the top $q$ principal components
Then regress $y$ on the scores on PC1, PC2, $\ldots$ PC$q$
Can lose information, but:
- New features are uncorrelated
- Can be a good way to deal with multicollinearity
- Can be a good way to deal with high-dimensional, $n < p$ problems
- Surprisingly competitive with ridge regression (Dhillon et al. 2013)

Netflix and recommender systems in general

The data: user ratings of movies (1–5), say $n$ users by $p$ movies
- (Really, accounts rather than users…)
Goal: predict how a user will rate a movie they haven’t seen before
- So Netflix can recommend movies users will like, and keep them paying
Originally, Netflix released data $10^8$ ratings from $4.8\times 10^5$ users of $1.8 \times 10^4$ movies
- Challenge: improve the RMSE of ratings on a testing set by 10%
- Released in 2006; prize of $ ${10}^6$ for first team to hit the threshold by 2011 (with smaller prizes for incremental progress)
- The first person to use PCA got an immediate improvement of $3.76$% (!)
Questions to decide before using PCA:
- Are users features for movies, or are movies features for users?
- What’s the best way to use the new features (= scores on PCs) for prediction?
- How do we handle missing values?
We’ll come back to this when we look at recommender systems in more detail

Summing up

We have multivariate data $\X$ (dimension = $[n\times p]$)
We want the best $q$-dimensional linear approximation
Solution: Principal components analysis
- Take the $q$ leading eigenvectors of $\V \equiv \frac{1}{n}\X^T \X =$ sample/empirical covariance matrix
- These eigenvectors = the principal components = directions of largest variance = biggest contrasts within the data
- Project the data on to the principal components
- Equivalently: rotate to new, uncorrelated coordinates
Assemble the eigenvectors into $\w_q$ ($[p\times q]$)
- Scores for the data are $\X\w_q \equiv \S_q$ ($[n\times q]$)
- Approximations are $(\X\w_q) \w_q^T = \S_q \w_q^T$ ($[n\times p]$)
No inference or prediction, just linear approximation
If we just want to (approximately) preserve distances, random linear projections can work really well too (see backup)

Backup: Gory details of PCA in matrix form

Use $q$ directions in a $p\times q$ matrix $\w$
- i.e., each column of $\w$ will be a unit vector in $p$-dimensional space
Require: $\mathbf{w}^T\mathbf{w} = \mathbf{I}$, the basis vectors are orthonormal
By analogy with one principal component:
- $\X \w =$ matrix of projection lengths $[n\times q]$
- $\X \w \w^T =$ matrix of projected vectors $[n\times p]$
- $\X - \X \w \w^T =$ matrix of vector residuals $[n\times p]$
- $(\X-\X\w\w^T)(\X-\X\w\w^T)^T =$ matrix of inner products of vector residuals $[n\times n]$
- $\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$ sum of squared errors $[1\times 1]$

Backup: The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize $\frac{1}{n}\tr{(\X\w\w^T\X^T)}$

Backup: The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of $\w$ must be an eigenvector of $\V$.

Backup: Distinct eigenvalues of a symmetric matrix have orthogonal eigenvectors

For any symmetric matrix with eigenvalues $\lambda_1, \ldots \lambda_p$ and eigenvectors $\vec{w}_1, \ldots \vec{w}_p$, if $\lambda_i \neq \lambda_j$, then $\vec{w}_i \perp \vec{w}_j$

Proof: Remember that $\vec{w}_i \perp \vec{w}_j$ iff $\vec{w}_i \cdot \vec{w}_j = 0$, so let’s get at that inner product. Now, for any two vectors $\vec{a}$, $\vec{b}$ and square matrix $\mathbf{c}$ (not necessarily symmetric), \[ \vec{a} \cdot (\mathbf{c} \vec{b}) = (\mathbf{c}^T \vec{a}) \cdot \vec{b} \] (To see this, write $\vec{a}$ and $\vec{b}$ as $p\times 1$ matrices, so we’ve got $\vec{a}^T \mathbf{c} \vec{b} = \vec{a}^T (\mathbf{c}^T)^T \vec{b} = (\mathbf{c}^T \vec{a})^T \vec{b}$.) Now let’s apply this to our eigenvectors: \[\begin{eqnarray} \vec{w}_i \cdot (\V \vec{w}_j) & = & (\V^T \vec{w}_i) \cdot \vec{w}_j\\ \vec{w}_i \cdot (\V \vec{w}_j) & = & (\V \vec{w}_i) \cdot \vec{w}_j \vec{w}_i \cdot (\lambda_j \vec{w}_j) & = & (\lambda_i \vec{w}_i) \cdot \vec{w}_j\\ \lambda_j (\vec{w}_i \cdot \vec{w}_j) & = & \lambda_i (\vec{w}_i \cdot \vec{w}_j) \end{eqnarray}\] (The second line uses the assumption that $\V$ is symmetric, the third line uses the fact that $\vec{w}_i$ and $\vec{w}_j$ are both eigenvectors of $\V$, the last line uses linearity of inner products.)

Since, by assumption, $\lambda_i \neq \lambda_j$, the only way the two sides of the equation can balance is if $\vec{w}_i \cdot \vec{w}_j = 0$, as was to be shown.

Backup: Distinct eigenvectors with equal eigenvalues can be made or chosen to be orthogonal

If $\vec{w}_i$ and $\vec{w}_j$ are both eigenvectors of some matrix $\mathbf{c}$, and both have eigenvalue $\lambda$, then $a\vec{w}_i + b\vec{w}_j$ is also an eigenvector of $\mathbf{c}$, with eigenvalue $\lambda$: \[ \mathbf{c}\left(a\vec{w}_i + b\vec{w}_j\right) = a(\mathbf{c}\vec{w}_i) + b(\mathbf{c}\vec{w}_j) = \lambda(a\vec{w}_i + b\vec{w}_j) \]
So suppose $\vec{w}_i$ and $\vec{w}_j$ are both eigenvectors of $\mathbf{c}$ with eigenvalue $\lambda$, and that $\vec{w}_i \not\perp \vec{w}_j$; we’ll construct a new eigenvector, also with eigenvalue $\lambda$, which is orthogonal to $\vec{w}_i$
- Without loss of generality (“wolog”), assume $\|\vec{w}_i\|=\|\vec{w}_j\|=1$
Define $\vec{u} \equiv \vec{w}_j - (\vec{w}_i \cdot \vec{w}_j) \vec{w}_i$
- Claim 1: $\vec{u}$ is an eigenvector with eigenvalue $\lambda$ (because it’s a linear combination of two eigenvectors with eigenvalue $\lambda$)
- Claim 2: $\vec{u} \perp \vec{w}_i$: \[\begin{eqnarray} \vec{u} \cdot \vec{w}_i & = & \left(\vec{w}_j - (\vec{w}_i \cdot \vec{w}_j) \vec{w}_i\right) \cdot \vec{w}_i\\ & = & \vec{w}_j \cdot \vec{w}_i - (\vec{w}_i \cdot \vec{w}_j) (\vec{w}_i \cdot \vec{w}_i)\\ & = & \vec{w}_j \cdot \vec{w}_i - \vec{w}_i \cdot \vec{w}_j\\ & = & 0 \end{eqnarray}\] (using $\|\vec{w}_i\|=1$ and symmetry of inner products)
- Claim 3: $\vec{w}_j^{\prime} = \frac{\vec{u}}{\|\vec{u}\|}$ is also an eigenvector with eigenvalue $\lambda$, and also orthogonal to $\vec{w}_i$
  - Because it’s proportional to $\vec{u}$
So we can replace $\vec{w}_j$ with $\vec{w}_j^{\prime}$ in our list of eigenvectors
By induction, this works if we’ve got more than 2 non-orthogonal eigenvectors with the same eigenvalue

Backup: Low-rank approximation

If we use all the scores, we just re-express the data: \[ \X = \S\w^T = (\X\w) \w^T \]
- Remember $\w^T = \w^{-1}$
Suppose we just keep the top $q$ PCs, so $\w_{q}$ is $p\times q$ \[ \S_{q} = \X\w_{q} ~(\text{dimension} ~ [n\times q]) \]
Approximation to the data: \[ \X_{q} = \S_{q} \w_{q}^T ~ ~(\text{dimension} ~ [n\times q][q\times p]) \]
$\X_q$ is a rank-$q$ matrix, and it’s the closest rank-$q$ matrix to $\X$

Backup Example: The axis of historical complexity (1)

Turchin et al. (2018)

9 variable measuring social complexity, over time, for dozens of locations across the world

Dates from -9600 to 1900

summary(soccomp[,complexities], digits=2)

##      PolPop       PolTerr          CapPop        levels      government  
##  Min.   :1.4   Min.   :-0.22   Min.   :1.4   Min.   :0.0   Min.   :0.00  
##  1st Qu.:4.2   1st Qu.: 3.65   1st Qu.:3.5   1st Qu.:1.8   1st Qu.:0.24  
##  Median :6.0   Median : 5.18   Median :4.3   Median :3.0   Median :0.62  
##  Mean   :5.5   Mean   : 4.78   Mean   :4.2   Mean   :2.9   Mean   :0.55  
##  3rd Qu.:6.8   3rd Qu.: 5.97   3rd Qu.:5.1   3rd Qu.:4.0   3rd Qu.:0.86  
##  Max.   :8.5   Max.   : 7.40   Max.   :6.3   Max.   :6.6   Max.   :1.00  
##     infrastr       writing         texts          money    
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.0  
##  1st Qu.:0.34   1st Qu.:0.26   1st Qu.:0.10   1st Qu.:1.8  
##  Median :0.75   Median :0.82   Median :0.93   Median :4.0  
##  Mean   :0.64   Mean   :0.65   Mean   :0.63   Mean   :3.4  
##  3rd Qu.:0.90   3rd Qu.:0.86   3rd Qu.:0.97   3rd Qu.:5.0  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :6.0

Backup Example: The axis of historical complexity (2)

soccomp.pca <- prcomp(soccomp[,complexities], scale=TRUE)

What are the parts of the return value?

str(soccomp.pca)

## List of 5
##  $ sdev    : num [1:9] 2.634 0.733 0.646 0.581 0.481 ...
##  $ rotation: num [1:9, 1:9] 0.351 0.32 0.339 0.341 0.332 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:9] 5.515 4.779 4.229 2.923 0.552 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ scale   : Named num [1:9] 1.59 1.561 1.112 1.449 0.325 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ x       : num [1:414, 1:9] -4.35 -4.24 -4.11 -3.67 -3.51 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

sdev = standard deviations along PCs $=\sqrt{\lambda_i}$
rotation = matrix of eigenvectors $=\w$
x = scores on all PCs $=\S$

Backup Example: The axis of historical complexity (3)

# R^2 for k components = sum(first k eigenvalues)/sum(all eigenvalues)
# cumsum(x) returns the cumulative sums along the vector x
plot(cumsum(soccomp.pca$sdev^2)/sum(soccomp.pca$sdev^2),
     xlab="Number of components used", ylab=expression(R^2),
     ylim=c(0,1))
# Add the lines indicating the desired levels
abline(h=0.75, lty="dashed")
abline(h=0.90, lty="dotted")

One principal component already keeps more than 75% of the variance (3 components keep just under 90%)

Backup Example: The axis of historical complexity (4)

# Plot the coordinates/weights/loadings of PC1
  # But suppress the usual labels on the horizontal axis
plot(soccomp.pca$rotation[,1], xlab="", main="PCs of the complexity measures",
     xaxt="n", ylab="PC weight", ylim=c(-1,1), type="l")
# Label the horizontal axis by the names of the variables
  # The las option turns the axis labels on their side (so they don't overlap
  # each other)
axis(side=1, at=1:9, labels=colnames(soccomp)[complexities], las=2,
     cex.axis=0.5)
# Now add PC2 and PC3, in distinct colors
lines(soccomp.pca$rotation[,2], col="red")
lines(soccomp.pca$rotation[,3], col="blue")
# Legend
legend("bottomright", legend=c("PC1", "PC2", "PC3"), lty="solid",
       col=c("black", "red", "blue"))
# Guide-to-the-eye horizontal line at 0
abline(h=0, col="grey")

PC1: Nearly equal positive weight on all variables
- High scores: large population, large areas, large capital cities, much hierarchy, very differentiated, etc.
- Low scores: small and unsophisticated
PC2: negative weight on population, area, levels of hierarchy, positive weight on gov’t, infrastructure, information, money
- High scores: small but sophisticated
- Low scores: big but dumb

Backup Example: The axis of historical complexity (5)

# Add scores to data frame
soccomp$PC1 <- soccomp.pca$x[,1]
# Plot vs. time (all areas)
with(soccomp, plot(PC1 ~ Time))

Backup Example: The axis of historical complexity (6)

Illinois, North India, Italy around Rome, central China, West Africa, Persia/Iran
Spot the collapses of civilization

Backup Example: The axis of historical complexity (7)

plot(soccomp.pca$x[,1], soccomp.pca$x[,2], xlab="Score on PC1",
     ylab="Score on PC2")

Correlation of scores between PC1 and PC2 is 1.5406210^{-14}
Uncorrelated but dependent

Backup Example: Latent Semantic Indexing (1)

A random selection of stories from the Times, about either art or music

nyt.nice.frame holds all the stories, with some standard text-processing tweaks applied
- (Up-weight counts of words which appear in few documents, “inverse document frequency” weighting)
- (Normalize all feature vectors to length 1)

dim(nyt.nice.frame)

## [1]  102 4431

Again, The basic command for PCA in R is prcomp():

nyt.pca <- prcomp(nyt.nice.frame)

Backup Example: Latent Semantic Indexing (2)

First component:

nyt.latent.sem <- nyt.pca$rotation
signif(sort(nyt.latent.sem[,1],decreasing=TRUE)[1:30],2)

##       music        trio     theater   orchestra   composers       opera 
##       0.110       0.084       0.083       0.067       0.059       0.058 
##    theaters           m    festival        east     program           y 
##       0.055       0.054       0.051       0.049       0.048       0.048 
##      jersey     players   committee      sunday        june     concert 
##       0.047       0.047       0.046       0.045       0.045       0.045 
##    symphony       organ     matinee   misstated instruments           p 
##       0.044       0.044       0.043       0.042       0.041       0.041 
##         X.d       april      samuel        jazz     pianist     society 
##       0.041       0.040       0.040       0.039       0.038       0.038

signif(sort(nyt.latent.sem[,1],decreasing=FALSE)[1:30],2)

##       she       her        ms         i      said    mother    cooper        my 
##    -0.260    -0.240    -0.200    -0.150    -0.130    -0.110    -0.100    -0.094 
##  painting   process paintings        im        he       mrs        me  gagosian 
##    -0.088    -0.071    -0.070    -0.068    -0.065    -0.065    -0.063    -0.062 
##       was   picasso     image sculpture      baby   artists      work    photos 
##    -0.058    -0.057    -0.056    -0.056    -0.055    -0.055    -0.054    -0.051 
##       you    nature    studio       out      says      like 
##    -0.051    -0.050    -0.050    -0.050    -0.050    -0.049

Music vs. art
What’s up with “p”, “m”, “y”, personal pronouns?

Backup Example: Latent Semantic Indexing (3)

Second component:

signif(sort(nyt.latent.sem[,2],decreasing=TRUE)[1:30],2)

##         art      museum      images     artists   donations     museums 
##       0.150       0.120       0.095       0.092       0.075       0.073 
##    painting         tax   paintings   sculpture     gallery  sculptures 
##       0.073       0.070       0.065       0.060       0.055       0.051 
##     painted       white    patterns      artist      nature     service 
##       0.050       0.050       0.047       0.047       0.046       0.046 
##  decorative        feet     digital      statue       color    computer 
##       0.043       0.043       0.043       0.042       0.042       0.041 
##       paris         war collections     diamond       stone     dealers 
##       0.041       0.041       0.041       0.041       0.041       0.040

signif(sort(nyt.latent.sem[,2],decreasing=FALSE)[1:30],2)

##          her          she      theater        opera           ms            i 
##       -0.220       -0.220       -0.160       -0.130       -0.130       -0.083 
##         hour   production         sang     festival        music      musical 
##       -0.081       -0.075       -0.075       -0.074       -0.070       -0.070 
##        songs        vocal    orchestra           la      singing      matinee 
##       -0.068       -0.067       -0.067       -0.065       -0.065       -0.061 
##  performance         band       awards    composers         says           my 
##       -0.061       -0.060       -0.058       -0.058       -0.058       -0.056 
##           im         play     broadway       singer       cooper performances 
##       -0.056       -0.056       -0.055       -0.052       -0.051       -0.051

Backup Example: Latent Semantic Indexing (4)

story.classes <- c(rep("art", times=length(art.stories)),
                   rep("music", times=length(music.stories)))
plot(nyt.pca$x[,1:2],
     pch=ifelse(story.classes=="music","m","a"),
     col=ifelse(story.classes=="music","blue","red"))

Notice: each document (= data point, row of $\X$) gets its own score on every PC (I’m just using the first two for visualization)
- What part of prcomp()’s return value contains the scores?
Notice: really easy to separate using just 2 principal components!

Bonus: PCA and distance preservation

In PCA, we project from $\vec{x}_i$ to $\mathbf{u}\vec{x}_i$
- $\mathbf{u} =$ projection operator for the principal components
- What’s $\mathbf{u}$ in terms of the eigenvectors?
We reduce from $p$ dimensions down to $q$
PCA tries to make the projected points close to the original points, so \[ \frac{1}{n}\sum_{i=1}^{n}{\| \vec{x}_i - \mathbf{u}\vec{x}_i\|^2} \] is minimized
So distances between projected points track distances between original points
- (By the triangle inequality)

Bonus: Distance-preserving projections

Suppose we just want to preserve distances
A projection matrix $\mathbf{u}$ would be distance preserving or an isometry if \[ \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 = \|\vec{x}_i - \vec{x}_j\|^2 \]
Exact distance-preservation is a lot to ask
- Unless $p > n$
Approximate distance preservation: we say that $\mathbf{u}$ is “$\epsilon$-close to being an isometry”, or an $\epsilon$-isometry, when \[ (1-\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \leq \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 \leq (1+\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \]
A remarkable math fact: If $q = O(\frac{\log{n}}{\epsilon^2})$ then there is always an $\epsilon$-isometry

Bonus: The random projection trick

Make $Z_{ij} \sim \mathcal{N}(0,1)$ independent standard Gaussians
Set \[ \mathbf{U} = \frac{1}{\sqrt{q}}\left[\begin{array}{ccc} Z_{11} & Z_{12} & \ldots & Z_{1p}\\ Z_{21} & Z_{22} & \ldots & Z_{2p}\\ \vdots & \vdots & \ldots & \vdots\\ Z_{q1} & Z_{q2} & \ldots & Z_{qp}\end{array}\right] \]
$\mathbf{U}\vec{x}_i$ is like projecting on to $q$ random $p$-dimensional vectors
- What is the expected squared length of those vectors?
Now you can check that for any vector $\vec{x}$, \[ \Expect{\|\mathbf{U}\vec{x}\|^2} = \|\vec{x}\|^2 \]
Hint: $\Expect{\|\mathbf{U}\vec{x}\|^2} = \Expect{\sum_{i=1}^{q}{\left(\frac{1}{\sqrt{q}}\sum_{j=1}^{p}{Z_{ij} x_j}\right)^2}}$ and $\|\vec{x}\|^2 = \sum_{j=1}^{p}{x_j^2}$

Bonus: The random projection trick

Expectation is linear and the $Z_{ij}$ are uncorrelated so \[ \Expect{\|\mathbf{U}\vec{x}\|^2} = \frac{1}{q}\sum_{i=1}^{q}{\sum_{j=1}^{p}{x_j^2}} = \|\vec{x}\|^2 \]
Random projections are distance-preserving on average

Bonus: Random projections are nearly distance-preserving with high probability

Pick your favorite $\epsilon > 0$ (approximation level) and $\delta > 0$ (error probability)
Johnson-Lindenstrauss theorem: With probability at least $1-\delta$, $\mathbf{U}$ is an $\epsilon$-isometry for our $n$ data points if \[ q \geq \frac{8}{\epsilon^2}\log{\frac{n}{\sqrt{\delta}}} \]
- E.g., for 95% confidence, need $q \geq \frac{8}{\epsilon^2}\left(\log{(n)} + 1.498\ldots \right)$
- For 95% confidence with $n=205$ (like in HW 10) and $\epsilon=0.2$, we need $q \geq 1365$
- If we had $n=2\times 10^6$, we’d need $q \geq 3202$
The remarkable bits: $p$ doesn’t matter and neither do the actual data vectors
- For the HW 10 data, $p=4.725\times 10^{5}$
You can also turn this around to see what $\epsilon$ you can achieve with a fixed budget of $q$
- With $q = 100$, $n=205$, $\delta=0.05$, we get $\epsilon \approx 0.74$ which isn’t great but is some signal
- Again, independent of $p$

MOAR READING

PCA goes back to Pearson (1901)
- This makes it clear that PCA just needs the sample variance matrix for the features (what we’ve been writing $\V$), and that it’s like doing linear regression without picking out one dimension as the thing to be predicted
  - In addition to PCA, Pearson invented the correlation coefficient, the $\chi^2$ test for dependence in contingency tables (and for goodness of fit), and a lot more of modern statistics. He did all this as part of a very racist and classist (but explicitly anti-sexist!) project to breed better human beings (Porter 2004).
- In the subsequent 120 years, PCA got re-invented multiple times, most influentially by Hotelling (1933a);Hotelling (1933b)
  - Lots of presentations inspired by Hotelling leap right to maximizing variance, but I think it makes more sense to focus on trying to come close to the data (as in Pearson’s original paper)
  - The latest re-invention is reported (credulously) in [https://www.wired.com/story/the-style-maven-astrophysicists-of-silicon-valley/]
- One re-invention was actually an important extension, the Karhunen-Loeve transformation or Karhunen-Loeve expansion (Loève 1955):
  - We used PCA to write $p$-dimensional vectors as combinations of $p$ orthogonal basis vectors, so $\vec{x}_i = \sum_{j=1}^{p}{s_{ij} \vec{w}_j}$ with $\vec{w}_j \cdot \vec{w}_k = \delta_{jk}$ and $n^{-1}\sum_{i=1}^{n}{s_{ij} s_{ik}} = \lambda_j \delta_{jk}$
  - Suppose instead our data was $n$ continuous functions, $x_i(t)$, say with $t \in [0,T]$. We’d like to find a set of basis functions $\phi_j(t)$ where we can write $x_i(t) = \sum_{j}{s_{ij} \phi_j(t)}$ (notice the score $s_{ij}$ doesn’t change over time), with $\int_{0}^{T}{\phi_j(t) \phi_k(t) dt} = \delta_{jk}$ (basis functions are orthogonal) and $n^{-1}\sum_{i=1}^{n}{s_{ij} s_{ik}} = \lambda_j \delta_{jk}$ (scores on different basis functions are uncorrelated)
  - K&L (independently) showed how to do this, and also showed how to make it work if we imagine all of the data $x_i(t)$ as being randomly drawn from some distribution over functions
  - The same trick works for functions defined over 2, 3, many, … dimensions
- In many fields, PCA is called “empirical orthogonal functions” (see e.g. Eshel (2012)) because of the connection to the KL transformation
The random projection trick relies on the Johnson-Lindenstrauss (JL) theorem, which goes back to Johnson and Lindenstrauss (1984)
- The theorem is often called the JL lemma, because it was just a lemma (intermediate result) in their original work (Johnson and Lindenstrauss 1984, Lemma 1, pp. 190–191)
  - Johnson and Lindenstrauss were actually solving a kind of interpolation problem, about how to extend functions defined on a limited set of points to functions defined everywhere on a continuous space, and the JL theorem was just a tool for them
- The slickest proof of the Johnson-Lindenstrauss theorem I’ve seen is Dasgupta and Gupta (2002)
  - See also [http://blog.geomblog.org/2011/11/intuitive-argument-for-jl-lemma.html] by Suresh Venkatasubramanian
- The version of the JL theorem I’ve quoted (for the constants) comes from Boucheron, Lugosi, and Massart (2013), ch. 2

References

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.

Dasgupta, Sanjoy, and Anupam Gupta. 2002. “An Elementary Proof of a Theorem of Johnson and Lindenstrauss.” Random Structures and Algorithms 22:60–65. https://doi.org/10.1002/rsa.10073.

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41:391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

Dhillon, Paramveer S., Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. 2013. “A Risk Comparison of Ordinary Least Squares Vs Ridge Regression.” Journal of Machine Lerning Research 14:1505–11. http://jmlr.org/papers/v14/dhillon13a.html.

Eshel, Gidon. 2012. Spatiotemporal Data Analysis. Princeton, New Jersey: Princeton University Press. https://doi.org/10.1515/9781400840632.

Hand, David, Heikki Mannila, and Padhraic Smyth. 2001. Principles of Data Mining. Cambridge, Massachusetts: MIT Press.

Hotelling, Harold. 1933a. “Analysis of a Complex of Statistical Variables into Principal Components [Part 1 of 2].” Journal of Educational Psychology 24:471–41. https://doi.org/10.1037/h0071325.

———. 1933b. “Analysis of a Complex of Statistical Variables into Principal Components [Part 2 of 2].” Journal of Educational Psychology 24:498–520. https://doi.org/10.1037/h0070888.

Johnson, William B., and Joram Lindenstrauss. 1984. “Extensions of Lipschitz Mappings into a Hilbert Space.” In Conference on Modern Analysis and Probability, edited by Richard Beals, Anatole Beck, Alexandra Bellow, and Arshag Hajian, 26:189–206. Contemporary Mathematics. Providence, Rhode Island: American Mathematical Society.

Landauer, Thomas K., and Susan T. Dumais. 1997. “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review 104:211–40. http://lsa.colorado.edu/papers/plato/plato.annote.html.

Loève, Michel. 1955. Probability Theory. 1st ed. New York: D. Van Nostrand Company.

Pearson, Karl. 1901. “On Lines and Planes of Closest Fit to Systems of Points in Space.” Philosophical Magazine 2 (series 6):559–72. https://doi.org/10.1080/14786440109462720.

Porter, Theodore M. 2004. Karl Pearson: The Scientific Life in a Statistical Age. Princeton, New Jersey: Princeton University Press. https://doi.org/10.1515/9781400835706.

Turchin, Peter, Thomas E. Currie, Harvey Whitehouse, Pieter François, Kevin Feeney, Daniel Mullins, Daniel Hoyer, et al. 2018. “Quantitative Historical Analysis Uncovers a Single Dimension of Complexity That Structures Global Variation in Human Social Organization.” Proceedings of the National Academy of Sciences (USA) 115:E144–E151. https://doi.org/10.1073/pnas.1708800115.

Linear Dimension Reduction (mostly Principal Components)

Previously…

What the data looks like

Finding the “principal” component

Projections

How well does the projection approximate the original? (1)

How well does the projection approximate the original? (2)

Minimizing MSE is maximizing variance (1)

Minimizing MSE is maximizing variance (2)

Minimizing MSE is maximizing variance (3)

OK, how do we find this magic direction? (1)

OK, how do we find this magic direction? (2)

The magic direction is an eigenvector

The first principal component (PC1) and scores on the first principal component

Multiple principle components

About the sample covariance matrix (1)

About the sample covariance matrix (2)

Some properties of the eigenvalues

Some properties of the PC vectors

The eigendecomposition of \(\V\)

Some properties of the PC scores (1)

Some properties of the PC scores (2)

Some properties of PCA as a whole

Another way to think about PCA

Interpreting PCA results

USA, \(\approx 1977\)

Principal components of the USA, \(\approx 1977\)

The principal component vectors

Scores on the principal components

PC1 is kinda southern (1)

PC1 is kinda southern (2)

PC1 is kinda southern (3)

The eigenvalues / variances along each PC

Scree plot (1)

Scree plot (2)

PCA as exploratory analysis, not statistical inference

Some statistical-learning / data-mining-y applications

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

Multidimensional scaling (MDS)

Principal components regression

Netflix and recommender systems in general

Summing up

Backup: Gory details of PCA in matrix form

Backup: The gory details

Backup: The gory details

Backup: Distinct eigenvalues of a symmetric matrix have orthogonal eigenvectors

Backup: Distinct eigenvectors with equal eigenvalues can be made or chosen to be orthogonal

Backup: Low-rank approximation

Backup Example: The axis of historical complexity (1)

Backup Example: The axis of historical complexity (2)

Backup Example: The axis of historical complexity (3)

Backup Example: The axis of historical complexity (4)

Backup Example: The axis of historical complexity (5)

Backup Example: The axis of historical complexity (6)

Backup Example: The axis of historical complexity (7)

Backup Example: Latent Semantic Indexing (1)

Backup Example: Latent Semantic Indexing (2)

Backup Example: Latent Semantic Indexing (3)

Backup Example: Latent Semantic Indexing (4)

Bonus: PCA and distance preservation

Bonus: Distance-preserving projections

Bonus: The random projection trick

Bonus: The random projection trick

Bonus: Random projections are nearly distance-preserving with high probability

MOAR READING

References