Linear Dimension Reduction (mostly Principal Components)

36-462/662, Spring 2022

24 March 2022 (Lecture 18)

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

Previously…

What the data looks like

Finding the “principal” component

Projections

How well does the projection approximate the original? (1)

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \| \vec{x_i}\|^2 - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

How well does the projection approximate the original? (2)

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

Minimizing MSE is maximizing variance (1)

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

(\(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\))

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] (Why?)

Minimizing MSE is maximizing variance (2)

so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance (3)

The direction which gives us the best approximation of the data is the direction along which the projections have the greatest variance

OK, how do we find this magic direction? (1)

Matrix form: all lengths of projections \(=\mathbf{x}\mathbf{w}\) \([n\times 1]\)

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i=1}^{n}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: \(\V \equiv \frac{\X^T \X}{n} =\) sample covariance matrix of the vectors

OK, how do we find this magic direction? (2)

\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\] \[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum comes from the leading eigenvector of \(\V\)

The first principal component (PC1) and scores on the first principal component

\[\begin{eqnarray} \text{Lead eigenvector of}~ \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

Multiple principle components

About the sample covariance matrix (1)

About the sample covariance matrix (2)

Some properties of the eigenvalues

Some properties of the PC vectors

The eigendecomposition of \(\V\)

Some properties of the PC scores (1)

Define the \([n\times p]\) matrix of the scores of all data points on all PCs: \[ \S \equiv \X \w \]

Some properties of the PC scores (2)

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w = \w^T (\w \mathbf{\Lambda} \mathbf{w}^T) \w & = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\ & = & \mathbf{\Lambda} \end{eqnarray}\]

Some properties of PCA as a whole

Another way to think about PCA

Interpreting PCA results

USA, \(\approx 1977\)

Dataset pre-loaded in R:

head(state.x77)
##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

Principal components of the USA, \(\approx 1977\)

state.pca <- prcomp(state.x77,scale.=TRUE)
str(state.pca)
## List of 5
##  $ sdev    : num [1:8] 1.897 1.277 1.054 0.841 0.62 ...
##  $ rotation: num [1:8, 1:8] 0.126 -0.299 0.468 -0.412 0.444 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:8] 4246.42 4435.8 1.17 70.88 7.38 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ scale   : Named num [1:8] 4464.49 614.47 0.61 1.34 3.69 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ x       : num [1:50, 1:8] 3.79 -1.053 0.867 2.382 0.241 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

The principal component vectors

The weight/loading matrix \(\w\) gets called $rotation (why?):

signif(state.pca$rotation[,1:2], 2)
##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of \(\V\)

-Break for in-class exercise

  1. What kind of state would get a large positive score on the 1st PC, and what kind of state would get a large negative score?
  2. What kind of state would get a large positive score on the 2nd PC, and what kind of state would get a large negative score?

Scores on the principal components

signif(state.pca$x[1:10, 1:2], 2)
##               PC1   PC2
## Alabama      3.80 -0.23
## Alaska      -1.10  5.50
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Colorado    -2.10  0.51
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns here are \(\vec{x}_i \cdot \vec{w}_1\) and \(\vec{x}_i \cdot \vec{w}_2\)

So, for instance, \[ s_{\text{Alabama}, 1} = 0.13 x_{\text{Alabama}, \text{Population}} + -0.3 x_{\text{Alabama}, \text{Income}} + \ldots -0.033 x_{\text{Alabama}, \text{Area}} \]

(after centering and scaling the features)

PC1 is kinda southern (1)

signif(head(sort(state.pca$x[,1])),2)
##    Minnesota North Dakota         Iowa         Utah     Nebraska     Colorado 
##         -2.4         -2.4         -2.3         -2.3         -2.2         -2.1
signif(tail(sort(state.pca$x[,1])),2)
## North Carolina        Georgia South Carolina        Alabama    Mississippi 
##            2.7            3.3            3.7            3.8            4.0 
##      Louisiana 
##            4.2

PC1 is kinda southern (2)

PC1 is kinda southern (3)

The eigenvalues / variances along each PC

signif(state.pca$sdev, 2)
## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34

Standard deviations along each principal component \(=\sqrt{\lambda_i}\)

If we keep \(q\) components, \[ R^2 = \frac{\sum_{i=1}^{q}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]

(Denominator \(=\tr{\V}\) — why?)

Scree plot (1)

plot(state.pca$sdev^2, xlab="PC", ylab="Variance", type="b")

Scree plot (2)

plot(cumsum(state.pca$sdev^2), xlab="Number of PCs", ylab="Cumulative variance",
     type="b", ylim=c(0, sum(state.pca$sdev^2)))

PCA as exploratory analysis, not statistical inference

Some statistical-learning / data-mining-y applications

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

Multidimensional scaling (MDS)

Principal components regression

Netflix and recommender systems in general

Summing up

Backup: Gory details of PCA in matrix form

Backup: The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize \(\frac{1}{n}\tr{(\X\w\w^T\X^T)}\)

Backup: The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of \(\w\) must be an eigenvector of \(\V\).

Backup: Distinct eigenvalues of a symmetric matrix have orthogonal eigenvectors

For any symmetric matrix with eigenvalues \(\lambda_1, \ldots \lambda_p\) and eigenvectors \(\vec{w}_1, \ldots \vec{w}_p\), if \(\lambda_i \neq \lambda_j\), then \(\vec{w}_i \perp \vec{w}_j\)

Proof: Remember that \(\vec{w}_i \perp \vec{w}_j\) iff \(\vec{w}_i \cdot \vec{w}_j = 0\), so let’s get at that inner product. Now, for any two vectors \(\vec{a}\), \(\vec{b}\) and square matrix \(\mathbf{c}\) (not necessarily symmetric), \[ \vec{a} \cdot (\mathbf{c} \vec{b}) = (\mathbf{c}^T \vec{a}) \cdot \vec{b} \] (To see this, write \(\vec{a}\) and \(\vec{b}\) as \(p\times 1\) matrices, so we’ve got \(\vec{a}^T \mathbf{c} \vec{b} = \vec{a}^T (\mathbf{c}^T)^T \vec{b} = (\mathbf{c}^T \vec{a})^T \vec{b}\).) Now let’s apply this to our eigenvectors: \[\begin{eqnarray} \vec{w}_i \cdot (\V \vec{w}_j) & = & (\V^T \vec{w}_i) \cdot \vec{w}_j\\ \vec{w}_i \cdot (\V \vec{w}_j) & = & (\V \vec{w}_i) \cdot \vec{w}_j \vec{w}_i \cdot (\lambda_j \vec{w}_j) & = & (\lambda_i \vec{w}_i) \cdot \vec{w}_j\\ \lambda_j (\vec{w}_i \cdot \vec{w}_j) & = & \lambda_i (\vec{w}_i \cdot \vec{w}_j) \end{eqnarray}\] (The second line uses the assumption that \(\V\) is symmetric, the third line uses the fact that \(\vec{w}_i\) and \(\vec{w}_j\) are both eigenvectors of \(\V\), the last line uses linearity of inner products.)

Since, by assumption, \(\lambda_i \neq \lambda_j\), the only way the two sides of the equation can balance is if \(\vec{w}_i \cdot \vec{w}_j = 0\), as was to be shown.

Backup: Distinct eigenvectors with equal eigenvalues can be made or chosen to be orthogonal

Backup: Low-rank approximation

Backup Example: The axis of historical complexity (1)

Turchin et al. (2018)

9 variable measuring social complexity, over time, for dozens of locations across the world

Dates from -9600 to 1900

summary(soccomp[,complexities], digits=2)
##      PolPop       PolTerr          CapPop        levels      government  
##  Min.   :1.4   Min.   :-0.22   Min.   :1.4   Min.   :0.0   Min.   :0.00  
##  1st Qu.:4.2   1st Qu.: 3.65   1st Qu.:3.5   1st Qu.:1.8   1st Qu.:0.24  
##  Median :6.0   Median : 5.18   Median :4.3   Median :3.0   Median :0.62  
##  Mean   :5.5   Mean   : 4.78   Mean   :4.2   Mean   :2.9   Mean   :0.55  
##  3rd Qu.:6.8   3rd Qu.: 5.97   3rd Qu.:5.1   3rd Qu.:4.0   3rd Qu.:0.86  
##  Max.   :8.5   Max.   : 7.40   Max.   :6.3   Max.   :6.6   Max.   :1.00  
##     infrastr       writing         texts          money    
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.0  
##  1st Qu.:0.34   1st Qu.:0.26   1st Qu.:0.10   1st Qu.:1.8  
##  Median :0.75   Median :0.82   Median :0.93   Median :4.0  
##  Mean   :0.64   Mean   :0.65   Mean   :0.63   Mean   :3.4  
##  3rd Qu.:0.90   3rd Qu.:0.86   3rd Qu.:0.97   3rd Qu.:5.0  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :6.0

Backup Example: The axis of historical complexity (2)

soccomp.pca <- prcomp(soccomp[,complexities], scale=TRUE)

What are the parts of the return value?

str(soccomp.pca)
## List of 5
##  $ sdev    : num [1:9] 2.634 0.733 0.646 0.581 0.481 ...
##  $ rotation: num [1:9, 1:9] 0.351 0.32 0.339 0.341 0.332 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:9] 5.515 4.779 4.229 2.923 0.552 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ scale   : Named num [1:9] 1.59 1.561 1.112 1.449 0.325 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ x       : num [1:414, 1:9] -4.35 -4.24 -4.11 -3.67 -3.51 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

Backup Example: The axis of historical complexity (3)

# R^2 for k components = sum(first k eigenvalues)/sum(all eigenvalues)
# cumsum(x) returns the cumulative sums along the vector x
plot(cumsum(soccomp.pca$sdev^2)/sum(soccomp.pca$sdev^2),
     xlab="Number of components used", ylab=expression(R^2),
     ylim=c(0,1))
# Add the lines indicating the desired levels
abline(h=0.75, lty="dashed")
abline(h=0.90, lty="dotted")

One principal component already keeps more than 75% of the variance (3 components keep just under 90%)

Backup Example: The axis of historical complexity (4)

# Plot the coordinates/weights/loadings of PC1
  # But suppress the usual labels on the horizontal axis
plot(soccomp.pca$rotation[,1], xlab="", main="PCs of the complexity measures",
     xaxt="n", ylab="PC weight", ylim=c(-1,1), type="l")
# Label the horizontal axis by the names of the variables
  # The las option turns the axis labels on their side (so they don't overlap
  # each other)
axis(side=1, at=1:9, labels=colnames(soccomp)[complexities], las=2,
     cex.axis=0.5)
# Now add PC2 and PC3, in distinct colors
lines(soccomp.pca$rotation[,2], col="red")
lines(soccomp.pca$rotation[,3], col="blue")
# Legend
legend("bottomright", legend=c("PC1", "PC2", "PC3"), lty="solid",
       col=c("black", "red", "blue"))
# Guide-to-the-eye horizontal line at 0
abline(h=0, col="grey")

Backup Example: The axis of historical complexity (5)

# Add scores to data frame
soccomp$PC1 <- soccomp.pca$x[,1]
# Plot vs. time (all areas)
with(soccomp, plot(PC1 ~ Time))

Backup Example: The axis of historical complexity (6)

Backup Example: The axis of historical complexity (7)

plot(soccomp.pca$x[,1], soccomp.pca$x[,2], xlab="Score on PC1",
     ylab="Score on PC2")

Backup Example: Latent Semantic Indexing (1)

A random selection of stories from the Times, about either art or music

dim(nyt.nice.frame)
## [1]  102 4431
nyt.pca <- prcomp(nyt.nice.frame)

Backup Example: Latent Semantic Indexing (2)

First component:

nyt.latent.sem <- nyt.pca$rotation
signif(sort(nyt.latent.sem[,1],decreasing=TRUE)[1:30],2)
##       music        trio     theater   orchestra   composers       opera 
##       0.110       0.084       0.083       0.067       0.059       0.058 
##    theaters           m    festival        east     program           y 
##       0.055       0.054       0.051       0.049       0.048       0.048 
##      jersey     players   committee      sunday        june     concert 
##       0.047       0.047       0.046       0.045       0.045       0.045 
##    symphony       organ     matinee   misstated instruments           p 
##       0.044       0.044       0.043       0.042       0.041       0.041 
##         X.d       april      samuel        jazz     pianist     society 
##       0.041       0.040       0.040       0.039       0.038       0.038
signif(sort(nyt.latent.sem[,1],decreasing=FALSE)[1:30],2)
##       she       her        ms         i      said    mother    cooper        my 
##    -0.260    -0.240    -0.200    -0.150    -0.130    -0.110    -0.100    -0.094 
##  painting   process paintings        im        he       mrs        me  gagosian 
##    -0.088    -0.071    -0.070    -0.068    -0.065    -0.065    -0.063    -0.062 
##       was   picasso     image sculpture      baby   artists      work    photos 
##    -0.058    -0.057    -0.056    -0.056    -0.055    -0.055    -0.054    -0.051 
##       you    nature    studio       out      says      like 
##    -0.051    -0.050    -0.050    -0.050    -0.050    -0.049

Backup Example: Latent Semantic Indexing (3)

Second component:

signif(sort(nyt.latent.sem[,2],decreasing=TRUE)[1:30],2)
##         art      museum      images     artists   donations     museums 
##       0.150       0.120       0.095       0.092       0.075       0.073 
##    painting         tax   paintings   sculpture     gallery  sculptures 
##       0.073       0.070       0.065       0.060       0.055       0.051 
##     painted       white    patterns      artist      nature     service 
##       0.050       0.050       0.047       0.047       0.046       0.046 
##  decorative        feet     digital      statue       color    computer 
##       0.043       0.043       0.043       0.042       0.042       0.041 
##       paris         war collections     diamond       stone     dealers 
##       0.041       0.041       0.041       0.041       0.041       0.040
signif(sort(nyt.latent.sem[,2],decreasing=FALSE)[1:30],2)
##          her          she      theater        opera           ms            i 
##       -0.220       -0.220       -0.160       -0.130       -0.130       -0.083 
##         hour   production         sang     festival        music      musical 
##       -0.081       -0.075       -0.075       -0.074       -0.070       -0.070 
##        songs        vocal    orchestra           la      singing      matinee 
##       -0.068       -0.067       -0.067       -0.065       -0.065       -0.061 
##  performance         band       awards    composers         says           my 
##       -0.061       -0.060       -0.058       -0.058       -0.058       -0.056 
##           im         play     broadway       singer       cooper performances 
##       -0.056       -0.056       -0.055       -0.052       -0.051       -0.051

Backup Example: Latent Semantic Indexing (4)

story.classes <- c(rep("art", times=length(art.stories)),
                   rep("music", times=length(music.stories)))
plot(nyt.pca$x[,1:2],
     pch=ifelse(story.classes=="music","m","a"),
     col=ifelse(story.classes=="music","blue","red"))

Bonus: PCA and distance preservation

Bonus: Distance-preserving projections

Bonus: The random projection trick

Bonus: The random projection trick

Bonus: Random projections are nearly distance-preserving with high probability

MOAR READING

References

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.

Dasgupta, Sanjoy, and Anupam Gupta. 2002. “An Elementary Proof of a Theorem of Johnson and Lindenstrauss.” Random Structures and Algorithms 22:60–65. https://doi.org/10.1002/rsa.10073.

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41:391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

Dhillon, Paramveer S., Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. 2013. “A Risk Comparison of Ordinary Least Squares Vs Ridge Regression.” Journal of Machine Lerning Research 14:1505–11. http://jmlr.org/papers/v14/dhillon13a.html.

Eshel, Gidon. 2012. Spatiotemporal Data Analysis. Princeton, New Jersey: Princeton University Press. https://doi.org/10.1515/9781400840632.

Hand, David, Heikki Mannila, and Padhraic Smyth. 2001. Principles of Data Mining. Cambridge, Massachusetts: MIT Press.

Hotelling, Harold. 1933a. “Analysis of a Complex of Statistical Variables into Principal Components [Part 1 of 2].” Journal of Educational Psychology 24:471–41. https://doi.org/10.1037/h0071325.

———. 1933b. “Analysis of a Complex of Statistical Variables into Principal Components [Part 2 of 2].” Journal of Educational Psychology 24:498–520. https://doi.org/10.1037/h0070888.

Johnson, William B., and Joram Lindenstrauss. 1984. “Extensions of Lipschitz Mappings into a Hilbert Space.” In Conference on Modern Analysis and Probability, edited by Richard Beals, Anatole Beck, Alexandra Bellow, and Arshag Hajian, 26:189–206. Contemporary Mathematics. Providence, Rhode Island: American Mathematical Society.

Landauer, Thomas K., and Susan T. Dumais. 1997. “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review 104:211–40. http://lsa.colorado.edu/papers/plato/plato.annote.html.

Loève, Michel. 1955. Probability Theory. 1st ed. New York: D. Van Nostrand Company.

Pearson, Karl. 1901. “On Lines and Planes of Closest Fit to Systems of Points in Space.” Philosophical Magazine 2 (series 6):559–72. https://doi.org/10.1080/14786440109462720.

Porter, Theodore M. 2004. Karl Pearson: The Scientific Life in a Statistical Age. Princeton, New Jersey: Princeton University Press. https://doi.org/10.1515/9781400835706.

Turchin, Peter, Thomas E. Currie, Harvey Whitehouse, Pieter François, Kevin Feeney, Daniel Mullins, Daniel Hoyer, et al. 2018. “Quantitative Historical Analysis Uncovers a Single Dimension of Complexity That Structures Global Variation in Human Social Organization.” Proceedings of the National Academy of Sciences (USA) 115:E144–E151. https://doi.org/10.1073/pnas.1708800115.