\[ \newcommand{\X}{\mathbf{x}} \newcommand{\Y}{\mathbf{y}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

Recap

We have \(n\) items represented as \(p\)-dimensional feature vectors; \(p\) is too big
PCA: Find the best \(q\)-dimensional linear approximation to the data
- \(\equiv\) find the \(q\)-directions of maximum variance
- \(\equiv\) find the top \(q\) eigenvectors of the variance matrix
- Then project the \(p\)-dimensional feature vectors on to those eigenvectors
- Scores = lengths of the projections = new, \(q\)-dimensional features
Factor analysis: Suppose the data are \(q\)-dimensional linear subspace plus noise
- Find the \(q\)-dimensional subspace which best preserves the correlations
- Backtrack to coordinates in the subspace (factor scores)
- Factor scores are the new, \(q\)-dimensional features
Drawbacks:
- This is a lot of computational work!
- Who says linear subspaces are so great?

PCA and distance preservation

PCA: we project from \(\vec{x}_i\) to \(\mathbf{u}\vec{x}_i\)
- \(\mathbf{u} =\) projection operator for the principal components
PCA: tries to make the projected points close to the original points, so \[ \frac{1}{n}\sum_{i=1}^{n}{\| \vec{x}_i - \mathbf{u}\vec{x}_i\|^2} \] is minimized
This means distances between projected points track distances between original points
- Triangle inequality

Distance-preserving projections

Suppose we just want to preserve distances
\(\mathbf{u}\) would be distance preserving or an isometry if \[ \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 = \|\vec{x}_i - \vec{x}_j\|^2 \]
Exact distance-preservation is a lot to ask
- Unless \(p > n\)
Approximate distance preservation: you pick an \(\epsilon > 0\); \(\mathbf{u}\) is an \(\epsilon\)-isometry when \[ (1-\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \leq \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 \leq (1+\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \]
A remarkable math fact: If \(q = O(\frac{\log{n}}{\epsilon^2})\) then there is always an \(\epsilon\)-isometry

The random projection trick

Make \(Z_{ij} \sim \mathcal{N}(0,1)\) independent standard Gaussians
Set \[ \mathbf{U} = \frac{1}{\sqrt{q}}\left[\begin{array}{ccc} Z_{11} & Z_{12} & \ldots & Z_{1p}\\ Z_{21} & Z_{22} & \ldots & Z_{2p}\\ \vdots & \vdots & \ldots & \vdots\\ Z_{q1} & Z_{q2} & \ldots & Z_{qp}\end{array}\right] \]
\(\mathbf{U}\vec{x}_i\) is like projecting on to \(d\) random \(p\)-dimensional vectors
- What is the expected squared length of those vectors?
Now you can check that for any vector \(\vec{x}\), \[ \Expect{\|\mathbf{U}\vec{x}\|^2} = \|\vec{x}\|^2 \]
Hint: \(\Expect{\|\mathbf{U}\vec{x}\|^2} = \Expect{\sum_{i=1}^{q}{\left(\frac{1}{\sqrt{q}}\sum_{j=1}^{p}{Z_{ij} x_j}\right)^2}}\) and \(\|\vec{x}\|^2 = \sum_{j=1}^{p}{x_j^2}\)

The random projection trick

Expectation is linear and the \(Z_{ij}\) are uncorrelated so \[ \Expect{\|\mathbf{U}\vec{x}\|^2} = \frac{1}{q}\sum_{i=1}^{q}{\sum_{j=1}^{p}{x_j^2}} = \|\vec{x}\|^2 \]
Random projections are distance-preserving on average

Random projections are nearly distance-preserving with high probability

Fix your favorite \(\epsilon > 0\) (approximation level) and \(\delta > 0\) (error probability)
Johnson-Lindenstrauss theorem: With probability at least \(1-\delta\), \(\mathbf{U}\) is an \(\epsilon\)-isometry for our \(n\) data points if \[ q \geq \frac{8}{\epsilon^2}\log{\frac{n}{\sqrt{\delta}}} \]
- E.g., for 95% confidence, need \(q \geq \frac{8}{\epsilon^2}\left(\log{(n)} + 1.498\ldots \right)\)
- For 95% confidence with \(n=205\) and \(\epsilon=0.2\), we need \(q \geq 1365\)
- If we had \(n=2\times 10^6\), we’d need \(q \geq 3202\)
The remarkable bits: \(p\) doesn’t matter and neither does the data
- For the homework data, \(p=4.725\times 10^{5}\)
You can also turn this around to see what \(\epsilon\) you can achieve with a fixed budget of \(q\)
- With \(q = 100\), \(n=205\), \(\delta=0.05\), we get \(\epsilon \approx 0.74\) which isn’t great but is some signal
- Again, independent of \(p\)

The problem with nonlinear structure

Some nonlinear data (a nice spiral):

x = matrix(c(exp(-0.2 * (-(1:300)/10)) * cos(-(1:300)/10), exp(-0.2 * (-(1:300)/10)) * 
    sin(-(1:300)/10)), ncol = 2)
plot(x)

PCA just fails here

fit.all = prcomp(x)
approx.all = fit.all$x[, 1] %*% t(fit.all$rotation[, 1])
plot(x, xlab = expression(x[1]), ylab = expression(x[2]))
points(approx.all, pch = 4)

Manifold learning

PCA and FA generally do badly when the data are curves (or curved surfaces)
Geometry fact: locally, every smooth curved surface looks linear; a set we get by piecing these together is called a “manifold”
Manifold learning = the data are on or near a manifold; find that manifold
- Factor analysis is a kind of linear manifold learning

PCA doesn’t do too badly around any small part of the curve

fit = prcomp(x[270:280, ])
pca.approx = fit$x[, 1] %*% t(fit$rotation[, 1]) + colMeans(x[270:280, ])
plot(rbind(x[270:280, ], pca.approx), type = "n", xlab = expression(x[1]), ylab = expression(x[2]))
points(x[270:280, ])
points(pca.approx, pch = 4)

Local(ly) Linear Embedding

Write each data point as a linear combination of near-by data points
Look for lower-dimensional coordinates which fit those linear combinations

LLE

Inputs: \(\X\) \([n \times p]\), \(q < p\), \(k \geq q+1\)
Output: \(\Y\) \([n \times q]\)

For each \(\vec{x}_i\), find the \(k\) nearest neighbors.
Find optimal weights for reconstructing \(\vec{x}_i\) from its neighbors, i.e., minimize \[\begin{equation} MSE(\mathbf{w}) \equiv \frac{1}{n}\sum_{i=1}^{n}{{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|}^2} \end{equation}\] with \(w_{ij} = 0\) unless \(j\) is a nearest neighbor of \(i\), and \(\sum_{j}{w_{ij}} = 1\)
Find coordinates \(\Y\) which minimize the reconstruction error, \[\begin{equation} \Phi(\mathbf{Y}) \equiv \sum_{i=1}^{n}{{\|\vec{y}_i - \sum_{j\neq i}{w_{ij} \vec{y}_j}\|}^2} \end{equation}\] with constraints \(\sum_{i}{y_{ij}} = 0\) and \(\Y^T \Y = \mathbf{I}\) (centered, de-correlated)

Finding neighbors in \(p\)-dimensional space

We know how to do this

Finding weights

\[ \min_{\w}{\frac{1}{n}\sum_{i=1}^{n}{{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|}^2}} \]

Can find weights separately for each data point \(\vec{x}_i\)
- i.e., independent computation of each row \(\w_{i\cdot}\) so \[ \min_{\w_{i\cdot}}{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|^2} \]
If \(k \geq p+1\), can always make the reconstruction error exactly 0
- Why \(k \geq p+1\)?
If all points are in a \(q\)-dimensional linear subspace, reconstruction error could be made exactly 0 with only \(k \geq q+1\) neighbors (why?)
Non-zero minimum error tells us about local curvature / nonlinearity
For big \(n\), nearest neighbors should be close and we should be able to get close to 0

Finding the weights

A solve-a-linear-system problem, but a different one for each \(i\)
In fact, another homework problem

Finding the new coordinates

We should be able to transfer the neighborhood structure from the old space to the new one
Take the weights as fixed and ask for new, low-dimensional coordinates which match up with the weights: minimize \[ \Phi(\Y) = \sum_{i}{{\left\|\vec{y}_i - \sum_{j \neq i}{\vec{y}_j w_{ij}}\right\|}^2} \]
Yet another eigen-problem
Yet another homework problem

Implementation

# Local linear embedding of data vectors Inputs: n*p matrix of vectors,
# number of dimensions q to find (< p), number of nearest neighbors per
# vector, scalar regularization setting Calls: find.kNNs,
# reconstruction.weights, coords.from.weights Output: n*q matrix of new
# coordinates
lle <- function(x, q, k = q + 1, alpha = 0.01) {
    stopifnot(q > 0, q < ncol(x), k > q, alpha > 0)  # sanity checks
    kNNs = find.kNNs(x, k)  # should return an n*k matrix of indices
    w = reconstruction.weights(x, kNNs, alpha)  # n*n weight  matrix
    coords = coords.from.weights(w, q)  # n*q coordinate matrix
    return(coords)
}

spiral.lle = lle(x, 1, 2)
plot(spiral.lle, ylab = "Coordinate on manifold")

all(diff(spiral.lle) > 0)

## [1] TRUE

plot(x, col = rainbow(300, end = 5/6)[cut(spiral.lle, 300, labels = FALSE)], 
    pch = 16)

Afternotes

There are many versions of the Johnson-Lindenstrauss Theorem on random projections, varying in what they assume about the random vectors and the exact proof technique
- The version given here, for Gaussian-entry random vectors, is taken from chapter 2 of Boucheron, Lugosi, and Massart (2013); because of the Gaussianity, it can be proved using just facts about \(\chi^2\) distributions
- More general versions need more advanced probability theory, as in Boucheron, Lugosi, and Massart (2013)
- If you know what Fourier transforms are, and the Chernoff/Hoeffding bound on the probability of a random variable deviating from its expectation, then Suresh Venkatasubramanian gave a good explanation of the JL theorem in 2011
- (The result is often called the Johnson-Lindenstrauss “Lemma”, because it was just a lemma (subsidiary result) in their original paper)
- The JL Theorem is part of a broader family of results about using random samplings and projections to deal with high-dimensional linear algebra (e.g., linear regression for very big data sets); Mahoney (2011) is a good introduction
Local linear embedding comes from Roweis and Saul (2000) and Saul and Roweis (2003)
- The source-code for these slides includes a complete (though not maximally efficient) implementation

References

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press.

Mahoney, Michael W. 2011. “Randomized Algorithms for Matrices and Data.” Foundations and Trends in Machine Learning 2:123–224. https://doi.org/10.1561/2200000035.

Roweis, Sam T., and Laurence K. Saul. 2000. “Nonlinear Dimensionality Reduction by Locally Linear Embedding.” Science 290:2323–6. https://doi.org/10.1126/science.290.5500.2323.

Saul, Lawrence K., and Sam T. Roweis. 2003. “Think Globally, Fit Locally: Supervised Learning of Low Dimensional Manifolds.” Journal of Machine Learning Research 4:119–55. http://jmlr.csail.mit.edu/papers/v4/saul03a.html.

Dimension Reduction III — Random Linear Projections and Locally Linear Embeddings

Recap

PCA and distance preservation

Distance-preserving projections

The random projection trick

The random projection trick

Random projections are nearly distance-preserving with high probability

The problem with nonlinear structure

PCA just fails here

Manifold learning

PCA doesn’t do too badly around any small part of the curve

Local(ly) Linear Embedding

LLE

Finding neighbors in \(p\)-dimensional space

Finding weights

Finding the weights

Finding the new coordinates

Implementation

Afternotes

References