36-462/662, Fall 2019

18 September 2019

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\Y}{\mathbf{y}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

- We have \(n\) items represented as \(p\)-dimensional feature vectors; \(p\) is too big
- PCA: Find the best \(q\)-dimensional linear approximation to the data
- \(\equiv\) find the \(q\)-directions of maximum variance
- \(\equiv\) find the top \(q\) eigenvectors of the variance matrix
- Then project the \(p\)-dimensional feature vectors on to those eigenvectors
- Scores = lengths of the projections = new, \(q\)-dimensional features

- Factor analysis: Suppose the data are \(q\)-dimensional linear subspace plus noise
- Find the \(q\)-dimensional subspace which best preserves the correlations
- Backtrack to coordinates in the subspace (factor scores)
- Factor scores are the new, \(q\)-dimensional features

- Drawbacks:
- This is a lot of computational work!
- Who says linear subspaces are so great?

- PCA: we project from \(\vec{x}_i\) to \(\mathbf{u}\vec{x}_i\)
- \(\mathbf{u} =\) projection operator for the principal components

- PCA: tries to make the projected points close to the original points, so \[ \frac{1}{n}\sum_{i=1}^{n}{\| \vec{x}_i - \mathbf{u}\vec{x}_i\|^2} \] is minimized
- This means distances between projected points track distances between original points
- Triangle inequality

- Suppose we just want to preserve distances
- \(\mathbf{u}\) would be
**distance preserving**or an**isometry**if \[ \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 = \|\vec{x}_i - \vec{x}_j\|^2 \] *Exact*distance-preservation is a lot to ask- Unless \(p > n\)

- Approximate distance preservation: you pick an \(\epsilon > 0\); \(\mathbf{u}\) is an
**\(\epsilon\)-isometry**when \[ (1-\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \leq \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 \leq (1+\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \] - A remarkable math fact: If \(q = O(\frac{\log{n}}{\epsilon^2})\) then there is
*always*an \(\epsilon\)-isometry

- Make \(Z_{ij} \sim \mathcal{N}(0,1)\) independent standard Gaussians
- Set \[ \mathbf{U} = \frac{1}{\sqrt{q}}\left[\begin{array}{ccc} Z_{11} & Z_{12} & \ldots & Z_{1p}\\ Z_{21} & Z_{22} & \ldots & Z_{2p}\\ \vdots & \vdots & \ldots & \vdots\\ Z_{q1} & Z_{q2} & \ldots & Z_{qp}\end{array}\right] \]
- \(\mathbf{U}\vec{x}_i\) is like projecting on to \(d\)
*random*\(p\)-dimensional vectors- What is the expected squared length of those vectors?

- Now you can check that for
*any*vector \(\vec{x}\), \[ \Expect{\|\mathbf{U}\vec{x}\|^2} = \|\vec{x}\|^2 \] - Hint: \(\Expect{\|\mathbf{U}\vec{x}\|^2} = \Expect{\sum_{i=1}^{q}{\left(\frac{1}{\sqrt{q}}\sum_{j=1}^{p}{Z_{ij} x_j}\right)^2}}\) and \(\|\vec{x}\|^2 = \sum_{j=1}^{p}{x_j^2}\)

Expectation is linear and the \(Z_{ij}\) are uncorrelated so \[ \Expect{\|\mathbf{U}\vec{x}\|^2} = \frac{1}{q}\sum_{i=1}^{q}{\sum_{j=1}^{p}{x_j^2}} = \|\vec{x}\|^2 \]

**Random projections are distance-preserving on average**

- Fix your favorite \(\epsilon > 0\) (approximation level) and \(\delta > 0\) (error probability)
**Johnson-Lindenstrauss theorem**: With probability at least \(1-\delta\), \(\mathbf{U}\) is an \(\epsilon\)-isometry for our \(n\) data points if \[ q \geq \frac{8}{\epsilon^2}\log{\frac{n}{\sqrt{\delta}}} \]- E.g., for 95% confidence, need \(q \geq \frac{8}{\epsilon^2}\left(\log{(n)} + 1.498\ldots \right)\)
- For 95% confidence with \(n=205\) and \(\epsilon=0.2\), we need \(q \geq 1365\)
- If we had \(n=2\times 10^6\), we’d need \(q \geq 3202\)

- The remarkable bits: \(p\) doesn’t matter and neither does the data
- For the homework data, \(p=4.725\times 10^{5}\)

- You can also turn this around to see what \(\epsilon\) you can achieve with a fixed budget of \(q\)
- With \(q = 100\), \(n=205\), \(\delta=0.05\), we get \(\epsilon \approx 0.74\) which isn’t great but is
*some*signal - Again, independent of \(p\)

- With \(q = 100\), \(n=205\), \(\delta=0.05\), we get \(\epsilon \approx 0.74\) which isn’t great but is

- Some nonlinear data (a nice spiral):

```
x = matrix(c(exp(-0.2 * (-(1:300)/10)) * cos(-(1:300)/10), exp(-0.2 * (-(1:300)/10)) *
sin(-(1:300)/10)), ncol = 2)
plot(x)
```

```
fit.all = prcomp(x)
approx.all = fit.all$x[, 1] %*% t(fit.all$rotation[, 1])
plot(x, xlab = expression(x[1]), ylab = expression(x[2]))
points(approx.all, pch = 4)
```

- PCA and FA generally do badly when the data are curves (or curved surfaces)
- Geometry fact:
*locally*, every smooth curved surface looks linear; a set we get by piecing these together is called a “manifold” **Manifold learning**= the data are on or near a manifold; find that manifold- Factor analysis is a kind of linear manifold learning

```
fit = prcomp(x[270:280, ])
pca.approx = fit$x[, 1] %*% t(fit$rotation[, 1]) + colMeans(x[270:280, ])
plot(rbind(x[270:280, ], pca.approx), type = "n", xlab = expression(x[1]), ylab = expression(x[2]))
points(x[270:280, ])
points(pca.approx, pch = 4)
```

- Write each data point as a linear combination of near-by data points
- Look for lower-dimensional coordinates which fit those linear combinations

- Inputs: \(\X\) \([n \times p]\), \(q < p\), \(k \geq q+1\)
- Output: \(\Y\) \([n \times q]\)

- For each \(\vec{x}_i\), find the \(k\) nearest neighbors.
- Find optimal weights for reconstructing \(\vec{x}_i\) from its neighbors, i.e., minimize \[\begin{equation} MSE(\mathbf{w}) \equiv \frac{1}{n}\sum_{i=1}^{n}{{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|}^2} \end{equation}\] with \(w_{ij} = 0\) unless \(j\) is a nearest neighbor of \(i\), and \(\sum_{j}{w_{ij}} = 1\)
- Find coordinates \(\Y\) which minimize the reconstruction error, \[\begin{equation} \Phi(\mathbf{Y}) \equiv \sum_{i=1}^{n}{{\|\vec{y}_i - \sum_{j\neq i}{w_{ij} \vec{y}_j}\|}^2} \end{equation}\] with constraints \(\sum_{i}{y_{ij}} = 0\) and \(\Y^T \Y = \mathbf{I}\) (centered, de-correlated)

- We know how to do this

\[ \min_{\w}{\frac{1}{n}\sum_{i=1}^{n}{{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|}^2}} \]

- Can find weights separately for each data point \(\vec{x}_i\)
- i.e., independent computation of each row \(\w_{i\cdot}\) so \[ \min_{\w_{i\cdot}}{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|^2} \]

- If \(k \geq p+1\), can always make the reconstruction error exactly 0
- Why \(k \geq p+1\)?

- If all points are in a \(q\)-dimensional linear subspace, reconstruction error could be made
*exactly*0 with only \(k \geq q+1\) neighbors (why?) - Non-zero minimum error tells us about local curvature / nonlinearity
- For big \(n\), nearest neighbors should be close and we should be able to get close to 0

- A solve-a-linear-system problem, but a different one for each \(i\)
- In fact, another homework problem

- We should be able to transfer the neighborhood structure from the old space to the new one
Take the weights as fixed and ask for new, low-dimensional coordinates which match up with the weights: minimize \[ \Phi(\Y) = \sum_{i}{{\left\|\vec{y}_i - \sum_{j \neq i}{\vec{y}_j w_{ij}}\right\|}^2} \]

- Yet another eigen-problem
Yet another homework problem

```
# Local linear embedding of data vectors Inputs: n*p matrix of vectors,
# number of dimensions q to find (< p), number of nearest neighbors per
# vector, scalar regularization setting Calls: find.kNNs,
# reconstruction.weights, coords.from.weights Output: n*q matrix of new
# coordinates
lle <- function(x, q, k = q + 1, alpha = 0.01) {
stopifnot(q > 0, q < ncol(x), k > q, alpha > 0) # sanity checks
kNNs = find.kNNs(x, k) # should return an n*k matrix of indices
w = reconstruction.weights(x, kNNs, alpha) # n*n weight matrix
coords = coords.from.weights(w, q) # n*q coordinate matrix
return(coords)
}
```

```
spiral.lle = lle(x, 1, 2)
plot(spiral.lle, ylab = "Coordinate on manifold")
```

`all(diff(spiral.lle) > 0)`

`## [1] TRUE`

```
plot(x, col = rainbow(300, end = 5/6)[cut(spiral.lle, 300, labels = FALSE)],
pch = 16)
```

- There are many versions of the Johnson-Lindenstrauss Theorem on random projections, varying in what they assume about the random vectors and the exact proof technique
- The version given here, for Gaussian-entry random vectors, is taken from chapter 2 of Boucheron, Lugosi, and Massart (2013); because of the Gaussianity, it can be proved using just facts about \(\chi^2\) distributions
- More general versions need more advanced probability theory, as in Boucheron, Lugosi, and Massart (2013)
- If you know what Fourier transforms are, and the Chernoff/Hoeffding bound on the probability of a random variable deviating from its expectation, then Suresh Venkatasubramanian gave a good explanation of the JL theorem in 2011
- (The result is often called the Johnson-Lindenstrauss “Lemma”, because it was just a lemma (subsidiary result) in their original paper)
- The JL Theorem is part of a broader family of results about using random samplings and projections to deal with high-dimensional linear algebra (e.g., linear regression for very big data sets); Mahoney (2011) is a good introduction

- Local linear embedding comes from Roweis and Saul (2000) and Saul and Roweis (2003)
- The source-code for these slides includes a complete (though not maximally efficient) implementation

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. *Concentration Inequalities: A Nonasymptotic Theory of Independence*. Oxford: Oxford University Press.

Mahoney, Michael W. 2011. “Randomized Algorithms for Matrices and Data.” *Foundations and Trends in Machine Learning* 2:123–224. https://doi.org/10.1561/2200000035.

Roweis, Sam T., and Laurence K. Saul. 2000. “Nonlinear Dimensionality Reduction by Locally Linear Embedding.” *Science* 290:2323–6. https://doi.org/10.1126/science.290.5500.2323.

Saul, Lawrence K., and Sam T. Roweis. 2003. “Think Globally, Fit Locally: Supervised Learning of Low Dimensional Manifolds.” *Journal of Machine Learning Research* 4:119–55. http://jmlr.csail.mit.edu/papers/v4/saul03a.html.