---
title: Dimension Reduction III --- Random Linear Projections and Locally Linear Embeddings
author: 36-462/662, Fall 2019
date: 18 September 2019
output: slidy_presentation
bibliography: locusts.bib
---
```{r, include=FALSE}
# General set-up options
library(knitr)
opts_chunk$set(size="small",background="white", highlight=FALSE, cache=TRUE,
autodep=TRUE, tidy=TRUE, warning=FALSE, message=FALSE)
```
\[
\newcommand{\X}{\mathbf{x}}
\newcommand{\Y}{\mathbf{y}}
\newcommand{\w}{\mathbf{w}}
\newcommand{\V}{\mathbf{v}}
\newcommand{\S}{\mathbf{s}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\]
## Recap
- We have $n$ items represented as $p$-dimensional feature vectors; $p$ is too big
- PCA: Find the best $q$-dimensional linear approximation to the data
+ $\equiv$ find the $q$-directions of maximum variance
+ $\equiv$ find the top $q$ eigenvectors of the variance matrix
+ Then project the $p$-dimensional feature vectors on to those eigenvectors
+ Scores = lengths of the projections = new, $q$-dimensional features
- Factor analysis: Suppose the data are $q$-dimensional linear subspace plus noise
+ Find the $q$-dimensional subspace which best preserves the correlations
+ Backtrack to coordinates in the subspace (factor scores)
+ Factor scores are the new, $q$-dimensional features
- Drawbacks:
+ This is a lot of computational work!
+ Who says linear subspaces are so great?
## PCA and distance preservation
- PCA: we project from $\vec{x}_i$ to $\mathbf{u}\vec{x}_i$
+ $\mathbf{u} =$ projection operator for the principal components
- PCA: tries to make the projected points close to the original points, so
\[
\frac{1}{n}\sum_{i=1}^{n}{\| \vec{x}_i - \mathbf{u}\vec{x}_i\|^2}
\]
is minimized
- This means distances between projected points track distances between original points
+ Triangle inequality
## Distance-preserving projections
- Suppose we just want to preserve distances
- $\mathbf{u}$ would be **distance preserving** or an **isometry** if
\[
\|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 = \|\vec{x}_i - \vec{x}_j\|^2
\]
- _Exact_ distance-preservation is a lot to ask
+ Unless $p > n$
- Approximate distance preservation: you pick an $\epsilon > 0$; $\mathbf{u}$ is
an **$\epsilon$-isometry** when
\[
(1-\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \leq \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 \leq (1+\epsilon) \|\vec{x}_i - \vec{x}_j\|^2
\]
- A remarkable math fact: If $q = O(\frac{\log{n}}{\epsilon^2})$ then there is _always_ an $\epsilon$-isometry
## The random projection trick
- Make $Z_{ij} \sim \mathcal{N}(0,1)$ independent standard Gaussians
- Set
\[
\mathbf{U} = \frac{1}{\sqrt{q}}\left[\begin{array}{ccc} Z_{11} & Z_{12} & \ldots & Z_{1p}\\
Z_{21} & Z_{22} & \ldots & Z_{2p}\\
\vdots & \vdots & \ldots & \vdots\\
Z_{q1} & Z_{q2} & \ldots & Z_{qp}\end{array}\right]
\]
- $\mathbf{U}\vec{x}_i$ is like projecting on to $d$ _random_ $p$-dimensional vectors
+ What is the expected squared length of those vectors?
- Now you can check that for _any_ vector $\vec{x}$,
\[
\Expect{\|\mathbf{U}\vec{x}\|^2} = \|\vec{x}\|^2
\]
- Hint: $\Expect{\|\mathbf{U}\vec{x}\|^2} = \Expect{\sum_{i=1}^{q}{\left(\frac{1}{\sqrt{q}}\sum_{j=1}^{p}{Z_{ij} x_j}\right)^2}}$ and $\|\vec{x}\|^2 = \sum_{j=1}^{p}{x_j^2}$
## The random projection trick
- Expectation is linear and the $Z_{ij}$ are uncorrelated so
\[
\Expect{\|\mathbf{U}\vec{x}\|^2} = \frac{1}{q}\sum_{i=1}^{q}{\sum_{j=1}^{p}{x_j^2}} = \|\vec{x}\|^2
\]
- **Random projections are distance-preserving on average**
## Random projections are nearly distance-preserving with high probability
- Fix your favorite $\epsilon > 0$ (approximation level) and $\delta > 0$ (error probability)
- **Johnson-Lindenstrauss theorem**: With probability at least $1-\delta$, $\mathbf{U}$ is an $\epsilon$-isometry for our $n$ data points if
\[
q \geq \frac{8}{\epsilon^2}\log{\frac{n}{\sqrt{\delta}}}
\]
+ E.g., for 95% confidence, need $q \geq \frac{8}{\epsilon^2}\left(\log{(n)} + `r signif(log(1/sqrt(0.05)),4)`\ldots \right)$
+ For 95% confidence with $n=205$ and $\epsilon=0.2$, we need $q \geq `r ceiling((8/((0.2)^2))*log(205/sqrt(0.05)))`$
+ If we had $n=2\times 10^6$, we'd need $q \geq `r ceiling((8/((0.2)^2))*log(2e6/sqrt(0.05)))`$
- The remarkable bits: $p$ doesn't matter and neither does the data
+ For the homework data, $p=`r 350*450*3`$
- You can also turn this around to see what $\epsilon$ you can achieve with a fixed budget of $q$
+ With $q = 100$, $n=205$, $\delta=0.05$, we get $\epsilon \approx `r signif(sqrt((8/100)*log(205/sqrt(0.05))),2)`$ which isn't great but is _some_ signal
+ Again, independent of $p$
## The problem with nonlinear structure
- Some nonlinear data (a nice spiral):
```{r}
x=matrix(c(exp(-0.2*(-(1:300)/10))*cos(-(1:300)/10),
exp(-0.2*(-(1:300)/10))*sin(-(1:300)/10)),
ncol=2)
plot(x)
```
## PCA just fails here
```{r}
fit.all = prcomp(x)
approx.all=fit.all$x[,1]%*%t(fit.all$rotation[,1])
plot(x,xlab=expression(x[1]),ylab=expression(x[2]))
points(approx.all,pch=4)
```
## Manifold learning
- PCA and FA generally do badly when the data are curves (or curved surfaces)
- Geometry fact: _locally_, every smooth curved surface looks linear; a set we get by piecing these together is called a "manifold"
- **Manifold learning** = the data are on or near a manifold; find that manifold
+ Factor analysis is a kind of linear manifold learning
## PCA doesn't do too badly around any small part of the curve
```{r}
fit = prcomp(x[270:280,])
pca.approx = fit$x[,1]%*%t(fit$rotation[,1])+colMeans(x[270:280,])
plot(rbind(x[270:280,],pca.approx),type="n",
xlab=expression(x[1]),ylab=expression(x[2]))
points(x[270:280,])
points(pca.approx,pch=4)
```
## Local(ly) Linear Embedding
- Write each data point as a linear combination of near-by data points
- Look for lower-dimensional coordinates which fit those linear combinations
## LLE
- Inputs: $\X$ $[n \times p]$, $q < p$, $k \geq q+1$
- Output: $\Y$ $[n \times q]$
1. For each $\vec{x}_i$, find the $k$ nearest neighbors.
2. Find optimal weights for reconstructing $\vec{x}_i$ from its neighbors,
i.e., minimize
\begin{equation}
MSE(\mathbf{w}) \equiv \frac{1}{n}\sum_{i=1}^{n}{{\| \vec{x}_i - \sum_{j \neq i}{w_{ij}
\vec{x}_j}\|}^2}
\end{equation}
with $w_{ij} = 0$ unless $j$ is a nearest neighbor of $i$, and $\sum_{j}{w_{ij}} = 1$
3. Find coordinates $\Y$ which minimize the reconstruction error,
\begin{equation}
\Phi(\mathbf{Y}) \equiv \sum_{i=1}^{n}{{\|\vec{y}_i - \sum_{j\neq i}{w_{ij}
\vec{y}_j}\|}^2}
\end{equation}
with constraints $\sum_{i}{y_{ij}} = 0$ and $\Y^T \Y = \mathbf{I}$ (centered, de-correlated)
## Finding neighbors in $p$-dimensional space
- We know how to do this
## Finding weights
\[
\min_{\w}{\frac{1}{n}\sum_{i=1}^{n}{{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|}^2}}
\]
- Can find weights separately for each data point $\vec{x}_i$
+ i.e., independent computation of each row $\w_{i\cdot}$
so
\[
\min_{\w_{i\cdot}}{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|^2}
\]
- If $k \geq p+1$, can always make the reconstruction error exactly 0
+ Why $k \geq p+1$?
- If all points are in a $q$-dimensional linear subspace, reconstruction error could be made _exactly_ 0 with only $k \geq q+1$ neighbors (why?)
- Non-zero minimum error tells us about local curvature / nonlinearity
- For big $n$, nearest neighbors should be close and we should be able to get close to 0
## Finding the weights
- A solve-a-linear-system problem, but a different one for each $i$
- In fact, another homework problem
## Finding the new coordinates
- We should be able to transfer the neighborhood structure from the old space to the new one
- Take the weights as fixed and ask for new, low-dimensional coordinates which
match up with the weights: minimize
\[
\Phi(\Y) = \sum_{i}{{\left\|\vec{y}_i - \sum_{j \neq i}{\vec{y}_j
w_{ij}}\right\|}^2}
\]
- Yet another eigen-problem
- Yet another homework problem
## Implementation
```{r}
# Local linear embedding of data vectors
# Inputs: n*p matrix of vectors, number of dimensions q to find (< p),
# number of nearest neighbors per vector, scalar regularization setting
# Calls: find.kNNs, reconstruction.weights, coords.from.weights
# Output: n*q matrix of new coordinates
lle <- function(x,q,k=q+1,alpha=0.01) {
stopifnot(q>0, qq, alpha>0) # sanity checks
kNNs = find.kNNs(x,k) # should return an n*k matrix of indices
w = reconstruction.weights(x,kNNs,alpha) # n*n weight matrix
coords = coords.from.weights(w,q) # n*q coordinate matrix
return(coords)
}
```
```{r, echo=FALSE}
# Find multiple nearest neighbors in a data frame
# Inputs: n*p matrix of data vectors, number of neighbors to find,
# optional arguments to dist function
# Calls: smallest.by.rows
# Output: n*k matrix of the indices of nearest neighbors
find.kNNs <- function(x,k,...) {
x.distances = dist(x,...) # Uses the built-in distance function
x.distances = as.matrix(x.distances) # need to make it a matrix
kNNs = smallest.by.rows(x.distances,k+1) # see text for +1
return(kNNs[,-1]) # see text for -1
}
# Find the k smallest entries in each row of an array
# Inputs: n*p array, p >= k, number of smallest entries to find
# Output: n*k array of column indices for smallest entries per row
smallest.by.rows <- function(m,k) {
stopifnot(ncol(m) >= k) # Otherwise "k smallest" is meaningless
row.orders = t(apply(m,1,order))
k.smallest = row.orders[,1:k]
return(k.smallest)
}
```
```{r, echo=FALSE}
# Least-squares weights for linear approx. of data from neighbors
# Inputs: n*p matrix of vectors, n*k matrix of neighbor indices,
# scalar regularization setting
# Calls: local.weights
# Outputs: n*n matrix of weights
reconstruction.weights <- function(x,neighbors,alpha) {
stopifnot(is.matrix(x),is.matrix(neighbors),alpha>0)
n=nrow(x)
stopifnot(nrow(neighbors) == n)
w = matrix(0,nrow=n,ncol=n)
for (i in 1:n) {
i.neighbors = neighbors[i,]
w[i,i.neighbors] = local.weights(x[i,],x[i.neighbors,],alpha)
}
return(w)
}
```
```{r, echo=FALSE}
# Calculate local reconstruction weights from vectors
# Inputs: focal vector (1*p matrix), k*p matrix of neighbors,
# scalar regularization setting
# Outputs: length k vector of weights, summing to 1
local.weights <- function(focal,neighbors,alpha) {
# basic matrix-shape sanity checks
stopifnot(nrow(focal)==1,ncol(focal)==ncol(neighbors))
# Should really sanity-check the rest (is.numeric, etc.)
k = nrow(neighbors)
# Center on the focal vector
neighbors=t(t(neighbors)-focal) # exploits recycling rule, which
# has a weird preference for columns
gram = neighbors %*% t(neighbors)
# Try to solve the problem without regularization
weights = try(solve(gram,rep(1,k)))
# The try() function tries to evaluate its argument and returns
# the value if successful; otherwise it returns an error
# message of class "try-error"
if (identical(class(weights),"try-error")) {
# Un-regularized solution failed, try to regularize
# ATTN: It'd be better to examine the error, and check
# if it's something regularization could fix, before
# trying a regularized version of the problem
weights = solve(gram+alpha*diag(k),rep(1,k))
}
# Enforce the unit-sum constraint
weights = weights/sum(weights)
return(weights)
}
```
```{r, echo=FALSE}
# Get approximation weights from indices of point and neighbors
# Inputs: index of focal point, n*p matrix of vectors, n*k matrix
# of nearest neighbor indices, scalar regularization setting
# Calls: local.weights
# Output: vector of n reconstruction weights
local.weights.for.index <- function(focal,x,NNs,alpha) {
n = nrow(x)
stopifnot(n> 0, 0 < focal, focal <= n, nrow(NNs)==n)
w = rep(0,n)
neighbors = NNs[focal,]
wts = local.weights(x[focal,],x[neighbors,],alpha)
w[neighbors] = wts
return(w)
}
```
```{r, echo=FALSE}
# Local linear approximation weights, without iteration
# Inputs: n*p matrix of vectors, n*k matrix of neighbor indices,
# scalar regularization setting
# Calls: local.weights.for.index
# Outputs: n*n matrix of reconstruction weights
reconstruction.weights.2 <- function(x,neighbors,alpha) {
# Sanity-checking should go here
n = nrow(x)
w = sapply(1:n, local.weights.for.index, x=x, NNs=neighbors,
alpha=alpha)
w = t(w) # sapply returns the transpose of the matrix we want
return(w)
}
```
```{r, echo=FALSE}
# Find intrinsic coordinates from local linear approximation weights
# Inputs: n*n matrix of weights, number of dimensions q, numerical
# tolerance for checking the row-sum constraint on the weights
# Output: n*q matrix of new coordinates on the manifold
coords.from.weights <- function(w,q,tol=1e-7) {
n=nrow(w)
stopifnot(ncol(w)==n) # Needs to be square
# Check that the weights are normalized
# to within tol > 0 to handle round-off error
stopifnot(all(abs(rowSums(w)-1) < tol))
# Make the Laplacian
M = t(diag(n)-w)%*%(diag(n)-w)
# diag(n) is n*n identity matrix
soln = eigen(M) # eigenvalues and eigenvectors (here,
# eigenfunctions), in order of decreasing eigenvalue
coords = soln$vectors[,((n-q):(n-1))] # bottom eigenfunctions
# except for the trivial one
return(coords)
}
```
##
```{r}
spiral.lle = lle(x,1,2)
plot(spiral.lle,ylab="Coordinate on manifold")
all(diff(spiral.lle) > 0)
```
##
```{r}
plot(x,col=rainbow(300,end=5/6)[cut(spiral.lle,300,labels=FALSE)], pch=16)
```
## Afternotes
- There are many versions of the Johnson-Lindenstrauss Theorem on random projections, varying in what they assume about the random vectors and the exact proof technique
+ The version given here, for Gaussian-entry random vectors, is taken from chapter 2 of @Boucheron-Lugosi-Massart-concentration; because of the Gaussianity, it can be proved using just facts about $\chi^2$ distributions
+ More general versions need more advanced probability theory, as in @Boucheron-Lugosi-Massart-concentration
+ If you know what Fourier transforms are, and the Chernoff/Hoeffding bound on the probability of a random variable deviating from its expectation, then Suresh Venkatasubramanian gave a [good explanation of the JL theorem in 2011](http://blog.geomblog.org/2011/11/intuitive-argument-for-jl-lemma.html)
+ (The result is often called the Johnson-Lindenstrauss "Lemma", because it was just a lemma (subsidiary result) in their original paper)
+ The JL Theorem is part of a broader family of results about using random samplings and projections to deal with high-dimensional linear algebra (e.g., linear regression for very big data sets); @Mahoney-randomized-for-matrices-and-data is a good introduction
- Local linear embedding comes from @Roweis-Saul-locally-linear-embedding and @Saul-Roweis-think-globally
+ The source-code for these slides includes a complete (though not maximally efficient) implementation
## References