--- title: Dimension Reduction III --- Random Linear Projections and Locally Linear Embeddings author: 36-462/662, Fall 2019 date: 18 September 2019 output: slidy_presentation bibliography: locusts.bib --- {r, include=FALSE} # General set-up options library(knitr) opts_chunk$set(size="small",background="white", highlight=FALSE, cache=TRUE, autodep=TRUE, tidy=TRUE, warning=FALSE, message=FALSE)  $\newcommand{\X}{\mathbf{x}} \newcommand{\Y}{\mathbf{y}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin}$ ## Recap - We have$n$items represented as$p$-dimensional feature vectors;$p$is too big - PCA: Find the best$q$-dimensional linear approximation to the data +$\equiv$find the$q$-directions of maximum variance +$\equiv$find the top$q$eigenvectors of the variance matrix + Then project the$p$-dimensional feature vectors on to those eigenvectors + Scores = lengths of the projections = new,$q$-dimensional features - Factor analysis: Suppose the data are$q$-dimensional linear subspace plus noise + Find the$q$-dimensional subspace which best preserves the correlations + Backtrack to coordinates in the subspace (factor scores) + Factor scores are the new,$q$-dimensional features - Drawbacks: + This is a lot of computational work! + Who says linear subspaces are so great? ## PCA and distance preservation - PCA: we project from$\vec{x}_i$to$\mathbf{u}\vec{x}_i$+$\mathbf{u} =$projection operator for the principal components - PCA: tries to make the projected points close to the original points, so $\frac{1}{n}\sum_{i=1}^{n}{\| \vec{x}_i - \mathbf{u}\vec{x}_i\|^2}$ is minimized - This means distances between projected points track distances between original points + Triangle inequality ## Distance-preserving projections - Suppose we just want to preserve distances -$\mathbf{u}$would be **distance preserving** or an **isometry** if $\|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 = \|\vec{x}_i - \vec{x}_j\|^2$ - _Exact_ distance-preservation is a lot to ask + Unless$p > n$- Approximate distance preservation: you pick an$\epsilon > 0$;$\mathbf{u}$is an **$\epsilon$-isometry** when $(1-\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \leq \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 \leq (1+\epsilon) \|\vec{x}_i - \vec{x}_j\|^2$ - A remarkable math fact: If$q = O(\frac{\log{n}}{\epsilon^2})$then there is _always_ an$\epsilon$-isometry ## The random projection trick - Make$Z_{ij} \sim \mathcal{N}(0,1)$independent standard Gaussians - Set $\mathbf{U} = \frac{1}{\sqrt{q}}\left[\begin{array}{ccc} Z_{11} & Z_{12} & \ldots & Z_{1p}\\ Z_{21} & Z_{22} & \ldots & Z_{2p}\\ \vdots & \vdots & \ldots & \vdots\\ Z_{q1} & Z_{q2} & \ldots & Z_{qp}\end{array}\right]$ -$\mathbf{U}\vec{x}_i$is like projecting on to$d$_random_$p$-dimensional vectors + What is the expected squared length of those vectors? - Now you can check that for _any_ vector$\vec{x}$, $\Expect{\|\mathbf{U}\vec{x}\|^2} = \|\vec{x}\|^2$ - Hint:$\Expect{\|\mathbf{U}\vec{x}\|^2} = \Expect{\sum_{i=1}^{q}{\left(\frac{1}{\sqrt{q}}\sum_{j=1}^{p}{Z_{ij} x_j}\right)^2}}$and$\|\vec{x}\|^2 = \sum_{j=1}^{p}{x_j^2}$## The random projection trick - Expectation is linear and the$Z_{ij}$are uncorrelated so $\Expect{\|\mathbf{U}\vec{x}\|^2} = \frac{1}{q}\sum_{i=1}^{q}{\sum_{j=1}^{p}{x_j^2}} = \|\vec{x}\|^2$ - **Random projections are distance-preserving on average** ## Random projections are nearly distance-preserving with high probability - Fix your favorite$\epsilon > 0$(approximation level) and$\delta > 0$(error probability) - **Johnson-Lindenstrauss theorem**: With probability at least$1-\delta$,$\mathbf{U}$is an$\epsilon$-isometry for our$n$data points if $q \geq \frac{8}{\epsilon^2}\log{\frac{n}{\sqrt{\delta}}}$ + E.g., for 95% confidence, need$q \geq \frac{8}{\epsilon^2}\left(\log{(n)} + r signif(log(1/sqrt(0.05)),4)\ldots \right)$+ For 95% confidence with$n=205$and$\epsilon=0.2$, we need$q \geq r ceiling((8/((0.2)^2))*log(205/sqrt(0.05)))$+ If we had$n=2\times 10^6$, we'd need$q \geq r ceiling((8/((0.2)^2))*log(2e6/sqrt(0.05)))$- The remarkable bits:$p$doesn't matter and neither does the data + For the homework data,$p=r 350*450*3$- You can also turn this around to see what$\epsilon$you can achieve with a fixed budget of$q$+ With$q = 100$,$n=205$,$\delta=0.05$, we get$\epsilon \approx r signif(sqrt((8/100)*log(205/sqrt(0.05))),2)$which isn't great but is _some_ signal + Again, independent of$p$## The problem with nonlinear structure - Some nonlinear data (a nice spiral): {r} x=matrix(c(exp(-0.2*(-(1:300)/10))*cos(-(1:300)/10), exp(-0.2*(-(1:300)/10))*sin(-(1:300)/10)), ncol=2) plot(x)  ## PCA just fails here {r} fit.all = prcomp(x) approx.all=fit.all$x[,1]%*%t(fit.all$rotation[,1]) plot(x,xlab=expression(x[1]),ylab=expression(x[2])) points(approx.all,pch=4)  ## Manifold learning - PCA and FA generally do badly when the data are curves (or curved surfaces) - Geometry fact: _locally_, every smooth curved surface looks linear; a set we get by piecing these together is called a "manifold" - **Manifold learning** = the data are on or near a manifold; find that manifold + Factor analysis is a kind of linear manifold learning ## PCA doesn't do too badly around any small part of the curve {r} fit = prcomp(x[270:280,]) pca.approx = fit$x[,1]%*%t(fit$rotation[,1])+colMeans(x[270:280,]) plot(rbind(x[270:280,],pca.approx),type="n", xlab=expression(x[1]),ylab=expression(x[2])) points(x[270:280,]) points(pca.approx,pch=4)  ## Local(ly) Linear Embedding - Write each data point as a linear combination of near-by data points - Look for lower-dimensional coordinates which fit those linear combinations ## LLE - Inputs:$\X[n \times p]$,$q < p$,$k \geq q+1$- Output:$\Y[n \times q]$1. For each$\vec{x}_i$, find the$k$nearest neighbors. 2. Find optimal weights for reconstructing$\vec{x}_i$from its neighbors, i.e., minimize $$MSE(\mathbf{w}) \equiv \frac{1}{n}\sum_{i=1}^{n}{{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|}^2}$$ with$w_{ij} = 0$unless$j$is a nearest neighbor of$i$, and$\sum_{j}{w_{ij}} = 1$3. Find coordinates$\Y$which minimize the reconstruction error, $$\Phi(\mathbf{Y}) \equiv \sum_{i=1}^{n}{{\|\vec{y}_i - \sum_{j\neq i}{w_{ij} \vec{y}_j}\|}^2}$$ with constraints$\sum_{i}{y_{ij}} = 0$and$\Y^T \Y = \mathbf{I}$(centered, de-correlated) ## Finding neighbors in$p$-dimensional space - We know how to do this ## Finding weights $\min_{\w}{\frac{1}{n}\sum_{i=1}^{n}{{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|}^2}}$ - Can find weights separately for each data point$\vec{x}_i$+ i.e., independent computation of each row$\w_{i\cdot}$so $\min_{\w_{i\cdot}}{\| \vec{x}_i - \sum_{j \neq i}{w_{ij} \vec{x}_j}\|^2}$ - If$k \geq p+1$, can always make the reconstruction error exactly 0 + Why$k \geq p+1$? - If all points are in a$q$-dimensional linear subspace, reconstruction error could be made _exactly_ 0 with only$k \geq q+1$neighbors (why?) - Non-zero minimum error tells us about local curvature / nonlinearity - For big$n$, nearest neighbors should be close and we should be able to get close to 0 ## Finding the weights - A solve-a-linear-system problem, but a different one for each$i$- In fact, another homework problem ## Finding the new coordinates - We should be able to transfer the neighborhood structure from the old space to the new one - Take the weights as fixed and ask for new, low-dimensional coordinates which match up with the weights: minimize $\Phi(\Y) = \sum_{i}{{\left\|\vec{y}_i - \sum_{j \neq i}{\vec{y}_j w_{ij}}\right\|}^2}$ - Yet another eigen-problem - Yet another homework problem ## Implementation {r} # Local linear embedding of data vectors # Inputs: n*p matrix of vectors, number of dimensions q to find (< p), # number of nearest neighbors per vector, scalar regularization setting # Calls: find.kNNs, reconstruction.weights, coords.from.weights # Output: n*q matrix of new coordinates lle <- function(x,q,k=q+1,alpha=0.01) { stopifnot(q>0, qq, alpha>0) # sanity checks kNNs = find.kNNs(x,k) # should return an n*k matrix of indices w = reconstruction.weights(x,kNNs,alpha) # n*n weight matrix coords = coords.from.weights(w,q) # n*q coordinate matrix return(coords) }  {r, echo=FALSE} # Find multiple nearest neighbors in a data frame # Inputs: n*p matrix of data vectors, number of neighbors to find, # optional arguments to dist function # Calls: smallest.by.rows # Output: n*k matrix of the indices of nearest neighbors find.kNNs <- function(x,k,...) { x.distances = dist(x,...) # Uses the built-in distance function x.distances = as.matrix(x.distances) # need to make it a matrix kNNs = smallest.by.rows(x.distances,k+1) # see text for +1 return(kNNs[,-1]) # see text for -1 } # Find the k smallest entries in each row of an array # Inputs: n*p array, p >= k, number of smallest entries to find # Output: n*k array of column indices for smallest entries per row smallest.by.rows <- function(m,k) { stopifnot(ncol(m) >= k) # Otherwise "k smallest" is meaningless row.orders = t(apply(m,1,order)) k.smallest = row.orders[,1:k] return(k.smallest) }  {r, echo=FALSE} # Least-squares weights for linear approx. of data from neighbors # Inputs: n*p matrix of vectors, n*k matrix of neighbor indices, # scalar regularization setting # Calls: local.weights # Outputs: n*n matrix of weights reconstruction.weights <- function(x,neighbors,alpha) { stopifnot(is.matrix(x),is.matrix(neighbors),alpha>0) n=nrow(x) stopifnot(nrow(neighbors) == n) w = matrix(0,nrow=n,ncol=n) for (i in 1:n) { i.neighbors = neighbors[i,] w[i,i.neighbors] = local.weights(x[i,],x[i.neighbors,],alpha) } return(w) }  {r, echo=FALSE} # Calculate local reconstruction weights from vectors # Inputs: focal vector (1*p matrix), k*p matrix of neighbors, # scalar regularization setting # Outputs: length k vector of weights, summing to 1 local.weights <- function(focal,neighbors,alpha) { # basic matrix-shape sanity checks stopifnot(nrow(focal)==1,ncol(focal)==ncol(neighbors)) # Should really sanity-check the rest (is.numeric, etc.) k = nrow(neighbors) # Center on the focal vector neighbors=t(t(neighbors)-focal) # exploits recycling rule, which # has a weird preference for columns gram = neighbors %*% t(neighbors) # Try to solve the problem without regularization weights = try(solve(gram,rep(1,k))) # The try() function tries to evaluate its argument and returns # the value if successful; otherwise it returns an error # message of class "try-error" if (identical(class(weights),"try-error")) { # Un-regularized solution failed, try to regularize # ATTN: It'd be better to examine the error, and check # if it's something regularization could fix, before # trying a regularized version of the problem weights = solve(gram+alpha*diag(k),rep(1,k)) } # Enforce the unit-sum constraint weights = weights/sum(weights) return(weights) }  {r, echo=FALSE} # Get approximation weights from indices of point and neighbors # Inputs: index of focal point, n*p matrix of vectors, n*k matrix # of nearest neighbor indices, scalar regularization setting # Calls: local.weights # Output: vector of n reconstruction weights local.weights.for.index <- function(focal,x,NNs,alpha) { n = nrow(x) stopifnot(n> 0, 0 < focal, focal <= n, nrow(NNs)==n) w = rep(0,n) neighbors = NNs[focal,] wts = local.weights(x[focal,],x[neighbors,],alpha) w[neighbors] = wts return(w) }  {r, echo=FALSE} # Local linear approximation weights, without iteration # Inputs: n*p matrix of vectors, n*k matrix of neighbor indices, # scalar regularization setting # Calls: local.weights.for.index # Outputs: n*n matrix of reconstruction weights reconstruction.weights.2 <- function(x,neighbors,alpha) { # Sanity-checking should go here n = nrow(x) w = sapply(1:n, local.weights.for.index, x=x, NNs=neighbors, alpha=alpha) w = t(w) # sapply returns the transpose of the matrix we want return(w) }  {r, echo=FALSE} # Find intrinsic coordinates from local linear approximation weights # Inputs: n*n matrix of weights, number of dimensions q, numerical # tolerance for checking the row-sum constraint on the weights # Output: n*q matrix of new coordinates on the manifold coords.from.weights <- function(w,q,tol=1e-7) { n=nrow(w) stopifnot(ncol(w)==n) # Needs to be square # Check that the weights are normalized # to within tol > 0 to handle round-off error stopifnot(all(abs(rowSums(w)-1) < tol)) # Make the Laplacian M = t(diag(n)-w)%*%(diag(n)-w) # diag(n) is n*n identity matrix soln = eigen(M) # eigenvalues and eigenvectors (here, # eigenfunctions), in order of decreasing eigenvalue coords = soln$vectors[,((n-q):(n-1))] # bottom eigenfunctions # except for the trivial one return(coords) }  ## {r} spiral.lle = lle(x,1,2) plot(spiral.lle,ylab="Coordinate on manifold") all(diff(spiral.lle) > 0)  ## {r} plot(x,col=rainbow(300,end=5/6)[cut(spiral.lle,300,labels=FALSE)], pch=16)  ## Afternotes - There are many versions of the Johnson-Lindenstrauss Theorem on random projections, varying in what they assume about the random vectors and the exact proof technique + The version given here, for Gaussian-entry random vectors, is taken from chapter 2 of @Boucheron-Lugosi-Massart-concentration; because of the Gaussianity, it can be proved using just facts about $\chi^2$ distributions + More general versions need more advanced probability theory, as in @Boucheron-Lugosi-Massart-concentration + If you know what Fourier transforms are, and the Chernoff/Hoeffding bound on the probability of a random variable deviating from its expectation, then Suresh Venkatasubramanian gave a [good explanation of the JL theorem in 2011](http://blog.geomblog.org/2011/11/intuitive-argument-for-jl-lemma.html) + (The result is often called the Johnson-Lindenstrauss "Lemma", because it was just a lemma (subsidiary result) in their original paper) + The JL Theorem is part of a broader family of results about using random samplings and projections to deal with high-dimensional linear algebra (e.g., linear regression for very big data sets); @Mahoney-randomized-for-matrices-and-data is a good introduction - Local linear embedding comes from @Roweis-Saul-locally-linear-embedding and @Saul-Roweis-think-globally + The source-code for these slides includes a complete (though not maximally efficient) implementation ## References