--- title: Nearest Neighbors III --- Mostly Computational output: slidy_presentation bibliography: locusts.bib --- ## Agenda 1. Wrap-up on k selection 2. Computational costs of naive implementation of kNN 3. Fast, approximate kNN search # Selecting k ## A general trade-off to model selection - Some methods are "model-selection-consistent" + $\equiv$ If the true model is one of the options, then the probability of selecting it $\rightarrow 1$ as $n\rightarrow \infty$; it converges on the truth - Some methods are "predictively optimal" + the model they select gives predictions (almost) as good as the best possible prediction among the available models - Dilemma: In general, model-selection-consistent methods are not predictively optimal and vice versa + In fact, predictively optimal methods tend to over-fit (select too big a model _even when_ $n\rightarrow\infty$) ## For cross-validation - LOOCV is usually predictively optimal - v-fold CV isn't quite consistent, but it's close and there are tweaks ## For kNN - There just is no true number of nearest neighbors! + $k$ is _only_ a control setting for prediction, _not_ an aspect of reality - We should expect LOOCV is best form of CV - Recent theoretical proof of optimality [@Azadkia-leave-one-out-is-optimal-for-k-nearest-neighhor-regression] # Computational costs of naive implementation ## How expensive is it to run kNN? - Need to store the _entire_ set of training data + Memory cost: $p$ features plus $1$ label per item, $n$ training items, cost $O(n(p+1)) = O(np)$ - Need to compare a new point to _every_ other data point + Cost to compute one distance: $O(p)$ + $\therefore$ Cost to compute all distances: $O(np)$ + Cost to find the $k$ smallest distances: $O(n)$ (regardless of $p$) + Time to compute the prediction from the NNs: $O(k)$ + Total time cost: $O(np+n+k) = O(np+k)$ - Theoretical computer scientists will tell you anything sub-exponential in $n$ is "tractable"; don't believe them... ## Why is it $O(n)$? - There are _two_ parts which take $O(n)$ time: + Computing distances to _all_ training points + Finding the smallest distances - $k$ is small, $n$ is big --- _most_ of the distances we compute end up being useless + Do we need all the points? + Can we be faster about computing distances (if only approximately)? + Can we rule out some points as nearest neighbors, _without_ examining them? ## Using fewer data points - **Diminishing returns**: risk shrinks as $n$ grows, but more and more slowly + Remember the distance to the nearest neighbor is $O(n^{-1/p})$ + Bias will be on the order of the distance (Taylor expand $\mu$) + Contribution to the risk $\propto \mathrm{bias}^2 = O(n^{-2/p})$ + For fixed $k$, variance contribution is $O(1)$ - If $p=10$, doubling $n$ doubles computing cost, but bias is still $\approx r signif(2^(-2/10), 2)$ of what it was before - **Sampling**: pick a random subset of $m \ll n$ data points to keep computing time small, at acceptable risk + Needs a price at which we can trade risk against computing cost + Constrain either time or risk, and Lagrange multiplier gives us the price - Seems wasteful to collect the data and then ignore most of it at random ## Faster distance computation - Use random projections: it takes only $O(\log{n})$ random projections to preserve distances + (to within a factor of $1\pm \epsilon$) + Time to project one vector on to one direction: $O(p)$ + Time to project all training vectors: $O(p\log{n})$, but we only do this once - We can find (approximate) nearest neighbors in time $O(n\log{n}+p\log{n}+k)$ + $O(p\log{n})$ to project new vector on to the $O(\log{n})$ random vectors + $O(n\log{n})$ to find distances between projected vectors + $O(n)$ to find smallest projected distance (absorbed into $O(n\log{n})$ + $O(k)$ to average $k$ nearest neighbors' responses + Helps with the scaling in $p$ + Only useful if $p \gg \log{n}$, but $p$ might easily be $10^3$ and $n=10^6$ so $\log{n}=6$ + Doesn't help with the scaling in $n$ ## Pre-selecting possible neighbors - Deterministic data structures for clever searching - Use random summaries to pre-select possible neighbors ## Data structures: $k-d$ trees - i.e., "$k$-dimensional trees" - Build a sorting tree to categorize the data points + Leaves are the actual data points + Each internal node splits on one and only one feature + Nodes on the same level split on the same feature (generally) - There are other data structures but $k-d$ trees work well + k-d trees are the default in the FNN package ## Using a $k-d$ tree - To find potential neighbors for a new point, "drop the point down the tree" + Start at the root + Go to one child or the other depending on the first feature + Go a grand-child node depending on the second feature + Continue until there are only $k$ leaf nodes below us ## Why is a $k-d$ tree fast? - Assume the number of points we _could_ be matched to gets cut in 1/2 at each node + So $n$ nodes under the root, $n/2$ under each child of the root, etc. EXERCISE: How many levels do we need to go down to reach $\approx k$ candidate neighbors? ## Why is a $k-d$ tree fast? - Ideally, cut number of points by 1/2 at each node, - We go from $n$ to $n/2$ to $n/4$ to $n 2^{-d}$ after $d$ levels SOLUTION: Set $n 2^{-d}$ to $k$ and solve: \begin{eqnarray} n 2^{-d} & = & k\\ \log_2{n} - d & =& \log_2{k}\\ d & = & \log_2{n/k} \end{eqnarray} - _This might not work_ at finding the _nearest_ neighbors + Nearer neighbors might be on the other side of one of these splits + There are tricks which will guarantee finding the nearest neighbor with the $k-d$ tree + Using those tricks, time complexity is still $O(\log{n})$ _on average_, but $O(n)$ in worst case ## Building the $k-d$ tree (one approach) - Put the features in some fixed order - At step $i$, we'll be dividing on feature $i\mod p$ - Initially, all points sit under the root node; divide at the median on feature 1 + Finding a median takes $O(n)$ time + sometimes randomly select a fixed small set of $m \ll n$ points and take their median - Within each child node, split on the median of the associated points - Recurse until there is only one data point within each node; those are the leaves - Because we've used the median, we've ensured that each child contains 1/2 of the points of its parents - Drawbacks of the $k-d$ tree: + We need to actually analyze the training data + If we get more data later, updating the tree is annoying (but possible) ## Locality-sensitive hashing - "Hash functions": map data (e.g., vectors) to fixed set of categories ("buckets", "bins", "slots", ...) + Try to ensure a uniform distribution over bins + Try to ensure that changes in the data result in changes in the bin + $\Rightarrow$ If $h(x) \neq h(y)$, then $x$ and $y$ are pretty different, with high probability + Usually: people want to ensure that even a _small_ change to $x$ will put it in a different bin (with high probability) * Used for detecting errors & tampering * Or for cryptography + Can "amplify" by using multiple hash functions to generate a vector of categories (= one bigger set of categories) - **Locality-sensitive hashing**: Try to ensure that points which _do_ end up in the same bin _are_ close to each other - Two important LSH's: + The random-hyperplane hash + The random-inner-product hash ## The random-hyperplane hash - Generate a random vector $\vec{V}$, length 1 but otherwise uniformly distributed + One way: Make a random Gaussian vector and normalize - $h(\vec{x}) = \sgn{\vec{x} \cdot \vec{V}}$ - So $h(\vec{x}) = h(\vec{y})$ if and only if $\vec{x}$ and $\vec{y}$ are on the same side of the plane defined (orthogonal to) by $\vec{V}$ - Probability of being on _opposite_ sides of a random plane $= \theta/\pi$ where $\theta=$ angle between $\vec{x}$ and $\vec{y}$ in radians - Probability of being on the _same_ side $=1-\theta/\pi$ - Probability of being on the _same_ side of $q$ different random hyperplanes = $(1-\theta/\pi)^q$ ## The random-hyperplane hash (cont'd) - Make $q$ different vectors $\vec{V}_1, \ldots \vec{V}_q$ and compute $s(\vec{x}) = [h_1(\vec{x}) \ldots h_q(\vec{x})]$ - If $s(\vec{x}) = s(\vec{y})$, then $\vec{x}$ and $\vec{y}$ have small angle between them with high probability - If $s(\vec{x}) = s(\vec{y})$, then $\vec{x}$ and $\vec{y}$ have high cosine similarity with high probability - To find cosine-similarity NN's for $\vec{x}$, compute $s(\vec{x})$ and then look only at vectors in that bin + If there are $\sqrt{n}$ bins then each contains about $\sqrt{n}$ training points + Search within a bin takes $O(p\sqrt{n})$ time + With $q$ random hyper-planes we get $2^q$ bins so we need $q=\frac{1}{2}\log_{2}{n}$ hyper-planes + Each inner product takes $O(p)$ time so computing the extended hash $s$ takes $O(p\log{n})$ time + Total time: $O(p\sqrt{n}+p\log{n}+k)$ + (Can you do better by adjusting the number of bins?) ## The random-inner-product hash - Generate a random vector $\vec{V}$ with standard independent Gaussian entries - Generate a random scalar $B$ uniform on $[0, r]$ for some $r$ - $h(\vec{x}) = \left\lfloor \frac{\vec{V}\cdot\vec{x} + B}{r} \right\rfloor$ (an integer) - You can show: the probability that $h(\vec{x})=h(\vec{y})$ decreases monotonically with $\|\vec{x}-\vec{y}\|$ - Vectors which are hashed into the same bin under multiple vectors have low distance with high probability + As with the random-hyperplanes hash - Finding the approximate nearest neighbor takes time $O(p n^{\rho} \log{n})$, where $\rho$ is a (calculable) constant $< 1$ ## The cluster hash - Randomly label each data point a bin from $1$ to $q$, labels $L_1, \ldots L_n$ - Compute the average for all points with a bin + Get vectors $\vec{c}_1, \ldots \vec{c}_q$ - Re-label points: $L_i = \argmin_{j \in 1:q}{\|\vec{x}_i - \vec{c}_j\|}$ - Re-compute averages, re-label, etc., until nothing changes - To find neighbors for a new point, find the bin center it's closest to, and then look for neighbors in that bin - If $q=\sqrt{n}$ then it takes $O(p\sqrt{n})$ to find the right bin, and $O(pn/q) = O(p\sqrt{n})$ to search for neighbors within the bin ## Some common threads to the LSH techniques - We need to hash all the training data - Figuring out which bin a new $\vec{x}$ belongs to (hashing it) is fast - We don't need to keep all the data around to hash $\vec{x}$ - Randomness helps! + Only the cluster hash needs the training data to work out the hash function + Only the cluster hash needs to be revised when we get more data - We still need to keep the training data around, but not in memory + $O(np)$ storage is a lot cheaper than $O(np)$ RAM ## Wrapping up - Time (and memory) costs of straightforward kNN are linear in $n$ + Technically "tractable" but not good when $n$ is industrial sized - Most of the time comes from computing distances and finding the nearest neighbors - Common ways to find (approximate) nearest neighbors faster: + Use fewer training points + Use projections to approximate distances + Use search trees, hashes or clusters to pre-select possible neighbors - We pay some cost in risk for great savings in time # After-notes ## After-notes - Model selection: See @Claeskens-Hjort-model-selection - Sampling: There are ways to _carefully_ select subsets of points which will work almost as well as the full data, but they're complicated and it's not clear how much they really improve over random sampling; references in the textbook - k-d trees: Due to @Bentley-introduces-k-d-trees (a very clear paper) + Using k-d trees as density estimators: see @Gershenfeld-modeling - Locality-sensitive hashing: Due to @Gionis-et-al-introduce-locality-sensitive-hashing + Good explanations in @Mining-Massive-Datasets-2nd + Random hyperplanes hash: @Charikar-rounding-locality-sensitive-hashing + Random projections hash: @Datar-et-al-locality-sensitive-hashing-based-on-stable-distributions + Clustering hash: we'll come back to this when we look at k-means clustering ## References