---
title: Nearest Neighbors III --- Mostly Computational
output: slidy_presentation
bibliography: locusts.bib
---
## Agenda
1. Wrap-up on k selection
2. Computational costs of naive implementation of kNN
3. Fast, approximate kNN search
# Selecting k
## A general trade-off to model selection
- Some methods are "model-selection-consistent"
+ $\equiv$ If the true model is one of the options, then the probability of selecting it $\rightarrow 1$ as $n\rightarrow \infty$; it converges on the truth
- Some methods are "predictively optimal"
+ the model they select gives predictions (almost) as good as the best possible prediction among the available models
- Dilemma: In general, model-selection-consistent methods are not predictively optimal and vice versa
+ In fact, predictively optimal methods tend to over-fit (select too big a model _even when_ $n\rightarrow\infty$)
## For cross-validation
- LOOCV is usually predictively optimal
- v-fold CV isn't quite consistent, but it's close and there are tweaks
## For kNN
- There just is no true number of nearest neighbors!
+ $k$ is _only_ a control setting for prediction, _not_ an aspect of reality
- We should expect LOOCV is best form of CV
- Recent theoretical proof of optimality [@Azadkia-leave-one-out-is-optimal-for-k-nearest-neighhor-regression]
# Computational costs of naive implementation
## How expensive is it to run kNN?
- Need to store the _entire_ set of training data
+ Memory cost: $p$ features plus $1$ label per item, $n$ training items, cost $O(n(p+1)) = O(np)$
- Need to compare a new point to _every_ other data point
+ Cost to compute one distance: $O(p)$
+ $\therefore$ Cost to compute all distances: $O(np)$
+ Cost to find the $k$ smallest distances: $O(n)$ (regardless of $p$)
+ Time to compute the prediction from the NNs: $O(k)$
+ Total time cost: $O(np+n+k) = O(np+k)$
- Theoretical computer scientists will tell you anything sub-exponential in $n$ is "tractable";
don't believe them...
## Why is it $O(n)$?
- There are _two_ parts which take $O(n)$ time:
+ Computing distances to _all_ training points
+ Finding the smallest distances
- $k$ is small, $n$ is big --- _most_ of the distances we compute end up being
useless
+ Do we need all the points?
+ Can we be faster about computing distances (if only approximately)?
+ Can we rule out some points as nearest neighbors, _without_ examining them?
## Using fewer data points
- **Diminishing returns**: risk shrinks as $n$ grows, but more and more slowly
+ Remember the distance to the nearest neighbor is $O(n^{-1/p})$
+ Bias will be on the order of the distance (Taylor expand $\mu$)
+ Contribution to the risk $\propto \mathrm{bias}^2 = O(n^{-2/p})$
+ For fixed $k$, variance contribution is $O(1)$
- If $p=10$, doubling $n$ doubles computing cost, but bias is still $\approx `r signif(2^(-2/10), 2)`$ of what it was before
- **Sampling**: pick a random subset of $m \ll n$ data points to keep computing
time small, at acceptable risk
+ Needs a price at which we can trade risk against computing cost
+ Constrain either time or risk, and Lagrange multiplier gives us the price
- Seems wasteful to collect the data and then ignore most of it at random
## Faster distance computation
- Use random projections: it takes only $O(\log{n})$ random projections to preserve distances
+ (to within a factor of $1\pm \epsilon$)
+ Time to project one vector on to one direction: $O(p)$
+ Time to project all training vectors: $O(p\log{n})$, but we only do this once
- We can find (approximate) nearest neighbors in time $O(n\log{n}+p\log{n}+k)$
+ $O(p\log{n})$ to project new vector on to the $O(\log{n})$ random vectors
+ $O(n\log{n})$ to find distances between projected vectors
+ $O(n)$ to find smallest projected distance (absorbed into $O(n\log{n})$
+ $O(k)$ to average $k$ nearest neighbors' responses
+ Helps with the scaling in $p$
+ Only useful if $p \gg \log{n}$, but $p$ might easily be $10^3$ and $n=10^6$ so $\log{n}=6$
+ Doesn't help with the scaling in $n$
## Pre-selecting possible neighbors
- Deterministic data structures for clever searching
- Use random summaries to pre-select possible neighbors
## Data structures: $k-d$ trees
- i.e., "$k$-dimensional trees"
- Build a sorting tree to categorize the data points
+ Leaves are the actual data points
+ Each internal node splits on one and only one feature
+ Nodes on the same level split on the same feature (generally)
- There are other data structures but $k-d$ trees work well
+ k-d trees are the default in the `FNN` package
## Using a $k-d$ tree
- To find potential neighbors for a new point, "drop the point down the tree"
+ Start at the root
+ Go to one child or the other depending on the first feature
+ Go a grand-child node depending on the second feature
+ Continue until there are only $k$ leaf nodes below us
## Why is a $k-d$ tree fast?
- Assume the number of points we _could_ be matched to gets cut in 1/2 at each
node
+ So $n$ nodes under the root, $n/2$ under each child of the root, etc.
EXERCISE: How many levels do we need to go down to reach $\approx k$ candidate
neighbors?
## Why is a $k-d$ tree fast?
- Ideally, cut number of points by 1/2 at each node,
- We go from $n$ to $n/2$ to $n/4$ to $n 2^{-d}$ after $d$ levels
SOLUTION: Set $n 2^{-d}$ to $k$ and solve:
\begin{eqnarray}
n 2^{-d} & = & k\\
\log_2{n} - d & =& \log_2{k}\\
d & = & \log_2{n/k}
\end{eqnarray}
- _This might not work_ at finding the _nearest_ neighbors
+ Nearer neighbors might be on the other side of one of these splits
+ There are tricks which will guarantee finding the nearest neighbor with the $k-d$ tree
+ Using those tricks, time complexity is still $O(\log{n})$ _on average_, but $O(n)$ in worst case
## Building the $k-d$ tree (one approach)
- Put the features in some fixed order
- At step $i$, we'll be dividing on feature $i\mod p$
- Initially, all points sit under the root node; divide at the median on feature 1
+ Finding a median takes $O(n)$ time
+ sometimes randomly select a fixed small set of $m \ll n$ points and take their median
- Within each child node, split on the median of the associated points
- Recurse until there is only one data point within each node; those are the leaves
- Because we've used the median, we've ensured that each child contains 1/2 of
the points of its parents
- Drawbacks of the $k-d$ tree:
+ We need to actually analyze the training data
+ If we get more data later, updating the tree is annoying (but possible)
## Locality-sensitive hashing
- "Hash functions": map data (e.g., vectors) to fixed set of categories
("buckets", "bins", "slots", ...)
+ Try to ensure a uniform distribution over bins
+ Try to ensure that changes in the data result in changes in the bin
+ $\Rightarrow$ If $h(x) \neq h(y)$, then $x$ and $y$ are pretty different, with high probability
+ Usually: people want to ensure that even a _small_ change to $x$ will put it in a different bin (with high probability)
* Used for detecting errors & tampering
* Or for cryptography
+ Can "amplify" by using multiple hash functions to generate a vector of categories (= one bigger set of categories)
- **Locality-sensitive hashing**: Try to ensure that points which _do_ end up in the same bin _are_ close to each other
- Two important LSH's:
+ The random-hyperplane hash
+ The random-inner-product hash
## The random-hyperplane hash
- Generate a random vector $\vec{V}$, length 1 but otherwise uniformly distributed
+ One way: Make a random Gaussian vector and normalize
- $h(\vec{x}) = \sgn{\vec{x} \cdot \vec{V}}$
- So $h(\vec{x}) = h(\vec{y})$ if and only if $\vec{x}$ and $\vec{y}$ are on
the same side of the plane defined (orthogonal to) by $\vec{V}$
- Probability of being on _opposite_ sides of a random plane $= \theta/\pi$
where $\theta=$ angle between $\vec{x}$ and $\vec{y}$ in radians
- Probability of being on the _same_ side $=1-\theta/\pi$
- Probability of being on the _same_ side of $q$ different random
hyperplanes = $(1-\theta/\pi)^q$
## The random-hyperplane hash (cont'd)
- Make $q$ different vectors $\vec{V}_1, \ldots \vec{V}_q$ and compute
$s(\vec{x}) = [h_1(\vec{x}) \ldots h_q(\vec{x})]$
- If $s(\vec{x}) = s(\vec{y})$, then $\vec{x}$ and $\vec{y}$ have small angle
between them with high probability
- If $s(\vec{x}) = s(\vec{y})$, then $\vec{x}$ and $\vec{y}$ have high cosine
similarity with high probability
- To find cosine-similarity NN's for $\vec{x}$, compute $s(\vec{x})$ and then look only
at vectors in that bin
+ If there are $\sqrt{n}$ bins then each contains about $\sqrt{n}$ training points
+ Search within a bin takes $O(p\sqrt{n})$ time
+ With $q$ random hyper-planes we get $2^q$ bins so we need $q=\frac{1}{2}\log_{2}{n}$ hyper-planes
+ Each inner product takes $O(p)$ time so computing the extended hash
$s$ takes $O(p\log{n})$ time
+ Total time: $O(p\sqrt{n}+p\log{n}+k)$
+ (Can you do better by adjusting the number of bins?)
## The random-inner-product hash
- Generate a random vector $\vec{V}$ with standard independent Gaussian entries
- Generate a random scalar $B$ uniform on $[0, r]$ for some $r$
- $h(\vec{x}) = \left\lfloor \frac{\vec{V}\cdot\vec{x} + B}{r} \right\rfloor$
(an integer)
- You can show: the probability that $h(\vec{x})=h(\vec{y})$ decreases monotonically with $\|\vec{x}-\vec{y}\|$
- Vectors which are hashed into the same bin under multiple vectors have low distance with high probability
+ As with the random-hyperplanes hash
- Finding the approximate nearest neighbor takes time $O(p n^{\rho} \log{n})$,
where $\rho$ is a (calculable) constant $< 1$
## The cluster hash
- Randomly label each data point a bin from $1$ to $q$, labels $L_1, \ldots L_n$
- Compute the average for all points with a bin
+ Get vectors $\vec{c}_1, \ldots \vec{c}_q$
- Re-label points: $L_i = \argmin_{j \in 1:q}{\|\vec{x}_i - \vec{c}_j\|}$
- Re-compute averages, re-label, etc., until nothing changes
- To find neighbors for a new point, find the bin center it's closest to,
and then look for neighbors in that bin
- If $q=\sqrt{n}$ then it takes $O(p\sqrt{n})$ to find the right bin, and
$O(pn/q) = O(p\sqrt{n})$ to search for neighbors within the bin
## Some common threads to the LSH techniques
- We need to hash all the training data
- Figuring out which bin a new $\vec{x}$ belongs to (hashing it) is fast
- We don't need to keep all the data around to hash $\vec{x}$
- Randomness helps!
+ Only the cluster hash needs the training data to work out the hash function
+ Only the cluster hash needs to be revised when we get more data
- We still need to keep the training data around, but not in memory
+ $O(np)$ storage is a lot cheaper than $O(np)$ RAM
## Wrapping up
- Time (and memory) costs of straightforward kNN are linear in $n$
+ Technically "tractable" but not good when $n$ is industrial sized
- Most of the time comes from computing distances and finding the nearest
neighbors
- Common ways to find (approximate) nearest neighbors faster:
+ Use fewer training points
+ Use projections to approximate distances
+ Use search trees, hashes or clusters to pre-select possible neighbors
- We pay some cost in risk for great savings in time
# After-notes
## After-notes
- Model selection: See @Claeskens-Hjort-model-selection
- Sampling: There are ways to _carefully_ select subsets of points which will
work almost as well as the full data, but they're complicated and it's not
clear how much they really improve over random sampling; references in the
textbook
- k-d trees: Due to @Bentley-introduces-k-d-trees (a very clear paper)
+ Using k-d trees as density estimators: see @Gershenfeld-modeling
- Locality-sensitive hashing: Due to @Gionis-et-al-introduce-locality-sensitive-hashing
+ Good explanations in @Mining-Massive-Datasets-2nd
+ Random hyperplanes hash: @Charikar-rounding-locality-sensitive-hashing
+ Random projections hash: @Datar-et-al-locality-sensitive-hashing-based-on-stable-distributions
+ Clustering hash: we'll come back to this when we look at k-means clustering
## References