# Agenda

1. Wrap-up on k selection
2. Computational costs of naive implementation of kNN
3. Fast, approximate kNN search

# A general trade-off to model selection

• Some methods are “model-selection-consistent”
• $$\equiv$$ If the true model is one of the options, then the probability of selecting it $$\rightarrow 1$$ as $$n\rightarrow \infty$$; it converges on the truth
• Some methods are “predictively optimal”
• the model they select gives predictions (almost) as good as the best possible prediction among the available models
• Dilemma: In general, model-selection-consistent methods are not predictively optimal and vice versa
• In fact, predictively optimal methods tend to over-fit (select too big a model even when $$n\rightarrow\infty$$)

# For cross-validation

• LOOCV is usually predictively optimal
• v-fold CV isn’t quite consistent, but it’s close and there are tweaks

# For kNN

• There just is no true number of nearest neighbors!
• $$k$$ is only a control setting for prediction, not an aspect of reality
• We should expect LOOCV is best form of CV
• Recent theoretical proof of optimality (Azadkia 2019)

# How expensive is it to run kNN?

• Need to store the entire set of training data
• Memory cost: $$p$$ features plus $$1$$ label per item, $$n$$ training items, cost $$O(n(p+1)) = O(np)$$
• Need to compare a new point to every other data point
• Cost to compute one distance: $$O(p)$$
• $$\therefore$$ Cost to compute all distances: $$O(np)$$
• Cost to find the $$k$$ smallest distances: $$O(n)$$ (regardless of $$p$$)
• Time to compute the prediction from the NNs: $$O(k)$$
• Total time cost: $$O(np+n+k) = O(np+k)$$
• Theoretical computer scientists will tell you anything sub-exponential in $$n$$ is “tractable”; don’t believe them…

# Why is it $$O(n)$$?

• There are two parts which take $$O(n)$$ time:
• Computing distances to all training points
• Finding the smallest distances
• $$k$$ is small, $$n$$ is big — most of the distances we compute end up being useless
• Do we need all the points?
• Can we be faster about computing distances (if only approximately)?
• Can we rule out some points as nearest neighbors, without examining them?

# Using fewer data points

• Diminishing returns: risk shrinks as $$n$$ grows, but more and more slowly
• Remember the distance to the nearest neighbor is $$O(n^{-1/p})$$
• Bias will be on the order of the distance (Taylor expand $$\mu$$)
• Contribution to the risk $$\propto \mathrm{bias}^2 = O(n^{-2/p})$$
• For fixed $$k$$, variance contribution is $$O(1)$$
• If $$p=10$$, doubling $$n$$ doubles computing cost, but bias is still $$\approx 0.87$$ of what it was before
• Sampling: pick a random subset of $$m \ll n$$ data points to keep computing time small, at acceptable risk
• Needs a price at which we can trade risk against computing cost
• Constrain either time or risk, and Lagrange multiplier gives us the price
• Seems wasteful to collect the data and then ignore most of it at random

# Faster distance computation

• Use random projections: it takes only $$O(\log{n})$$ random projections to preserve distances
• (to within a factor of $$1\pm \epsilon$$)
• Time to project one vector on to one direction: $$O(p)$$
• Time to project all training vectors: $$O(p\log{n})$$, but we only do this once
• We can find (approximate) nearest neighbors in time $$O(n\log{n}+p\log{n}+k)$$
• $$O(p\log{n})$$ to project new vector on to the $$O(\log{n})$$ random vectors
• $$O(n\log{n})$$ to find distances between projected vectors
• $$O(n)$$ to find smallest projected distance (absorbed into $$O(n\log{n})$$
• $$O(k)$$ to average $$k$$ nearest neighbors’ responses
• Helps with the scaling in $$p$$
• Only useful if $$p \gg \log{n}$$, but $$p$$ might easily be $$10^3$$ and $$n=10^6$$ so $$\log{n}=6$$
• Doesn’t help with the scaling in $$n$$

# Pre-selecting possible neighbors

• Deterministic data structures for clever searching
• Use random summaries to pre-select possible neighbors

# Data structures: $$k-d$$ trees

• i.e., “$$k$$-dimensional trees”
• Build a sorting tree to categorize the data points
• Leaves are the actual data points
• Each internal node splits on one and only one feature
• Nodes on the same level split on the same feature (generally)
• There are other data structures but $$k-d$$ trees work well
• k-d trees are the default in the FNN package

# Using a $$k-d$$ tree

• To find potential neighbors for a new point, “drop the point down the tree”
• Start at the root
• Go to one child or the other depending on the first feature
• Go a grand-child node depending on the second feature
• Continue until there are only $$k$$ leaf nodes below us

# Why is a $$k-d$$ tree fast?

• Assume the number of points we could be matched to gets cut in 1/2 at each node
• So $$n$$ nodes under the root, $$n/2$$ under each child of the root, etc.

EXERCISE: How many levels do we need to go down to reach $$\approx k$$ candidate neighbors?

# Why is a $$k-d$$ tree fast?

• Ideally, cut number of points by 1/2 at each node,
• We go from $$n$$ to $$n/2$$ to $$n/4$$ to $$n 2^{-d}$$ after $$d$$ levels

SOLUTION: Set $$n 2^{-d}$$ to $$k$$ and solve: $\begin{eqnarray} n 2^{-d} & = & k\\ \log_2{n} - d & =& \log_2{k}\\ d & = & \log_2{n/k} \end{eqnarray}$

• This might not work at finding the nearest neighbors
• Nearer neighbors might be on the other side of one of these splits
• There are tricks which will guarantee finding the nearest neighbor with the $$k-d$$ tree
• Using those tricks, time complexity is still $$O(\log{n})$$ on average, but $$O(n)$$ in worst case

# Building the $$k-d$$ tree (one approach)

• Put the features in some fixed order
• At step $$i$$, we’ll be dividing on feature $$i\mod p$$
• Initially, all points sit under the root node; divide at the median on feature 1
• Finding a median takes $$O(n)$$ time
• sometimes randomly select a fixed small set of $$m \ll n$$ points and take their median
• Within each child node, split on the median of the associated points
• Recurse until there is only one data point within each node; those are the leaves
• Because we’ve used the median, we’ve ensured that each child contains 1/2 of the points of its parents

• Drawbacks of the $$k-d$$ tree:
• We need to actually analyze the training data
• If we get more data later, updating the tree is annoying (but possible)

# Locality-sensitive hashing

• “Hash functions”: map data (e.g., vectors) to fixed set of categories (“buckets”, “bins”, “slots”, …)
• Try to ensure a uniform distribution over bins
• Try to ensure that changes in the data result in changes in the bin
• $$\Rightarrow$$ If $$h(x) \neq h(y)$$, then $$x$$ and $$y$$ are pretty different, with high probability
• Usually: people want to ensure that even a small change to $$x$$ will put it in a different bin (with high probability)
• Used for detecting errors & tampering
• Or for cryptography
• Can “amplify” by using multiple hash functions to generate a vector of categories (= one bigger set of categories)
• Locality-sensitive hashing: Try to ensure that points which do end up in the same bin are close to each other
• Two important LSH’s:
• The random-hyperplane hash
• The random-inner-product hash

# The random-hyperplane hash

• Generate a random vector $$\vec{V}$$, length 1 but otherwise uniformly distributed
• One way: Make a random Gaussian vector and normalize
• $$h(\vec{x}) = \sgn{\vec{x} \cdot \vec{V}}$$
• So $$h(\vec{x}) = h(\vec{y})$$ if and only if $$\vec{x}$$ and $$\vec{y}$$ are on the same side of the plane defined (orthogonal to) by $$\vec{V}$$
• Probability of being on opposite sides of a random plane $$= \theta/\pi$$ where $$\theta=$$ angle between $$\vec{x}$$ and $$\vec{y}$$ in radians
• Probability of being on the same side $$=1-\theta/\pi$$
• Probability of being on the same side of $$q$$ different random hyperplanes = $$(1-\theta/\pi)^q$$

# The random-hyperplane hash (cont’d)

• Make $$q$$ different vectors $$\vec{V}_1, \ldots \vec{V}_q$$ and compute $$s(\vec{x}) = [h_1(\vec{x}) \ldots h_q(\vec{x})]$$
• If $$s(\vec{x}) = s(\vec{y})$$, then $$\vec{x}$$ and $$\vec{y}$$ have small angle between them with high probability
• If $$s(\vec{x}) = s(\vec{y})$$, then $$\vec{x}$$ and $$\vec{y}$$ have high cosine similarity with high probability
• To find cosine-similarity NN’s for $$\vec{x}$$, compute $$s(\vec{x})$$ and then look only at vectors in that bin
• If there are $$\sqrt{n}$$ bins then each contains about $$\sqrt{n}$$ training points
• Search within a bin takes $$O(p\sqrt{n})$$ time
• With $$q$$ random hyper-planes we get $$2^q$$ bins so we need $$q=\frac{1}{2}\log_{2}{n}$$ hyper-planes
• Each inner product takes $$O(p)$$ time so computing the extended hash $$s$$ takes $$O(p\log{n})$$ time
• Total time: $$O(p\sqrt{n}+p\log{n}+k)$$
• (Can you do better by adjusting the number of bins?)

# The random-inner-product hash

• Generate a random vector $$\vec{V}$$ with standard independent Gaussian entries
• Generate a random scalar $$B$$ uniform on $$[0, r]$$ for some $$r$$
• $$h(\vec{x}) = \left\lfloor \frac{\vec{V}\cdot\vec{x} + B}{r} \right\rfloor$$ (an integer)
• You can show: the probability that $$h(\vec{x})=h(\vec{y})$$ decreases monotonically with $$\|\vec{x}-\vec{y}\|$$
• Vectors which are hashed into the same bin under multiple vectors have low distance with high probability
• As with the random-hyperplanes hash
• Finding the approximate nearest neighbor takes time $$O(p n^{\rho} \log{n})$$, where $$\rho$$ is a (calculable) constant $$< 1$$

# The cluster hash

• Randomly label each data point a bin from $$1$$ to $$q$$, labels $$L_1, \ldots L_n$$
• Compute the average for all points with a bin
• Get vectors $$\vec{c}_1, \ldots \vec{c}_q$$
• Re-label points: $$L_i = \argmin_{j \in 1:q}{\|\vec{x}_i - \vec{c}_j\|}$$
• Re-compute averages, re-label, etc., until nothing changes
• To find neighbors for a new point, find the bin center it’s closest to, and then look for neighbors in that bin
• If $$q=\sqrt{n}$$ then it takes $$O(p\sqrt{n})$$ to find the right bin, and $$O(pn/q) = O(p\sqrt{n})$$ to search for neighbors within the bin

# Some common threads to the LSH techniques

• We need to hash all the training data
• Figuring out which bin a new $$\vec{x}$$ belongs to (hashing it) is fast
• We don’t need to keep all the data around to hash $$\vec{x}$$
• Randomness helps!
• Only the cluster hash needs the training data to work out the hash function
• Only the cluster hash needs to be revised when we get more data
• We still need to keep the training data around, but not in memory
• $$O(np)$$ storage is a lot cheaper than $$O(np)$$ RAM

# Wrapping up

• Time (and memory) costs of straightforward kNN are linear in $$n$$
• Technically “tractable” but not good when $$n$$ is industrial sized
• Most of the time comes from computing distances and finding the nearest neighbors
• Common ways to find (approximate) nearest neighbors faster:
• Use fewer training points
• Use projections to approximate distances
• Use search trees, hashes or clusters to pre-select possible neighbors
• We pay some cost in risk for great savings in time

# After-notes

• Model selection: See Claeskens and Hjort (2008)
• Sampling: There are ways to carefully select subsets of points which will work almost as well as the full data, but they’re complicated and it’s not clear how much they really improve over random sampling; references in the textbook
• k-d trees: Due to Bentley (1975) (a very clear paper)
• Using k-d trees as density estimators: see Gershenfeld (1999)
• Locality-sensitive hashing: Due to Gionis, Indyk, and Motwani (1999)
• Good explanations in Leskovec, Rajaraman, and Ullman (2014)
• Random hyperplanes hash: Charikar (2002)
• Random projections hash: Datar et al. (2004)
• Clustering hash: we’ll come back to this when we look at k-means clustering