A general trade-off to model selection

Some methods are “model-selection-consistent”
- \(\equiv\) If the true model is one of the options, then the probability of selecting it \(\rightarrow 1\) as \(n\rightarrow \infty\); it converges on the truth
Some methods are “predictively optimal”
- the model they select gives predictions (almost) as good as the best possible prediction among the available models
Dilemma: In general, model-selection-consistent methods are not predictively optimal and vice versa
- In fact, predictively optimal methods tend to over-fit (select too big a model even when \(n\rightarrow\infty\))

For cross-validation

LOOCV is usually predictively optimal
v-fold CV isn’t quite consistent, but it’s close and there are tweaks

For kNN

There just is no true number of nearest neighbors!
- \(k\) is only a control setting for prediction, not an aspect of reality
We should expect LOOCV is best form of CV
Recent theoretical proof of optimality (Azadkia 2019)

Computational costs of naive implementation

How expensive is it to run kNN?

Need to store the entire set of training data
- Memory cost: \(p\) features plus \(1\) label per item, \(n\) training items, cost \(O(n(p+1)) = O(np)\)
Need to compare a new point to every other data point
- Cost to compute one distance: \(O(p)\)
- \(\therefore\) Cost to compute all distances: \(O(np)\)
- Cost to find the \(k\) smallest distances: \(O(n)\) (regardless of \(p\))
- Time to compute the prediction from the NNs: \(O(k)\)
- Total time cost: \(O(np+n+k) = O(np+k)\)
Theoretical computer scientists will tell you anything sub-exponential in \(n\) is “tractable”; don’t believe them…

Why is it \(O(n)\)?

There are two parts which take \(O(n)\) time:
- Computing distances to all training points
- Finding the smallest distances
\(k\) is small, \(n\) is big — most of the distances we compute end up being useless
- Do we need all the points?
- Can we be faster about computing distances (if only approximately)?
- Can we rule out some points as nearest neighbors, without examining them?

Using fewer data points

Diminishing returns: risk shrinks as \(n\) grows, but more and more slowly
- Remember the distance to the nearest neighbor is \(O(n^{-1/p})\)
- Bias will be on the order of the distance (Taylor expand \(\mu\))
- Contribution to the risk \(\propto \mathrm{bias}^2 = O(n^{-2/p})\)
- For fixed \(k\), variance contribution is \(O(1)\)
If \(p=10\), doubling \(n\) doubles computing cost, but bias is still \(\approx 0.87\) of what it was before
Sampling: pick a random subset of \(m \ll n\) data points to keep computing time small, at acceptable risk
- Needs a price at which we can trade risk against computing cost
- Constrain either time or risk, and Lagrange multiplier gives us the price
Seems wasteful to collect the data and then ignore most of it at random

Faster distance computation

Use random projections: it takes only \(O(\log{n})\) random projections to preserve distances
- (to within a factor of \(1\pm \epsilon\))
- Time to project one vector on to one direction: \(O(p)\)
- Time to project all training vectors: \(O(p\log{n})\), but we only do this once
We can find (approximate) nearest neighbors in time \(O(n\log{n}+p\log{n}+k)\)
- \(O(p\log{n})\) to project new vector on to the \(O(\log{n})\) random vectors
- \(O(n\log{n})\) to find distances between projected vectors
- \(O(n)\) to find smallest projected distance (absorbed into \(O(n\log{n})\)
- \(O(k)\) to average \(k\) nearest neighbors’ responses
- Helps with the scaling in \(p\)
- Only useful if \(p \gg \log{n}\), but \(p\) might easily be \(10^3\) and \(n=10^6\) so \(\log{n}=6\)
- Doesn’t help with the scaling in \(n\)

Pre-selecting possible neighbors

Deterministic data structures for clever searching
Use random summaries to pre-select possible neighbors

Data structures: \(k-d\) trees

i.e., “\(k\)-dimensional trees”
Build a sorting tree to categorize the data points
Leaves are the actual data points
Each internal node splits on one and only one feature
Nodes on the same level split on the same feature (generally)
There are other data structures but \(k-d\) trees work well
- k-d trees are the default in the FNN package

Using a \(k-d\) tree

To find potential neighbors for a new point, “drop the point down the tree”
- Start at the root
- Go to one child or the other depending on the first feature
- Go a grand-child node depending on the second feature
- Continue until there are only \(k\) leaf nodes below us

Why is a \(k-d\) tree fast?

Assume the number of points we could be matched to gets cut in 1/2 at each node
- So \(n\) nodes under the root, \(n/2\) under each child of the root, etc.

EXERCISE: How many levels do we need to go down to reach \(\approx k\) candidate neighbors?

Why is a \(k-d\) tree fast?

Ideally, cut number of points by 1/2 at each node,
We go from \(n\) to \(n/2\) to \(n/4\) to \(n 2^{-d}\) after \(d\) levels

SOLUTION: Set \(n 2^{-d}\) to \(k\) and solve: \[\begin{eqnarray} n 2^{-d} & = & k\\ \log_2{n} - d & =& \log_2{k}\\ d & = & \log_2{n/k} \end{eqnarray}\]

This might not work at finding the nearest neighbors
- Nearer neighbors might be on the other side of one of these splits
- There are tricks which will guarantee finding the nearest neighbor with the \(k-d\) tree
- Using those tricks, time complexity is still \(O(\log{n})\) on average, but \(O(n)\) in worst case

Building the \(k-d\) tree (one approach)

Put the features in some fixed order
At step \(i\), we’ll be dividing on feature \(i\mod p\)
Initially, all points sit under the root node; divide at the median on feature 1
- Finding a median takes \(O(n)\) time
- sometimes randomly select a fixed small set of \(m \ll n\) points and take their median
Within each child node, split on the median of the associated points
Recurse until there is only one data point within each node; those are the leaves
Because we’ve used the median, we’ve ensured that each child contains 1/2 of the points of its parents
Drawbacks of the \(k-d\) tree:
- We need to actually analyze the training data
- If we get more data later, updating the tree is annoying (but possible)

Locality-sensitive hashing

“Hash functions”: map data (e.g., vectors) to fixed set of categories (“buckets”, “bins”, “slots”, …)
- Try to ensure a uniform distribution over bins
- Try to ensure that changes in the data result in changes in the bin
- \(\Rightarrow\) If \(h(x) \neq h(y)\), then \(x\) and \(y\) are pretty different, with high probability
- Usually: people want to ensure that even a small change to \(x\) will put it in a different bin (with high probability)
  - Used for detecting errors & tampering
  - Or for cryptography
- Can “amplify” by using multiple hash functions to generate a vector of categories (= one bigger set of categories)
Locality-sensitive hashing: Try to ensure that points which do end up in the same bin are close to each other
Two important LSH’s:
- The random-hyperplane hash
- The random-inner-product hash

The random-hyperplane hash

Generate a random vector \(\vec{V}\), length 1 but otherwise uniformly distributed
- One way: Make a random Gaussian vector and normalize
\(h(\vec{x}) = \sgn{\vec{x} \cdot \vec{V}}\)
So \(h(\vec{x}) = h(\vec{y})\) if and only if \(\vec{x}\) and \(\vec{y}\) are on the same side of the plane defined (orthogonal to) by \(\vec{V}\)
Probability of being on opposite sides of a random plane \(= \theta/\pi\) where \(\theta=\) angle between \(\vec{x}\) and \(\vec{y}\) in radians
Probability of being on the same side \(=1-\theta/\pi\)
Probability of being on the same side of \(q\) different random hyperplanes = \((1-\theta/\pi)^q\)

The random-hyperplane hash (cont’d)

Make \(q\) different vectors \(\vec{V}_1, \ldots \vec{V}_q\) and compute \(s(\vec{x}) = [h_1(\vec{x}) \ldots h_q(\vec{x})]\)
If \(s(\vec{x}) = s(\vec{y})\), then \(\vec{x}\) and \(\vec{y}\) have small angle between them with high probability
If \(s(\vec{x}) = s(\vec{y})\), then \(\vec{x}\) and \(\vec{y}\) have high cosine similarity with high probability
To find cosine-similarity NN’s for \(\vec{x}\), compute \(s(\vec{x})\) and then look only at vectors in that bin
- If there are \(\sqrt{n}\) bins then each contains about \(\sqrt{n}\) training points
- Search within a bin takes \(O(p\sqrt{n})\) time
- With \(q\) random hyper-planes we get \(2^q\) bins so we need \(q=\frac{1}{2}\log_{2}{n}\) hyper-planes
- Each inner product takes \(O(p)\) time so computing the extended hash \(s\) takes \(O(p\log{n})\) time
- Total time: \(O(p\sqrt{n}+p\log{n}+k)\)
- (Can you do better by adjusting the number of bins?)

The random-inner-product hash

Generate a random vector \(\vec{V}\) with standard independent Gaussian entries
Generate a random scalar \(B\) uniform on \([0, r]\) for some \(r\)
\(h(\vec{x}) = \left\lfloor \frac{\vec{V}\cdot\vec{x} + B}{r} \right\rfloor\) (an integer)
You can show: the probability that \(h(\vec{x})=h(\vec{y})\) decreases monotonically with \(\|\vec{x}-\vec{y}\|\)
Vectors which are hashed into the same bin under multiple vectors have low distance with high probability
As with the random-hyperplanes hash
Finding the approximate nearest neighbor takes time \(O(p n^{\rho} \log{n})\), where \(\rho\) is a (calculable) constant \(< 1\)

The cluster hash

Randomly label each data point a bin from \(1\) to \(q\), labels \(L_1, \ldots L_n\)
Compute the average for all points with a bin
- Get vectors \(\vec{c}_1, \ldots \vec{c}_q\)
Re-label points: \(L_i = \argmin_{j \in 1:q}{\|\vec{x}_i - \vec{c}_j\|}\)
Re-compute averages, re-label, etc., until nothing changes
To find neighbors for a new point, find the bin center it’s closest to, and then look for neighbors in that bin
If \(q=\sqrt{n}\) then it takes \(O(p\sqrt{n})\) to find the right bin, and \(O(pn/q) = O(p\sqrt{n})\) to search for neighbors within the bin

Some common threads to the LSH techniques

We need to hash all the training data
Figuring out which bin a new \(\vec{x}\) belongs to (hashing it) is fast
We don’t need to keep all the data around to hash \(\vec{x}\)
Randomness helps!
- Only the cluster hash needs the training data to work out the hash function
- Only the cluster hash needs to be revised when we get more data
We still need to keep the training data around, but not in memory
- \(O(np)\) storage is a lot cheaper than \(O(np)\) RAM

Wrapping up

Time (and memory) costs of straightforward kNN are linear in \(n\)
- Technically “tractable” but not good when \(n\) is industrial sized
Most of the time comes from computing distances and finding the nearest neighbors
Common ways to find (approximate) nearest neighbors faster:
- Use fewer training points
- Use projections to approximate distances
- Use search trees, hashes or clusters to pre-select possible neighbors
We pay some cost in risk for great savings in time

After-notes

Model selection: See Claeskens and Hjort (2008)
Sampling: There are ways to carefully select subsets of points which will work almost as well as the full data, but they’re complicated and it’s not clear how much they really improve over random sampling; references in the textbook
k-d trees: Due to Bentley (1975) (a very clear paper)
- Using k-d trees as density estimators: see Gershenfeld (1999)
Locality-sensitive hashing: Due to Gionis, Indyk, and Motwani (1999)
- Good explanations in Leskovec, Rajaraman, and Ullman (2014)
- Random hyperplanes hash: Charikar (2002)
- Random projections hash: Datar et al. (2004)
- Clustering hash: we’ll come back to this when we look at k-means clustering

References

Azadkia, Mona. 2019. “Optimal Choice of \(k\) for \(k\)-Nearest Neighbor Regression.” E-print, arxiv:1909.05495. http://arxiv.org/abs/1909.05495.

Bentley, Jon Louis. 1975. “Multidimensional Binary Search Trees Used for Associative Searching.” Communications of the ACM 18:508–17. https://doi.org/10.1145/361002.361007.

Charikar, Moses S. 2002. “Similarity Estimation Techniques from Rounding Algorithms.” In, edited by John Reif, 380–88. New York: ACM. https://doi.org/10.1145/509907.509965.

Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge, England: Cambridge University Press.

Datar, Mayur, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. “Locality-Sensitive Hashing Scheme Based on P-Stable Distributions.” In Proceedings of the 20th Annual Symposium on Computational Geometry [Scg04], edited by Jack Snoeyink and Jean-Daniel Boissonnat, 253–62. New York: ACM. https://doi.org/10.1145/997817.997857.

Gershenfeld, Neil. 1999. The Nature of Mathematical Modeling. Cambridge, England: Cambridge University Press.

Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. 1999. “Similarity Search in High Dimensions via Hashing.” In Proceedings of the 25th International Conference on Very Large Data Bases [Vldb ’99], edited by Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie, 518–29. San Francisco: Morgan Kaufmann.

Leskovec, Jure, Anand Rajaraman, and Jeffrey D. Ullman. 2014. Mining of Massive Datasets. Second. Cambridge, England: Cambridge University Press. http://www.mmds.org.

Nearest Neighbors III — Mostly Computational

Agenda

Selecting k

A general trade-off to model selection

For cross-validation

For kNN

Computational costs of naive implementation

How expensive is it to run kNN?

Why is it \(O(n)\)?

Using fewer data points

Faster distance computation

Pre-selecting possible neighbors

Data structures: \(k-d\) trees

Using a \(k-d\) tree

Why is a \(k-d\) tree fast?

Why is a \(k-d\) tree fast?

Building the \(k-d\) tree (one approach)

Locality-sensitive hashing

The random-hyperplane hash

The random-hyperplane hash (cont’d)

The random-inner-product hash

The cluster hash

Some common threads to the LSH techniques

Wrapping up

After-notes

After-notes

References