- Wrap-up on k selection
- Computational costs of naive implementation of kNN
- Fast, approximate kNN search

- Some methods are “model-selection-consistent”
- \(\equiv\) If the true model is one of the options, then the probability of selecting it \(\rightarrow 1\) as \(n\rightarrow \infty\); it converges on the truth

- Some methods are “predictively optimal”
- the model they select gives predictions (almost) as good as the best possible prediction among the available models

- Dilemma: In general, model-selection-consistent methods are not predictively optimal and vice versa
- In fact, predictively optimal methods tend to over-fit (select too big a model
*even when*\(n\rightarrow\infty\))

- In fact, predictively optimal methods tend to over-fit (select too big a model

- LOOCV is usually predictively optimal
- v-fold CV isn’t quite consistent, but it’s close and there are tweaks

- There just is no true number of nearest neighbors!
- \(k\) is
*only*a control setting for prediction,*not*an aspect of reality

- \(k\) is
- We should expect LOOCV is best form of CV
- Recent theoretical proof of optimality (Azadkia 2019)

- Need to store the
*entire*set of training data- Memory cost: \(p\) features plus \(1\) label per item, \(n\) training items, cost \(O(n(p+1)) = O(np)\)

- Need to compare a new point to
*every*other data point- Cost to compute one distance: \(O(p)\)
- \(\therefore\) Cost to compute all distances: \(O(np)\)
- Cost to find the \(k\) smallest distances: \(O(n)\) (regardless of \(p\))
- Time to compute the prediction from the NNs: \(O(k)\)
- Total time cost: \(O(np+n+k) = O(np+k)\)

- Theoretical computer scientists will tell you anything sub-exponential in \(n\) is “tractable”; don’t believe them…

- There are
*two*parts which take \(O(n)\) time:- Computing distances to
*all*training points - Finding the smallest distances

- Computing distances to
- \(k\) is small, \(n\) is big —
*most*of the distances we compute end up being useless- Do we need all the points?
- Can we be faster about computing distances (if only approximately)?
- Can we rule out some points as nearest neighbors,
*without*examining them?

**Diminishing returns**: risk shrinks as \(n\) grows, but more and more slowly- Remember the distance to the nearest neighbor is \(O(n^{-1/p})\)
- Bias will be on the order of the distance (Taylor expand \(\mu\))
- Contribution to the risk \(\propto \mathrm{bias}^2 = O(n^{-2/p})\)
- For fixed \(k\), variance contribution is \(O(1)\)

- If \(p=10\), doubling \(n\) doubles computing cost, but bias is still \(\approx 0.87\) of what it was before
**Sampling**: pick a random subset of \(m \ll n\) data points to keep computing time small, at acceptable risk- Needs a price at which we can trade risk against computing cost
- Constrain either time or risk, and Lagrange multiplier gives us the price

- Seems wasteful to collect the data and then ignore most of it at random

- Use random projections: it takes only \(O(\log{n})\) random projections to preserve distances
- (to within a factor of \(1\pm \epsilon\))
- Time to project one vector on to one direction: \(O(p)\)
- Time to project all training vectors: \(O(p\log{n})\), but we only do this once

- We can find (approximate) nearest neighbors in time \(O(n\log{n}+p\log{n}+k)\)
- \(O(p\log{n})\) to project new vector on to the \(O(\log{n})\) random vectors
- \(O(n\log{n})\) to find distances between projected vectors
- \(O(n)\) to find smallest projected distance (absorbed into \(O(n\log{n})\)
- \(O(k)\) to average \(k\) nearest neighbors’ responses
- Helps with the scaling in \(p\)
- Only useful if \(p \gg \log{n}\), but \(p\) might easily be \(10^3\) and \(n=10^6\) so \(\log{n}=6\)
- Doesn’t help with the scaling in \(n\)

- Deterministic data structures for clever searching
- Use random summaries to pre-select possible neighbors

- i.e., “\(k\)-dimensional trees”
- Build a sorting tree to categorize the data points
- Leaves are the actual data points
- Each internal node splits on one and only one feature
- Nodes on the same level split on the same feature (generally)
- There are other data structures but \(k-d\) trees work well
- k-d trees are the default in the
`FNN`

package

- k-d trees are the default in the

- To find potential neighbors for a new point, “drop the point down the tree”
- Start at the root
- Go to one child or the other depending on the first feature
- Go a grand-child node depending on the second feature
- Continue until there are only \(k\) leaf nodes below us

- Assume the number of points we
*could*be matched to gets cut in 1/2 at each node- So \(n\) nodes under the root, \(n/2\) under each child of the root, etc.

EXERCISE: How many levels do we need to go down to reach \(\approx k\) candidate neighbors?

- Ideally, cut number of points by 1/2 at each node,
- We go from \(n\) to \(n/2\) to \(n/4\) to \(n 2^{-d}\) after \(d\) levels

SOLUTION: Set \(n 2^{-d}\) to \(k\) and solve: \[\begin{eqnarray} n 2^{-d} & = & k\\ \log_2{n} - d & =& \log_2{k}\\ d & = & \log_2{n/k} \end{eqnarray}\]

*This might not work*at finding the*nearest*neighbors- Nearer neighbors might be on the other side of one of these splits
- There are tricks which will guarantee finding the nearest neighbor with the \(k-d\) tree
- Using those tricks, time complexity is still \(O(\log{n})\)
*on average*, but \(O(n)\) in worst case

- Put the features in some fixed order
- At step \(i\), we’ll be dividing on feature \(i\mod p\)
- Initially, all points sit under the root node; divide at the median on feature 1
- Finding a median takes \(O(n)\) time
- sometimes randomly select a fixed small set of \(m \ll n\) points and take their median

- Within each child node, split on the median of the associated points
- Recurse until there is only one data point within each node; those are the leaves
Because we’ve used the median, we’ve ensured that each child contains 1/2 of the points of its parents

- Drawbacks of the \(k-d\) tree:
- We need to actually analyze the training data
- If we get more data later, updating the tree is annoying (but possible)

- “Hash functions”: map data (e.g., vectors) to fixed set of categories (“buckets”, “bins”, “slots”, …)
- Try to ensure a uniform distribution over bins
- Try to ensure that changes in the data result in changes in the bin
- \(\Rightarrow\) If \(h(x) \neq h(y)\), then \(x\) and \(y\) are pretty different, with high probability
- Usually: people want to ensure that even a
*small*change to \(x\) will put it in a different bin (with high probability)- Used for detecting errors & tampering
- Or for cryptography

- Can “amplify” by using multiple hash functions to generate a vector of categories (= one bigger set of categories)

**Locality-sensitive hashing**: Try to ensure that points which*do*end up in the same bin*are*close to each other- Two important LSH’s:
- The random-hyperplane hash
- The random-inner-product hash

- Generate a random vector \(\vec{V}\), length 1 but otherwise uniformly distributed
- One way: Make a random Gaussian vector and normalize

- \(h(\vec{x}) = \sgn{\vec{x} \cdot \vec{V}}\)
- So \(h(\vec{x}) = h(\vec{y})\) if and only if \(\vec{x}\) and \(\vec{y}\) are on the same side of the plane defined (orthogonal to) by \(\vec{V}\)
- Probability of being on
*opposite*sides of a random plane \(= \theta/\pi\) where \(\theta=\) angle between \(\vec{x}\) and \(\vec{y}\) in radians - Probability of being on the
*same*side \(=1-\theta/\pi\) - Probability of being on the
*same*side of \(q\) different random hyperplanes = \((1-\theta/\pi)^q\)

- Make \(q\) different vectors \(\vec{V}_1, \ldots \vec{V}_q\) and compute \(s(\vec{x}) = [h_1(\vec{x}) \ldots h_q(\vec{x})]\)
- If \(s(\vec{x}) = s(\vec{y})\), then \(\vec{x}\) and \(\vec{y}\) have small angle between them with high probability
- If \(s(\vec{x}) = s(\vec{y})\), then \(\vec{x}\) and \(\vec{y}\) have high cosine similarity with high probability
- To find cosine-similarity NN’s for \(\vec{x}\), compute \(s(\vec{x})\) and then look only at vectors in that bin
- If there are \(\sqrt{n}\) bins then each contains about \(\sqrt{n}\) training points
- Search within a bin takes \(O(p\sqrt{n})\) time
- With \(q\) random hyper-planes we get \(2^q\) bins so we need \(q=\frac{1}{2}\log_{2}{n}\) hyper-planes
- Each inner product takes \(O(p)\) time so computing the extended hash \(s\) takes \(O(p\log{n})\) time
- Total time: \(O(p\sqrt{n}+p\log{n}+k)\)
- (Can you do better by adjusting the number of bins?)

- Generate a random vector \(\vec{V}\) with standard independent Gaussian entries
- Generate a random scalar \(B\) uniform on \([0, r]\) for some \(r\)
- \(h(\vec{x}) = \left\lfloor \frac{\vec{V}\cdot\vec{x} + B}{r} \right\rfloor\) (an integer)
- You can show: the probability that \(h(\vec{x})=h(\vec{y})\) decreases monotonically with \(\|\vec{x}-\vec{y}\|\)
- Vectors which are hashed into the same bin under multiple vectors have low distance with high probability
- As with the random-hyperplanes hash
- Finding the approximate nearest neighbor takes time \(O(p n^{\rho} \log{n})\), where \(\rho\) is a (calculable) constant \(< 1\)

- Randomly label each data point a bin from \(1\) to \(q\), labels \(L_1, \ldots L_n\)
- Compute the average for all points with a bin
- Get vectors \(\vec{c}_1, \ldots \vec{c}_q\)

- Re-label points: \(L_i = \argmin_{j \in 1:q}{\|\vec{x}_i - \vec{c}_j\|}\)
- Re-compute averages, re-label, etc., until nothing changes
- To find neighbors for a new point, find the bin center it’s closest to, and then look for neighbors in that bin
- If \(q=\sqrt{n}\) then it takes \(O(p\sqrt{n})\) to find the right bin, and \(O(pn/q) = O(p\sqrt{n})\) to search for neighbors within the bin

- We need to hash all the training data
- Figuring out which bin a new \(\vec{x}\) belongs to (hashing it) is fast
- We don’t need to keep all the data around to hash \(\vec{x}\)
- Randomness helps!
- Only the cluster hash needs the training data to work out the hash function
- Only the cluster hash needs to be revised when we get more data

- We still need to keep the training data around, but not in memory
- \(O(np)\) storage is a lot cheaper than \(O(np)\) RAM

- Time (and memory) costs of straightforward kNN are linear in \(n\)
- Technically “tractable” but not good when \(n\) is industrial sized

- Most of the time comes from computing distances and finding the nearest neighbors
- Common ways to find (approximate) nearest neighbors faster:
- Use fewer training points
- Use projections to approximate distances
- Use search trees, hashes or clusters to pre-select possible neighbors

- We pay some cost in risk for great savings in time

- Model selection: See Claeskens and Hjort (2008)
- Sampling: There are ways to
*carefully*select subsets of points which will work almost as well as the full data, but they’re complicated and it’s not clear how much they really improve over random sampling; references in the textbook - k-d trees: Due to Bentley (1975) (a very clear paper)
- Using k-d trees as density estimators: see Gershenfeld (1999)

- Locality-sensitive hashing: Due to Gionis, Indyk, and Motwani (1999)
- Good explanations in Leskovec, Rajaraman, and Ullman (2014)
- Random hyperplanes hash: Charikar (2002)
- Random projections hash: Datar et al. (2004)
- Clustering hash: we’ll come back to this when we look at k-means clustering

Azadkia, Mona. 2019. “Optimal Choice of \(k\) for \(k\)-Nearest Neighbor Regression.” E-print, arxiv:1909.05495. http://arxiv.org/abs/1909.05495.

Bentley, Jon Louis. 1975. “Multidimensional Binary Search Trees Used for Associative Searching.” *Communications of the ACM* 18:508–17. https://doi.org/10.1145/361002.361007.

Charikar, Moses S. 2002. “Similarity Estimation Techniques from Rounding Algorithms.” In, edited by John Reif, 380–88. New York: ACM. https://doi.org/10.1145/509907.509965.

Claeskens, Gerda, and Nils Lid Hjort. 2008. *Model Selection and Model Averaging*. Cambridge, England: Cambridge University Press.

Datar, Mayur, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. “Locality-Sensitive Hashing Scheme Based on P-Stable Distributions.” In *Proceedings of the 20th Annual Symposium on Computational Geometry [Scg04]*, edited by Jack Snoeyink and Jean-Daniel Boissonnat, 253–62. New York: ACM. https://doi.org/10.1145/997817.997857.

Gershenfeld, Neil. 1999. *The Nature of Mathematical Modeling*. Cambridge, England: Cambridge University Press.

Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. 1999. “Similarity Search in High Dimensions via Hashing.” In *Proceedings of the 25th International Conference on Very Large Data Bases [Vldb ’99]*, edited by Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie, 518–29. San Francisco: Morgan Kaufmann.

Leskovec, Jure, Anand Rajaraman, and Jeffrey D. Ullman. 2014. *Mining of Massive Datasets*. Second. Cambridge, England: Cambridge University Press. http://www.mmds.org.