---
title: $k$-Nearest Neighbors I (Mostly Theory)
author: 36-462/662, Fall 2019
date: 25 September 2019 (Lecture 9)
bibliography: locusts.bib
output:
  html_document:
    toc: true
---

\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Prob}[1]{\Pr\left( #1 \right)}
\]

# Setting: Prediction, including both classification and regression

Let's fix our setting.  As usual, we have a database of $n$ items, represented
as vectors of $p$ features.  Following the usual notation for regression
courses, we'll write this as an $n\times p$ matrix $\mathbf{x}$; the vector for
data-point $i$ will be $\vec{x}_i$.  (To get to this point, we may have done
some dimension reduction as a pre-processing step, but that won't matter for us
here.)  Beyond these features, we have an additional variable for each item
that we want to **predict**, based on the features.  We'll write it $y_i$ for
data-point $i$, compiled into the $n\times 1$ matrix $\mathbf{y}$ (again, this
is regression notation).  This variable is called the **label**, **outcome**,
**target**, **output** or (oddly) **dependent variable** (sometimes even just
called the **predictand**).

A prediction here is going to be a function of the features which outputs a
guess ("point prediction") about the outcome or label.

- **Regression**: $y$ is a continuous numerical variable, so the **regression
  function** should map $\vec{x}$ to a number.
- **Classification**: $y$ is binary, so the **classification rule**
  should map $\vec{x}$ to 0 or 1.
  + Multi-class classification works similarly but with more notation.

## Prediction quality, risk, optimal risk and optimal predictors

How good is the guess?

- Regression: measure with expected squared error, so
  $\Expect{(Y-m(\vec{X}))^2}$ indicates how bad the function $m$ is.
- Classification: measure with inaccuracy or error rate, so
  $\Prob{Y \neq m(\vec{X})}$ indicates how bad the function $m$ is.

This expected error on new data is called the **risk**.  In predictive
modeling, we want to learn functions with low risk.

There's an optimal function, i.e., one which has a lower risk than any
other function.

- Regression: The optimum function is $\mu(\vec{x}) = \Expect{Y|\vec{X}=\vec{x}}$.
  This is called the **true regression function**.
- Classification: The optimum function[^bayes] depends on the conditional probability
  $\Prob{Y=1|\vec{X}=\vec{x}} \equiv p(\vec{x})$.  The function is
  $c(\vec{x}) = 1$ if $p(\vec{x}) \geq 0.5$ and $c(\vec{x}) = 0$ otherwise.


[^bayes]: Some people call it the **Bayes rule**, even though it has nothing to do with Bayes's rule, with Bayesian inference, or with Thomas Bayes.

Even the optimal function will not, in general, have zero risk.

- Regression: Suppose $Y=\mu(\vec{X}) + \epsilon$, where the noise $\epsilon$
  has $\Expect{\epsilon|\vec{X}} = 0$, $\Var{\epsilon|\vec{X}=\vec{x}} = \sigma^2(\vec{x})$.
  (In your linear-regression class, you would have assumed this, _plus_ that
  $\sigma^2$ is constant.)  Then
  the risk of the true regression function is $\Expect{\sigma^2(\vec{X})} > 0$.
- Classification: Suppose $p(\vec{x})$ isn't either 0 or 1 everywhere.  Then
  the probability of mis-classifying at $\vec{x}$ is $p(\vec{x})$ if
  $p(\vec{x}) < 0.5$ (because there $c(\vec{x}) = 0$, and $1-p(\vec{x})$ if
  $p(\vec{x} \geq 0.5$.  A little thought shows that we can write the
  conditional probability in a unified way[^altexp] as $\min{\left{p(\vec{x}),1-p(\vec{x})\right}}$.
  The risk of the optimal classifier is then
  $\Expect{\min{\left{p(\vec{X}),1-p(\vec{X})\right}}}$.  This minimal risk will
  be $>0$ unless $p(\vec{x})=0$ or $=1$ almost everywhere.

[^altexp]: An alternative expression would be  $\frac{1}{2} - \left|p(\vec{x})-\frac{1}{2}\right|$, but this will be less useful later on.

Unfortunately, the optimal function depends on the true distribution
generating the data.  So does the risk of the optimal function.
What we want, then, is some way of estimating a function from the data
which can learn what the true function is, or at least learn to
predict almost as well as the true function, without having to know in
advance too much about that function.

# Nearest neighbors as a predictor

This is where nearest neighbors comes in.

In this context, "distance" always refers to distances between the
$p$-dimensional feature vectors.  The **nearest neighbor** of a vector
$\vec{x}$ is the $\vec{x}_i$ closest to it.  The $k$ nearest neighbors are the
$k$ vectors $\vec{x}_i$ closest to $\vec{x}$.  (Notice that these definitions
make sense whether or not $\vec{x}$ is also one of the $\vec{x}_i$.)  We will
often need a way of keeping track of the indices of the neighbors, so we'll
write $NN(\vec{x}, j)$ for the index of the $j^{\mathrm{th}}$ nearest neighbor
of $\vec{x}$.

The k-nearest-neighbor estimate of the regression function is then
the average value of the response over the $k$ nearest neighbors:
\[
\hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{y_{NN(\vec{x}, j)}}
\]
For classification, we similarly average the labels of neighbors to
estimate $p(\vec{x})$,
\[
\hat{p}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{y_{NN(\vec{x}, j)}}
\]
and then threshold it:
\[
\hat{c}(\vec{x}) = \mathbf{1}(\hat{p}(\vec{x}) \geq 0.5)
\]

# Analysis of 1-Nearest-Neighbors for Learning Noise-Free Functions

There are lots of estimation methods we _could_ use.  To decide on using
this one, nearest neighbors, we should have some reason to
think it will predict well.  This is where theory comes in.

Start with the simplest, most extreme setting, to build ideas.  We'll assume
that there is _no noise_ in the outcomes (responses, labels).  This means for
regression that $y_i = \mu(\vec{x}_i)$, and for classification that $y_i =
c(\vec{x}_i)$.  If nearest neighbors can't learn to predict here, it's got to
be toast; if it can, we'll add noise back in.  Let's also make our lives simple
by only looking at 1-nearest-neighbors, $k=1$.  To simplify notation, I'll
write $NN$ as the index for the nearest neighbor of $\vec{x}$, leaving the
dependence on $\vec{x}$ implicit.

In this setting --- no noise, $k=1$ --- the error nearest neighbors will
make for regression at $\vec{x}$ will be
\[
\mu(\vec{x}) - y_NN = \mu(\vec{x}) - \mu(x_{NN})
\]
so the risk will be
\[
\Expect{(\mu(\vec{X}) - \mu(\vec{X}_{NN}))^2}
\]
Similarly, for classification, the risk will be
\[
\Prob{c(\vec{X}) \neq c(\vec{X}_{NN})}
\]
We'd like these risks to go to 0 as $n\rightarrow\infty$ (because here the
optimal risk _is_ zero).  The equations above suggests that for regression we
want the true function to be continuous, but for classification we want it to
be piecewise constant.  (In fact, piecewise continuity is usually enough for
regression.)  But even with this, we need to see that $\vec{x}_{NN} \rightarrow
\vec{x}$, otherwise continuity won't help.

## Convergence of the nearest neighbor

Requiring $\vec{x}_{NN} \rightarrow \vec{x}$ is the same as want
$\|\vec{x}_{NN} - \vec{x}\| \rightarrow 0$.  When will this happen?

Well, pick some positive distance $\epsilon > 0$.  What is the
probability that $\|\vec{x}_{NN} - \vec{x}\| > \epsilon$?  Ideally, we'd
like this to go to zero as $n\rightarrow\infty$, no matter how small
that $\epsilon$ might be; that would indicate that the nearest neighbor
is approaching the point of interest.

A little thought should convince you that the nearest neighbor is more than
$\epsilon$ away from $\vec{x}$ if and only if every $\vec{x}_i$ is more than $\epsilon$ away.  So
\[
\Prob{\|\vec{x}_{NN} - \vec{x}\| > \epsilon} = \Prob{\forall i, \|\vec{x}_i -\vec{x}\| > \epsilon}
\]

At this point, we need to make an assumption about the feature vectors.  We'll
assume they're IID.  Then the probability of _all_ the feature vectors
doing the same thing (being far from $\vec{x}$) turns into the product of _each_ of them doing that thing:
\begin{eqnarray}
\Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} & = & \Prob{\forall i, \|\vec{X}_i -\vec{x}\| > \epsilon}\\
& = & \prod_{i=1}^{n}{\Prob{ \|\vec{X}_i -\vec{x}\| > \epsilon}}\\
& = & \left(\Prob{ \|\vec{X} -\vec{x}\| > \epsilon}\right)^n
\end{eqnarray}
This has got to go to zero as $n\rightarrow\infty$ (which is what we want),
unless the probability we're raising to the power $n$ is exactly 1.
To get a handle on that, let's re-write it a bit more:
\begin{eqnarray}
\Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} & = & \left(\Prob{ \|\vec{X} -\vec{x}\| > \epsilon}\right)^n\\
& = & \left(1 - \Prob{\|\vec{X} - \vec{x}\| \leq \epsilon}\right)^n
\end{eqnarray}
So all we need is for there to be {\em some} probability of being within
$\epsilon$ of $\vec{x}$.  If we're asking for a prediction at a point in the
middle of a region of zero probability, nearest neighbors is not a great idea,
but otherwise, we're set.

We can be a little bit more detailed by approximating the probability
in question.  Assume $\vec{X}$ follows a pdf $f(\vec{u})$.  Then
the probability integrates the pdf over the radius-$\epsilon$ ball
centered on $\vec{x}$,
\[
 \Prob{\|\vec{X} - \vec{x}\| \leq \epsilon} = \int_{\vec{u}: \|\vec{u} - \vec{x}\| \leq \epsilon}{f(\vec{u}) d\vec{u}}
\]
Let's assume $\epsilon$ is small.  (Ultimately we want it to shrink towards
zero, after all.)  Over a small enough ball, $f(\vec{u})$ will be nearly
constant, and equal to $f(\vec{x})$.  So
\[
 \Prob{\|\vec{X} - \vec{x}\| \leq \epsilon} \approx c_p \epsilon^p f(\vec{x})
\]
where $c_p$ is a constant, geometrical factor ($c_2 = \pi$, $c_3 = \frac{4}{3}\pi$, etc.).  That is, the probability of a _small_ ball centered around $\vec{x}$ is about $f(\vec{x})$ times the volume of the ball.

Putting all this together,
\[
\Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} \approx (1-c_p \epsilon^p f(\vec{x}))^n
\]

Let's make use of one last approximation, that $(1+h)^b \approx 1+bh$ when $|h| \ll 1$.  (Use the binomial theorem if you don't believe me.)  Then we
get
\[
\Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} \approx 1- n c_p \epsilon^p f(\vec{x})
\]
This is going to zero for each fixed $\epsilon$.  If we want this to be constant
--- say, if we want to find an $\epsilon$ which bounds the distance to
the nearest neighbor with 50% confidence, the median nearest-neighbor distance --- we'd need to say
\[
1- n c_p \epsilon^p f(\vec{x}) = \delta
\]
or
\[
\epsilon = n^{-1/p} \left(\frac{(1-\delta)}{c_p f(\vec{x})}\right)^{1/p}
\]
So the typical distance to the nearest neighbor is shrinking to 0, at rate $n^{-1/p}$.  This $\rightarrow 0$ as $n\rightarrow\infty$, as desired.

## 1NN is consistent for noise-free functions

To recap, because $\|\vec{x}_NN - \vec{x}\| \rightarrow 0$ as
$n\rightarrow\infty$, if the true function is (piecewise) continuous, then 1NN
will approximate it arbitrarily well given enough data.  When an estimator
converges on the truth as $n\rightarrow\infty$, it's called "consistent", so
we've just shown that nearest neighbors is consistent for learning
noise-free functions.

# Putting the noise back in for 1NN

What happens if we add in noise, but still use 1NN?  In the regression
case, $Y=\mu(X)+\epsilon$, so
\begin{eqnarray}
\hat{\mu}(\vec{x}) & = & y_{NN}\\
& = & \mu(\vec{X}_{NN}) + \epsilon_{NN}
\end{eqnarray}
The error in predicting a new response at $\vec{x}$, $Y_{new}$,
is thus
\begin{eqnarray}
Y_{new} - \hat{\mu}(\vec{x}) & = & \mu(\vec{x}) + \epsilon_{new} -  \mu(\vec{X}_{NN}) - \epsilon_{NN}\\
& = & (\mu(\vec{x}) -  \mu(\vec{X}_{NN})) + \epsilon_{new} - \epsilon_{NN}
\end{eqnarray}
As $n\rightarrow\infty$, the $\mu$ term in parentheses $\rightarrow 0$,
since $\mu$ is continuous (by assumption) and the nearest neighbor converges
on the point.  So the error approaches $\epsilon_{new} - \epsilon_{NN}$.
Squaring, taking expectations, and remembering that the noises are uncorrelated,
we get that the risk of 1NN regression at $\vec{x}$ approaches
\[
\Var{\epsilon|\vec{X}=\vec{x}} + (-1)^2\Var{-\epsilon|\vec{X}=\vec{x}} = 2\sigma^2(\vec{x})
\]
The over-all risk of 1NN regression thus approaches
\[
2\Expect{\sigma^2(\vec{X})}$
\]
as $n\rightarrow\infty$.
$2\Var{\epsilon}$.
But the risk of the true regression function is already $\Expect{\sigma^2(\vec{X})}$, so we've come within a factor of two of the optimum risk.[^homosked]

[^homosked]: If the noise variance is constant, $\sigma^2(\vec{x}) = \sigma^2$, this simplifes: the risk of 1NN regression approach $2\sigma^2$, while the risk of the true regression function is just $\sigma^2$.

For classification, the risk at a particular point $\vec{x}$ is
\begin{eqnarray}
\Prob{Y_{new} \neq \hat{c}(\vec{x}_{new})} & = & \Prob{Y_{new} \neq Y_{NN}}\\
& = & \Prob{Y_{new}=1, Y_{NN}=0} + \Prob{Y_{new}=0, Y_{NN}=1}\\
& = & p(\vec{x}(1-p(\vec{x}_{NN})) + (1-p(\vec{x}))p(\vec{x}_{NN})
\end{eqnarray}
As $\vec{x}_{NN} \rightarrow \vec{x}$, this approaches
\[
2p(\vec{x})(1-p(\vec{x}))
\]
provided $p$ is a continuous function (or at least piecewise continuous).
Recall from earlier that the conditional risk of the optimal classification function
is $\min{\left\{p(\vec{x}), 1-p(\vec{x})\right}}$, say $r(\vec{x})$.  So
the conditional risk of 1NN approaches $2r(\vec{x})(1-r(\vec{x}))$ \leq 2r(\vec{x})$.
The
over-all risk will thus approach
\[
2\Expect{p(\vec{X})(1-p(\vec{X}))}
\]
which is at most twice the risk of the optimal classifier.

# What about multiple neighbors?

Recall how we defined the predictions for $k$-nearest-neighbor regression[^class]:
\[
\hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{Y_{NN(\vec{x}, j)}}
\]
For every data point, $Y=\mu(\vec{X})+\epsilon$, where quite generally
$\Expect{\epsilon|\vec{X}=\vec{x}} = 0$.  So we can write
\[
\hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{x}_{NN(\vec{x}, j)}} + \frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}}
\]
What we'd like the prediction to be is of course $\mu(\vec{x})$, as before.

[^class]: The analysis for kNN-classification is very similar and comes to the same conclusion, but I don't feel like writing everything out twice.


The last equation makes it clear that the error in kNN-regression has two
sources:

1. Evaluating the true regression function at the nearest neighbors.  That is, we're _approximating_ the quantity we want ($\mu(\vec{x})$) by something else, namely $\frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{x}_{NN(\vec{x}, j)})}$.  We'll call this the approximation error[^bias].
2. The noise in the response values for the nearest neighbors $\left( \frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}}\right)$.  This is pure noise.

[^bias]: If we think of the locations of the nearest neighbors as fixed, and only the responses $Y$ as random, then we can call this "bias" in the technical sense, as the expected difference between the estimate $\hat{\mu}(\vec{x})$ and the truth $\mu(\vec{x})$.  If we treat the locations of the nearest neighbors as random, then the bias would be $\Expect{\frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{X}_{NN(\vec{x}, j)}) - \mu(\vec{x})}}$, which is a bit of a mess, though fortunately not something we'll need to know in detail, as the next paragraph will explain.

For 1NN, we controlled the approximation error by realizing that it goes to
zero as $\vec{x}$'s nearest neighbor converges on $\vec{x}$.  You[^exercise]
can extend the argument to show that the $k^{\mathrm{th}}$ nearest neighbor
does too, for any fixed $k$.  If the $k^{\mathrm{th}}$ nearest neighbor is
within $\epsilon$ of $\vec{x}$, then all of $k$ nearest neighbor_s_ must be
too.  Then continuity of $\mu$ says that the approximation error $\rightarrow
0$ as $n\rightarrow\infty$.

[^exercise]: "You", meaning "not me, at least not now".  The trick is however to realize that if the $k$th neighbor is more than $\epsilon$ away from $\vec{x}$, _at least_ $n-k+1$ of the data points must be more than $\epsilon$ away.  (Said differently, if the $k$th neighbor is within $\epsilon$, then at least $k-1$ other data points must also be within $\epsilon$.)  The probability of this happening is something we can calculate from a binomial distribution, with $n$ trials and a success probability depending on the probability of a random point being in the $\epsilon$-ball around $\vec{x}$.

As for the noise, it's the average of $k$ noise terms.  If we assume the $\epsilon$s are uncorrelated across data points, we can say that
\[
\Var{\frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}}} = \frac{1}{k}\sum_{j=1}^{k^2}{\Var{\epsilon_{NN(\vec{x}, j)}}}
\]
If $\Var{\epsilon|\vec{X}=\vec{u}} = \sigma^2(\vec{u})$, then all of those
variances are converging on $\sigma^2(\vec{x})$, and we get
\[
\frac{\sigma^2(\vec{x})}{k}
\]
for the variance of the noise.

The over-all risk of kNN-regression at $\vec{x}$ will thus tend, as $n\rightarrow\infty$,
to
\[
(\text{system noise}) + (\text{approximation error}) + (\text{estimation noise}) \rightarrow \sigma^2(\vec{x}) + 0 + \frac{\sigma^2(\vec{x})}{k} = \left(1+\frac{1}{k}\right)\sigma^2(\vec{x})
\]
That is, rather than having twice the optimum risk with $k=1$, kNN regression gets only $1+1/k$ of the optimum risk --- at least as $n\rightarrow\infty$.

That last phrase is of course why we don't just automatically set $k$ to be
as large as possible.  At any _finite_ $n$, we face a trade-off:

- Increasing $k$ means averaging over more data points for each prediction, which reduces the variance by averaging together more noise terms (i.e., big $k$ means less variance);
- Decreasing $k$ means averaging over fewer data points for each prediction, which reduces the approximation error by averaging over points closer to where we want a prediction (i.e., small $k$ means less bias).

This is a manifestation of one of the fundamental issues in statistics, the
**bias-variance tradeoff**.  When we are doing prediction, we don't (usually)
_care_ about whether our errors come from bias or from variance, just about the
over-all magnitude of the error.  We will usually find that we want methods
with _some_ bias, because the error added by the bias is more than compensated
for by the reduction in variance.  We need some practical way of deciding how
much bias we want to trade for less variance; this is what we'll tackle
next time.


# Background

The pioneering theoretical analysis of nearest neighbors, covering both
regression and classification as special cases of prediction-in-general, was
done by Cover in the 1960s
[@Cover-Hart-nearest-neighbor; @Cover-estimation-by-NN; @Cover-rates-of-conv-for-NN].
What I've done above is basically "Cover made even simpler".  For more refined
analyses of kNN classification and regression, see the appropriate chapters of
@Devroye-Gyorfi-Lugosi-probabilistic-theory-of-pattern-recognition and
@Gyorfi-Kohler-Krzyzak-Walk-nonparametric-regrssion, respectively.


**Historical note**: "Find the most similar case with a known outcome, and
guess that a new case will be similar" is such a natural idea that it's almost
impossible to trace its earliest history.  The recognition that this idea could
be a general, explicit statistical method, along with the name "nearest
neighbors", seems to go back to the 1950s (see @Cover-estimation-by-NN for
references).  But _because_ it's such a natural idea that it keeps getting
re-invented in different subjects: in nonlinear dynamics and the physics of
chaotic systems, for instance, it was introduced in the 1980s as the "method of
analogs" (see @Kantz-Schreiber for references).

# References