--- title: $k$-Nearest Neighbors I (Mostly Theory) author: 36-462/662, Fall 2019 date: 25 September 2019 (Lecture 9) bibliography: locusts.bib output: html_document: toc: true --- $\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Prob}[1]{\Pr\left( #1 \right)}$ # Setting: Prediction, including both classification and regression Let's fix our setting. As usual, we have a database of $n$ items, represented as vectors of $p$ features. Following the usual notation for regression courses, we'll write this as an $n\times p$ matrix $\mathbf{x}$; the vector for data-point $i$ will be $\vec{x}_i$. (To get to this point, we may have done some dimension reduction as a pre-processing step, but that won't matter for us here.) Beyond these features, we have an additional variable for each item that we want to **predict**, based on the features. We'll write it $y_i$ for data-point $i$, compiled into the $n\times 1$ matrix $\mathbf{y}$ (again, this is regression notation). This variable is called the **label**, **outcome**, **target**, **output** or (oddly) **dependent variable** (sometimes even just called the **predictand**). A prediction here is going to be a function of the features which outputs a guess ("point prediction") about the outcome or label. - **Regression**: $y$ is a continuous numerical variable, so the **regression function** should map $\vec{x}$ to a number. - **Classification**: $y$ is binary, so the **classification rule** should map $\vec{x}$ to 0 or 1. + Multi-class classification works similarly but with more notation. ## Prediction quality, risk, optimal risk and optimal predictors How good is the guess? - Regression: measure with expected squared error, so $\Expect{(Y-m(\vec{X}))^2}$ indicates how bad the function $m$ is. - Classification: measure with inaccuracy or error rate, so $\Prob{Y \neq m(\vec{X})}$ indicates how bad the function $m$ is. This expected error on new data is called the **risk**. In predictive modeling, we want to learn functions with low risk. There's an optimal function, i.e., one which has a lower risk than any other function. - Regression: The optimum function is $\mu(\vec{x}) = \Expect{Y|\vec{X}=\vec{x}}$. This is called the **true regression function**. - Classification: The optimum function[^bayes] depends on the conditional probability $\Prob{Y=1|\vec{X}=\vec{x}} \equiv p(\vec{x})$. The function is $c(\vec{x}) = 1$ if $p(\vec{x}) \geq 0.5$ and $c(\vec{x}) = 0$ otherwise. [^bayes]: Some people call it the **Bayes rule**, even though it has nothing to do with Bayes's rule, with Bayesian inference, or with Thomas Bayes. Even the optimal function will not, in general, have zero risk. - Regression: Suppose $Y=\mu(\vec{X}) + \epsilon$, where the noise $\epsilon$ has $\Expect{\epsilon|\vec{X}} = 0$, $\Var{\epsilon|\vec{X}=\vec{x}} = \sigma^2(\vec{x})$. (In your linear-regression class, you would have assumed this, _plus_ that $\sigma^2$ is constant.) Then the risk of the true regression function is $\Expect{\sigma^2(\vec{X})} > 0$. - Classification: Suppose $p(\vec{x})$ isn't either 0 or 1 everywhere. Then the probability of mis-classifying at $\vec{x}$ is $p(\vec{x})$ if $p(\vec{x}) < 0.5$ (because there $c(\vec{x}) = 0$, and $1-p(\vec{x})$ if $p(\vec{x} \geq 0.5$. A little thought shows that we can write the conditional probability in a unified way[^altexp] as $\min{\left{p(\vec{x}),1-p(\vec{x})\right}}$. The risk of the optimal classifier is then $\Expect{\min{\left{p(\vec{X}),1-p(\vec{X})\right}}}$. This minimal risk will be $>0$ unless $p(\vec{x})=0$ or $=1$ almost everywhere. [^altexp]: An alternative expression would be $\frac{1}{2} - \left|p(\vec{x})-\frac{1}{2}\right|$, but this will be less useful later on. Unfortunately, the optimal function depends on the true distribution generating the data. So does the risk of the optimal function. What we want, then, is some way of estimating a function from the data which can learn what the true function is, or at least learn to predict almost as well as the true function, without having to know in advance too much about that function. # Nearest neighbors as a predictor This is where nearest neighbors comes in. In this context, "distance" always refers to distances between the $p$-dimensional feature vectors. The **nearest neighbor** of a vector $\vec{x}$ is the $\vec{x}_i$ closest to it. The $k$ nearest neighbors are the $k$ vectors $\vec{x}_i$ closest to $\vec{x}$. (Notice that these definitions make sense whether or not $\vec{x}$ is also one of the $\vec{x}_i$.) We will often need a way of keeping track of the indices of the neighbors, so we'll write $NN(\vec{x}, j)$ for the index of the $j^{\mathrm{th}}$ nearest neighbor of $\vec{x}$. The k-nearest-neighbor estimate of the regression function is then the average value of the response over the $k$ nearest neighbors: $\hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{y_{NN(\vec{x}, j)}}$ For classification, we similarly average the labels of neighbors to estimate $p(\vec{x})$, $\hat{p}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{y_{NN(\vec{x}, j)}}$ and then threshold it: $\hat{c}(\vec{x}) = \mathbf{1}(\hat{p}(\vec{x}) \geq 0.5)$ # Analysis of 1-Nearest-Neighbors for Learning Noise-Free Functions There are lots of estimation methods we _could_ use. To decide on using this one, nearest neighbors, we should have some reason to think it will predict well. This is where theory comes in. Start with the simplest, most extreme setting, to build ideas. We'll assume that there is _no noise_ in the outcomes (responses, labels). This means for regression that $y_i = \mu(\vec{x}_i)$, and for classification that $y_i = c(\vec{x}_i)$. If nearest neighbors can't learn to predict here, it's got to be toast; if it can, we'll add noise back in. Let's also make our lives simple by only looking at 1-nearest-neighbors, $k=1$. To simplify notation, I'll write $NN$ as the index for the nearest neighbor of $\vec{x}$, leaving the dependence on $\vec{x}$ implicit. In this setting --- no noise, $k=1$ --- the error nearest neighbors will make for regression at $\vec{x}$ will be $\mu(\vec{x}) - y_NN = \mu(\vec{x}) - \mu(x_{NN})$ so the risk will be $\Expect{(\mu(\vec{X}) - \mu(\vec{X}_{NN}))^2}$ Similarly, for classification, the risk will be $\Prob{c(\vec{X}) \neq c(\vec{X}_{NN})}$ We'd like these risks to go to 0 as $n\rightarrow\infty$ (because here the optimal risk _is_ zero). The equations above suggests that for regression we want the true function to be continuous, but for classification we want it to be piecewise constant. (In fact, piecewise continuity is usually enough for regression.) But even with this, we need to see that $\vec{x}_{NN} \rightarrow \vec{x}$, otherwise continuity won't help. ## Convergence of the nearest neighbor Requiring $\vec{x}_{NN} \rightarrow \vec{x}$ is the same as want $\|\vec{x}_{NN} - \vec{x}\| \rightarrow 0$. When will this happen? Well, pick some positive distance $\epsilon > 0$. What is the probability that $\|\vec{x}_{NN} - \vec{x}\| > \epsilon$? Ideally, we'd like this to go to zero as $n\rightarrow\infty$, no matter how small that $\epsilon$ might be; that would indicate that the nearest neighbor is approaching the point of interest. A little thought should convince you that the nearest neighbor is more than $\epsilon$ away from $\vec{x}$ if and only if every $\vec{x}_i$ is more than $\epsilon$ away. So $\Prob{\|\vec{x}_{NN} - \vec{x}\| > \epsilon} = \Prob{\forall i, \|\vec{x}_i -\vec{x}\| > \epsilon}$ At this point, we need to make an assumption about the feature vectors. We'll assume they're IID. Then the probability of _all_ the feature vectors doing the same thing (being far from $\vec{x}$) turns into the product of _each_ of them doing that thing: \begin{eqnarray} \Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} & = & \Prob{\forall i, \|\vec{X}_i -\vec{x}\| > \epsilon}\\ & = & \prod_{i=1}^{n}{\Prob{ \|\vec{X}_i -\vec{x}\| > \epsilon}}\\ & = & \left(\Prob{ \|\vec{X} -\vec{x}\| > \epsilon}\right)^n \end{eqnarray} This has got to go to zero as $n\rightarrow\infty$ (which is what we want), unless the probability we're raising to the power $n$ is exactly 1. To get a handle on that, let's re-write it a bit more: \begin{eqnarray} \Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} & = & \left(\Prob{ \|\vec{X} -\vec{x}\| > \epsilon}\right)^n\\ & = & \left(1 - \Prob{\|\vec{X} - \vec{x}\| \leq \epsilon}\right)^n \end{eqnarray} So all we need is for there to be {\em some} probability of being within $\epsilon$ of $\vec{x}$. If we're asking for a prediction at a point in the middle of a region of zero probability, nearest neighbors is not a great idea, but otherwise, we're set. We can be a little bit more detailed by approximating the probability in question. Assume $\vec{X}$ follows a pdf $f(\vec{u})$. Then the probability integrates the pdf over the radius-$\epsilon$ ball centered on $\vec{x}$, $\Prob{\|\vec{X} - \vec{x}\| \leq \epsilon} = \int_{\vec{u}: \|\vec{u} - \vec{x}\| \leq \epsilon}{f(\vec{u}) d\vec{u}}$ Let's assume $\epsilon$ is small. (Ultimately we want it to shrink towards zero, after all.) Over a small enough ball, $f(\vec{u})$ will be nearly constant, and equal to $f(\vec{x})$. So $\Prob{\|\vec{X} - \vec{x}\| \leq \epsilon} \approx c_p \epsilon^p f(\vec{x})$ where $c_p$ is a constant, geometrical factor ($c_2 = \pi$, $c_3 = \frac{4}{3}\pi$, etc.). That is, the probability of a _small_ ball centered around $\vec{x}$ is about $f(\vec{x})$ times the volume of the ball. Putting all this together, $\Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} \approx (1-c_p \epsilon^p f(\vec{x}))^n$ Let's make use of one last approximation, that $(1+h)^b \approx 1+bh$ when $|h| \ll 1$. (Use the binomial theorem if you don't believe me.) Then we get $\Prob{\|\vec{X}_{NN} - \vec{x}\| > \epsilon} \approx 1- n c_p \epsilon^p f(\vec{x})$ This is going to zero for each fixed $\epsilon$. If we want this to be constant --- say, if we want to find an $\epsilon$ which bounds the distance to the nearest neighbor with 50% confidence, the median nearest-neighbor distance --- we'd need to say $1- n c_p \epsilon^p f(\vec{x}) = \delta$ or $\epsilon = n^{-1/p} \left(\frac{(1-\delta)}{c_p f(\vec{x})}\right)^{1/p}$ So the typical distance to the nearest neighbor is shrinking to 0, at rate $n^{-1/p}$. This $\rightarrow 0$ as $n\rightarrow\infty$, as desired. ## 1NN is consistent for noise-free functions To recap, because $\|\vec{x}_NN - \vec{x}\| \rightarrow 0$ as $n\rightarrow\infty$, if the true function is (piecewise) continuous, then 1NN will approximate it arbitrarily well given enough data. When an estimator converges on the truth as $n\rightarrow\infty$, it's called "consistent", so we've just shown that nearest neighbors is consistent for learning noise-free functions. # Putting the noise back in for 1NN What happens if we add in noise, but still use 1NN? In the regression case, $Y=\mu(X)+\epsilon$, so \begin{eqnarray} \hat{\mu}(\vec{x}) & = & y_{NN}\\ & = & \mu(\vec{X}_{NN}) + \epsilon_{NN} \end{eqnarray} The error in predicting a new response at $\vec{x}$, $Y_{new}$, is thus \begin{eqnarray} Y_{new} - \hat{\mu}(\vec{x}) & = & \mu(\vec{x}) + \epsilon_{new} - \mu(\vec{X}_{NN}) - \epsilon_{NN}\\ & = & (\mu(\vec{x}) - \mu(\vec{X}_{NN})) + \epsilon_{new} - \epsilon_{NN} \end{eqnarray} As $n\rightarrow\infty$, the $\mu$ term in parentheses $\rightarrow 0$, since $\mu$ is continuous (by assumption) and the nearest neighbor converges on the point. So the error approaches $\epsilon_{new} - \epsilon_{NN}$. Squaring, taking expectations, and remembering that the noises are uncorrelated, we get that the risk of 1NN regression at $\vec{x}$ approaches $\Var{\epsilon|\vec{X}=\vec{x}} + (-1)^2\Var{-\epsilon|\vec{X}=\vec{x}} = 2\sigma^2(\vec{x})$ The over-all risk of 1NN regression thus approaches $2\Expect{\sigma^2(\vec{X})}$ as $n\rightarrow\infty$. $2\Var{\epsilon}$. But the risk of the true regression function is already $\Expect{\sigma^2(\vec{X})}$, so we've come within a factor of two of the optimum risk.[^homosked] [^homosked]: If the noise variance is constant, $\sigma^2(\vec{x}) = \sigma^2$, this simplifes: the risk of 1NN regression approach $2\sigma^2$, while the risk of the true regression function is just $\sigma^2$. For classification, the risk at a particular point $\vec{x}$ is \begin{eqnarray} \Prob{Y_{new} \neq \hat{c}(\vec{x}_{new})} & = & \Prob{Y_{new} \neq Y_{NN}}\\ & = & \Prob{Y_{new}=1, Y_{NN}=0} + \Prob{Y_{new}=0, Y_{NN}=1}\\ & = & p(\vec{x}(1-p(\vec{x}_{NN})) + (1-p(\vec{x}))p(\vec{x}_{NN}) \end{eqnarray} As $\vec{x}_{NN} \rightarrow \vec{x}$, this approaches $2p(\vec{x})(1-p(\vec{x}))$ provided $p$ is a continuous function (or at least piecewise continuous). Recall from earlier that the conditional risk of the optimal classification function is $\min{\left\{p(\vec{x}), 1-p(\vec{x})\right}}$, say $r(\vec{x})$. So the conditional risk of 1NN approaches $2r(\vec{x})(1-r(\vec{x}))$ \leq 2r(\vec{x})$. The over-all risk will thus approach $2\Expect{p(\vec{X})(1-p(\vec{X}))}$ which is at most twice the risk of the optimal classifier. # What about multiple neighbors? Recall how we defined the predictions for$k$-nearest-neighbor regression[^class]: $\hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{Y_{NN(\vec{x}, j)}}$ For every data point,$Y=\mu(\vec{X})+\epsilon$, where quite generally$\Expect{\epsilon|\vec{X}=\vec{x}} = 0$. So we can write $\hat{\mu}(\vec{x}) = \frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{x}_{NN(\vec{x}, j)}} + \frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}}$ What we'd like the prediction to be is of course$\mu(\vec{x})$, as before. [^class]: The analysis for kNN-classification is very similar and comes to the same conclusion, but I don't feel like writing everything out twice. The last equation makes it clear that the error in kNN-regression has two sources: 1. Evaluating the true regression function at the nearest neighbors. That is, we're _approximating_ the quantity we want ($\mu(\vec{x})$) by something else, namely$\frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{x}_{NN(\vec{x}, j)})}$. We'll call this the approximation error[^bias]. 2. The noise in the response values for the nearest neighbors$\left( \frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}}\right)$. This is pure noise. [^bias]: If we think of the locations of the nearest neighbors as fixed, and only the responses$Y$as random, then we can call this "bias" in the technical sense, as the expected difference between the estimate$\hat{\mu}(\vec{x})$and the truth$\mu(\vec{x})$. If we treat the locations of the nearest neighbors as random, then the bias would be$\Expect{\frac{1}{k}\sum_{j=1}^{k}{\mu(\vec{X}_{NN(\vec{x}, j)}) - \mu(\vec{x})}}$, which is a bit of a mess, though fortunately not something we'll need to know in detail, as the next paragraph will explain. For 1NN, we controlled the approximation error by realizing that it goes to zero as$\vec{x}$'s nearest neighbor converges on$\vec{x}$. You[^exercise] can extend the argument to show that the$k^{\mathrm{th}}$nearest neighbor does too, for any fixed$k$. If the$k^{\mathrm{th}}$nearest neighbor is within$\epsilon$of$\vec{x}$, then all of$k$nearest neighbor_s_ must be too. Then continuity of$\mu$says that the approximation error$\rightarrow 0$as$n\rightarrow\infty$. [^exercise]: "You", meaning "not me, at least not now". The trick is however to realize that if the$k$th neighbor is more than$\epsilon$away from$\vec{x}$, _at least_$n-k+1$of the data points must be more than$\epsilon$away. (Said differently, if the$k$th neighbor is within$\epsilon$, then at least$k-1$other data points must also be within$\epsilon$.) The probability of this happening is something we can calculate from a binomial distribution, with$n$trials and a success probability depending on the probability of a random point being in the$\epsilon$-ball around$\vec{x}$. As for the noise, it's the average of$k$noise terms. If we assume the$\epsilon$s are uncorrelated across data points, we can say that $\Var{\frac{1}{k}\sum_{j=1}^{k}{\epsilon_{NN(\vec{x}, j)}}} = \frac{1}{k}\sum_{j=1}^{k^2}{\Var{\epsilon_{NN(\vec{x}, j)}}}$ If$\Var{\epsilon|\vec{X}=\vec{u}} = \sigma^2(\vec{u})$, then all of those variances are converging on$\sigma^2(\vec{x})$, and we get $\frac{\sigma^2(\vec{x})}{k}$ for the variance of the noise. The over-all risk of kNN-regression at$\vec{x}$will thus tend, as$n\rightarrow\infty$, to $(\text{system noise}) + (\text{approximation error}) + (\text{estimation noise}) \rightarrow \sigma^2(\vec{x}) + 0 + \frac{\sigma^2(\vec{x})}{k} = \left(1+\frac{1}{k}\right)\sigma^2(\vec{x})$ That is, rather than having twice the optimum risk with$k=1$, kNN regression gets only$1+1/k$of the optimum risk --- at least as$n\rightarrow\infty$. That last phrase is of course why we don't just automatically set$k$to be as large as possible. At any _finite_$n$, we face a trade-off: - Increasing$k$means averaging over more data points for each prediction, which reduces the variance by averaging together more noise terms (i.e., big$k$means less variance); - Decreasing$k$means averaging over fewer data points for each prediction, which reduces the approximation error by averaging over points closer to where we want a prediction (i.e., small$k\$ means less bias). This is a manifestation of one of the fundamental issues in statistics, the **bias-variance tradeoff**. When we are doing prediction, we don't (usually) _care_ about whether our errors come from bias or from variance, just about the over-all magnitude of the error. We will usually find that we want methods with _some_ bias, because the error added by the bias is more than compensated for by the reduction in variance. We need some practical way of deciding how much bias we want to trade for less variance; this is what we'll tackle next time. # Background The pioneering theoretical analysis of nearest neighbors, covering both regression and classification as special cases of prediction-in-general, was done by Cover in the 1960s [@Cover-Hart-nearest-neighbor; @Cover-estimation-by-NN; @Cover-rates-of-conv-for-NN]. What I've done above is basically "Cover made even simpler". For more refined analyses of kNN classification and regression, see the appropriate chapters of @Devroye-Gyorfi-Lugosi-probabilistic-theory-of-pattern-recognition and @Gyorfi-Kohler-Krzyzak-Walk-nonparametric-regrssion, respectively. **Historical note**: "Find the most similar case with a known outcome, and guess that a new case will be similar" is such a natural idea that it's almost impossible to trace its earliest history. The recognition that this idea could be a general, explicit statistical method, along with the name "nearest neighbors", seems to go back to the 1950s (see @Cover-estimation-by-NN for references). But _because_ it's such a natural idea that it keeps getting re-invented in different subjects: in nonlinear dynamics and the physics of chaotic systems, for instance, it was introduced in the 1980s as the "method of analogs" (see @Kantz-Schreiber for references). # References