# Reminder: statistical (regression) models

You have some data $$X_1,\ldots,X_p,Y$$: the variables $$X_1,\ldots,X_p$$ are called predictors, and $$Y$$ is called a response. You’re interested in the relationship that governs them

So you posit that $$Y|X_1,\ldots,X_p \sim P_\theta$$, where $$\theta$$ represents some unknown parameters. This is called regression model for $$Y$$ given $$X_1,\ldots,X_p$$. Goal is to estimate parameters. Why?

• To assess model validity, predictor importance (inference)
• To predict future $$Y$$’s from future $$X_1,\ldots,X_p$$’s (prediction)

# Shifting tides: a focus on prediction

Classically, statistics has focused in large part on inference. The tides are shifting (at least in some part), and in many modern problems, the following view is taken:

Models are only approximations; some methods need not even have underlying models; let’s evaluate prediction accuracy, and let this determine model/method usefulness

This is (in some sense) one of the basic tenets of machine learning

versus ?

# Statistical prediction machines

Some methods for predicting $$Y$$ from $$X_1,\ldots,X_p$$ have (in a sense) no parameters at all. Perhaps better said: they are not motivated from writing down a statistical model like $$Y|X_1,\ldots,X_p \sim P_\theta$$

We’ll call these statistical prediction machines. Admittedly: not a real term, but it’s evocative of what they are doing, and there’s no real consensus terminology. You might also see these described as:

• Model-free methods
• Distribution-free methods
• Machine learning methods

Comment: in a broad sense, most of these methods would have been completely unthinkable before the rise of high-performance computing

# $$k$$-nearest neighbors

One of the simplest prediction machines: $$k$$-nearest neighbors regression

• Given training data $$X_i=(X_{i1},\ldots,X_{ip})$$ and $$Y_i$$, $$i=1,\ldots,n$$
• Given a new test point $$X^*=(X^*_1,\ldots,X^*_p)$$
• Find $$k$$-nearest training points $$X_{(1)},\ldots,X_{(k)}$$ to $$X^*$$
• Use as our prediction $$\hat{Y^*}=\sum_{i=1}^k Y_{(i)}/k$$

Ask yourself: what happens when $$k=1$$? What happens when $$k=n$$?

# From $$k$$-nearest neighbors to trees

Can think of $$k$$-nearest neighbors predictions as being simply given by averages within each element of what is called as Voronoi tesellation: these are polyhedra that partition the predictor space

Regression trees are similar but somewhat different. In a nutshell, they use (nested) rectangles instead of polyhedra. These rectangles are fit through sequential (greedy) split-point determinations

$$\beta_1 \cdot$$