You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them

So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called **regression model** for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?

- To assess model validity, predictor importance (
**inference**) - To predict future \(Y\)’s from future \(X_1,\ldots,X_p\)’s (
**prediction**)

Classically, statistics has focused in large part on inference. The tides are shifting (at least in some part), and in many modern problems, the following view is taken:

Models are only approximations; some methods need not even have underlying models; let’s

evaluate prediction accuracy, and let this determine model/method usefulness

This is (in some sense) one of the basic tenets of machine learning

versus ?

Some methods for predicting \(Y\) from \(X_1,\ldots,X_p\) have (in a sense) **no parameters** at all. Perhaps better said: they are not motivated from writing down a statistical model like \(Y|X_1,\ldots,X_p \sim P_\theta\)

We’ll call these **statistical prediction machines**. Admittedly: not a real term, but it’s evocative of what they are doing, and there’s no real consensus terminology. You might also see these described as:

- Model-free methods
- Distribution-free methods
- Machine learning methods

Comment: in a broad sense, most of these methods would have been **completely unthinkable** before the rise of high-performance computing

One of the simplest prediction machines: **\(k\)-nearest neighbors** regression

- Given training data \(X_i=(X_{i1},\ldots,X_{ip})\) and \(Y_i\), \(i=1,\ldots,n\)
- Given a new test point \(X^*=(X^*_1,\ldots,X^*_p)\)
- Find \(k\)-nearest training points \(X_{(1)},\ldots,X_{(k)}\) to \(X^*\)
- Use as our prediction \(\hat{Y^*}=\sum_{i=1}^k Y_{(i)}/k\)

Ask yourself: what happens when \(k=1\)? What happens when \(k=n\)?

**Advantages**: simple and flexible. **Disadvantages**: can be slow and cumbersome

Can think of \(k\)-nearest neighbors predictions as being simply given by averages within each element of what is called as **Voronoi tesellation**: these are polyhedra that partition the predictor space

Regression **trees** are similar but somewhat different. In a nutshell, they use (nested) rectangles instead of polyhedra. These rectangles are fit through sequential (greedy) split-point determinations

**Advantage**: easier to make predictions (from split-points). **Disadvantage**: less flexible

**Boosting** is a method built on top of regression trees in a clever way. To make predictions, can think of taking predictions from a sequence of trees, and combining them with weights (coefficients)

\(\beta_1 \cdot\)