36-402, Section A, Spring 2019

19 February 2019

\[ \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\mu}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]} \DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia! \]

- We want to predict \(Y\) from \(X\) with small MSE, \(\Expect{(Y-m(X))^2}\)
- Optimum would be \(\TrueRegFunc(x) = \Expect{Y|X=x}\); then \(Y=\TrueRegFunc(X)+\epsilon\) with \(\Expect{\epsilon|X=x}=0\) for all \(x\)
- \(\sigma^2 \equiv \Expect{(Y-\mu(X))^2} > 0\) (generally)

- We try to learn/estimate \(\TrueRegFunc\) from data \((x_1, y_1), \ldots (x_n, y_n)\), getting an approximation \(\EstRegFunc\)
- Linear smoothers: \[ \EstRegFunc(x_0) = \sum_{i=1}^{n}{y_i w(x_0, x_i)} \]

- Kernel smoothing, 1D: \[ \EstRegFunc(x_0) = \sum_{i=1}^{n}{y_i \frac{K(\frac{x_0-x_i}{h})}{\sum_{k=1}^{n}{K(\frac{x_0-x_k}{h})}}} \]
- Kernel smoothing, \(p\)-dimensional: \[ \EstRegFunc(x_0) = \sum_{i=1}^{n}{y_i \frac{\prod_{j=1}^{p}{K(\frac{x_{0j}-x_{ij}}{h_j})}}{\sum_{k=1}^{n}{\prod_{j=1}^{p}{K(\frac{x_{0j}-x_{kj}}{h_j})}}}} \]
- Bias of kernel smoothing is \(O(h^2)\) in any dimension
- Variance of kernel smoothing is \(O(n^{-1} h^{-p})\) in \(p\) dimensions

- Bias is \(O(h^2)\), variance is \(O(n^{-1} h^{-p})\)
- MSE of kernel smoothing is \[
MSE(h) = \sigma^2 + b h^4 + v n^{-1}h^{-p}
\]
- (plus negligible remainders)

- What’s the best bandwidth? How does it change with \(n\) and \(p\)?
- Does the bias at the best bandwidth \(\rightarrow 0\) as \(n\rightarrow\infty\)?
- Does the variance at the best bandwidth \(\rightarrow 0\) as \(n\rightarrow\infty\)?
- Does \(\EstRegFunc \rightarrow \TrueRegFunc\) as \(n\rightarrow\infty\)?

Yes, if bias and variance both \(\rightarrow 0\), then the estimate must converge on the truth

- In any dimension, as \(n\rightarrow \infty\), the bias of \(\EstRegFunc\) \(\rightarrow 0\) and the variance \(\rightarrow 0\), so \[ \EstRegFunc \rightarrow \TrueRegFunc \]
- Kernel regression is
**universally consistent**- At least if \(\TrueRegFunc\) is smooth enough for a 2nd-order Taylor series

- The MSE at the best bandwidth is \[
MSE = \sigma^2+O(h_{opt}^4) = \sigma^2 + O(n^{-\frac{4}{4+p}})
\]
- So MSE approaches that of \(\TrueRegFunc\)

- BUT we need exponentially more data as the dimension grows
- With \(n={10}^{12}\) and \(p=1\), \(n^{-4/5} = 2.5\times 10^{-10}\)
- With \(n={10}^{12}\) and \(p=100\), \(n^{-4/(104)} = 0.35\)

- \(k\)-nearest neighbors is also universally consistent, and also has the same rate of convergence
- So do splines, trees, random forests, deep neural networks, …

- In fact, this is the best possible rate for
*any*universally-consistent method- You can converge faster for
*some*\(\TrueRegFunc\), if you don’t converge at all for others

- You can converge faster for
- This — exponentially slow convergence as \(p\) grows — is the
**curse of dimensionality**

- Impose special structure on \(\TrueRegFunc\), either because:
- We believe it’s true, or
- We’re willing to make that approximation

- Extreme example: Only use linear functions
- If \(\TrueRegFunc\) is linear, then \(MSE = \sigma^2 + O(p/n)\)
- If \(\TrueRegFunc\) isn’t linear, then \(MSE = \sigma^2 + (\mathrm{bias\ of\ best\ linear\ model})^2 + O(p/n) = \sigma^2 + O(1)\)
- Can we do something less extreme than linear models?

- Linear model: \[ Y = \beta_0 + \sum_{j=1}^{p}{\beta_j x_j} + \epsilon \]
**Additive**model: \[ Y = \alpha + \sum_{j=1}^{p}{f_j(x_j)} + \epsilon \]- Every linear model is additive, \(f_j(x_j) = \beta_j x_j\)
- Not every additive model is linear, e.g, \(x_1 + \sin{(x_2)} + \log{(1+x_3)}\)

- The \(f_j\) are called
**partial response**functions- To get total response (= prediction for \(Y\)), add up all the partial responses, plus \(\alpha\)
- All else being equal, if you change \(x_j\) to \(x_j^{\prime}\), the predicted \(Y\) changes by \(f_j(x_j^{\prime}) - f_j(x_j)\)
- Like transforming your predictors in a linear model

- Re-arrange terms \[ f_j(x_j) + \epsilon = Y - (\alpha + \sum_{k \neq j}{f_k(x_k)}) \]
- If we knew all the other functions, we could estimate \(f_j\) by 1D smoothing

- Start with \(\alpha = \overline{y}\), and \(f_1 = f_2 \ldots = f_p = 0\)
- For \(j \in 1:p\):
- \(r_i \leftarrow y_i - (\alpha + \sum_{k\neq j}{f_k(x_{ik})})\) (
**partial residual**) - Use your favorite smoother to regress \(r\)’s on \(x_j\) (1D smoothing), get function \(s_j\)
- \(f_j \leftarrow s_j - \overline{s_j}\) (keep functions centered at 0)

- \(r_i \leftarrow y_i - (\alpha + \sum_{k\neq j}{f_k(x_{ik})})\) (
- Repeat (1) until things stop changing

`mgcv`

- Several packages available, older
`gam`

and newer, nicer`mgcv`

- Main function is called
`gam`

(Generalized Additive Model)- We’ll get to the “generalized” in a few weeks

- Syntax is very like
`lm`

*but*can tell it to smooth variables:

`s()`

- People often read
`s()`

as “smooth”, but it really means “spline” - A
**smoothing spline**is a curve balancing fit to data against wiggliness: \[ s_{\lambda} = \argmin_{m}{\frac{1}{n}\sum_{i=1}^{n}{(y_i - m(x_i))^2} + \lambda\int{\left(\frac{d^2 m}{dx^2}(x)\right)^2 dx}} \]- Solution for each \(\lambda\) is a smooth, piecewise-cubic polynomial
- \(s_{\lambda}\) is linear in \(\mathbf{y}\) (surprisingly)
- Polynomials are
*fast*to calculate and estimate - \(\lambda\rightarrow 0\) gives interpolation; \(\lambda\rightarrow\infty\) gives a linear model
- (There are also other kinds of spline)

`gam()`

knows about*lots*of other smoother techniques`te()`

,`ti()`

, etc., etc., etc.- Mostly about how to handle interactions
- See textbook

`mgcv::gam`

`predict`

: Works just like with`lm`

(or`npreg`

for that matter), see`help(predict.gam)`

for details`fitted`

,`residuals`

: Work just like with`lm`

`plot`

: We just saw this; many options, some described in text, see`help(plot.gam)`

for details`coefficients`

:*Mostly*coefficients of the spline polynomials — NOT INTERPRETABLE- Can include linear terms, e.g.,
`gam(y ~ x1+s(x2))`

- Can include interaction terms, e.g.,
`gam(y ~ s(x1, x2))`

- If \(\TrueRegFunc\) is additive, then \[ MSE = \sigma^2 + O(p n^{-4/5}) \] which is pretty good
- If \(\TrueRegFunc\) is not additive, then \[
MSE = \sigma^2 + (\mathrm{bias\ of\ best\ additive\ model})^2 + O(pn^{-4/5})
\]
- This is still better, as \(n\rightarrow\infty\), than linear models

- Estimating regression functions in high dimensions is
**intrinsically hard** - We need either a
*lot*of data OR special assumptions about the function - Additive models are an easy step beyond linear models with a lot of advantages
- Get used to typing
`gam`

rather than`lm`