\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\OptLinPred}{m} \newcommand{\EstLinPred}{\hat{m}} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \DeclareMathOperator*{\argmin}{argmin} \]

Context

We want to find patterns in our data which will let us predict one variable from another (or from more than one variable)
The oldest and most widely-used prediction method is linear regression
We’re going to review linear regression, as a prediction method
Without a lot of the mythology from 401
But with an eye to more powerful methods

Optimal prediction in general

We want to predict a numerical random variable \(Y\)
We want a one-number (“point”) prediction, not a range or a distribution
We say how good our guess is by expected squared error

Optimal prediction in general (cont’d.)

What’s the best constant guess for a random variable \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m \in \mathbb{R}}{\Expect{(Y-m)^2}}\\ & = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\ & = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\ & = & \argmin_m{ (\Expect{Y} - m)^2}\\ & = & \Expect{Y} \end{eqnarray}\]

(Because: \(\Expect{Z^2} = \Var{Z} + (\Expect{Z})^2\), always)

Optimal prediction in general (cont’d.)

What’s the best function of \(X\) to guess for \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m: \mathcal{X} \mapsto \mathbb{R}}{\Expect{(Y-m(X))^2}}\\ & = & \argmin_{m}{\Expect{\Expect{(Y-m(X))^2|X}}} \end{eqnarray}\]

For each \(x\), best \(m(x)\) is \(\Expect{Y|X=x}\) (by previous slide)

\[ \TrueRegFunc(x) = \Expect{Y|X=x} \]

Optimal prediction in general (cont’d.)

Learning arbitrary functions is hard!

Who knows what the right function might be?

What if we decide to make our predictions linear?

Optimal linear prediction with univariate predictor

Our prediction will be of the form \[ \OptLinPred(x) = a + b x \] and we want the best \(a, b\)

Optimal linear prediction, univariate case

\[ (\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bX))^2}} \]

Expand out that expectation, then take derivatives and set them to 0

The intercept

\[\begin{eqnarray} \Expect{(Y-(a+bX))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bX)} + \Expect{(a+bX)^2}\\ & = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YX} +\\ & & a^2 + 2 ab \Expect{X} + b^2 \Expect{X^2}\\ \left. \frac{\partial}{\partial a}\Expect{(Y-(a+bX))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{X} = 0\\ \alpha & = & \Expect{Y} - \beta\Expect{X} \end{eqnarray}\]

\(\therefore\) optimal linear predictor \(m(X) = \alpha+\beta X\) looks like \[\begin{eqnarray} m(X) & = & \alpha + \beta X\\ & = & \Expect{Y} - \beta\Expect{X} + \beta X\\ & = & \Expect{Y} + \beta(X-\Expect{X}) \end{eqnarray}\] The optimal linear predictor only cares about how far \(X\) is from its expectation \(\Expect{X}\) And when \(X=\Expect{X}\), we will always predict \(\Expect{Y}\)

The slope

\[\begin{eqnarray} \left. \frac{\partial}{\partial b}\Expect{(Y-(a+bX))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YX} + 2\alpha \Expect{X} + 2\beta \Expect{X^2} = 0\\ 0 & = & -\Expect{YX} + (\Expect{Y} - \beta\Expect{X})\Expect{X} + \beta\Expect{X^2} \\ 0 & = & \Expect{Y}\Expect{X} - \Expect{YX} + \beta(\Expect{X^2} - \Expect{X}^2)\\ 0 & = & -\Cov{Y,X} + \beta \Var{X}\\ \beta & = & \frac{\Cov{Y,X}}{\Var{X}} \end{eqnarray}\]

If we replace \(X\) with \(X' = X-\Expect{X}\), \(\beta\) doesn’t change
If we replace \(Y\) with \(Y' = Y-\Expect{Y}\), \(\beta\) doesn’t change
\(\therefore\) centering the variables doesn’t change the slope

The optimal linear predictor of \(Y\) from \(X\)

The optimal linear predictor of \(Y\) from a single \(X\) is always

\[ \alpha + \beta X = \Expect{Y} + \left(\frac{\Cov{X,Y}}{\Var{X}}\right) (X - \Expect{X}) \]

What did we not assume?

That the true relationship between \(Y\) and \(X\) is linear
That anything is Gaussian
That anything has constant variance
That anything is independent or even uncorrelated

NONE OF THAT MATTERS for the optimal linear predictor

The prediction errors average out to zero

\[\begin{eqnarray} \Expect{Y-\OptLinPred(X)} & = & \Expect{Y - (\Expect{Y} + \beta(X-\Expect{X}))}\\ & = & \Expect{Y} - \Expect{Y} - \beta(\Expect{X} - \Expect{X}) = 0 \end{eqnarray}\]

If they didn’t average to zero, we’d adjust the coefficients until they did
Important: In general, \(\Expect{Y-\OptLinPred(X)|X} \neq 0\)

The prediction errors are uncorrelated with \(X\)

\[\begin{eqnarray} \Cov{X, Y-\OptLinPred(X)} & = & \Expect{X(Y-\OptLinPred(X))} ~\text{(by previous slide)}\\ & = & \Expect{X(Y - \Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}(X-\Expect{X}))}\\ & = & \Expect{XY - X\Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}(X^2) + \frac{\Cov{Y,X}}{\Var{X}} (X \Expect{X})}\\ & = & \Expect{XY} - \Expect{X}\Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}\Expect{X^2} + \frac{\Cov{Y,X}}{\Var{X}} (\Expect{X})^2\\ & = & \Cov{X,Y} - \frac{\Cov{Y,X}}{\Var{X}}(\Var{X})\\ & = & 0 \end{eqnarray}\]

If the errors weren’t uncorrelated with \(X\), we’d adjust the coefficients until they were

The prediction errors are uncorrelated with \(X\)

Alternate take:

\[\begin{eqnarray} \Cov{X, Y-\OptLinPred(X)} & = & \Cov{X, Y} - \Cov{X, \alpha + \beta X}\\ & = & \Cov{Y,X} - \Cov{X, \beta X}\\ & = & \Cov{Y,X} - \beta\Cov{X,X}\\ & = & \Cov{Y,X} - \beta\Var{X}\\ & = & \Cov{Y,X} - \Cov{Y,X} = 0 \end{eqnarray}\]

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\ & = & \Var{Y - \beta X}\\ \end{eqnarray}\]

After-class exercise: Reduce this to an expression involving only \(\Var{Y}\), \(\Var{X}\) and \(\Cov{Y,X}\); if you get the right answer you should see that it’s \(< \Var{Y}\) unless \(\Cov{Y,X}=0\)

\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…

Multivariate case

We try to predict \(Y\) from a whole bunch of variables
Bundle those predictor variables into \(\vec{X}\)
We need a vector of slope coefficients \(\vec{\beta}\)
Solution:

\[ \OptLinPred(\vec{X}) = \alpha+ \vec{\beta} \cdot \vec{X} = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{\vec{X},Y} (\vec{X} - \Expect{\vec{X}}) \]

and

\[ \Var{Y-\OptLinPred(\vec{X})} = \Var{Y} - \Cov{Y,\vec{X}}^T \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}} \]

(Gory details in the back-up slides)

What we don’t assume, again

That the linear predictor is correct
That anything is Gaussian
Anything about the distributions of \(Y\) or \(\vec{X}\)

Estimation: Data, not the full distribution

\(\Var{X}\), \(\Cov{Y,X}\), \(\Expect{(Y-(a+bX))^2}\) all involve the true distribution, which we don’t know
\(\therefore\) We can’t just calculate \(\beta\)
What we do know is \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\)
- That is, \(n\) pairs of predictor and outcome

Estimation: Ordinary Least Squares (OLS)

Set up the mean squared error, and minimize it:

\[\begin{eqnarray} MSE(a,b) & = & \frac{1}{n}\sum_{i=1}^{n}{(y_i - (a+b x_i))^2}\\ (\hat{\alpha}, \hat{\beta}) & \equiv & \argmin_{a,b}{MSE(a,b)} \end{eqnarray}\]

Do the calculus: \[\begin{eqnarray} \frac{\partial MSE}{\partial a} & = & \frac{1}{n}\sum_{i=1}^{n}{-2(y_i - (a+b x_i))}\\ \frac{\partial MSE}{\partial b} & = & \frac{1}{n}\sum_{i=1}^{n}{-2(y_i - (a+b x_i))x_i}\\ \hat{\alpha} & = & \overline{y} - \hat{\beta} \overline{x}\\ \hat{\beta} & = & \frac{ \overline{yx} - \overline{y}\overline{x}}{\overline{x^2}} = \frac{\widehat{\Cov{Y,X}}}{\widehat{\Var{X}}} \end{eqnarray}\]

(with \(\overline{x} =\) sample mean of \(x\), etc.)

Optimum vs. estimate (I)

Optimal linear model versus Linear model estimated by OLS: \[\begin{eqnarray} \alpha + \beta x & = & \Expect{Y} + \left(\frac{\Cov{X,Y}}{\Var{X}}\right) (x - \Expect{X})\\ \hat{\alpha} + \hat{\beta} x & = & \overline{y} + \frac{\widehat{\Cov{Y,X}}}{\widehat{\Var{X}}} (x - \overline{x}) \end{eqnarray}\]
These are not the same, but we can hope they’re close, and increasingly close with more data

When does OLS/plug-in work?

“Work” = converge on the optimal linear predictor
Jointly sufficient conditions:

Sample means converge on expectation values
Sample covariances converge on true covariance
Sample variances converge on true, invertible variance

Then by continuity OLS coefficients converge on true \(\beta\)
None of this requires that the linear model is right, that anything is Gaussian, etc.

Optimum vs. estimate (II)

A little more subtle: for every fixed \((a,b)\), by the law of large numbers, \[ MSE(a,b) \rightarrow \Expect{(Y - (a+b X))^2} \]
Mean squared error \(\rightarrow\) expected squared error
This is almost enough to ensure that minimize MSE will converge on the true optimum
We’ll come back to this, because this approach generalizes better to other models and loss functions

What do the estimates look like?

Bundle all the regressor values into an \(n\times (p+1)\) matrix \(\mathbf{x}\), with an extra column of 1s
Bundle all the regressand values into an \(n\times 1\) matrix \(\mathbf{y}\)
Then the \((p+1)\times 1\) matrix of least-squares coefficients \(\hat{\mathbf{\beta}}\) is \[ \hat{\mathbf{\beta}} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \] (as you learned in linear models)
This is really \[\begin{eqnarray} \hat{\mathbf{\beta}} & = & {\left(\frac{1}{n}\mathbf{x}^T\mathbf{x}\right)}^{-1} \left(\frac{1}{n}\mathbf{x}^T \mathbf{y}\right)\\ & = & (\mathrm{sample\ variance\ matrix\ of\ regressors})^{-1} (\mathrm{sample\ covariances\ between\ regressors\ and\ response}) \end{eqnarray}\]

What do the predictions look like?

The prediction at an arbitrary point \(\vec{x} = (x_1, x_2, \ldots x_p)\) is \[ \EstLinPred(\vec{x}) = \begin{array}{ccccc}[1 & x_1 & x_2 & \ldots & x_p]\end{array} \hat{\beta} = \begin{array}{ccccc}[1 & x_1 & x_2 & \ldots & x_p]\end{array} \left((\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\right) \]
The \(n\times 1\) matrix of fitted values, at the training points, is \[ \mathbf{\EstLinPred} = \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \]
The matrix \(\mathbf{h} \equiv \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\) is the weight matrix, influence matrix or hat matrix and says how much \(y_j\) matters to the prediction of \(y_i\)

Fitted values and other predictions are weighted sums of the observations

\[\begin{eqnarray} \EstLinPred(\vec{x}) & = & \vec{x} \hat{\beta} = \vec{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\\ \mathbf{\EstLinPred} & = & \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \end{eqnarray}\]

Every prediction we make is linear in \(\mathbf{y}\), it’s a weighted sum of all the observed values of the response
How much weight do we put on \(y_i\) when we’re predicting at \(\vec{x}\)?

Explicit form of the weights for OLS

For univariate OLS, if we’re trying to predict at the point \(x\) and built the model with data \((x_1, y_1), \ldots (x_n, y_n)\), we end up with: \[\begin{eqnarray} \EstLinPred(x) & = & \overline{y} + \hat{\beta}_1(x-\overline{x})\\ & = & \sum_{i=1}^{n}{\frac{1}{n}\left(1 + \frac{(x-\overline{x})(x_i-\overline{x})}{\hat{\sigma}^2_x}\right)y_i} \end{eqnarray}\]

We give more weight to \(y_i\) if \(x_i\) is far from \(\overline{x}\)
and we give more weight to \(y_i\) if \(x\) is far from \(\overline{x}\)
Weights don’t care about \(x - x_i\), which is stupid, but also required to get a straight line
Same idea, with more algebra, for multiple regressors

Generalizing: linear smoothers

Linear regression is a special case of a linear smoother \[ \EstRegFunc(\vec{x}) = \sum_{i=1}^{n}{w(\vec{x}, \vec{x}_i) y_i} \]
Notice: Linear in the \(y\)’s, not in \(\vec{x}\)
General idea: predict at \(\vec{x}\) by finding similar \(\vec{x}_i\)’s and averaging their \(y_i\)’s
Different choices of \(w\) are different ideas about “similar” and about weighted averaging
Most prediction methods used in practice are linear smoothers
- Nearest neighbors
- Trees and forests
- Kernels
- Neural networks, a.k.a. “deep learning”, a.k.a. “artificial intelligence”

What about the rest of your linear models course?

Say you want to test whether \(\beta_{37} = 1\), or to give a confidence set for \((\alpha, \beta_{12})\)
That’s hard to do explicitly without more assumptions
Generally, you need to know the sampling distribution of \(\mathbf{\hat{\beta}}\)

What about the rest of your linear models course? (cont’d)

The usual (“401”) assumptions give nice formulas for the sampling distribution of parameter estimators, and so for parameter inference:

The true regression function is exactly linear.
\(Y=\alpha + \vec{X} \cdot \vec{\beta} + \epsilon\) where \(\epsilon\) is independent of \(x\).
\(\epsilon\) is independent across observations.
\(\epsilon \sim \mathcal{N}(0,\sigma^2)\).

Assuming (1)–(4), the usual formulas (embedded in R) are exactly right
- Gaussian distribution for \(\hat{\mathbf{\beta}}\), centered at true \(\mathbf{\beta}\)
Assuming (1)–(3), non-Gaussian noise, and the influence of any one observation \(\rightarrow 0\), then the usual formulas hold as \(n\rightarrow\infty\)
- Get back the Gaussian distribution by a (complicated) central limit theorem
Do not use those formulas without checking those assumptions

The most important assumption to check

If the true regression function is exactly linear, then \[ \Expect{Y-(\alpha+ \vec{X} \beta)|\vec{X}=\vec{x}} = 0 \] for all \(\vec{x}\)
So \[ \Expect{Y-(\hat{\alpha}+\vec{X} \hat{\beta})|\vec{X}=\vec{x}} \approx 0 \]
The residuals should have conditional mean 0 everywhere if the linear model is right
- This is something we can check using nonlinear regression methods

Summing up

We can always decide to use a linear predictor, \(\OptLinPred(\vec{X}) = \alpha + \vec{\beta} \cdot \vec{X}\)
The optimal linear predictor of \(Y\) from \(\vec{X}\) always takes the same form: \[ \OptLinPred(Y) = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}} (\vec{X} - \Expect{\vec{X}}) \]
Doing linear prediction requires finding those covariances
We usually estimate those covariances (implicitly) by ordinary least squares
OLS converges pretty robustly to the optimal linear predictor (not to the truth)
The usual “401” assumptions are needed for parameter inference, not point prediction
Once we use OLS, all our predictions are weighted sums of the observations
The weights for linear regression are weird and implausible
We’re going to explore other ways of weighting
We’re also going to explore minimize other measures of prediction error, over other classes of predictors

A final thought

When you’re fundraising, it’s AI
When you’re hiring, it’s ML
When you’re implementing, it’s linear regression

Baron Schwartz, 15 November 2017¹

Backup: Further reading

See Berk (2008), chapter 1, and Hastie, Tibshirani, and Friedman (2009), sections 2.3.1 and 2.6, and chapter 3 (especially through section 3.4), for more on linear regression as a prediction method
If you want more from the point of view taken here, see Shalizi (n.d.), chapter 2; if you want a lot more, see Shalizi (2015)
If you need to do inference about the coefficients but the usual assumptions don’t hold, your best bet is the bootstrap (covered in 402, and in some excellent textbooks like Davison and Hinkley (1997)); some of the complicated procedures developed in econometrics for “robust standard errors” in linear models turn out to be equivalent to simple bootstraps (Buja et al. 2014)
- If you need prediction intervals or predictive distributions from your linear model, there are also ways of doing that using the bootstrap — see Davison and Hinkley (1997) again, or again Berk (2008), chapter 1
The view that the optimal linear model is just \(\Var{\vec{X}}^{-1}\Cov{Y, \vec{X}}\) and all we need is for sample covariance to converge, goes back to Wiener (1949)
- A notoriously hard-to-read book, but that’s because Wiener was trying to explain this in a way which made mathematical sense where instead of regressing \(Y\) on a fixed-length vector \(\vec{X}\), you’re regressing \(Y\) on a continuous function, and framing the solution in a way which could be implemented using 1940-vintage analog electronics…

Backup: The optimal regression function

We set up the problem \[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m: \mathcal{X} \mapsto \mathbb{R}}{\Expect{\Expect{(Y-m(X))^2|X}}}\\ & = & \argmin_{m: \mathcal{X} \mapsto \mathbb{R}}{\int{dx p(x) \int{dy p(y|x) (y-m(x))^2}}} \end{eqnarray}\]
For each \(x\), we can minimize the inner integral by setting \(m(x) = \Expect{Y|X=x}\)
We just pasted together all of those pointwise minimizers to get \(\TrueRegFunc\)
We could do this because we were minimizing over all possible functions, so minimizing at \(x_1\) doesn’t interfere with minimizing at \(x_2\)
This doesn’t work if we limit ourselves to a particular set of functions \(\mathcal{M}\)
- Unless of course \(\TrueRegFunc \in \mathcal{M}\), i.e., that model is well-specified

Backup: Gory details for multivariate predictors

\[\begin{eqnarray} \OptLinPred(\vec{X}) & = & a + \vec{b} \cdot \vec{X}\\ (\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{X}))^2}}\\ \Expect{(Y-(a+\vec{b}\cdot \vec{X}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{X})^2}\\ \nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{X})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{X}}\\ & = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{X} \otimes \vec{X}} b \\ \nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{X}} + 2a\vec{b}\cdot \Expect{\vec{X}}\\ \end{eqnarray}\]

Backup: Gory details: the intercept

Take derivative w.r.t. \(a\), set to 0: \[\begin{eqnarray} 0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{X}} + 2\alpha \\ \alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}\\ \end{eqnarray}\] just like when \(X\) was univariate

Backup: Gory details: the slopes

\[\begin{eqnarray} -2 \Expect{Y\vec{X}} + 2 \Expect{\vec{X} \otimes \vec{X}} \beta + 2 \alpha \Expect{\vec{X}} & = & 0\\ \Expect{Y\vec{X}} - \alpha\Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\ \Expect{Y\vec{X}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}) \Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\ \Cov{Y,\vec{X}} & = & \Var{\vec{X}} \beta\\ \beta & = & (\Var{\vec{X}})^{-1} \Cov{Y,\vec{X}} \end{eqnarray}\]

Reduces to \(\Cov{Y,X}/\Var{X}\) when \(X\) is univariate

Backup: Gory details: the PCA view

The factor of \(\Var{\vec{X}}^{-1}\) rotates and scales \(\vec{X}\) to uncorrelated, unit-variance variables

\[\begin{eqnarray} \Var{\vec{X}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\ \Var{\vec{X}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\ \Var{\vec{X}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\ & = & \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T\\ \vec{U} & \equiv & \vec{X} \Var{\vec{X}}^{-1/2}\\ \Var{\vec{U}} & = & \mathbf{I}\\ \vec{X}\cdot\vec{\beta} & = & \vec{X} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\ & = & \vec{X} \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T \Cov{\vec{X}, Y}\\ & = & \vec{U} \Cov{\vec{U}, Y}\\ \end{eqnarray}\]

Backup: Square root of a matrix

A square matrix \(\mathbf{d}\) is a square root of \(\mathbf{c}\) when \(\mathbf{c} = \mathbf{d} \mathbf{d}^T\)
If there are any square roots, there are many square roots
- Pick any orthogonal matrix \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \((\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T\)
- Just like every real number has two square roots…
If \(\mathbf{c}\) is diagonal, define \(\mathbf{c}^{1/2}\) as the diagonal matrix of square roots
If \(\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T\), one square root is \(\mathbf{w}\mathbf{\Lambda}^{1/2}\)

Backup/Aside: \(R^2\) is useless

\(R^2 \equiv \Var{\mathrm{fitted values}}/\Var{Y}\)
Suppose we know \(Y=\beta X+\epsilon\), with \(\Var{\epsilon} = \sigma^2\) and \(\epsilon\) independent of \(X\)
Then \(R^2 = \frac{\beta^2 \Var{X}}{\beta^2 \Var{X} + \sigma^2}\)
Consequence 1: \(R^2\) can be arbitrarily close to 0, even for the completely correct model (shrink \(\Var{X}\) and/or grow \(\sigma^2\))
Consequence 2: If we make \(\Var{X}\) bigger, and the linear model is right, \(R^2\) will grow \(\uparrow 1\) (unless \(\beta = 0\)); \(R^2\) measures the range of the regressor, not goodness of fit
Suppose the linear model is wrong but we use it anyway; \(R^2 = \beta^2\Var{X}/\Var{Y}\)
Consequence 3: \(R^2\) can be arbitrarily close to 1 even if every single assumption of the linear model is badly, obviously broken
Conclusion: \(R^2\) tells us nothing about how good or bad our model is that we didn’t already know from the mean squared error; you’d be better off forgetting about it
More: Shalizi (2015), chapter 10

References

Berk, Richard A. 2008. Statistical Learning from a Regression Perspective. New York: Springer-Verlag.

Buja, Andreas, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, Linda Zhao, and Kai Zhang. 2014. “Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression.” arxiv:1404.1578. http://arxiv.org/abs/1404.1578.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press. https://doi.org/10.1017/CBO9780511802843.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second. Berlin: Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/.

Shalizi, Cosma Rohilla. 2015. “The Truth About Linear Regression.” Online Manuscript. http:///www.stat.cmu.edu/~cshalizi/TALR.

———. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Wiener, Norbert. 1949. Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications. Cambridge, Massachusetts: The Technology Press of the Massachusetts Institute of Technology.

Since-deleted tweet, but see e.g. [https://twitter.com/bc238dev/status/1225150435729666048]. I’ve often seen the last line quoted as “it’s logistic regression”, which fits with computer science’s emphasis on classification rather than regression, but so far as I can work out that’s a later mutation.↩

Linear Regression as Prediction

Context

Optimal prediction in general

Optimal prediction in general (cont’d.)

Optimal prediction in general (cont’d.)

Optimal prediction in general (cont’d.)

Optimal linear prediction with univariate predictor

Optimal linear prediction, univariate case

The intercept

The slope

The optimal linear predictor of \(Y\) from \(X\)

What did we not assume?

The prediction errors average out to zero

The prediction errors are uncorrelated with \(X\)

The prediction errors are uncorrelated with \(X\)

How big are the prediction errors?

Multivariate case

What we don’t assume, again

Estimation: Data, not the full distribution

Estimation: Ordinary Least Squares (OLS)

Optimum vs. estimate (I)

When does OLS/plug-in work?

Optimum vs. estimate (II)

What do the estimates look like?

What do the predictions look like?

Fitted values and other predictions are weighted sums of the observations

Explicit form of the weights for OLS

Generalizing: linear smoothers

What about the rest of your linear models course?

What about the rest of your linear models course? (cont’d)

The most important assumption to check

Summing up

A final thought

Backup: Further reading

Backup: The optimal regression function

Backup: Gory details for multivariate predictors

Backup: Gory details: the intercept

Backup: Gory details: the slopes

Backup: Gory details: the PCA view

Backup: Square root of a matrix

Backup/Aside: \(R^2\) is useless

References