Linear Regression as Prediction

36-462/36-662, Spring 2022

20 January 2022

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\OptLinPred}{m} \newcommand{\EstLinPred}{\hat{m}} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \DeclareMathOperator*{\argmin}{argmin} \]

Context

Optimal prediction in general

Optimal prediction in general (cont’d.)

What’s the best constant guess for a random variable \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m \in \mathbb{R}}{\Expect{(Y-m)^2}}\\ & = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\ & = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\ & = & \argmin_m{ (\Expect{Y} - m)^2}\\ & = & \Expect{Y} \end{eqnarray}\]

(Because: \(\Expect{Z^2} = \Var{Z} + (\Expect{Z})^2\), always)

Optimal prediction in general (cont’d.)

What’s the best function of \(X\) to guess for \(Y\)?

\[\begin{eqnarray} \TrueRegFunc & = & \argmin_{m: \mathcal{X} \mapsto \mathbb{R}}{\Expect{(Y-m(X))^2}}\\ & = & \argmin_{m}{\Expect{\Expect{(Y-m(X))^2|X}}} \end{eqnarray}\]

For each \(x\), best \(m(x)\) is \(\Expect{Y|X=x}\) (by previous slide)

\[ \TrueRegFunc(x) = \Expect{Y|X=x} \]

Optimal prediction in general (cont’d.)

Learning arbitrary functions is hard!

Who knows what the right function might be?

What if we decide to make our predictions linear?

Optimal linear prediction with univariate predictor

Our prediction will be of the form \[ \OptLinPred(x) = a + b x \] and we want the best \(a, b\)

Optimal linear prediction, univariate case

\[ (\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bX))^2}} \]

Expand out that expectation, then take derivatives and set them to 0

The intercept

\[\begin{eqnarray} \Expect{(Y-(a+bX))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bX)} + \Expect{(a+bX)^2}\\ & = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YX} +\\ & & a^2 + 2 ab \Expect{X} + b^2 \Expect{X^2}\\ \left. \frac{\partial}{\partial a}\Expect{(Y-(a+bX))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{X} = 0\\ \alpha & = & \Expect{Y} - \beta\Expect{X} \end{eqnarray}\]

\(\therefore\) optimal linear predictor \(m(X) = \alpha+\beta X\) looks like \[\begin{eqnarray} m(X) & = & \alpha + \beta X\\ & = & \Expect{Y} - \beta\Expect{X} + \beta X\\ & = & \Expect{Y} + \beta(X-\Expect{X}) \end{eqnarray}\] The optimal linear predictor only cares about how far \(X\) is from its expectation \(\Expect{X}\) And when \(X=\Expect{X}\), we will always predict \(\Expect{Y}\)

The slope

\[\begin{eqnarray} \left. \frac{\partial}{\partial b}\Expect{(Y-(a+bX))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YX} + 2\alpha \Expect{X} + 2\beta \Expect{X^2} = 0\\ 0 & = & -\Expect{YX} + (\Expect{Y} - \beta\Expect{X})\Expect{X} + \beta\Expect{X^2} \\ 0 & = & \Expect{Y}\Expect{X} - \Expect{YX} + \beta(\Expect{X^2} - \Expect{X}^2)\\ 0 & = & -\Cov{Y,X} + \beta \Var{X}\\ \beta & = & \frac{\Cov{Y,X}}{\Var{X}} \end{eqnarray}\]

The optimal linear predictor of \(Y\) from \(X\)

The optimal linear predictor of \(Y\) from a single \(X\) is always

\[ \alpha + \beta X = \Expect{Y} + \left(\frac{\Cov{X,Y}}{\Var{X}}\right) (X - \Expect{X}) \]

What did we not assume?

NONE OF THAT MATTERS for the optimal linear predictor

The prediction errors average out to zero

\[\begin{eqnarray} \Expect{Y-\OptLinPred(X)} & = & \Expect{Y - (\Expect{Y} + \beta(X-\Expect{X}))}\\ & = & \Expect{Y} - \Expect{Y} - \beta(\Expect{X} - \Expect{X}) = 0 \end{eqnarray}\]

The prediction errors are uncorrelated with \(X\)

\[\begin{eqnarray} \Cov{X, Y-\OptLinPred(X)} & = & \Expect{X(Y-\OptLinPred(X))} ~\text{(by previous slide)}\\ & = & \Expect{X(Y - \Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}(X-\Expect{X}))}\\ & = & \Expect{XY - X\Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}(X^2) + \frac{\Cov{Y,X}}{\Var{X}} (X \Expect{X})}\\ & = & \Expect{XY} - \Expect{X}\Expect{Y} - \frac{\Cov{Y,X}}{\Var{X}}\Expect{X^2} + \frac{\Cov{Y,X}}{\Var{X}} (\Expect{X})^2\\ & = & \Cov{X,Y} - \frac{\Cov{Y,X}}{\Var{X}}(\Var{X})\\ & = & 0 \end{eqnarray}\]

The prediction errors are uncorrelated with \(X\)

Alternate take:

\[\begin{eqnarray} \Cov{X, Y-\OptLinPred(X)} & = & \Cov{X, Y} - \Cov{X, \alpha + \beta X}\\ & = & \Cov{Y,X} - \Cov{X, \beta X}\\ & = & \Cov{Y,X} - \beta\Cov{X,X}\\ & = & \Cov{Y,X} - \beta\Var{X}\\ & = & \Cov{Y,X} - \Cov{Y,X} = 0 \end{eqnarray}\]

How big are the prediction errors?

\[\begin{eqnarray} \Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\ & = & \Var{Y - \beta X}\\ \end{eqnarray}\]

After-class exercise: Reduce this to an expression involving only \(\Var{Y}\), \(\Var{X}\) and \(\Cov{Y,X}\); if you get the right answer you should see that it’s \(< \Var{Y}\) unless \(\Cov{Y,X}=0\)

\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…

Multivariate case

\[ \OptLinPred(\vec{X}) = \alpha+ \vec{\beta} \cdot \vec{X} = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{\vec{X},Y} (\vec{X} - \Expect{\vec{X}}) \]

and

\[ \Var{Y-\OptLinPred(\vec{X})} = \Var{Y} - \Cov{Y,\vec{X}}^T \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}} \]

(Gory details in the back-up slides)

What we don’t assume, again

Estimation: Data, not the full distribution

Estimation: Ordinary Least Squares (OLS)

Set up the mean squared error, and minimize it:

\[\begin{eqnarray} MSE(a,b) & = & \frac{1}{n}\sum_{i=1}^{n}{(y_i - (a+b x_i))^2}\\ (\hat{\alpha}, \hat{\beta}) & \equiv & \argmin_{a,b}{MSE(a,b)} \end{eqnarray}\]

Do the calculus: \[\begin{eqnarray} \frac{\partial MSE}{\partial a} & = & \frac{1}{n}\sum_{i=1}^{n}{-2(y_i - (a+b x_i))}\\ \frac{\partial MSE}{\partial b} & = & \frac{1}{n}\sum_{i=1}^{n}{-2(y_i - (a+b x_i))x_i}\\ \hat{\alpha} & = & \overline{y} - \hat{\beta} \overline{x}\\ \hat{\beta} & = & \frac{ \overline{yx} - \overline{y}\overline{x}}{\overline{x^2}} = \frac{\widehat{\Cov{Y,X}}}{\widehat{\Var{X}}} \end{eqnarray}\]

(with \(\overline{x} =\) sample mean of \(x\), etc.)

Optimum vs. estimate (I)

When does OLS/plug-in work?

  1. Sample means converge on expectation values
  2. Sample covariances converge on true covariance
  3. Sample variances converge on true, invertible variance

Optimum vs. estimate (II)

What do the estimates look like?

What do the predictions look like?

Fitted values and other predictions are weighted sums of the observations

\[\begin{eqnarray} \EstLinPred(\vec{x}) & = & \vec{x} \hat{\beta} = \vec{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\\ \mathbf{\EstLinPred} & = & \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \end{eqnarray}\]

Explicit form of the weights for OLS

Generalizing: linear smoothers

What about the rest of your linear models course?

What about the rest of your linear models course? (cont’d)

  1. The true regression function is exactly linear.
  2. \(Y=\alpha + \vec{X} \cdot \vec{\beta} + \epsilon\) where \(\epsilon\) is independent of \(x\).
  3. \(\epsilon\) is independent across observations.
  4. \(\epsilon \sim \mathcal{N}(0,\sigma^2)\).

The most important assumption to check

Summing up

A final thought

When you’re fundraising, it’s AI
When you’re hiring, it’s ML
When you’re implementing, it’s linear regression

Backup: Further reading

Backup: The optimal regression function

Backup: Gory details for multivariate predictors

\[\begin{eqnarray} \OptLinPred(\vec{X}) & = & a + \vec{b} \cdot \vec{X}\\ (\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{X}))^2}}\\ \Expect{(Y-(a+\vec{b}\cdot \vec{X}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{X})^2}\\ \nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{X})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{X}}\\ & = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{X} \otimes \vec{X}} b \\ \nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{X}} + 2a\vec{b}\cdot \Expect{\vec{X}}\\ \end{eqnarray}\]

Backup: Gory details: the intercept

Take derivative w.r.t. \(a\), set to 0: \[\begin{eqnarray} 0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{X}} + 2\alpha \\ \alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}\\ \end{eqnarray}\] just like when \(X\) was univariate

Backup: Gory details: the slopes

\[\begin{eqnarray} -2 \Expect{Y\vec{X}} + 2 \Expect{\vec{X} \otimes \vec{X}} \beta + 2 \alpha \Expect{\vec{X}} & = & 0\\ \Expect{Y\vec{X}} - \alpha\Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\ \Expect{Y\vec{X}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}) \Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\ \Cov{Y,\vec{X}} & = & \Var{\vec{X}} \beta\\ \beta & = & (\Var{\vec{X}})^{-1} \Cov{Y,\vec{X}} \end{eqnarray}\]

Reduces to \(\Cov{Y,X}/\Var{X}\) when \(X\) is univariate

Backup: Gory details: the PCA view

The factor of \(\Var{\vec{X}}^{-1}\) rotates and scales \(\vec{X}\) to uncorrelated, unit-variance variables

\[\begin{eqnarray} \Var{\vec{X}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\ \Var{\vec{X}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\ \Var{\vec{X}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\ & = & \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T\\ \vec{U} & \equiv & \vec{X} \Var{\vec{X}}^{-1/2}\\ \Var{\vec{U}} & = & \mathbf{I}\\ \vec{X}\cdot\vec{\beta} & = & \vec{X} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\ & = & \vec{X} \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T \Cov{\vec{X}, Y}\\ & = & \vec{U} \Cov{\vec{U}, Y}\\ \end{eqnarray}\]

Backup: Square root of a matrix

Backup/Aside: \(R^2\) is useless

References

Berk, Richard A. 2008. Statistical Learning from a Regression Perspective. New York: Springer-Verlag.

Buja, Andreas, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, Linda Zhao, and Kai Zhang. 2014. “Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression.” arxiv:1404.1578. http://arxiv.org/abs/1404.1578.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press. https://doi.org/10.1017/CBO9780511802843.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second. Berlin: Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/.

Shalizi, Cosma Rohilla. 2015. “The Truth About Linear Regression.” Online Manuscript. http:///www.stat.cmu.edu/~cshalizi/TALR.

———. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Wiener, Norbert. 1949. Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications. Cambridge, Massachusetts: The Technology Press of the Massachusetts Institute of Technology.


  1. Since-deleted tweet, but see e.g. [https://twitter.com/bc238dev/status/1225150435729666048]. I’ve often seen the last line quoted as “it’s logistic regression”, which fits with computer science’s emphasis on classification rather than regression, but so far as I can work out that’s a later mutation.