Linear Regression as Prediction
36-462/36-662, Spring 2022
20 January 2022
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\newcommand{\OptLinPred}{m}
\newcommand{\EstLinPred}{\hat{m}}
\newcommand{\TrueNoise}{\epsilon}
\newcommand{\EstNoise}{\widehat{\TrueNoise}}
\DeclareMathOperator*{\argmin}{argmin}
\]
Context
- We want to find patterns in our data which will let us predict one variable from another (or from more than one variable)
- The oldest and most widely-used prediction method is linear regression
- We’re going to review linear regression, as a prediction method
- Without a lot of the mythology from 401
- But with an eye to more powerful methods
Optimal prediction in general
- We want to predict a numerical random variable \(Y\)
- We want a one-number (“point”) prediction, not a range or a distribution
- We say how good our guess is by expected squared error
Optimal prediction in general (cont’d.)
What’s the best constant guess for a random variable \(Y\)?
\[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m \in \mathbb{R}}{\Expect{(Y-m)^2}}\\
& = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\
& = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\
& = & \argmin_m{ (\Expect{Y} - m)^2}\\
& = & \Expect{Y}
\end{eqnarray}\]
(Because: \(\Expect{Z^2} = \Var{Z} + (\Expect{Z})^2\), always)
Optimal prediction in general (cont’d.)
What’s the best function of \(X\) to guess for \(Y\)?
\[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m: \mathcal{X} \mapsto \mathbb{R}}{\Expect{(Y-m(X))^2}}\\
& = & \argmin_{m}{\Expect{\Expect{(Y-m(X))^2|X}}}
\end{eqnarray}\]
For each \(x\), best \(m(x)\) is \(\Expect{Y|X=x}\) (by previous slide)
\[
\TrueRegFunc(x) = \Expect{Y|X=x}
\]
Optimal prediction in general (cont’d.)
Learning arbitrary functions is hard!
Who knows what the right function might be?
What if we decide to make our predictions linear?
Optimal linear prediction with univariate predictor
Our prediction will be of the form \[
\OptLinPred(x) = a + b x
\] and we want the best \(a, b\)
Optimal linear prediction, univariate case
\[
(\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bX))^2}}
\]
Expand out that expectation, then take derivatives and set them to 0
The intercept
\[\begin{eqnarray}
\Expect{(Y-(a+bX))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bX)} + \Expect{(a+bX)^2}\\
& = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YX} +\\
& & a^2 + 2 ab \Expect{X} + b^2 \Expect{X^2}\\
\left. \frac{\partial}{\partial a}\Expect{(Y-(a+bX))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{X} = 0\\
\alpha & = & \Expect{Y} - \beta\Expect{X}
\end{eqnarray}\]
\(\therefore\) optimal linear predictor \(m(X) = \alpha+\beta X\) looks like \[\begin{eqnarray}
m(X) & = & \alpha + \beta X\\
& = & \Expect{Y} - \beta\Expect{X} + \beta X\\
& = & \Expect{Y} + \beta(X-\Expect{X})
\end{eqnarray}\] The optimal linear predictor only cares about how far \(X\) is from its expectation \(\Expect{X}\) And when \(X=\Expect{X}\), we will always predict \(\Expect{Y}\)
The slope
\[\begin{eqnarray}
\left. \frac{\partial}{\partial b}\Expect{(Y-(a+bX))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YX} + 2\alpha \Expect{X} + 2\beta \Expect{X^2} = 0\\
0 & = & -\Expect{YX} + (\Expect{Y} - \beta\Expect{X})\Expect{X} + \beta\Expect{X^2} \\
0 & = & \Expect{Y}\Expect{X} - \Expect{YX} + \beta(\Expect{X^2} - \Expect{X}^2)\\
0 & = & -\Cov{Y,X} + \beta \Var{X}\\
\beta & = & \frac{\Cov{Y,X}}{\Var{X}}
\end{eqnarray}\]
- If we replace \(X\) with \(X' = X-\Expect{X}\), \(\beta\) doesn’t change
- If we replace \(Y\) with \(Y' = Y-\Expect{Y}\), \(\beta\) doesn’t change
- \(\therefore\) centering the variables doesn’t change the slope
The optimal linear predictor of \(Y\) from \(X\)
The optimal linear predictor of \(Y\) from a single \(X\) is always
\[
\alpha + \beta X = \Expect{Y} + \left(\frac{\Cov{X,Y}}{\Var{X}}\right) (X - \Expect{X})
\]
What did we not assume?
- That the true relationship between \(Y\) and \(X\) is linear
- That anything is Gaussian
- That anything has constant variance
- That anything is independent or even uncorrelated
NONE OF THAT MATTERS for the optimal linear predictor
The prediction errors average out to zero
\[\begin{eqnarray}
\Expect{Y-\OptLinPred(X)} & = & \Expect{Y - (\Expect{Y} + \beta(X-\Expect{X}))}\\
& = & \Expect{Y} - \Expect{Y} - \beta(\Expect{X} - \Expect{X}) = 0
\end{eqnarray}\]
- If they didn’t average to zero, we’d adjust the coefficients until they did
- Important: In general, \(\Expect{Y-\OptLinPred(X)|X} \neq 0\)
How big are the prediction errors?
\[\begin{eqnarray}
\Var{Y-\OptLinPred(X)} & = & \Var{Y - \alpha - \beta X}\\
& = & \Var{Y - \beta X}\\
\end{eqnarray}\]
After-class exercise: Reduce this to an expression involving only \(\Var{Y}\), \(\Var{X}\) and \(\Cov{Y,X}\); if you get the right answer you should see that it’s \(< \Var{Y}\) unless \(\Cov{Y,X}=0\)
\(\Rightarrow\) Optimal linear predictor is almost always better than nothing…
Multivariate case
- We try to predict \(Y\) from a whole bunch of variables
- Bundle those predictor variables into \(\vec{X}\)
- We need a vector of slope coefficients \(\vec{\beta}\)
- Solution:
\[
\OptLinPred(\vec{X}) = \alpha+ \vec{\beta} \cdot \vec{X} = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{\vec{X},Y} (\vec{X} - \Expect{\vec{X}})
\]
and
\[
\Var{Y-\OptLinPred(\vec{X})} = \Var{Y} - \Cov{Y,\vec{X}}^T \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}}
\]
(Gory details in the back-up slides)
What we don’t assume, again
- That the linear predictor is correct
- That anything is Gaussian
- Anything about the distributions of \(Y\) or \(\vec{X}\)
Estimation: Data, not the full distribution
- \(\Var{X}\), \(\Cov{Y,X}\), \(\Expect{(Y-(a+bX))^2}\) all involve the true distribution, which we don’t know
- \(\therefore\) We can’t just calculate \(\beta\)
- What we do know is \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\)
- That is, \(n\) pairs of predictor and outcome
Estimation: Ordinary Least Squares (OLS)
Set up the mean squared error, and minimize it:
\[\begin{eqnarray}
MSE(a,b) & = & \frac{1}{n}\sum_{i=1}^{n}{(y_i - (a+b x_i))^2}\\
(\hat{\alpha}, \hat{\beta}) & \equiv & \argmin_{a,b}{MSE(a,b)}
\end{eqnarray}\]
Do the calculus: \[\begin{eqnarray}
\frac{\partial MSE}{\partial a} & = & \frac{1}{n}\sum_{i=1}^{n}{-2(y_i - (a+b x_i))}\\
\frac{\partial MSE}{\partial b} & = & \frac{1}{n}\sum_{i=1}^{n}{-2(y_i - (a+b x_i))x_i}\\
\hat{\alpha} & = & \overline{y} - \hat{\beta} \overline{x}\\
\hat{\beta} & = & \frac{ \overline{yx} - \overline{y}\overline{x}}{\overline{x^2}} = \frac{\widehat{\Cov{Y,X}}}{\widehat{\Var{X}}}
\end{eqnarray}\]
(with \(\overline{x} =\) sample mean of \(x\), etc.)
Optimum vs. estimate (I)
- Optimal linear model versus Linear model estimated by OLS: \[\begin{eqnarray}
\alpha + \beta x & = & \Expect{Y} + \left(\frac{\Cov{X,Y}}{\Var{X}}\right) (x - \Expect{X})\\
\hat{\alpha} + \hat{\beta} x & = & \overline{y} + \frac{\widehat{\Cov{Y,X}}}{\widehat{\Var{X}}} (x - \overline{x})
\end{eqnarray}\]
- These are not the same, but we can hope they’re close, and increasingly close with more data
When does OLS/plug-in work?
- “Work” = converge on the optimal linear predictor
- Jointly sufficient conditions:
- Sample means converge on expectation values
- Sample covariances converge on true covariance
- Sample variances converge on true, invertible variance
- Then by continuity OLS coefficients converge on true \(\beta\)
- None of this requires that the linear model is right, that anything is Gaussian, etc.
Optimum vs. estimate (II)
- A little more subtle: for every fixed \((a,b)\), by the law of large numbers, \[
MSE(a,b) \rightarrow \Expect{(Y - (a+b X))^2}
\]
- Mean squared error \(\rightarrow\) expected squared error
- This is almost enough to ensure that minimize MSE will converge on the true optimum
- We’ll come back to this, because this approach generalizes better to other models and loss functions
What do the estimates look like?
- Bundle all the regressor values into an \(n\times (p+1)\) matrix \(\mathbf{x}\), with an extra column of 1s
- Bundle all the regressand values into an \(n\times 1\) matrix \(\mathbf{y}\)
- Then the \((p+1)\times 1\) matrix of least-squares coefficients \(\hat{\mathbf{\beta}}\) is \[
\hat{\mathbf{\beta}} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\] (as you learned in linear models)
- This is really \[\begin{eqnarray}
\hat{\mathbf{\beta}} & = & {\left(\frac{1}{n}\mathbf{x}^T\mathbf{x}\right)}^{-1} \left(\frac{1}{n}\mathbf{x}^T \mathbf{y}\right)\\
& = & (\mathrm{sample\ variance\ matrix\ of\ regressors})^{-1} (\mathrm{sample\ covariances\ between\ regressors\ and\ response})
\end{eqnarray}\]
What do the predictions look like?
- The prediction at an arbitrary point \(\vec{x} = (x_1, x_2, \ldots x_p)\) is \[
\EstLinPred(\vec{x}) = \begin{array}{ccccc}[1 & x_1 & x_2 & \ldots & x_p]\end{array} \hat{\beta} = \begin{array}{ccccc}[1 & x_1 & x_2 & \ldots & x_p]\end{array} \left((\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\right)
\]
- The \(n\times 1\) matrix of fitted values, at the training points, is \[
\mathbf{\EstLinPred} = \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\]
- The matrix \(\mathbf{h} \equiv \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\) is the weight matrix, influence matrix or hat matrix and says how much \(y_j\) matters to the prediction of \(y_i\)
Fitted values and other predictions are weighted sums of the observations
\[\begin{eqnarray}
\EstLinPred(\vec{x}) & = & \vec{x} \hat{\beta} = \vec{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}\\
\mathbf{\EstLinPred} & = & \mathbf{x} (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y}
\end{eqnarray}\]
- Every prediction we make is linear in \(\mathbf{y}\), it’s a weighted sum of all the observed values of the response
- How much weight do we put on \(y_i\) when we’re predicting at \(\vec{x}\)?
Generalizing: linear smoothers
- Linear regression is a special case of a linear smoother \[
\EstRegFunc(\vec{x}) = \sum_{i=1}^{n}{w(\vec{x}, \vec{x}_i) y_i}
\]
- Notice: Linear in the \(y\)’s, not in \(\vec{x}\)
- General idea: predict at \(\vec{x}\) by finding similar \(\vec{x}_i\)’s and averaging their \(y_i\)’s
- Different choices of \(w\) are different ideas about “similar” and about weighted averaging
- Most prediction methods used in practice are linear smoothers
- Nearest neighbors
- Trees and forests
- Kernels
- Neural networks, a.k.a. “deep learning”, a.k.a. “artificial intelligence”
What about the rest of your linear models course?
- Say you want to test whether \(\beta_{37} = 1\), or to give a confidence set for \((\alpha, \beta_{12})\)
- That’s hard to do explicitly without more assumptions
- Generally, you need to know the sampling distribution of \(\mathbf{\hat{\beta}}\)
What about the rest of your linear models course? (cont’d)
- The usual (“401”) assumptions give nice formulas for the sampling distribution of parameter estimators, and so for parameter inference:
- The true regression function is exactly linear.
- \(Y=\alpha + \vec{X} \cdot \vec{\beta} + \epsilon\) where \(\epsilon\) is independent of \(x\).
- \(\epsilon\) is independent across observations.
- \(\epsilon \sim \mathcal{N}(0,\sigma^2)\).
- Assuming (1)–(4), the usual formulas (embedded in R) are exactly right
- Gaussian distribution for \(\hat{\mathbf{\beta}}\), centered at true \(\mathbf{\beta}\)
- Assuming (1)–(3), non-Gaussian noise, and the influence of any one observation \(\rightarrow 0\), then the usual formulas hold as \(n\rightarrow\infty\)
- Get back the Gaussian distribution by a (complicated) central limit theorem
- Do not use those formulas without checking those assumptions
The most important assumption to check
- If the true regression function is exactly linear, then \[
\Expect{Y-(\alpha+ \vec{X} \beta)|\vec{X}=\vec{x}} = 0
\] for all \(\vec{x}\)
- So \[
\Expect{Y-(\hat{\alpha}+\vec{X} \hat{\beta})|\vec{X}=\vec{x}} \approx 0
\]
- The residuals should have conditional mean 0 everywhere if the linear model is right
- This is something we can check using nonlinear regression methods
Summing up
- We can always decide to use a linear predictor, \(\OptLinPred(\vec{X}) = \alpha + \vec{\beta} \cdot \vec{X}\)
- The optimal linear predictor of \(Y\) from \(\vec{X}\) always takes the same form: \[
\OptLinPred(Y) = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{Y,\vec{X}} (\vec{X} - \Expect{\vec{X}})
\]
- Doing linear prediction requires finding those covariances
- We usually estimate those covariances (implicitly) by ordinary least squares
- OLS converges pretty robustly to the optimal linear predictor (not to the truth)
- The usual “401” assumptions are needed for parameter inference, not point prediction
- Once we use OLS, all our predictions are weighted sums of the observations
- The weights for linear regression are weird and implausible
- We’re going to explore other ways of weighting
- We’re also going to explore minimize other measures of prediction error, over other classes of predictors
A final thought
When you’re fundraising, it’s AI
When you’re hiring, it’s ML
When you’re implementing, it’s linear regression
- Baron Schwartz, 15 November 2017
Backup: Further reading
- See Berk (2008), chapter 1, and Hastie, Tibshirani, and Friedman (2009), sections 2.3.1 and 2.6, and chapter 3 (especially through section 3.4), for more on linear regression as a prediction method
- If you want more from the point of view taken here, see Shalizi (n.d.), chapter 2; if you want a lot more, see Shalizi (2015)
- If you need to do inference about the coefficients but the usual assumptions don’t hold, your best bet is the bootstrap (covered in 402, and in some excellent textbooks like Davison and Hinkley (1997)); some of the complicated procedures developed in econometrics for “robust standard errors” in linear models turn out to be equivalent to simple bootstraps (Buja et al. 2014)
- If you need prediction intervals or predictive distributions from your linear model, there are also ways of doing that using the bootstrap — see Davison and Hinkley (1997) again, or again Berk (2008), chapter 1
- The view that the optimal linear model is just \(\Var{\vec{X}}^{-1}\Cov{Y, \vec{X}}\) and all we need is for sample covariance to converge, goes back to Wiener (1949)
- A notoriously hard-to-read book, but that’s because Wiener was trying to explain this in a way which made mathematical sense where instead of regressing \(Y\) on a fixed-length vector \(\vec{X}\), you’re regressing \(Y\) on a continuous function, and framing the solution in a way which could be implemented using 1940-vintage analog electronics…
Backup: The optimal regression function
- We set up the problem \[\begin{eqnarray}
\TrueRegFunc & = & \argmin_{m: \mathcal{X} \mapsto \mathbb{R}}{\Expect{\Expect{(Y-m(X))^2|X}}}\\
& = & \argmin_{m: \mathcal{X} \mapsto \mathbb{R}}{\int{dx p(x) \int{dy p(y|x) (y-m(x))^2}}}
\end{eqnarray}\]
- For each \(x\), we can minimize the inner integral by setting \(m(x) = \Expect{Y|X=x}\)
- We just pasted together all of those pointwise minimizers to get \(\TrueRegFunc\)
- We could do this because we were minimizing over all possible functions, so minimizing at \(x_1\) doesn’t interfere with minimizing at \(x_2\)
- This doesn’t work if we limit ourselves to a particular set of functions \(\mathcal{M}\)
- Unless of course \(\TrueRegFunc \in \mathcal{M}\), i.e., that model is well-specified
Backup: Gory details for multivariate predictors
\[\begin{eqnarray}
\OptLinPred(\vec{X}) & = & a + \vec{b} \cdot \vec{X}\\
(\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{X}))^2}}\\
\Expect{(Y-(a+\vec{b}\cdot \vec{X}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{X})^2}\\
\nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{X})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{X}}\\
& = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{X} \otimes \vec{X}} b \\
\nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{X}} + 2a\vec{b}\cdot \Expect{\vec{X}}\\
\end{eqnarray}\]
Backup: Gory details: the intercept
Take derivative w.r.t. \(a\), set to 0: \[\begin{eqnarray}
0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{X}} + 2\alpha \\
\alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}\\
\end{eqnarray}\] just like when \(X\) was univariate
Backup: Gory details: the slopes
\[\begin{eqnarray}
-2 \Expect{Y\vec{X}} + 2 \Expect{\vec{X} \otimes \vec{X}} \beta + 2 \alpha \Expect{\vec{X}} & = & 0\\
\Expect{Y\vec{X}} - \alpha\Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\
\Expect{Y\vec{X}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{X}}) \Expect{\vec{X}} & = & \Expect{\vec{X} \otimes \vec{X}} \beta\\
\Cov{Y,\vec{X}} & = & \Var{\vec{X}} \beta\\
\beta & = & (\Var{\vec{X}})^{-1} \Cov{Y,\vec{X}}
\end{eqnarray}\]
Reduces to \(\Cov{Y,X}/\Var{X}\) when \(X\) is univariate
Backup: Gory details: the PCA view
The factor of \(\Var{\vec{X}}^{-1}\) rotates and scales \(\vec{X}\) to uncorrelated, unit-variance variables
\[\begin{eqnarray}
\Var{\vec{X}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\
\Var{\vec{X}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\
\Var{\vec{X}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\
& = & \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T\\
\vec{U} & \equiv & \vec{X} \Var{\vec{X}}^{-1/2}\\
\Var{\vec{U}} & = & \mathbf{I}\\
\vec{X}\cdot\vec{\beta} & = & \vec{X} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\
& = & \vec{X} \Var{\vec{X}}^{-1/2} \left(\Var{\vec{X}}^{-1/2}\right)^T \Cov{\vec{X}, Y}\\
& = & \vec{U} \Cov{\vec{U}, Y}\\
\end{eqnarray}\]
Backup: Square root of a matrix
- A square matrix \(\mathbf{d}\) is a square root of \(\mathbf{c}\) when \(\mathbf{c} = \mathbf{d} \mathbf{d}^T\)
- If there are any square roots, there are many square roots
- Pick any orthogonal matrix \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \((\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T\)
- Just like every real number has two square roots…
- If \(\mathbf{c}\) is diagonal, define \(\mathbf{c}^{1/2}\) as the diagonal matrix of square roots
- If \(\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T\), one square root is \(\mathbf{w}\mathbf{\Lambda}^{1/2}\)
Backup/Aside: \(R^2\) is useless
- \(R^2 \equiv \Var{\mathrm{fitted values}}/\Var{Y}\)
- Suppose we know \(Y=\beta X+\epsilon\), with \(\Var{\epsilon} = \sigma^2\) and \(\epsilon\) independent of \(X\)
- Then \(R^2 = \frac{\beta^2 \Var{X}}{\beta^2 \Var{X} + \sigma^2}\)
- Consequence 1: \(R^2\) can be arbitrarily close to 0, even for the completely correct model (shrink \(\Var{X}\) and/or grow \(\sigma^2\))
- Consequence 2: If we make \(\Var{X}\) bigger, and the linear model is right, \(R^2\) will grow \(\uparrow 1\) (unless \(\beta = 0\)); \(R^2\) measures the range of the regressor, not goodness of fit
- Suppose the linear model is wrong but we use it anyway; \(R^2 = \beta^2\Var{X}/\Var{Y}\)
- Consequence 3: \(R^2\) can be arbitrarily close to 1 even if every single assumption of the linear model is badly, obviously broken
- Conclusion: \(R^2\) tells us nothing about how good or bad our model is that we didn’t already know from the mean squared error; you’d be better off forgetting about it
- More: Shalizi (2015), chapter 10
References
Berk, Richard A. 2008. Statistical Learning from a Regression Perspective. New York: Springer-Verlag.
Buja, Andreas, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, Linda Zhao, and Kai Zhang. 2014. “Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression.” arxiv:1404.1578. http://arxiv.org/abs/1404.1578.
Wiener, Norbert. 1949. Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications. Cambridge, Massachusetts: The Technology Press of the Massachusetts Institute of Technology.