# Optimal Linear Prediction

18 September 2018

$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}}$

# In our previous episodes

• Linear smoothers
• Predictions are linear combinations of the data
• How to choose the weights?
• PCA
• Use correlations to break the data into additive components

Today: use correlations to do prediction

# Optimal prediction in general

What’s the best constant guess for a random variable $$Y$$?

$\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m)^2}}\\ & = & \argmin_{m}{\Var{(Y-m)} + (\Expect{Y-m})^2}\\ & = & \argmin_m{\Var{Y} + (\Expect{Y} - m)^2}\\ & = & \argmin_m{ (\Expect{Y} - m)^2}\\ & = & \Expect{Y} \end{eqnarray}$

# Optimal prediction in general

What’s the best function of $$Z$$ to guess for $$Y$$?

$\begin{eqnarray} \TrueRegFunc & = & \argmin_{m}{\Expect{(Y-m(Z))^2}}\\ & = & \argmin_{m}{\Expect{\Expect{(Y-m(Z))^2|Z}}} \end{eqnarray}$

For each $$z$$, best $$m(z)$$ is $$\Expect{Y|Z=z}$$

$\TrueRegFunc(z) = \Expect{Y|Z=z}$

# Optimal prediction in general

Learning arbitrary functions is hard!

Who knows what the right function might be?

What if we decide to make our predictions linear?

# Optimal linear prediction with univariate predictor

Our prediction will be of the form $m(z) = a + b z$ and we want the best $$a, b$$

# Optimal linear prediction, univariate case

$(\alpha, \beta) = \argmin_{a,b}{\Expect{(Y-(a+bZ))^2}}$

Expand out that expectation, then take derivatives and set them to 0

# The intercept

$\begin{eqnarray} \Expect{(Y-(a+bZ))^2} & = & \Expect{Y^2} - 2\Expect{Y(a+bZ)} + \Expect{(a+bZ)^2}\\ & = & \Expect{Y^2} - 2a\Expect{Y} - 2b\Expect{YZ} +\\ & & a^2 + 2 ab \Expect{Z} + b^2 \Expect{Z^2}\\ \left. \frac{\partial}{\partial a}\Expect{(Y-(a+bZ))^2} \right|_{a=\alpha, b=\beta} & = & -2\Expect{Y} + 2\alpha + 2\beta\Expect{Z} = 0\\ \alpha & = & \Expect{Y} - \beta\Expect{Z} \end{eqnarray}$

$$\therefore$$ optimal linear predictor looks like $\Expect{Y} + \beta(Z-\Expect{Z})$ $$\Rightarrow$$ centering $$Z$$ and/or $$Y$$ won’t change the slope

# The slope

$\begin{eqnarray} \left. \frac{\partial}{\partial b}\Expect{(Y-(a+bZ))^2}\right|_{a=\alpha, b=\beta} & = & -2\Expect{YZ} + 2\alpha \Expect{Z} + 2\beta \Expect{Z^2} = 0\\ 0 & = & -\Expect{YZ} + (\Expect{Y} - \beta\Expect{Z})\Expect{Z} + \beta\Expect{Z^2} \\ 0 & = & \Expect{Y}\Expect{Z} - \Expect{YZ} + \beta(\Expect{Z^2} - \Expect{Z}^2)\\ 0 & = & -\Cov{Y,Z} + \beta \Var{Z}\\ \beta & = & \frac{\Cov{Y,Z}}{\Var{Z}} \end{eqnarray}$

# The optimal linear predictor of $$Y$$ from $$Z$$

The optimal linear predictor of $$Y$$ from a single $$Z$$ is always

$\alpha + \beta Z = \Expect{Y} + \left(\frac{\Cov{Z,Y}}{\Var{Z}}\right) (Z - \Expect{Z})$

# What did we not assume?

• That the true relationship between $$Y$$ and $$Z$$ is linear
• That anything is Gaussian
• That anything has constant variance
• That anything is independent or even uncorrelated

NONE OF THAT MATTERS for the optimal linear predictor

# The prediction errors average out to zero

$\begin{eqnarray} \Expect{Y-m(Z)} & = & \Expect{Y - (\Expect{Y} + \beta(Z-\Expect{Z}))}\\ & = & \Expect{Y} - \Expect{Y} - \beta(\Expect{Z} - \Expect{Z}) = 0 \end{eqnarray}$
• If they didn’t average to zero, we’d adjust the coefficients until they did
• Important: In general, $$\Expect{Y-m(Z)|Z} \neq 0$$

# The prediction errors are uncorrelated with $$Z$$

$\begin{eqnarray} \Cov{Z, Y-m(Z)} & = & \Expect{Z(Y-m(Z))} ~\text{(by previous slide)}\\ & = & \Expect{Z(Y - \Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(Z-\Expect{Z}))}\\ & = & \Expect{ZY - Z\Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(Z^2) + \frac{\Cov{Y,Z}}{\Var{Z}} (Z \Expect{Z})}\\ & = & \Expect{ZY} - \Expect{Z}\Expect{Y} - \frac{\Cov{Y,Z}}{\Var{Z}}\Expect{Z^2} + \frac{\Cov{Y,Z}}{\Var{Z}} (\Expect{Z})^2\\ & = & \Cov{Z,Y} - \frac{\Cov{Y,Z}}{\Var{Z}}(\Var{Z})\\ & = & 0 \end{eqnarray}$
• If they weren’t uncorrelated, we’d adjust the coefficients until they were

# The prediction errors are uncorrelated with $$Z$$

Alternate take:

$\begin{eqnarray} \Cov{Z, Y-m(Z)} & = & \Cov{Z, Y} - \Cov{Z, \alpha + \beta Z}\\ & = & \Cov{Y,Z} - \Cov{Z, \beta Z}\\ & = & \Cov{Y,Z} - \beta\Cov{Z,Z}\\ & = & \Cov{Y,Z} - \beta\Var{Z}\\ & = & \Cov{Y,Z} - \Cov{Y,Z} = 0 \end{eqnarray}$

# How big are the prediction errors?

$\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\ & = & \Var{Y - \beta Z}\\ \end{eqnarray}$

In-class exercise: finish this! Answer in terms of $$\Var{Y}$$, $$\Var{Z}$$, $$\Cov{Y,Z}$$

# How big are the prediction errors?

$\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y - \alpha - \beta Z}\\ & = & \Var{Y - \beta Z}\\ & = & \Var{Y} + \beta^2\Var{Z} - 2\beta\Cov{Y,Z} \end{eqnarray}$

but $$\beta = \Cov{Y,Z}/\Var{Z}$$ so

$\begin{eqnarray} \Var{Y-m(Z)} & = & \Var{Y} + \frac{\Cov{Y,Z}^2}{\Var{Z}} - 2\frac{\Cov{Y,Z}^2}{\Var{Z}}\\ & = & \Var{Y} - \frac{\Cov{Y,Z}^2}{\Var{Z}}\\ & < & \Var{Y} \text{unless}\ \Cov{Y,Z} = 0 \end{eqnarray}$

$$\Rightarrow$$ Optimal linear predictor is almost always better than nothing…

# Multivariate case

We try to predict $$Y$$ from a whole bunch of variables

Bundle those predictor variables into $$\vec{Z}$$

Solution:

$m(\vec{Z}) = \alpha+\vec{\beta}\cdot \vec{Z} = \Expect{Y} + \Var{\vec{Z}}^{-1} \Cov{\vec{Z},Y} (\vec{Z} - \Expect{\vec{Z}})$

and

$\Var{Y-m(\vec{Z})} = \Var{Y} - \Cov{Y,\vec{Z}}^T \Var{\vec{Z}}^{-1} \Cov{Y,\vec{Z}}$

# What we don’t assume, again

• Anything about the distributions of $$Y$$ or $$\vec{Z}$$
• That the linear predictor is correct
• That anything is Gaussian

# Some possible contexts

• Interpolating or extrapolating one variable over space and/or time
• Predicting one variable from another
• Predicting one variable from 2+ others

# Interpolating or extrapolating a single variable

• Given: $$X(r_1, t_1), X(r_2, t_2), \ldots X(r_n, t_n)$$
• Desired: estimate/guess at $$X(r_0, t_0)$$
$\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [X(r_1, t_1), X(r_2, t_2), \ldots X(r_n, t_n)] \end{eqnarray}$

Prediction for $$X(r_0, t_0)$$ is a linear combination of $$X$$ at other points

$\begin{eqnarray} \EstRegFunc(r_0, t_0) & = & \alpha + \vec{\beta} \cdot \left[\begin{array}{c} X(r_1, t_1) \\ X(r_2, t_2) \\ \vdots \\ X(r_n, t_n) \end{array}\right]\\ \alpha & = & \Expect{X(r_0, t_0)} - \vec{\beta} \cdot \left[\begin{array}{c} \Expect{X(r_1, t_1)}\\ \Expect{X(r_2, t_2)} \\ \vdots \\ \Expect{X(r_n, t_n)}\end{array}\right] ~ \text{(goes away if everything's centered)}\\ \beta & = & {\left[\begin{array}{cccc} \Var{X(r_1, t_1)} & \Cov{X(r_1, t_1), X(r_2, t_2)} & \ldots & \Cov{X(r_1, t_1), X(r_n, t_n)}\\ \Cov{X(r_1, t_1), X(r_2, t_2)} & \Var{X(r_2, t_2)} & \ldots & \Cov{X(r_2, t_2), X(r_n, t_n)}\\ \vdots & \vdots & \ldots & \vdots\\ \Cov{X(r_1, t_1), X(r_n, t_n)} & \Cov{X(r_2, t_2), X(r_n, t_n)} & \ldots & \Var{X(r_n, t_n)}\end{array}\right]}^{-1} \left[\begin{array}{c} \Cov{X(r_0, t_0), X(r_1, t_1)}\\ \Cov{X(r_0, t_0), X(r_2, t_2)}\\ \vdots \\ \Cov{X(r_0, t_0), X(r_n, t_n)}\end{array}\right] \end{eqnarray}$
• looks a lot like a linear smoother
• best choice of weights $$\mathbf{w}$$ from variances and covariances

# Predicting one variable from another

• Given: values of variable $$U$$ at many points, $$U(r_1, t_1), \ldots U(r_n, t_n)$$
• Desired: estimate of $$X$$ at point $$(r_0, t_0)$$, $$X\neq U$$
$\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [U(r_1, t_1), U(r_2, t_2), \ldots U(r_n, t_n)]\\ \end{eqnarray}$
• Need to find covariances of the $$U$$s with each other, and their covariances with $$X$$

# Predicting one variable from 2+ others

• Given: values of two variables $$U$$, $$V$$ at many points
• Desired: estimate of $$X$$ at one point
$\begin{eqnarray} Y & = & X(r_0, t_0)\\ \vec{Z} & = & [U(r_1, t_1), V(r_1, t_1), U(r_2, t_2), V(r_2, t_2), \ldots U(r_n, t_n), V(r_n, t_n)] \end{eqnarray}$
• Need to find covariances of $$U$$s and $$V$$s with each other, and with $$X$$

# Optimal prediction depends on variances and covariances

so how do we get these?

• Repeat the experiment many times
• OR make assumptions
• E.g., some covariances should be the same
• E.g., covariances should change smoothly in time or space
• E.g., covariances should follow a particular model

# Summing up

• We can always decide to use a linear predictor, $$m(\vec{Z}) = \alpha + \vec{\beta} \cdot \vec{Z}$$
• The optimal linear predictor of $$Y$$ from $$\vec{Z}$$ always takes the same form: $m(Y) = \Expect{Y} + \Var{\vec{Z}}^{-1} \Cov{Y,\vec{Z}} (\vec{Z} - \Expect{\vec{Z}})$
• Doing linear prediction requires finding the covariances
• Next few lectures: how to find and use covariances over time, over space, over both

# Gory details for multivariate predictors

$\begin{eqnarray} m(\vec{Z}) & = & a + \vec{b} \cdot \vec{Z}\\ (\alpha, \vec{\beta}) & = & \argmin_{a, \vec{b}}{\Expect{(Y-(a + \vec{b} \cdot \vec{Z}))^2}}\\ \Expect{(Y-(a+\vec{b}\cdot \vec{Z}))^2} & = & \Expect{Y^2} + a^2 + \Expect{(\vec{b}\cdot \vec{Z})^2}\\ \nonumber & & - 2\Expect{Y (\vec{b}\cdot \vec{Z})} - 2 \Expect{Y a} + 2 \Expect{a \vec{b} \cdot \vec{Z}}\\ & = & \Expect{Y^2} + a^2 + \vec{b} \cdot \Expect{\vec{Z} \otimes \vec{Z}} b \\ \nonumber & & -2a\Expect{Y} - 2 \vec{b} \cdot \Expect{Y\vec{Z}} + 2a\vec{b}\cdot \Expect{\vec{Z}}\\ \end{eqnarray}$

# Gory details: the intercept

Take derivative w.r.t. $$a$$, set to 0:

$\begin{eqnarray} 0 & = & -2\Expect{Y} + 2\beta \Expect{\vec{Z}} + 2\alpha \\ \alpha & = & \Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}\\ \end{eqnarray}$

just like when $$Z$$ was univariate

# Gory details: the slopes

$\begin{eqnarray} -2 \Expect{Y\vec{Z}} + 2 \Expect{\vec{Z} \otimes \vec{Z}} \beta + 2 \alpha \Expect{\vec{Z}} & = & 0\\ \Expect{Y\vec{Z}} - \alpha\Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\ \Expect{Y\vec{Z}} - (\Expect{Y} - \vec{\beta} \cdot \Expect{\vec{Z}}) \Expect{\vec{Z}} & = & \Expect{\vec{Z} \otimes \vec{Z}} \beta\\ \Cov{Y,\vec{Z}} & = & \Var{\vec{Z}} \beta\\ \beta & = & (\Var{\vec{Z}})^{-1} \Cov{Y,\vec{Z}} \end{eqnarray}$

Reduces to $$\Cov{Y,Z}/\Var{Z}$$ when $$Z$$ is univariate

# Gory details: the PCA view

The factor of $$\Var{\vec{Z}}^{-1}$$ rotates and scales $$\vec{Z}$$ to uncorrelated, unit-variance variables

$\begin{eqnarray} \Var{\vec{Z}} & = & \mathbf{w} \mathbf{\Lambda} \mathbf{w}^T\\ \Var{\vec{Z}}^{-1} & = & \mathbf{w} \mathbf{\Lambda}^{-1} \mathbf{w}^T\\ \Var{\vec{Z}}^{-1} & = & (\mathbf{w} \mathbf{\Lambda}^{-1/2}) (\mathbf{w} \mathbf{\Lambda}^{-1/2})^T\\ & = & \Var{\vec{Z}}^{-1/2} \left(\Var{\vec{Z}}^{-1/2}\right)^T\\ \vec{U} & \equiv & \vec{Z} \Var{\vec{Z}}^{-1/2}\\ \Var{\vec{U}} & = & \mathbf{I}\\ \vec{Z}\cdot\vec{\beta} & = & \vec{Z} \cdot \Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\\ & = & \vec{Z} \Var{\vec{Z}}^{-1/2} \left(\Var{\vec{Z}}^{-1/2}\right)^T \Cov{\vec{Z}, Y}\\ & = & \vec{U} \Cov{\vec{U}, Y}\\ \end{eqnarray}$

# Estimation I: “plug-in”

• We don’t see the true expectations, variances, covariances
• But we can have sample/empirical values
• One estimate of the optimal linear predictor: plug in the sample values

so for univariate $$Z$$,

$\EstRegFunc(z) = \overline{y} - \frac{\widehat{\Cov{Y,Z}}}{\widehat{\Var{Z}}}(z-\overline{z})$

# Estimation II: ordinary least squares

• We don’t see the true expected squared error, but we do have the sample mean
• Minimize that
• Leads to exactly the same results as plug-in approach!

# When does OLS/plug-in work?

• Jointly sufficient conditions:
• Sample means converge on expectation values
• Sample covariances converge on true covariance
• Sample variances converge on true, invertible variance
• Then by continuity OLS coefficients converge on true $$\beta$$
• This can all happen even when everything is dependent on everything else!

# Square root of a matrix

• A square matrix $$\mathbf{d}$$ is a square root of $$\mathbf{c}$$ when $$\mathbf{c} = \mathbf{d} \mathbf{d}^T$$
• If there are any square roots, there are many square roots
• Pick any orthogonal matrix $$\mathbf{o}^T = \mathbf{o}^{-1}$$
• $$(\mathbf{d}\mathbf{o})(\mathbf{d}\mathbf{o})^T = \mathbf{d}\mathbf{d}^T$$
• Just like every real number has two square roots…
• If $$\mathbf{c}$$ is diagonal, define $$\mathbf{c}^{1/2}$$ as the diagonal matrix of square roots
• If $$\mathbf{c} = \mathbf{w}\mathbf{\Lambda}\mathbf{w}^T$$, one square root is $$\mathbf{w}\mathbf{\Lambda}^{1/2}$$