\documentclass{article} \usepackage{amsmath,amsfonts} \usepackage{natbib} \usepackage{hyperref} \usepackage[font={it,small},labelfont={sc,small}]{caption} \usepackage{datetime} \usepackage{fancyhdr} % For putting the revision date on the bottom of each page % Needs fancyhdr, datetime packages \pagestyle{fancy} \fancyhead{} \fancyhead[LO,RE]{\thepage} \fancyhead[RO,LE]{\rightmark} \fancyfoot{} \fancyfoot[CO,CE]{\currenttime\ \today} \renewcommand{\headrulewidth}{0pt} \fancypagestyle{plain}{ \fancyhf{} \fancyhead[C]{\currenttime\ \today\\ \scriptsize{See updates and corrections at \url{http://www.stat.cmu.edu/~cshalizi/mreg/}}} \fancyfoot[C]{\thepage} \renewcommand{\headrulewidth}{0pt} \renewcommand{\footrulewidth}{0pt} } \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \DeclareMathOperator*{\argmin}{argmin} \begin{document} \title{Lecture 5: The Method of Least Squares for Simple Linear Regression} \author{36-401, Fall 2015, Section B} \date{15 September 2015} \maketitle % Set global options for knitr << echo=FALSE >>= library(knitr) opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE) @ \tableofcontents \section{Recapitulation} Let's recap from last time. The simple linear regression model is a statistical model for two variables, $X$ and $Y$. We use $X$ --- the {\bf predictor} variable --- to try to predict $Y$, the {\bf target} or {\bf response}\footnote{Older terms would be ``independent'' and ``dependent'' variables, respectively. These import an unwarranted suggestion of causality or even deliberate manipulation on the part of $X$, so I will try to avoid them.}. The assumptions of the model are: \begin{enumerate} \item The distribution of $X$ is arbitrary (and perhaps $X$ is even non-random). \item If $X=x$, then $Y = \beta_0 + \beta_1 x + \epsilon$, for some constants (``coefficients'', ``parameters'') $\beta_0$ and $\beta_1$, and some random noise variable $\epsilon$. \item $\Expect{\epsilon|X=x} = 0$ (no matter what $x$ is), $\Var{\epsilon|X=x} = \sigma^2$ (no matter what $x$ is). \item $\epsilon$ is uncorrelated across observations. \end{enumerate} In a typical situation, we also possess observations $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$, which we presume are a realization of the model. Our goals are to estimate the parameters of the model, and to use those parameters to make predictions. In the notes for the last lecture, we saw that we could estimate the parameters by the {\bf method of least squares}: that is, of minimizing the in-sample mean squared error: \begin{equation} \widehat{MSE}(b_0, b_1) \equiv \frac{1}{n}\sum_{i=1}^{n}{(y_i - (b_0 + b_1 x_i))^2} \end{equation} In particular, we obtained the following results: \paragraph{Normal or estimating equations} The least-squares estimates solve the {\bf normal} or {\bf estimating} equations: \begin{eqnarray} \label{eqn:estimating-eqn-for-beta.0} \overline{y} - \hat{\beta}_0 - \hat{\beta}_1 \overline{x} & = & 0\\ \label{eqn:estimating-eqn-for-beta.1} \overline{xy} - \hat{\beta}_0 \overline{x} - \hat{\beta}_1 \overline{x^2} & = & 0 \end{eqnarray} \paragraph{Closed-form solutions} The solution to the estimating equations can be given in closed form: \begin{eqnarray} \label{eqn:closed-form-beta1-hat} \hat{\beta}_1 & = & \frac{c_{XY}}{s^2_X}\\ \label{eqn:closed-form-beta0-hat} \hat{\beta}_0 & = & \overline{y} - \hat{\beta}_1 \overline{x} \end{eqnarray} \paragraph{Unbiasedness} The least-squares estimator is unbiased: \begin{eqnarray} \label{eqn:bias-of-beta0} \Expect{\hat{\beta}_0} & = & \beta_0 \\ \Expect{\hat{\beta}_1} & = & \beta_1 \end{eqnarray} \paragraph{Variance shrinks like $1/n$} The variance of the estimator goes to $0$ as $n\rightarrow \infty$, like $1/n$: \begin{eqnarray} \Var{\hat{\beta}_1} & = & \frac{\sigma^2}{n s^2_X}\\ \label{eqn:var-of-beta0} \Var{\hat{\beta}_0} & = & \frac{\sigma^2}{n}\left(1 + \frac{\overline{x}^2}{s^2_X}\right) \end{eqnarray} In these notes, I will try to explain a bit more of the general picture underlying these results, and to explain what it has to do with prediction. \section{In-Sample MSE vs.\ True MSE} The true regression coefficients minimize the true MSE, which is (under the simple linear regression model): \begin{equation} (\beta_0, \beta_1) = \argmin_{(b_0, b_1)} {\Expect{(Y- (b_0 + b_1 X))^2}} \end{equation} What we minimize instead is the mean squared error on the data: \begin{equation} (\hat{\beta}_0, \hat{\beta}_1) = \argmin_{(b_0, b_1)}{\frac{1}{n}\sum_{i=1}^{n}{(y_i - (b_0 + b_1 x_i))^2}} \end{equation} This is the {\bf in-sample} or {\bf empirical} version of the MSE. It's clear that it's a sample average, so for any {\em fixed} parameters $b_0, b_1$, when the law of large numbers applies, we should have \begin{equation} \frac{1}{n}\sum_{i=1}^{n}{(y_i - (b_0 + b_1 x_i))^2} \rightarrow \Expect{(Y- (b_0 + b_1 X))^2} \end{equation} as $n \rightarrow \infty$. This should make it {\em plausible} that the minimum of the function of the left is going to converge on the minimum of the function on the right, but there can be tricky situations, with more complex models, where this convergence doesn't happen. To illustrate what I mean by this convergence, Figure \ref{fig:error-surfaces} shows a sequence of surfaces of the MSE as a function of $(b_0, b_1)$. (The simulation code is in Figure \ref{fig:error-surface-code}.) The first row shows different in-sample MSE surfaces at a small value of $n$; the next row at a larger value of $n$; the next row at a still larger value of $n$. What you can see is that as $n$ grows, these surfaces all become more similar to each other, and the locations of the minima are also becoming more similar. This isn't a {\em proof}, but shows why it's worth looking for a proof. \begin{figure} <<>>= # Simulate from a linear model with uniform X and t-distributed noise # Inputs: number of points; intercept; slope; width of uniform X distribution # (symmetric around 0); degrees of freedom for t # Output: data frame with columns for X and Y sim.linmod <- function(n, beta.0, beta.1, width, df) { # draw n points from a uniform distribution centered on 0 x <- runif(n, min=-width/2, max=width/2) # draw n points from a t distribution with the given number of degrees # of freedom epsilon <- rt(n, df=df) # make y from a linear model y <- beta.0 + beta.1*x + epsilon # return the data frame return(data.frame(x=x, y=y)) } # Calculate in-sample MSE of a linear model # First define a function that works for just one slope/intercept pair at # time # Then "Vectorize" it to handle vectors of intercepts and slopes # Inputs: slope; intercept; data frame with "x" and "y" columns # Output: the in-sample MSE # Presumes: "y" is the target variable and "x" is the predictor mse.insample <- function(b.0, b.1, data) { mean((data$y-(b.0+b.1*data$x))^2) } mse.insample <- Vectorize(mse.insample, vectorize.args=c("b.0","b.1")) # Grids of possible intercepts and slopes b.0.seq <- seq(from=-10,to=10, length.out=20) b.1.seq <- seq(from=-10,to=10, length.out=20) # 3d wire-mesh ("perspective") plot of a linear model's error surface # Input: data set; maximum value for Z axis (for comparability across plots) # Output: Transformation matrix for adding new points/lines to the plot, # invisibly --- see help(persp) under "Value". (Ignored here) # ATTN: hard-coded slope/intercept sequences less than ideal in.sample.persp <- function(data, zmax=600) { # Calculate the in-sample MSE for every combination of z <- outer(b.0.seq, b.1.seq, mse.insample, data=data) persp(b.0.seq, b.1.seq, z, zlim=c(0,zmax), xlab="Intercept", ylab="Slope", zlab="MSE", ticktype="detailed") } @ \caption{Code to simulate from a linear model with $t$-distributed noise and uniformly distributed $X$ (to emphasize here needs anything to be Gaussian); to calculate the MSE of a linear model on a given data sample; and to plot the error surface on a given data set.} \label{fig:error-surface-code} \end{figure} \begin{figure} <>= par(mfrow=c(3,2)) in.sample.persp(sim.linmod(n=10,beta.0=5,beta.1=-2,width=4,df=3)) in.sample.persp(sim.linmod(n=10,beta.0=5,beta.1=-2,width=4,df=3)) in.sample.persp(sim.linmod(n=100,beta.0=5,beta.1=-2,width=4,df=3)) in.sample.persp(sim.linmod(n=100,beta.0=5,beta.1=-2,width=4,df=3)) in.sample.persp(sim.linmod(n=1e5,beta.0=5,beta.1=-2,width=4,df=3)) in.sample.persp(sim.linmod(n=1e5,beta.0=5,beta.1=-2,width=4,df=3)) par(mfrow=c(1,1)) @ <>= <> @ \caption{Error surfaces for the linear model $Y=5-2X+\epsilon$, $\epsilon \sim t_3$, $X \sim U(-2,2)$, at $n=10$ (top row), $n=100$ (middle) and $n=100000$ (bottom). Each column is an independent run of the model. Notice how these become increasingly similar as $n$ grows.} \label{fig:error-surfaces} \end{figure} \subsection{Existence and Uniqueness} On any given finite data set, it is evident from Eqs.\ \ref{eqn:closed-form-beta1-hat}--\ref{eqn:closed-form-beta0-hat} that there is always a least-squares estimate, {\em unless} $s^2_X = 0$, i.e., unless the sample variance of $X$ is zero, i.e., unless all the $x_i$ have the same value. (Obviously, with only one value of the $x$ coordinate, we can't work out the slope of a line!) Moreover, if $s^2_X > 0$, then there is exactly {\em one} combination of slope and intercept which minimizes the MSE in-sample. One way to understand this algebraically is that the estimating equations give us a system of two linear equations in two unknowns. As we remember from linear algebra (or earlier), such systems have a unique solution, unless one of the equations of the system is redundant. (See Exercise \ref{exercise:linear-systems-of-estimating-equations}.) Notice that this existence and uniqueness of a least-squares estimate assumes {\em absolutely nothing} about the data-generating process. In particular, it does {\em not} assume that the simple linear regression model is correct. There is always {\em some} straight line that comes closest to our data points, no matter how wrong, inappropriate or even just plain silly the simple linear model might be. \section{Constant-Plus-Noise Representations} In deriving the properties of the least-squares estimators, it is extremely helpful to re-write them so that they have the form ``constant $+$ noise'', and especially to try to write the noise as a sum of uncorrelated random variables. This sort of ``representation'' of the estimator makes it much simpler to determine its properties, because adding up constants and uncorrelated random variables is what the rules of algebra from Lecture 1 make easy for us. To this end, let's be explicit about writing out $\hat{\beta}_1$ in the form of a constant plus a sum of uncorrelated noise random variables. Begin with the fact that $\hat{\beta}_1$ is the ratio of the sample covariance to the sample variance of $X$: \begin{eqnarray} \hat{\beta}_1 & = & \frac{c_{XY}}{s^2_X}\\ & = & \frac{\frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})(y_i - \overline{y})}}{s^2_X}\\ & = & \frac{\frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})y_i} - \frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})\overline{y}}}{s^2_X} \end{eqnarray} At this point, we need to pause for a fundamental fact which we will use often: for {\em any} variable $z$, the average difference from the sample mean is zero: $n^{-1}\sum_{i}{z_i - \overline{z}} = 0$. To see this, break up the sum of the difference into a difference in sums: \begin{eqnarray} \frac{1}{n}\sum_{i=1}^{n}{z_i -\overline{z}} & = & \frac{1}{n}\sum_{i=1}^{n}{z_i} - \frac{1}{n}\sum_{i=1}^{n}{\overline{z}}\\ & = & \overline{z} - \frac{n\overline{z}}{n} = 0 \end{eqnarray} It follows that for any $w$ which is constant in $i$, \begin{equation} \label{eqn:average-difference-from-average-is-zero} \frac{1}{n}\sum_{i=1}^{n}{(z_i - \overline{z})w} = 0 \end{equation} Thus \begin{equation} \frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})\overline{y}} = 0 \end{equation} So \begin{equation} \label{eqn:hat-beta1-in-terms-of-yi} \hat{\beta}_1 = \frac{\frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})y_i}}{s^2_X} \end{equation} So far, we have not used any of our modeling assumptions. We now do so. Specifically, we use the assumption that \begin{equation} y_i = \beta_0 + \beta_1 x_i + \epsilon_i \end{equation} For reasons which should become clear momentarily, it will be more convenient to write this in terms of how far $x_i$ is from the sample mean $\overline{x}$: \begin{equation} y_i = \beta_0 + \beta_1 \overline{x} + \beta_1 (x_i - \overline{x}) + \epsilon_i \end{equation} to substitute the above expression for $y_i$ into Eq.\ \ref{eqn:hat-beta1-in-terms-of-yi}: \begin{eqnarray} \hat{\beta}_1 & = & \frac{\frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})(\beta_0 + \beta_1\overline{x} + \beta_1 (x_i-\overline{x}) + \epsilon_i)}}{s^2_X}\\ & = & \frac{\frac{\beta_0+\beta_1 \overline{x}}{n}\sum_{i=1}^{n}{(x_i-\overline{x})} + \frac{\beta_1}{n}\sum_{i=1}^{n}{(x_i-\overline{x})^2} + \frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})\epsilon_i}}{s^2_X} \end{eqnarray} The first sum in the numerator is a constant times the average difference of $x_i$ from $\overline{x}$, so it's zero (by Eq.\ \ref{eqn:average-difference-from-average-is-zero}). The second sum in the numerator is just $s^2_X$ again. In the third sum, because the $\epsilon_i$ are {\em not} constant, is not (necessarily) zero. Simplifying: \begin{equation} \label{eqn:hat-beta-1-is-constant-plus-noise} \hat{\beta}_1 = \beta_1 + \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i} \end{equation} Notice the form of Eq.\ \ref{eqn:hat-beta-1-is-constant-plus-noise}: it writes our random estimator as a constant plus a weighted sum of the noise terms $\epsilon_i$. In fact, by the fourth item in our listing of assumptions for the simple linear regression model, it writes $\hat{\beta_1}$ as a constant plus a weighted sum of {\em uncorrelated} noise terms. It is now very easy to work out the expected value: \begin{eqnarray} \Expect{\hat{\beta}_1} & = & \Expect{\beta_1 + \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i}} \\ & = & \beta_1 + \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{s^2_X}\Expect{\epsilon_i}} = \beta_1 \end{eqnarray} or the variance: \begin{eqnarray} \Var{\hat{\beta}_1} & = & \Var{\beta_1 + \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i}}\\ & = & \Var{\sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i}}\\ & = & \sum_{i=1}^{n}{\frac{(x_i - \overline{x})^2}{n^2 s^4_X}\Var{\epsilon_i}}\\ & = & \sigma^2 \frac{n s^2_X}{n^2 s^4_X} = \frac{\sigma^2}{ns^2_X} \end{eqnarray} where the last line uses the modeling assumption that all of the $\epsilon_i$ have the same variance. (The next-to-last line uses the assumption that they are uncorrelated.) So far, this is just re-capitulating stuff we've done already, but the exact same strategy works for any estimator (or test statistic, etc.) which we can manipulate into constant-plus-noise form. It's not {\em always} possible to do this (though see the optional section \ref{sec:propagation-of-error}, and, for the ambitious, \citealt{van-der-Vaart-asymptotic-stats}), but it's a very powerful strategy when it works. To illustrate its power, we'll now use it on predictions of the simple linear model, when estimated by least squares. \section{Predictions} \label{sec:prediction} Remember that we got into all this mess not because we want to know the numbers $\beta_0$ and $\beta_1$ for their own sake, but because we wanted to predict $Y$ from $X$. How do we make those predictions, and how good are they? If we knew $\beta_0$ and $\beta_1$, and that $X=x$, then our prediction\footnote{This is called a {\bf point} prediction; think of it as ``if you have to give one number, this is the best single number to give.'' We might also make {\bf interval} predictions (e.g., ``with probability $p$, $Y$ will be in the interval $[l, u]$'') or {\bf distributional} predictions (e.g., ``$Y$ will follow an $N(m,,v)$ distribution''.} for $Y$ would be $\beta_0 + \beta_1 x$. This is, assuming the simple linear regression model is true, exactly $\Expect{Y|X=x}$, which we saw back in Lecture 1 is the best prediction we can make. As $x$ changes, this prediction changes, but that's precisely what we want --- the predictions will just follow points on the line. Since we do not know $\beta_0$ and $\beta_1$, we fake it --- that is, we use our estimates of the coefficients. At an {\em arbitrary} value of $X$, say $x$ (sometimes called the {\bf operating point}), we predict that on average $Y$ will be \begin{equation} \hat{m}(x) = \hat{\beta}_0 + \hat{\beta}_1 x \end{equation} This point prediction is called the {\bf fitted value}\footnote{The name originates from thinking of $\epsilon$ as purely measurement error, so that $\hat{m}(x)$ is our best-fitting estimate of the true value at $x$.} at $x$. Notice the fitted value $\hat{m}(x)$ is an {\em estimate} of $\Expect{Y|X=x}$. The latter is a perfectly deterministic quantity; it has the value $\beta_0 + \beta_1 x$, which is some number or other, and we just happen not to know it. But $\hat{m}(x)$ is a function of our data, which are random, hence $\hat{m}(x)$ is also random. It inherits its randomness from $\hat{\beta}_0$ and $\hat{\beta}_1$, which in turn inherit theirs from $\overline{y}$ and $c_{XY}$. To analyze the randomness in $\hat{m}(x)$, we will represent it as constants plus a weighted sum of uncorrelated noise terms. Using Eqs.\ \ref{eqn:closed-form-beta0-hat}, \begin{eqnarray} \hat{m}(x) & = & \hat{\beta}_0 + \hat{\beta}_1 x \\ & = & \overline{y} - \hat{\beta}_1 \overline{x} + \hat{\beta}_1 x\\ & = & \overline{y} + (x-\overline{x})\hat{\beta}_1 \end{eqnarray} Using Eq.\ \ref{eqn:hat-beta-1-is-constant-plus-noise} and the definition of a sample mean, \begin{eqnarray} \hat{m}(x) & = & \frac{1}{n}\sum_{i=1}^{n}{y_i} + (x -\overline{x})\left( \beta_1 + \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i} \right)\\ & = & \frac{1}{n}\sum_{i=1}^{n}{(\beta_0 + \beta_1 x_i + \epsilon_i)} + (x -\overline{x})\left( \beta_1 + \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i} \right)\\ & = & \beta_0 + \beta_1 \overline{x} + \frac{1}{n}\sum_{i=1}^{n}{\epsilon_i} + (x-\overline{x})\beta_1 + (x-\overline{x})\sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i}\\ & = & \beta_0 + \beta_1 x + \frac{1}{n}\sum_{i=1}^{n}{\left(1 + (x-\overline{x}) \frac{x_i - \overline{x}}{s^2_X} \right) \epsilon_i} \label{eqn:prediction-as-constant-plus-noise} \end{eqnarray} where in the last line I've canceled $\beta_1 \overline{x}$ terms of opposite sign, and combined terms in the $\epsilon_i$. Also, the second line used the second assumption in the simple linear regression model, that $Y$ is a linear function of $X$ plus noise. Now we can check whether or not our predictions are biased: \begin{eqnarray} \Expect{\hat{m}(x)} & = & \Expect{\beta_0 + \beta_1 x + \frac{1}{n}\sum_{i=1}^{n}{\left(1 + (x-\overline{x}) \frac{x_i - \overline{x}}{s^2_X} \right) \epsilon_i}}\\ & = & \beta_0 + \beta_1 x + \frac{1}{n}\sum_{i=1}^{n}{\left(1 + (x-\overline{x})\frac{x_i - \overline{x}}{s^2_x}\right)\Expect{\epsilon_i}}\\ & = & \beta_0 + \beta_1 x \end{eqnarray} This is to say, no, under the simple linear model, the predictions of least squares are unbiased. Of course, our predictions are somewhat random, because (as I said) they're functions of the somewhat-random data we estimated the model on. What is the variance of these predictions? \begin{eqnarray} \Var{\hat{m}(x)} & = & \Var{\beta_0 + \beta_1 x + \frac{1}{n}\sum_{i=1}^{n}{\left(1 + (x-\overline{x}) \frac{x_i - \overline{x}}{s^2_X} \right) \epsilon_i}}\\ & = & \Var{\frac{1}{n}\sum_{i=1}^{n}{\left(1 + (x-\overline{x}) \frac{x_i - \overline{x}}{s^2_X} \right) \epsilon_i}}\\ & = & \frac{1}{n^2}\sum_{i=1}^{n}{\left(1 + (x-\overline{x}) \frac{x_i - \overline{x}}{s^2_X} \right)^2 \Var{\epsilon_i}}\\ & = & \frac{\sigma^2}{n^2}\sum_{i=1}^{n}{1 + 2 (x-\overline{x})\frac{x_i - \overline{x}}{s^2_X} + (x-\overline{x})^2 \frac{(x_i - \overline{x})^2}{s^4_X} }\\ & = & \frac{\sigma^2}{n^2}\left(n + 0 + (x-\overline{x})^2\frac{ ns^2_X}{n s^4_X}\right)\\ & = & \frac{\sigma^2}{n}\left(1 + \frac{(x-\overline{x})^2}{s^2_X}\right) \label{eqn:variance-of-prediction} \end{eqnarray} Notice what's going on here: \begin{itemize} \item The variance grows as $\sigma^2$ grows: the more noise there is around the regression line, the harder we find it to estimate the regression line, and the more of that noise will propagate into our predictions. \item The larger $n$ is, the smaller the variance: the more points we see, the more exactly we can pin down the line, and so our predictions. \item The variance of our predictions is the sum of two terms. The first, which doesn't depend on $x$, is $\sigma^2/n$, which is the variance of $\overline{y}$ (Exercise \ref{exercise:variance-of-y-bar}). Since our line has to go through the center of the data, this just how much noise there is in the height of that center. \item The other term does change with $x$, specifically with $(x-\overline{x})^2$: the further our operating point $x$ is from the center of the data $\overline{x}$, the bigger our uncertainty. This is the uncertainty that comes with being unable to pin down the slope precisely; it therefore shrinks with $s^2_X$, since it's easier to find the slope when the points have a wide spread on the horizontal axis. \end{itemize} Again, Eq.\ \ref{eqn:variance-of-prediction} is conditional on the $x_i$. If those are random, we need to use the law of total variance (as in the last lecture) to get the unconditional variance of $\hat{m}(x)$. Figure \ref{fig:scatter-of-regression-lines} illustrates how the spread in point predictions changes as we move away from the mean of the $x$ values. \begin{figure} <>= # Create an empty plot (type="n" for "do Nothing") plot(0,type="n",xlim=c(-10,10),ylim=c(-10,10),xlab="x",ylab="y") # Add the true regression line; exaggerate width so it stands out abline(a=5, b=-2, lwd=5) # Repeat 10 times: do a simulation, fit a line to the sim., add the fitted # line to the plot invisible(replicate(20, abline(lm(y ~ x, data=sim.linmod(n=10,beta.0=5, beta.1=-2,width=4,df=3)), col="grey"))) @ <>= <> @ \caption{Scatter of estimated least-squares regression lines (thin, grey) around the true regression line (thick, black). Notice how the estimated lines become more spread out as we move away from the center of the distribution (here conveniently set at $X=0$).} \label{fig:scatter-of-regression-lines} \end{figure} \clearpage \section{Estimating $\sigma^2$; Sum of Squared Errors} Under the simple linear regression model, it is easy to show (Exercise \ref{exercise:sigmasq-is-generalization-mse}) that \begin{equation} \label{eqn:sigmasq-is-generalization-mse} \Expect{(Y-(\beta_0 + \beta_1 X))^2} = \sigma^2 \end{equation} This suggests that the minimal value of the in-sample MSE, \begin{equation} \label{eqn:minimal-in-sample-mse} \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}{(y_i - \hat{m}(x_i))^2} \end{equation} is a natural estimator for $\sigma^2$. This is, in fact, a consistent estimator. (You can prove this using the consistency of $\hat{\beta}_0$ and $\hat{\beta}_1$, and continuity.) It is, however, a slightly biased estimator. Specifically (Exercise \ref{exercise:debiasing-mse}) \begin{equation} \label{eqn:unbiased-in-sample-mse} s^2 = \frac{n}{n-2} \hat{\sigma}^2 \end{equation} is an {\em un}-biased estimator of $\sigma^2$, though one with a larger variance. Some people, accordingly, use Eq.\ \ref{eqn:unbiased-in-sample-mse}, rather than Eq.\ \ref{eqn:minimal-in-sample-mse}, as their definition of ``MSE''. This is mostly something to be aware of when different pieces of R code, textbooks, papers, etc., differ in what they are calling ``MSE''; to get sensible results, you may need to apply conversion factors in one direction or the other. As usual, if the difference between $1/n$ and $1/(n-2)$ is large enough to make a difference to your conclusions, you should really be asking yourself whether you have enough data to be doing any statistics at all. \paragraph{Sum of squared errors} The sum of squared errors for a fitted regression is just what it sounds like: \begin{equation} SSE = \sum_{i=1}^{n}{(y_i - \hat{m}(x_i))^2} = \sum_{i=1}^{n}{(y_i - (\hat{\beta}_0+\hat{\beta}_1 x_i))^2} \end{equation} It's mostly important as a historical relic, from the days when people fit regression models by hand, or with slide rules and adding machines, and so every bit of arithmetic you could avoid was a win. \section{Residuals} The {\bf residual} value at a data point is the difference between the actual value of the response $y_i$ and the fitted value $\hat{m}(x_i)$: \begin{equation} \label{eqn:ith-residual-defined} e_i = y_i - \hat{m}(x_i) = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \end{equation} This may look like re-arranging the basic equation for the linear regression model, \begin{equation} \label{eqn:ith-noise-term} \epsilon_i = Y_i - (\beta_0 + \beta_1 x_i) \end{equation} and it {\em is} similar, but it's {\em not} the same. The right-hand side of Eq.\ \ref{eqn:ith-noise-term} involves the true parameters. The right-hand side of Eq.\ \ref{eqn:ith-residual-defined} involves the {\em estimated} parameters, which are different. In particular, the estimated parameters are functions of {\em all} the noise variables. Therefore \begin{quotation} The residuals are not the noise terms; $e_i \neq \epsilon_i$ \end{quotation} There are some ways in which the residuals are {\em like} the noise terms. For example, the residuals are always uncorrelated with the $x_i$: \begin{equation} \frac{1}{n}\sum_{i=1}^{n}{e_i (x_i - \overline{x})} = 0 \end{equation} However, this fact (and others like it, which you will get to prove in the homework) are consequences of the estimating equations, and are true {\em whether or not} the simple linear regression model is actually true. Another consequence of the estimating equations is that \begin{equation} \label{eqn:residuals-average-to-zero} \frac{1}{n}\sum_{i=1}^{n}{e_i} = 0 \end{equation} This is {\em reminiscent} of $\Expect{\epsilon} = 0$, but generally $n^{-1}\sum_{i=1}^{n}{\epsilon_i} \neq 0$. In fact, Eq.\ \ref{eqn:residuals-average-to-zero} implies that the residuals are actually linearly dependent on each other, while the $\epsilon_i$ are not. Despite these differences, there is enough of a relationship between the $\epsilon_i$ and the $e_i$ that a lot of our model-checking and diagnostics will be done in terms of the residuals. You should get used to looking at them for basically any model you estimate, or even think seriously about estimating. \section{Limitations of Least Squares} The results in this handout, and the last, almost exhaust the theory of statistical inference for least squares estimates in the simple linear regression model\footnote{The main exception is a result called the {\bf Gauss-Markov theorem}: the least squares estimator has smaller variance than any other unbiased estimator which is a linear function of the $y_i$. This was more impressive when nobody had the computing power to use nonlinear estimators\ldots}. Going beyond the mean and variance of parameter estimates or predicted values is pretty much impossible, using {\em just} least squares and the simple linear regression model. In particular, we can't get {\bf sampling distributions} for anything --- we can't say what probability law $\hat{\beta}_1$ follows, even as a function of the true parameters $\beta_0, \beta_1, \sigma^2$. There are just too many possibilities which are compatible with the model assumptions. Since, as you remember from your mathematical statistics course, we need sampling distributions to form confidence intervals, evaluate the properties of hypothesis tests, etc., etc., there is really not much more to say about this combination of model and method. \paragraph{Chebyshev} If we absolutely must do some probability calculations under the least-squares assumptions, the best we can usually do is to invoke Chebyshev's inequality (the extra credit problem in homework 1): for any random variable $Z$ with expected value $\mu$ and variance $\sigma$, and any $r > 0$, \begin{equation} \Prob{|Z-\mu| \geq r} \leq \frac{\Var{Z}}{r^2} \end{equation} In particular, we can say that \begin{equation} \Prob{|Z-\mu| \geq k \sqrt{\Var{Z}}} \leq \frac{1}{k^2} \end{equation} These probability bounds are {\em very} loose, so if we do try to use them to do hypothesis tests, they have very little power (equivalently: the confidence intervals we get are huge). \paragraph{Asymptotic Gaussianity} The right-hand side of Eq.\ \ref{eqn:hat-beta-1-is-constant-plus-noise} shows that $\hat{\beta}_1$ is $\beta_1$ plus a weighted average of the $\epsilon_i$. If we add to the simple linear regression model the assumption that the $\epsilon_i$ are IID draws from a fixed, {\em not-necessarily-Gaussian} distribution, we might then try to use the central limit theorem to show that the weighted average tends towards a Gaussian as $n\rightarrow\infty$. This can be done in some generality, but it needs more delicate probability theory than the rest of what we are doing, and if the initial distribution of the $\epsilon_i$ is, say, appreciably skewed, $n$ might have to be truly huge before the Gaussian approximation is any good\footnote{To those who think everything is Gaussian once $n \geq 30$, all I can say is ``Bless your heart.''}. \section{Least-Squares in R} The basic command for fitting a linear model by least squares in R is \texttt{lm}. It has a huge number of options, and can do a lot more than we will ask it to here, but for our purposes we use it as follows: <>= lm(y ~ x, data=df) @ Here \texttt{df} is a data frame containing the data we want to fit a regression to, and the first part, the {\bf formula}, tells \texttt{lm} that we want to regress the column of \texttt{df} called \texttt{y} on the column called \texttt{x}. By default\footnote{There are ways to tweak this, some of which we'll see later in the course.}, \texttt{lm} always fits its regression models by least squares. What \texttt{lm} returns is a rather complicated object. If you just print it out, it seems to be only the intercept and the slope: <<>>= # Make a very small simulated data set from our running examing toy.data <- sim.linmod(n=10, beta.0=5, beta.1=-2, width=4, df=3) # Fit the simple linear regression model to it by least squares lm(y~x, data=toy.data) @ In fact, \texttt{lm} has done {\em lots} of calculations as part of fitting the model, and stored many of the results into the object it returns; R just doesn't print all of that, unless you make it. We can see some of what's in there, though: <<>>= names(lm(y~x, data=toy.data)) @ (See \texttt{help(lm)}, under ``Value'', for the gory details.) It's annoying (and slow and error-prone) to keep having R re-estimate the model every time we want to refer back to it, so we usually store the estimated model under a new variable name: <<>>= # Fit a linear model to the toy data, and store as toy.lm # The name of the estimated model needn't match that of the data, but it's # often a good idea toy.lm <- lm(y~x, data=toy.data) @ We can access some of what's in the \texttt{lm} object by using special functions. A couple in particular will become close and familiar friends. \texttt{coefficients} gives us the vector of coefficients: <<>>= coefficients(toy.lm) @ \texttt{fitted} gives us the vector of fitted values, in the order which the data points appeared: <<>>= fitted(toy.lm) @ \texttt{residuals} gives us the vector of residuals (ditto order): <<>>= residuals(toy.lm) @ (How would you use \texttt{residuals} to calculate $s^2$? To calculate $\frac{n}{n-2}s^2$?) You might think that \texttt{plot(toy.lm)} would draw a picture of the fitted model; instead, it goes through a bunch of diagnostic plots, which we will get to later. If we want to add a line based on the model to an existing plot, we use \texttt{abline}, as in Figure \ref{fig:toy-model}. \begin{figure} <>= plot(y~x, data=toy.data, xlab="x",ylab="y", main="Simulated ('toy') data") abline(toy.lm) @ <>= <> @ \caption{Using \texttt{abline} to add the line of an estimated linear regression model to a plot.} \label{fig:toy-model} \end{figure} \texttt{fitted} gives us the model's predictions at the particular $x_i$ where we happened to have data. In principle, though, the model has an opinion about what $\Expect{Y|X=x}$ should be at every possible value of $x$. To extract that, we use a function called \texttt{predict}. Naturally enough, we need to tell it both which model we want to use (since we could have more than one floating around), {\em and} where to make the predictions: <>= predict(object, newdata) @ Here the first argument, what \texttt{predict} somewhat obscurely calls \texttt{object}, is the estimated regression model, like our \texttt{toy.lm}. (It is {\em not} the name of the estimating function, like \texttt{lm}.) The second argument, \texttt{newdata}, is a data frame with a column whose name matches the column name of the predictor variable in our original data frame. Thus <<>>= predict(toy.lm, newdata=data.frame(x=1:5)) @ gives us the fitted model's predictions at the integers from 1 to 5. You might well think that if \texttt{newdata} were missing, then \texttt{predict} would throw an error. You might very well think that. <<>>= predict(toy.lm) predict(toy.lm, data=data.frame(x=1:5)) # What's wrong here? @ For reasons best known to the designers of R\footnote{Really, the designers of the predecessor language, S.}, when \texttt{newdata} is missing or mal-formed, \texttt{predict} returns the fitted values on the original data. On the other hand, you {\em will} get an error if \texttt{newdata} exists but doesn't contain the right column name: <<>>= predict(toy.lm, newdata=data.frame(zebra=1:5)) @ Extraneous columns, however, are just ignored: <<>>= predict(toy.lm, newdata=data.frame(x=1:5, zebra=6:10)) @ There is one further option to \texttt{predict} which is worth mentioning at this time. If we set \texttt{se.fit=TRUE}, we get the standard errors of the fitted values, i.e., the square roots of the variances\footnote{If a homework problem asks you to calculate the variance of a predicted value, it's (generally) asking you to do the character-building work of actually putting numbers into an algebraic formula by yourself, though you can use this to check your work.}: <<>>= predict(toy.lm, newdata=data.frame(x=1:5), se.fit=TRUE) @ Notice that what this gives us back is not a vector but a {\em list}, whose first two entries are vectors. (We will come back to what the \texttt{df} means, but you should already be able to guess what \texttt{residual.scale} might be.) \section{Propagation of Error, {\em alias} ``The Delta Method''} \label{sec:propagation-of-error} \begin{quotation} An optional section, but a very useful one. \end{quotation} The constant-plus-sum-of-noise-terms trick is the core of an extremely handy technique for getting approximate variances and standard errors for functions of quantities which are themselves estimated with error. It is known variously as ``propagation of error'' or (more obscurely) as ``the delta method''. Suppose we are trying to estimate some quantity $\theta$. We compute an estimate $\widehat{\theta}$, based on our data. Since our data is more or less random, so is $\widehat{\theta}$. One convenient way of measuring the purely statistical noise or uncertainty in $\widehat{\theta}$ is its standard deviation. This is the {\bf standard error} of our estimate of $\theta$.\footnote{It is not, of course, to be confused with the standard deviation of the data. It is not even to be confused with the standard error of the mean, unless $\theta$ is the expected value of the data and $\widehat{\theta}$ is the sample mean.} Standard errors are not the only way of summarizing this noise, nor a completely sufficient way, but they are often useful. Suppose that our estimate $\widehat{\theta}$ is a function of some intermediate quantities $\widehat{\psi_1}, \widehat{\psi_2}, \ldots, \widehat{\psi_p}$, which are also estimated: \begin{equation} \widehat{\theta} = f(\widehat{\psi_1},\widehat{\psi_2}, \ldots \widehat{\psi_p} ) \end{equation} For instance, $\theta$ might be the difference in expected values between two groups, with $\psi_1$ and $\psi_2$ the expected values in the two groups, and $f(\psi_1,\psi_2) = \psi_1 - \psi_2$. If we have a standard error for each of the original quantities $\widehat{\psi_i}$, it would seem like we should be able to get a standard error for the {\bf derived quantity} $\widehat{\theta}$. Propagation of error achieves this, by writing $\widehat{\theta}$ in the constant-plus-noise form. We start with a Taylor expansion. We'll write $\psi_i^*$ for the true (population, distribution, ensemble) value which is estimated by $\widehat{\psi_i}$. \begin{eqnarray} f(\psi_1^*,\psi_2^*,\ldots \psi_p^*) & \approx & f(\widehat{\psi_1},\widehat{\psi_2},\ldots\widehat{\psi_p}) + \sum_{i=1}^{p}{(\psi_i^* - \widehat{\psi_i}) {\left.\frac{\partial f}{\partial \psi_i}\right|}_{\psi = \widehat{\psi}}}\\ f(\widehat{\psi_1},\widehat{\psi_2},\ldots\widehat{\psi_p}) & \approx & f(\psi_1^*,\psi_2^*,\ldots \psi_p^*) + \sum_{i=1}^{p}{(\widehat{\psi_i}-\psi_i^*) {\left.\frac{\partial f}{\partial \psi_i}\right|}_{\psi = \widehat{\psi}}}\\ \hat{\theta} & \approx & \theta^* + \sum_{i=1}^{p}{(\widehat{\psi_i} - \psi_i^*) f^{\prime}_i(\widehat{\psi})} \label{eqn:estimate-in-taylor-expansion} \end{eqnarray} introducing $f^{\prime}_i$ as an abbreviation for $\frac{\partial f}{\partial \psi_i}$. The left-hand side is now the quantity whose standard error we want. I have done this manipulation because now $\hat{\theta}$ is a linear function (approximately!) of some random quantities whose variances we know, and some derivatives which we can calculate. Remember (from Lecture 1) the rules for arithmetic with variances: if $X$ and $Y$ are random variables, and $a$, $b$ and $c$ are constants, \begin{eqnarray} \Var{a} & = & 0\\ \Var{a+bX} & = & b^2\Var{X}\\ \Var{a+bX+cY} & = & b^2\Var{X} + c^2\Var{Y} + 2bc\Cov{X,Y} \end{eqnarray} While we don't know $f(\psi_1^*,\psi_2^*,\ldots \psi_p^*)$, it's constant, so it has variance 0. Similarly, $\Var{\widehat{\psi_i} - \psi_i^*} = \Var{\widehat{\psi_i}}$. Repeatedly applying these rules to Eq.\ \ref{eqn:estimate-in-taylor-expansion}, \begin{equation} \Var{\widehat{\theta}} \approx \sum_{i=1}^{p}{(f^{\prime}_i(\widehat{\psi}))^2 \Var{\widehat{\psi_i}}} + 2\sum_{i=1}^{p-1}{\sum_{j=i+1}^{p}{f^{\prime}_i(\widehat{\psi}) f^{\prime}_{j}(\widehat{\psi}) \Cov{\widehat{\psi_i},\widehat{\psi_j}}}} \label{eqn:prop-of-error-formula} \end{equation} The standard error for $\widehat{\theta}$ would then be the square root of this. If we follow this rule for the simple case of group differences, $f(\psi_1,\psi_2) = \psi_1 - \psi_2$, we find that \begin{equation} \Var{\widehat{\theta}} = \Var{\widehat{\psi_1}} + \Var{\widehat{\psi_2}} - 2\Cov{\widehat{\psi_1},\widehat{\psi_2}} \end{equation} just as we would find from the basic rules for arithmetic with variances. The approximation in Eq.\ \ref{eqn:prop-of-error-formula} comes from the nonlinearities in $f$. If the estimates of the initial quantities are uncorrelated, Eq.\ \ref{eqn:prop-of-error-formula} simplifies to \begin{equation} \Var{\widehat{\theta}} \approx \sum_{i=1}^{p}{(f^{\prime}_i(\widehat{\psi}))^2 \Var{\widehat{\psi_i}}} \label{eqn:uncorrelated-prop-of-error} \end{equation} and, again, the standard error of $\widehat{\theta}$ would be the square root of this. The special case of Eq.\ \ref{eqn:uncorrelated-prop-of-error} is sometimes called {\em the} propagation of error formula, but I think it's better to use that name for the more general Eq.\ \ref{eqn:prop-of-error-formula}. \section*{Exercises} To think through or practice on, not to hand in. \begin{enumerate} \item {\em True MSE of a linear model} Prove that the full-distribution MSE of a linear model with intercept $b_0$ and slope $b_1$ is \begin{equation} \Var{Y} + (\Expect{Y})^2 - 2b_0 \Expect{Y} - 2b_1 \Cov{X,Y} - 2 b_1\Expect{X}\Expect{Y} + 2b_1 \Expect{X} + b_1^2\Var{X} + b_1^2 (\Expect{X})^2 \end{equation} \item \label{exercise:linear-systems-of-estimating-equations} Show that if all $x_i = \overline{x}$, then the system of linear equations, Eqs.\ \ref{eqn:estimating-eqn-for-beta.0}--\ref{eqn:estimating-eqn-for-beta.1}, actually contains only one linearly-independent equation. {\em Hint:} Write the system of equations as a matrix multiplying the vector whose entries are $(\hat{\beta}_0, \hat{\beta}_1)$, and consider the determinant of the matrix. (How does the determinant of such a matrix relate to whether the equations are all linearly independent?) \item \label{exercise:variance-of-y-bar} Show that, under the simple linear regression model, $\Var{\overline{y}} = \sigma^2/n$. You may treat the $x_i$ as fixed in this calculation. \item Derive Eqs.\ \ref{eqn:bias-of-beta0} and \ref{eqn:var-of-beta0} from the results in \S \ref{sec:prediction}. (Hint: $\hat{\beta}_0 = \hat{m}(x)$ for what value of $x$?) Is this a circular argument? \item \label{exercise:sigmasq-is-generalization-mse} Prove Eq.\ \ref{eqn:sigmasq-is-generalization-mse}. \item \label{exercise:debiasing-mse} Express the right-hand side of Eq.\ \ref{eqn:minimal-in-sample-mse} in terms of $\beta_0$, $\beta_1$, the $x_i$ and the $\epsilon_i$. Use this expression to find $\Expect{\hat{\sigma}^2}$, and show that it equals $\frac{n-2}{n}\sigma^2$. \end{enumerate} \bibliography{locusts} \bibliographystyle{crs} \end{document}