\documentclass{article}
\usepackage{amsmath,amsfonts}
\usepackage{natbib}
\usepackage{hyperref}
\usepackage[font={it,small},labelfont={sc,small}]{caption}
\usepackage{datetime}
\usepackage{fancyhdr}
% For putting the revision date on the bottom of each page
% Needs fancyhdr, datetime packages
\pagestyle{fancy}
\fancyhead{}
\fancyhead[LO,RE]{\thepage}
\fancyhead[RO,LE]{\rightmark}
\fancyfoot{}
\fancyfoot[CO,CE]{\currenttime\ \today}
\renewcommand{\headrulewidth}{0pt}
\fancypagestyle{plain}{
\fancyhf{}
\fancyhead[C]{\currenttime\ \today\\ \scriptsize{See updates and corrections at \url{http://www.stat.cmu.edu/~cshalizi/mreg/}}}
\fancyfoot[C]{\thepage}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}
}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)}
\DeclareMathOperator*{\argmin}{argmin}
\begin{document}
\title{Lecture 4: Simple Linear Regression Models, with Hints at Their Estimation}
\author{36-401, Fall 2015, Section B}
\date{10 September 2015}
\maketitle
% Set global options for knitr
<< echo=FALSE >>=
library(knitr)
opts_chunk$set(size='small',background='white')
@
\section{The Simple Linear Regression Model}
Let's recall the simple linear regression model from last time. This is a
statistical model with two variables $X$ and $Y$, where we try to predict $Y$
from $X$. The assumptions of the model are as follows:
\begin{enumerate}
\item The distribution of $X$ is arbitrary (and perhaps $X$ is even non-random).
\item If $X=x$, then $Y = \beta_0 + \beta_1 x + \epsilon$, for some constants
(``coefficients'', ``parameters'') $\beta_0$ and $\beta_1$, and some random
noise variable $\epsilon$.
\item $\Expect{\epsilon|X=x} = 0$ (no matter what $x$ is), $\Var{\epsilon|X=x}
= \sigma^2$ (no matter what $x$ is).
\item $\epsilon$ is uncorrelated across observations.
\end{enumerate}
To elaborate, with multiple data points, $(X_1, Y_1), (X_2, Y_2), \ldots (X_n, Y_n)$, then the model says that, for {\em each} $i \in 1:n$,
\begin{equation}
Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
\end{equation}
where the noise variables $\epsilon_i$ all have the same expectation (0) and
the same variance ($\sigma^2)$, and $\Cov{\epsilon_i, \epsilon_j} = 0$ (unless
$i=j$, of course).
\subsection{``Plug-In'' Estimates}
In lecture 1, we saw that the optimal linear predictor of $Y$ from $X$ has
slope $\beta_1 = \Cov{X,Y}/\Var{X}$, and intercept $\beta_0 = \Expect{Y} -
\beta_1 \Expect{X}$. A common tactic in devising estimators is to use what's
sometimes called the ``plug-in principle'', where we find equations for the
parameters which would hold if we knew the full distribution, and ``plug in''
the sample versions of the population quantities. We saw this in the last
lecture, where we estimated $\beta_1$ by the ratio of the sample covariance to
the sample variance:
\begin{equation}
\widehat{\beta_1} = \frac{c_{XY}}{s^2_{X}}
\end{equation}
We also saw, in the notes to the last lecture, that so long as the law of large numbers holds,
\begin{equation}
\widehat{\beta_1} \rightarrow \beta_1
\end{equation}
as $n\rightarrow\infty$. It follows easily that
\begin{equation}
\widehat{\beta_0} = \overline{Y} - \widehat{\beta_1}\overline{X}
\end{equation}
will also converge on $\beta_0$.
\subsection{Least Squares Estimates}
An alternative way of estimating the simple linear regression model starts from
the objective we are trying to reach, rather than from the formula for the
slope. Recall, from lecture 1, that the true optimal slope and intercept are
the ones which minimize the mean squared error:
\begin{equation}
(\beta_0, \beta_1) = \argmin_{(b_0, b_1)}{\Expect{(Y-(b_0 + b_1 X))^2}}
\end{equation}
This is a function of the complete distribution, so we can't get it
from data, but we can approximate it with data. The {\bf in-sample},
{\bf empirical} or {\bf training} MSE is
\begin{equation}
\widehat{MSE}(b_0, b_1) \equiv \frac{1}{n}\sum_{i=1}^{n}{(y_i - (b_0 + b_1 x_i))^2}
\end{equation}
Notice that this is a function of $b_0$ and $b_1$; it is also, of course, a
function of the data, $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$, but we will
generally suppress that in our notation.
If our samples are all independent, for any fixed $(b_0, b_1)$, the law of
large numbers tells us that $\widehat{MSE}(b_0,b_1) \rightarrow MSE(b_0,b_1)$
as $n\rightarrow\infty$. So it doesn't seem unreasonable to try minimizing the
in-sample error, which we can compute, as a proxy for minimizing the true MSE,
which we can't. Where does it lead us?
Start by taking the derivatives with respect to the slope and the intercept:
\begin{eqnarray}
\frac{\partial \widehat{MSE}}{\partial b_0} & = & \frac{1}{n}\sum_{i=1}^{n}{(y_i - (b_0 + b_1 x_i))(-2)}\\
\frac{\partial \widehat{MSE}}{\partial b_0} & = & \frac{1}{n}\sum_{i=1}^{n}{(y_i - (b_0 + b_1 x_i))(-2x_i)}
\end{eqnarray}
Set these to zero at the optimum $(\hat{\beta}_0, \hat{\beta}_1)$:
\begin{eqnarray}
\label{eqn:normal-equations-for-least-squares}
\frac{1}{n}\sum_{i=1}^{n}{(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))} & = & 0\\
\nonumber \frac{1}{n}\sum_{i=1}^{n}{(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))(x_i)} & = & 0
\end{eqnarray}
These are often called the {\bf normal equations} for least-squares estimation,
or the {\bf estimating equations}: a system of two equations in two unknowns,
whose solution gives the estimate. Many people would, at this point, remove the factor of $1/n$, but I think it makes it easier to understand the next steps:
\begin{eqnarray}
\overline{y} - \hat{\beta}_0 - \hat{\beta}_1 \overline{x} & = & 0\\
\overline{xy} - \hat{\beta}_0 \overline{x} - \hat{\beta}_1 \overline{x^2} & = & 0
\end{eqnarray}
The first equation, re-written, gives
\begin{equation}
\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}
\end{equation}
Substituting this into the remaining equation,
\begin{eqnarray}
0 & = & \overline{xy} - \bar{y}\bar{x} + \hat{\beta}_1 \bar{x}\bar{x} - \hat{\beta}_1 \overline{x^2}\\
0 & = & c_{XY} - \hat{\beta}_1 s^2_X\\
\hat{\beta}_1 & = & \frac{c_{XY}}{s^2_{X}}
\end{eqnarray}
That is, the least-squares estimate of the slope is our old friend the plug-in
estimate of the slope, and thus the least-squares intercept is also the plug-in
intercept.
\paragraph{Going forward} The equivalence between the plug-in estimator and the
least-squares estimator is a bit of a special case for linear models. In some
non-linear models, least squares is quite feasible (though the optimum can only
be found numerically, not in closed form); in others, plug-in estimates
are more useful than optimization.
\subsection{Bias, Variance and Standard Error of Parameter Estimates}
Whether we think of it as deriving from pluging-in or from least squares, we
work out some of the properties of this estimator of the coefficients,
using the model assumptions. We'll start with the slope, $\hat{\beta}_1$.
\begin{eqnarray}
\hat{\beta}_1 & = & \frac{c_{XY}}{s^2_X}\\
& = & \frac{\frac{1}{n}\sum_{i=1}^{n}{x_i y_i} - \bar{x}\bar{y}}{s^2_X}\\
& = & \frac{\frac{1}{n}\sum_{i=1}^{n}{x_i (\beta_0 + \beta_1 x_i + \epsilon_1)} - \bar{x}(\beta_0 + \beta_1 \bar{x} + \bar{\epsilon})}{s^2_X}\\
& = & \frac{\beta_0 \bar{x} + \beta_1 \overline{x^2} + \frac{1}{n}\sum_{i=1}^{n}{x_i \epsilon_i} - \bar{x}\beta_0 - \beta_1 \bar{x}^2 - \bar{x}\bar{\epsilon}}{s^2_x}\\
& = & \frac{\beta_1 s^2_X + \frac{1}{n}\sum_{i=1}^{n}{x_i \epsilon_i} - \bar{x}\bar{\epsilon}}{s^2_X}\\
& = & \beta_1 + \frac{\frac{1}{n}\sum_{i=1}^{n}{x_i \epsilon_i} - \bar{x}\bar{\epsilon}}{s^2_X}
\end{eqnarray}
Since $\bar{x}\bar{\epsilon}=n^{-1}\sum_{i}{\bar{x}\epsilon_i}$,
\begin{equation}
\hat{\beta_1} = \beta_1 + \frac{\frac{1}{n}\sum_{i=1}^{n}{(x_i - \bar{x})\epsilon_i}}{s^2_X}
\end{equation}
This representation of the slope estimate shows that it is equal to the true
slope ($\beta_1$) plus something which depends on the noise terms (the
$\epsilon_i$, and their sample average $\bar{\epsilon}$). We'll use this to
find the expected value and the variance of the estimator $\hat{\beta}_1$.
In the next couple of paragraphs, I am going to treat the $x_i$ as non-random
variables. This is appropriate in ``designed'' or ``controlled'' experiments,
where we get to chose their value. In randomized experiments or in
observational studies, obviously the $x_i$ aren't necessarily fixed; however,
these expressions will be correct for the conditional expectation
$\Expect{\hat{\beta}_1|x_1, \ldots x_n}$ and conditional variance
$\Var{\hat{\beta}_1|x_1, \ldots x_n}$, and I will come back to how we get the
unconditional expectation and variance.
\paragraph{Expected value and bias}
Recall that $\Expect{\epsilon_i|X_i} = 0$, so
\begin{equation}
\frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x}) \Expect{\epsilon_i}} =0
\end{equation}
Thus,
\begin{equation}
\Expect{\hat{\beta}_1} = \beta_1
\end{equation}
Since the {\bf bias} of an estimator is the difference between its expected
value and the truth, $\hat{\beta}_1$ is an {\bf unbiased} estimator of the
optimal slope.
(To repeat what I'm sure you remember from mathematical statistics: ``bias''
here is a technical term, meaning no more and no less than
$\Expect{\hat{\beta}_1} - \beta_1$. An unbiased estimator could still make
systematic mistakes --- for instance, it could underestimate 99\% of the time,
provided that the 1\% of the time it over-estimates, it does so by much more
than it under-estimates. Moreover, unbiased estimators are not necessarily
superior to biased ones: the total error depends on both the bias of the
estimator and its variance, and there are many situations where you can remove
lots of bias at the cost of adding a little variance. Least squares for simple
linear regression happens not to be one of them, but you shouldn't expect that
as a general rule.)
Turning to the intercept,
\begin{eqnarray}
\Expect{\hat{\beta}_0} & = & \Expect{\overline{Y} - \hat{\beta}_1 \overline{X}}\\
& = & \beta_0 + \beta_1 \overline{X} - \Expect{\hat{\beta}_1}\overline{X}\\
& = & \beta_0 + \beta_1 \overline{X} - \beta_1 \overline{X}\\
& = & \beta_0
\end{eqnarray}
so it, too, is unbiased.
\paragraph{Variance and Standard Error}
Using the formula for the variance of a sum from lecture 1, and the model
assumption that all the $\epsilon_i$ are uncorrelated with each other,
\begin{eqnarray}
\Var{\hat{\beta}_1} & = & \Var{\beta_1 + \frac{\frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x}) \epsilon_i}}{s^2_X}}\\
& = & \Var{\frac{\frac{1}{n}\sum_{i=1}^{n}{(x_i-\bar{x}) \epsilon_i}}{s^2_X}}\\
& = & \frac{\frac{1}{n^2}\sum_{i=1}^{n}{(x_i - \bar{x})^2 \Var{\epsilon_i}}}{(s^2_X)^2}\\
& = & \frac{\frac{\sigma^2}{n}s^2_X}{(s^2_X)^2}\\
& = & \frac{\sigma^2}{n s^2_X}
\end{eqnarray}
In words, this says that the variance of the slope estimate goes up as the
noise around the regression line ($\sigma^2$) gets bigger, and goes down as we
have more observations ($n$), which are further spread out along the horizontal
axis ($s^2_X$); it should not be surprising that it's easier to work out the
slope of a line from many, well-separated points on the line than from a few
points smushed together.
The {\bf standard error} of an estimator is just its standard deviation,
or the square root of its variance:
\begin{equation}
\mathrm{se}(\hat{\beta}_1) = \frac{\sigma}{\sqrt{n s^2_X}}
\end{equation}
I will leave working out the variance of $\hat{\beta}_0$ as an exercise.
\paragraph{Unconditional-on-$X$ Properties}
The last few paragraphs, as I said, have looked at the expectation and variance
of $\hat{\beta}_1$ conditional on $x_1, \ldots x_n$, either because the $x$'s
really are non-random (e.g., controlled by us), or because we're just
interested in conditional inference. If we do care about unconditional
properties, then we still need to find $\Expect{\hat{\beta}_1}$ and
$\Var{\hat{\beta}_1}$, not just $\Expect{\hat{\beta}_1|x_1, \ldots x_n}$ and
$\Var{\hat{\beta}_1|x_1, \ldots x_n}$. Fortunately, this is easy, {\em so long
as the simple linear regression model holds}.
To get the unconditional expectation, we use the ``law of total
expectation'':
\begin{eqnarray}
\Expect{\hat{\beta}_1} & = & \Expect{\Expect{\hat{\beta}_1|X_1, \ldots X_n}}\\
& = & \Expect{\beta_1} = \beta_1
\end{eqnarray}
That is, the estimator is {\em unconditionally} unbiased.
To get the unconditional variance, we use the ``law of total variance'':
\begin{eqnarray}
\Var{\hat{\beta_1}} & = & \Expect{\Var{\hat{\beta_1}|X_1, \ldots X_n}} + \Var{\Expect{\hat{\beta}_1|X_1, \ldots X_n}}\\
& = & \Expect{\frac{\sigma^2}{n s^2_X}} + \Var{\beta_1}\\
& = & \frac{\sigma^2}{n}\Expect{\frac{1}{s^2_X}}
\end{eqnarray}
\subsection{Parameter Interpretation; Causality}
Two of the parameters are easy to interpret.
\begin{description}
\item[$\sigma^2$] is the variance of the noise around the regression
line; $\sigma$ is a typical distance of a point from the line. (``Typical''
here in a special sense, it's the root-mean-squared distance, rather than,
say, the average absolute distance.)
\item[$\beta_0$] is the simply the expected value of $Y$ when $X$ is 0,
$\Expect{Y|X=0}$. The point $X=0$ usually has no special significance, but
this setting does ensure that the line goes through the point ($\Expect{X},
\Expect{Y}$).
\end{description}
The interpretation of the slope is both very straightforward and very tricky.
Mathematically, it's easy to convince yourself that, for any $x$
\begin{equation}
\beta_1 = \Expect{Y|X=x} - \Expect{Y|X=x-1}
\end{equation}
or, for any $x_1, x_2$,
\begin{equation}
\beta_1 = \frac{\Expect{Y|X=x_2} - \Expect{Y|X=x_1}}{x_2 - x_1}
\end{equation}
This is just saying that the slope of a line is ``rise$/$run''.
The tricky part is that we have a {\em very} strong, natural tendency to
interpret this as telling us something about causation --- ``If we change $X$
by 1, then on average $Y$ will change by $\beta_1$''. This interpretation is
usually completely unsupported by the analysis. If I use an old-fashioned
mercury thermometer, the height of mercury in the tube usually has a nice
linear relationship with the temperature of the room the thermometer is in.
This linear relationship goes both ways, so we could regress temperature ($Y$)
on mercury height ($X$). But if I manipulate the height of the mercury (say,
by changing the ambient pressure, or shining a laser into the tube, etc.),
changing the height $X$ will not, in fact, change the temperature outside.
The right way to interpret $\beta_1$ is not as the result of a {\em change},
but as an expected {\em difference}. The correct catch-phrase would be
something like ``If we select two sets of cases from the un-manipulated
distribution where $X$ differs by 1, we expect $Y$ to differ by $\beta_1$.''
This covers the thermometer example, and every other I can think of. It is, I
admit, much more inelegant than ``If $X$ changes by 1, $Y$ changes by $\beta_1$
on average'', but it has the advantage of being true, which the other does not.
There {\em are} circumstances where regression can be a useful part of causal
inference, but we will need a lot more tools to grasp them; that will come
towards the end of 402.
\section{The Gaussian-Noise Simple Linear Regression Model}
We have, so far, assumed comparatively little about the noise term $\epsilon$.
The advantage of this is that our conclusions apply to lots of different
situations; the drawback is that there's really not all that much more to say
about our estimator $\widehat{\beta}$ or our predictions than we've already
gone over. If we made more detailed assumptions about $\epsilon$,
we could make more precise inferences.
There are lots of forms of distributions for $\epsilon$ which we might
contemplate, and which are compatible with the assumptions of the simple linear
regression model (Figure \ref{fig:some-noise-dists}). The one which has become
the most common over the last two centuries is to assume $\epsilon$ follows a
Gaussian distribution.
\begin{figure}
<>=
par(mfrow=c(2,3))
curve(dnorm(x), from=-3, to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(exp(-abs(x))/2, from=-3, to=3, xlab=expression(epsilon), ylab="",
ylim=c(0,1))
curve(sqrt(pmax(0,1-x^2))/(pi/2), from=-3, to=3, xlab=expression(epsilon),
ylab="", ylim=c(0,1))
curve(dt(x,3), from=-3, to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(dgamma(x+1.5, shape=1.5, scale=1), from=-3, to=3,
xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(0.5*dgamma(x+1.5, shape=1.5, scale=1) +
0.5*dgamma(0.5-x, shape=0.5, scale=1), from=-3,
to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
par(mfrow=c(1,1))
@
<>=
<>
@
\caption{Some possible noise distributions for the simple linear regression
model, since all have $\Expect{\epsilon}=0$, and could get any variance by
scaling. (The model is even compatible with each observation taking
$\epsilon$ from a different distribution.) From top left to bottom right:
Gaussian; double-exponential (``Laplacian''); ``circular'' distribution; $t$
with 3 degrees of freedom; a gamma distribution (shape 1.5, scale 1) shifted
to have mean 0; mixture of two gammas with shape 1.5 and shape 0.5, each
off-set to have expectation 0. The first three were all used as error models
in the 18th and 19th centuries. (See the source file for how to get the code
below the figure.)}
\label{fig:some-noise-dists}
\end{figure}
The result is the Gaussian-noise simple linear regression model\footnote{Our
textbook, rather old-fashionedly, calls this the ``normal error'' model
rather than ``Gaussian noise''. I dislike this: ``normal'' is an over-loaded
word in math, while ``Gaussian'' is (comparatively) specific; ``error'' made
sense in Gauss's original context of modeling, specifically, errors of
observation, but is misleading generally; and calling Gaussian distributions
``normal'' suggests they are much more common than they really are.}:
\begin{enumerate}
\item The distribution of $X$ is arbitrary (and perhaps $X$ is even non-random).
\item If $X=x$, then $Y = \beta_0 + \beta_1 x + \epsilon$, for some constants
(``coefficients'', ``parameters'') $\beta_0$ and $\beta_1$, and some random
noise variable $\epsilon$.
\item $\epsilon \sim N(0,\sigma^2)$, independent of $X$.
\item $\epsilon$ is independent across observations.
\end{enumerate}
You will notice that these assumptions are strictly stronger than those of the
simple linear regression model. More exactly, the first two assumptions are
the same, while the third and fourth assumptions of the Gaussian-noise model
imply the corresponding assumptions of the other model. This means that
everything we have done so far directly applies to the Gaussian-noise model.
On the other hand, the stronger assumptions let us say more. They tell us,
exactly, the probability distribution for $Y$ given $X$, and so will let us get
exact distributions for predictions and for other inferential statistics.
\paragraph{Why the Gaussian noise model?} Why should we think that the noise
around the regression line would follow a Gaussian distribution, independent of
$X$? There are two big reasons.
\begin{enumerate}
\item {\em Central limit theorem} The noise might be due to adding up the
effects of lots of little random causes, all nearly independent of each other
and of $X$, where each of the effects are of roughly similar magnitude. Then
the central limit theorem will take over, and the distribution of the sum of
effects will indeed be pretty Gaussian. For Gauss's original context, $X$
was (simplifying) ``Where is such-and-such-a-planet in space?'', $Y$ was
``Where does an astronomer record the planet as appearing in the sky?'', and
noise came from defects in the telescope, eye-twitches, atmospheric
distortions, etc., etc., so this was pretty reasonable. It is clearly {\em
not} a universal truth of nature, however, or even something we should
expect to hold true as a general rule, as the name ``normal'' suggests.
\item {\em Mathematical convenience} Assuming Gaussian noise lets us work out a
very complete theory of inference and prediction for the model, with lots of
closed-form answers to questions like ``What is the optimal estimate of the
variance?'' or ``What is the probability that we'd see a fit this good from a
line with a non-zero intercept if the true line goes through the origin?'',
etc., etc. Answering such questions without the Gaussian-noise assumption
needs somewhat more advanced techniques, and much more advanced computing;
we'll get to it towards the end of the class.
\end{enumerate}
\subsection{Visualizing the Gaussian Noise Model}
The Gaussian noise model gives us not just an expected value for $Y$ at each
$x$, but a whole conditional distribution for $Y$ at each $x$. To visualize
it, then, it's not enough to just sketch a curve; we need a three-dimensional
surface, showing, for each combination of $x$ and $y$, the probability density
of $Y$ around that $y$ given that $x$. Figure
\ref{fig:persp-plot-for-gaussian-noise} illustrates.
\begin{figure}
<>=
x.seq <- seq(from=-1, to=3, length.out=150)
y.seq <- seq(from=-5, to=5, length.out=150)
cond.pdf <- function(x,y) { dnorm(y, mean=10-5*x, sd=0.5) }
z <- outer(x.seq, y.seq, cond.pdf)
persp(x.seq,y.seq,z, ticktype="detailed", phi=75, xlab="x",
ylab="y", zlab=expression(p(y|x)), cex.axis=0.8)
@
<>=
<>
@
\caption{Illustrating how the conditional pdf of $Y$ varies as a function of
$X$, for a hypothetical Gaussian noise simple linear regression where
$\beta_0=10$, $\beta_1 = -5$, and $\sigma^2 = (0.5)^2$. The perspective is
adjusted so that we are looking nearly straight down from above on the
surface. (Can you find a better viewing angle?) See \texttt{help(persp)}
for the 3D plotting (especially the examples), and \texttt{help(outer)} for
the \texttt{outer} function, which takes all combinations of elements from
two vectors and pushes them through a function. How would you modify this so
that the regression line went through the origin with a slope of $4/3$ and a
standard deviation of 5?}
\label{fig:persp-plot-for-gaussian-noise}
\end{figure}
\subsection{Maximum Likelihood vs.\ Least Squares}
As you remember from your mathematical statistics class, the {\bf likelihood}
of a parameter value on a data set is the probability density at the data under
those parameters. We could not work with the likelihood with the simple linear
regression model, because it didn't specify enough about the distribution to
let us calculate a density. With the Gaussian-noise model, however, we can
write down a likelihood\footnote{Strictly speaking, this is a ``conditional''
(on $X$) likelihood; but only pedants use the adjective in this context.} By
the model's assumptions, if think the parameters are
the parameters are $b_0, b_1, s^2$ (reserving the Greek letters for their true values), then
$Y|X=x \sim N(b_0 + b_1 x, s^2)$, and $Y_i$ and $Y_j$ are independent
given $X_i$ and $X_j$, so the over-all likelihood is
\begin{equation}
\prod_{i=1}^{n}{\frac{1}{\sqrt{2\pi s^2}}e^{-\frac{(y_i - (b_0+b_1 x_i))^2}{2s^2}}}
\end{equation}
As usual, we work with the log-likelihood, which gives us the same information\footnote{Why is this?}
but replaces products with sums:
\begin{equation}
L(b_0, b_1, s^2) = -\frac{n}{2}\log{2\pi} - \frac{n}\log{s} - \frac{1}{2s^2}\sum_{i=1}^{n}{(y_i - (b_0 + b_1 x_i))^2}
\end{equation}
We recall from mathematical statistics that when we've got a likelihood
function, we generally want to maximize it. That is, we want to find the
parameter values which make the data we observed as likely, as probable, as the
model will allow. (This may not be very likely; that's another issue.)
We recall from calculus that one way to maximize is to take derivatives
and set them to zero.
\begin{eqnarray}
\frac{\partial L}{\partial b_0} & = & -\frac{1}{2s^2}\sum_{i=1}^{n}{2(y_i - (b_0 + b_1 x_i))(-1)}\\
\frac{\partial L}{\partial b_1} & = & -\frac{1}{2s^2}\sum_{i=1}^{n}{2(y_i - (b_0 + b_1 x_i))(-x_i)}
\end{eqnarray}
Notice that when we set these derivatives to zero, all the multiplicative constants --- in particular, the prefactor of $\frac{1}{2s^2}$ --- go away. We are left with
\begin{eqnarray}
\sum_{i=1}^{n}{y_i - (\widehat{\beta_0} + \widehat{\beta_1} x_i)} & = & 0\\
\sum_{i=1}^{n}{(y_i - (\widehat{\beta_0} + \widehat{\beta_1} x_i))x_i} & = & 0
\end{eqnarray}
These are, up to a factor of $1/n$, {\em exactly} the equations we got from the
method of least squares (Eq.\ \ref{eqn:normal-equations-for-least-squares}).
That means that the least squares solution {\em is} the maximum likelihood
estimate under the Gaussian noise model; this is no coincidence\footnote{It's
no coincidence because, to put it somewhat anachronistically, what Gauss did
was ask himself ``for what distribution of the noise would least squares
maximize the likelihood?''.}.
Now let's take the derivative with respect to $s$:
\begin{eqnarray}
\frac{\partial L}{\partial s} & = & -\frac{n}{s} + 2\frac{1}{2s^3}\sum_{i=1}^{n}{(y_i - (b_0 + b_1 x_i))^2}
\end{eqnarray}
Setting this to $0$ at the optimum, including multiplying through by $\widehat{\sigma}^3$, we get
\begin{equation}
\widehat{\sigma^2} = \frac{1}{n}\sum_{i=1}^{n}{(y_i - (\widehat{\beta_0} + \widehat{\beta_1} x_i))^2}
\end{equation}
Notice that the right-hand side is just the in-sample mean squared error.
\paragraph{Other models} Maximum likelihood estimates of the regression curve
coincide with least-squares estimates when the noise around the curve is
additive, Gaussian, of constant variance, and both independent of $X$ and of
other noise terms. These were all assumptions we used in setting up a
log-likelihood which was, up to constants, proportional to the (negative)
mean-squared error. If any of those assumptions fail, maximum likelihood and
least squares estimates can diverge, though sometimes the MLE solves a
``generalized'' least squares problem (as we'll see later in this course).
\section*{Exercises}
To think through, not to hand it.
\begin{enumerate}
\item Show that if $\Expect{\epsilon|X=x} = 0$ for all $x$, then
$\Cov{X,\epsilon} = 0$. Would this still be true if $\Expect{\epsilon|X=x} =
a$ for some other constant $a$?
\item Find the variance of $\hat{\beta_0}$. {\em Hint:} Do you need to worry
about covariance between $\overline{Y}$ and $\hat{\beta_1}$?
\end{enumerate}
\bibliography{locusts}
\bibliographystyle{crs}
\end{document}