\documentclass{article} \input{../common} \begin{document} \title{Lecture 8: Inference on Parameters} \author{36-401, Fall 2015, Section B} \date{24 September 2015} \maketitle << echo=FALSE >>= library(knitr) opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE) @ \tableofcontents Having gone over the Gaussian-noise simple linear regression model, over ways of estimating its parameters and some of the properties of the model, and over how to check the model's assumptions, we are now ready to begin doing some serious statistical inference within the model\footnote{Presuming, of course, that the model's assumptions, when thoroughly checked, do in fact hold good.}. In previous lectures, we came up with {\bf point estimators} of the parameters and the conditional mean (prediction) function, but we weren't able to say much about the margin of uncertainty around these estimates. In this lecture we will focus on supplementing point estimates with {\em reliable} measures of uncertainty. This will naturally lead us to testing hypotheses about the true parameters --- again, we will want hypothesis tests which are unlikely to get the answer wrong, whatever the truth might be. To accomplish all this, we first need to understand the sampling distribution of our point estimators. We can find them, mathematically, but they involve the unknown true parameters in inconvenient ways. We will therefore work to find combinations of our estimators and the true parameters with fixed, parameter-free distributions; we'll get our confidence sets and our hypothesis tests from them. Throughout this lecture, I am assuming, unless otherwise noted, that all of the assumptions of the Gaussian-noise simple linear regression model hold. After all, we checked those assumptions last time... \section{Sampling Distribution of $\hat{\beta}_0$, $\hat{\beta}_1$ and $\hat{\sigma}^2$} The Gaussian-noise simple linear regression model has three parameters: the intercept $\beta_0$, the slope $\beta_1$, and the noise variance $\sigma^2$. We've seen, previously, how to estimate all of these by maximum likelihood; the MLE for the $\beta$s is the same as their least-squares estimates. These are \begin{eqnarray} \hat{\beta}_1 & = & \frac{c_{XY}}{s^2_X} = \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{n s^2_X} y_i}\\ \hat{\beta}_0 & = & \overline{y} - \hat{\beta}_1 \overline{x}\\ \hat{\sigma}^2 & = & \frac{1}{n}\sum_{i=1}^{n}{(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2} \end{eqnarray} We have also seen how to re-write the first two of these as a deterministic part plus a weighted sum of the noise terms $\epsilon$: \begin{eqnarray} \label{eqn:hat-beta-1-in-terms-of-epsilon} \hat{\beta}_1 & = & \beta_1 + \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i}\\ \label{eqn:hat-beta-0-in-terms-of-epsilon} \hat{\beta}_0 & = & \beta_0 + \frac{1}{n}\sum_{i=1}^{n}{\left(1 -\overline{x} \frac{x_i - \overline{x}}{s^2_X} \right) \epsilon_i} \end{eqnarray} Finally, we have our modeling assumption that the $\epsilon_i$ are independent Gaussians, $\epsilon_i \sim N(0,\sigma^2)$. \subsection{Reminders of Basic Properties of Gaussian Distributions} Suppose $U \sim N(\mu,\sigma^2)$. By the basic algebra of expectations and variances, $\Expect{a+bU} = a+ b\mu$, while $\Var{a+bU} = b^2 \sigma^2$. This would be true of any random variable; a special property of Gaussians\footnote{There some other families of distributions which have this property; they're called ``location-scale'' families.} is that $a+bU \sim N(a+b\mu, b^2 \sigma^2)$. Suppose $U_1, U_2, \ldots U_n$ are {\em independent} Gaussians, with means $\mu_i$ and variances $\sigma^2_i$. Then \[ \sum_{i=1}^{n}{U_i} \sim N(\sum_{i}{\mu_i}, \sum_{i}{\sigma^2_i}) \] That the expected values add up for a sum is true of all random variables; that the variances add up is true for all uncorrelated random variables. That the sum follows the same type of distribution as the summands is a special property of Gaussians\footnote{There are some other families of distributions which have this property; they're called ``stable'' families.}. \subsection{Sampling Distribution of $\hat{\beta}_1$} Since we're assuming Gaussian noise, the $\epsilon_i$ are independent Gaussians, $\epsilon_i \sim N(0,\sigma^2)$. Hence (using the first basic property of Gaussians) \[ \frac{x_i - \overline{x}}{ns^2_X}\epsilon_i \sim N(0, \left(\frac{x_i - \overline{x}}{ns^2_X}\right)^2 \sigma^2) \] Thus, using the second basic property of Gaussians, \begin{eqnarray} \sum_{i=1}^{n}{\frac{x_i - \overline{x}}{ns^2_X}\epsilon_i} & \sim & N(0, \sigma^2 \sum_{i=1}^{n}{\left(\frac{x_i - \overline{x}}{ns^2_X}\right)^2})\\ & = & N(0, \frac{\sigma^2}{n s^2_X}) \end{eqnarray} Using the first property of Gaussians again, \begin{equation} \hat{\beta}_1 \sim N(\beta_1, \frac{\sigma^2}{n s^2_X}) \label{eqn:sampling-dist-of-hat-beta-1} \end{equation} This is the distribution of estimates we'd see if we repeated the experiment (survey, observation, etc.) many times, and collected the results. Every particular run of the experiment would give a slightly different $\hat{\beta}_1$, but they'd average out to $\beta_1$, the average squared difference from $\beta_1$ would be $\sigma^2/n s^2_X$, and a histogram of them would follow the Gaussian probability density function (Figure \ref{fig:sampling-distribution-of-hat-beta-1}). \begin{figure} <<>>= # Simulate a Gaussian-noise simple linear regression model # Inputs: x sequence; intercept; slope; noise variance; switch for whether to # return the simulated values, or run a regression and return the coefficients # Output: data frame or coefficient vector sim.gnslrm <- function(x, intercept, slope, sigma.sq, coefficients=TRUE) { n <- length(x) y <- intercept + slope*x + rnorm(n,mean=0,sd=sqrt(sigma.sq)) if (coefficients) { return(coefficients(lm(y~x))) } else { return(data.frame(x=x, y=y)) } } # Fix an arbitrary vector of x's x <- seq(from=-5, to=5, length.out=42) @ \caption{Code setting up a simulation of a Gaussian-noise simple linear regression model, along a fixed vector of $x_i$ values.} \end{figure} \begin{figure} <>= # Run the simulation 10,000 times and collect all the coefficients # What intercept, slope and noise variance does this impose? many.coefs <- replicate(1e4, sim.gnslrm(x=x, 5, -2, 0.1, coefficients=TRUE)) # Histogram of the slope estimates hist(many.coefs[2,], breaks=50, freq=FALSE, xlab=expression(hat(beta)[1]), main="") # Theoretical Gaussian sampling distribution theoretical.se <- sqrt(0.1/(length(x)*var(x))) curve(dnorm(x,mean=-2,sd=theoretical.se), add=TRUE, col="blue") @ <>= <> @ \caption{Simulating 10,000 runs of a Gaussian-noise simple linear regression model, calculating $\hat{\beta}_1$ each time, and comparing the histogram of estimates to the theoretical Gaussian distribution (Eq.\ \ref{eqn:sampling-dist-of-hat-beta-1}, in blue).} \label{fig:sampling-distribution-of-hat-beta-1} \end{figure} It is a bit hard to use Eq.\ \ref{eqn:sampling-dist-of-hat-beta-1}, because it involves two of the unknown parameters. We can manipulate it a bit to remove one of the parameters from the probability distribution, \[ \hat{\beta}_1 - \beta_1 \sim N(0, \frac{\sigma^2}{n s^2_X}) \] but that still has $\sigma^2$ on the right hand side, so we can't actually calculate anything. We could write \[ \frac{\hat{\beta}_1 - \beta_1}{\sigma^2/\sqrt{n s^2_X}} \sim N(0,1) \] but now we've got two unknown parameters on the left-hand side, which is also awkward. \subsection{Sampling Distribution of $\hat{\beta}_0$} Starting from Eq.\ \ref{eqn:hat-beta-0-in-terms-of-epsilon} rather than Eq.\ \ref{eqn:hat-beta-1-in-terms-of-epsilon}, an argument exactly parallel to the one we just went through gives \[ \hat{\beta}_0 \sim N(\beta_0, \frac{\sigma^2}{n}\left(1 + \frac{\overline{x}^2}{s^2_X}\right)) \] It follows, again by parallel reasoning, that \[ \frac{\hat{\beta}_0 - \beta_0}{\sqrt{\frac{\sigma^2}{n}\left(1 + \frac{\overline{x}^2}{s^2_X}\right)}} \sim N(0,1) \] The right-hand side of this equation is admirably simple and easy for us to calculate, but the left-hand side unfortunately involves two unknown parameters, and that complicates any attempt to use it. \subsection{Sampling Distribution of $\hat{\sigma}^2$} It is mildly challenging, but certainly not too hard, to show that \[ \Expect{\hat{\sigma}^2} = \frac{n-2}{n}\sigma^2 \] As I have said before, this will be a problem on a future assignment, so I will not give a proof, but I will note that the way to proceed is to write \[ \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}{e^2_i} ~ ; \] then to write each residual $e_i$ as a weighted sum of the noise terms $\epsilon$; to use $\Expect{e^2_i} = \Var{e_i} + (\Expect{e_i})^2$; and finally to sum up over $i$. Notice that this implies that $\Expect{\hat{\sigma}^2} = 0$ when $n=2$. This is because any two points in the plane define a (unique) line, so if we have only two data points, least squares will just run a line through them exactly, and have an in-sample MSE of 0. In general, we get the factor of $n-2$ from the fact that we are estimating two parameters. We can however be much more specific. When $\epsilon_i \sim N(0,\sigma^2)$, it can be shown that \[ \frac{n \hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-2} \] Notice, by the way, that this equation involves no unknown parameters on the right-hand side, and only one on the left-hand side. It lets us calculate the probability that $\hat{\sigma}^2$ is within any given {\em factor} of $\sigma^2$. If, for instance, we wanted to know the probability that $\hat{\sigma}^2 \geq 7 \sigma^2$, this will let us find it. I will offer only a hand-waving explanation; I am afraid I am not aware of any truly elementary mathematical explanation --- every one I know of either uses probability facts which are about as obscure as the result to be shown, or linear-algebraic facts about the properties of idempotent matrices\footnote{Where $M^2 = M$.}, and we've not seen, {\em yet}, how to write linear regression in matrix form. I do however want to re-assure you that there are actual proofs, and I promise to include one in these notes once we've seen how to connect what we're doing to matrices and linear algebra. I am afraid I do not have even a hand-waving explanation of a second important property of $\hat{\sigma}^2$: it is statistically independent of $\hat{\beta}_0$ and $\hat{\beta}_1$. This is {\em not} obvious --- after all, all three of these estimators are functions of the same noise variables $\epsilon$ --- but it {\em is} true, and, again, I promise to provide a genuine proof in these notes once we've gone over the necessary math. \subsubsection{The Hand-Waving Explanation for $n-2$} Let's think for a moment about a related (but strictly different!) quantity from $\hat{\sigma}^2$, namely \[ \frac{1}{n}\sum_{i=1}^{n}{\epsilon_i^2} \] This is a weighted sum of independent, mean-zero squared Gaussians, which is where the connection to $\chi^2$ distributions comes in. \paragraph{Some reminders about $\chi^2$} If $Z \sim N(0,1)$, then $Z^2 \sim \chi^2_1$ {\em by definition} (of the $\chi^2_1$ distribution). From this, it follows that $\Expect{\chi^2_1} = 1$, $\Var{\chi^2_1} = \Expect{Z^4} - (\Expect{Z^2})^2 = 2$. If $Z_1, Z_2, \ldots Z_d \sim N(0,1)$ and are independent, then the $\chi^2_d$ distribution is {\em defined} to be the distribution of $\sum_{i=1}^{d}{Z^2_i}$. By simple algebra, it follows that $\Expect{\chi^2_d} = d$ while $\Var{\chi^2_d} = 2d$. \paragraph{Back to the sum of squared noise terms} $\epsilon_i$ isn't a standard Gaussian, but $\epsilon_i/\sigma$ is, so \[ \frac{\sum_{i=1}^{n}{\epsilon^2_i}}{\sigma^2} = \sum_{i=1}^{n}{(\frac{\epsilon_i}{\sigma})^2} \sim \chi^2_n \] The numerator here is {\em like} $n\hat{\sigma}^2 = \sum_{i}{e^2_i}$, but of course the residuals $e_i$ are not the same as the noise terms $\epsilon_i$. The reason we end up with a $\chi^2_{n-2}$ distribution, rather than a $\chi^2_n$ distribution, is that we're estimating two parameters from the data removes two degrees of freedom, so two of the $\epsilon_i$ end up making no real contribution to the sum of squared errors. (Again, if $n=2$, we'd be able to fit the two data points {\em exactly} with the least squares line.) If we had estimated more or fewer parameters, we would have had to adjust the number of degrees of freedom accordingly. (There is also a geometric interpretation: the sum of squared errors, $\sum_{i=1}^{n}{e^2_i}$, is the squared length of the $n$-dimensional vector of residuals, $(e_1, e_2, \ldots e_n)$. But the residuals must obey the two equations $\sum_{i}{e_i} = 0$, $\sum_{i}{x_i e_i} = 0$, so the residual vector actually is confined to an $(n-2)$-dimensional linear subspace. Thus we only end up adding up $(n-2)$ {\em independent} contributions to its length. If we estimated more parameters, we'd have more estimating equations, and so more constraints on the vector of residuals.) \subsection{Standard Errors of $\hat{\beta}_0$ and $\hat{\beta}_1$} The {\bf standard error} of an estimator is its standard deviation\footnote{We don't just call it the standard deviation because we want to emphasize that it is, in fact, telling us about the random errors our estimator makes.}. We've just seen that the true standard errors of $\hat{\beta}_0$ and $\hat{\beta}_1$ are, respectively, \begin{eqnarray} \se{\hat{\beta}_1} & = & \frac{\sigma}{s_x \sqrt{n}}\\ \se{\hat{\beta}_0} & = & \frac{\sigma}{\sqrt{n} s_X}\sqrt{s^2_X + \overline{x}^2} \end{eqnarray} Unfortunately, these standard errors involve the unknown parameter $\sigma^2$ (or its square root $\sigma$, equally unknown to us). We can, however, {\em estimate} the standard errors. The maximum-likelihood estimates just substitute $\hat{\sigma}$ for $\sigma$: \begin{eqnarray} \sehat{\hat{\beta}_1} & = & \frac{\hat{\sigma}}{s_x \sqrt{n}}\\ \sehat{\hat{\beta}_0} & = & \frac{\hat{\sigma}}{s_X \sqrt{n} }\sqrt{s^2_X + \overline{x}^2} \end{eqnarray} For later theoretical purposes, however, things will work out slightly nicer if we use the de-biased version, $\frac{n}{n-2}\hat{\sigma}^2$: \begin{eqnarray} \label{eqn:debiased-se-hats} \sehat{\hat{\beta}_1} & = & \frac{\hat{\sigma}}{s_x \sqrt{n-2}}\\ \sehat{\hat{\beta}_0} & = & \frac{\hat{\sigma}}{s_x \sqrt{n-2}}\sqrt{s^2_X + \overline{x}^2} \end{eqnarray} These standard errors --- approximate or estimated though they be --- are one important way of quantifying how much uncertainty there is around our point estimates. However, we can't use them, {\em alone} to say anything terribly precise\footnote{Exercise to think through: Could you use Chebyshev's inequality (the extra credit problem from Homework 1) here?} about, say, the probability that $\beta_1$ is in the interval $[\hat{\beta}_1 - \sehat{\hat{\beta}_1}, \hat{\beta}_1 - \sehat{\hat{\beta}_1}]$, which is the sort of thing we'd want to be able to give guarantees about the reliability of our estimates. \section{Sampling distribution of $(\hat{\beta} - \beta)/\sehat{\hat{\beta}}$} It should take only a little work with the properties of the Gaussian distribution to convince yourself that \[ \frac{\hat{\beta}_1 - \beta_1}{\se{\hat{\beta}_1}} \sim N(0,1) \] the standard Gaussian distribution. If the Oracle told us $\sigma^2$, we'd know $\se{\hat{\beta}_1}$, and so we could assert that (for example) \begin{eqnarray} \lefteqn{\Prob{\beta_1 - 1.96 \se{\hat{\beta}_1} \leq \hat{\beta}_1 \leq \beta_1 + 1.96 \se{\hat{\beta}_1}}} & & \\ & = & \Prob{-1.96\se{\hat{\beta}_1} \leq \hat{\beta}_1 - \beta_1 \leq 1.96 \se{\hat{\beta}_1}}\\ & = & \Prob{-1.96 \leq \frac{\hat{\beta}_1 - \beta_1}{\se{\hat{\beta}_1}} \leq 1.96}\\ & = & \Phi(1.96) - \Phi(-1.96) = 0.95 \end{eqnarray} where $\Phi$ is the cumulative distribution function of the $N(0,1)$ distribution. Since the oracles have fallen silent, we can't use this approach. What we {\em can} do is use the following fact\footnote{When I messed up the derivation in class today, I left out dividing by $d$ in the denominator. As I mentioned at the end of that debacle, this was stupid. As $d \rightarrow\infty$, $t_{d}$ converges on the standard Gaussian distribution $N(0,1)$. (Notice that $\Expect{d^{-1} \chi^2_d} = 1$, while $\Var{d^{-1} \chi^2_d} = 2/d$, so $d^{-1}\chi^2_d \rightarrow 1$.) Without the normalizing factor of $d$ inside the square root, however, looking just at $Z/S$, we've got a random variable whose distribution doesn't change with $d$ being divided by something whose magnitude {\em grows} with $d$, so $Z/S \rightarrow 0$ as $d \rightarrow \infty$, not $\rightarrow N(0,1)$. I apologize again for my error.}: \begin{proposition} If $Z \sim N(0,1)$, $S^2 \sim \chi^2_{d}$, and $Z$ and $S^2$ are independent, then \[ \frac{Z}{\sqrt{S^2/d}} \sim t_{d} \] \end{proposition} (I call this a proposition, but it's almost a definition of what we mean by a $t$ distribution with $d$ degrees of freedom. Of course, if we take this as the definition, the proposition that this distribution has a probability density $\propto (1+x^2/d)^{-(d+1)/2}$ would become yet another proposition to be demonstrated.) Let's try to manipulate $(\hat{\beta}_1-\beta_1)/\sehat{\hat{\beta}_1}$ into this form. \begin{eqnarray} \frac{\hat{\beta}_1 - \beta_1}{\sehat{\hat{\beta}_1}} & = & \frac{\hat{\beta}_1 - \beta_1}{\sigma} \frac{\sigma}{\sehat{\hat{\beta}_1}}\\ & = & \frac{\frac{\hat{\beta}_1 - \beta_1}{\sigma}}{\frac{\sehat{\hat{\beta}_1}}{\sigma}}\\ & = & \frac{N(0,1/n s^2_X)}{\frac{\hat{\sigma}}{s_x \sigma \sqrt{n-2}}}\\ & = & \frac{s_X N(0,1/n s^2_X)}{\frac{\hat{\sigma}}{\sigma \sqrt{n-2}}}\\ & = & \frac{N(0,1/n)}{\frac{\hat{\sigma}}{\sigma \sqrt{n-2}}}\\ & = & \frac{\sqrt{n} N(0,1/n)}{\frac{\sqrt{n}\hat{\sigma}}{\sigma \sqrt{n-2}}}\\ & = & \frac{N(0,1)}{\sqrt{\frac{n\hat{\sigma}^2}{\sigma^2}\frac{1}{n-2}}}\\ & = & \frac{N(0,1)}{\sqrt{\chi^2_{n-2}/(n-2)}}\\ & = & t_{n-2} \end{eqnarray} where in the last step I've used the proposition I stated (without proof) above. To sum up: \begin{proposition} Using the $\sehat{\hat{\beta}_1}$ of Eq.\ \ref{eqn:debiased-se-hats}, \begin{eqnarray} \label{eqn:t-statistic} \frac{\hat{\beta}_1 - \beta_1}{\sehat{\hat{\beta}_1}} \sim t_{n-2} \end{eqnarray} \end{proposition} Notice that we can compute $\sehat{\hat{\beta}_1}$ without knowing any of the true parameters --- it's a pure statistic, just a function of the data. This is a key to actually using the proposition for anything useful. By exactly parallel reasoning, we may also demonstrate that \[ \frac{\hat{\beta}_0 - \beta_0}{\sehat{\hat{\beta}_0}} \sim t_{n-2} \] \section{Sampling Intervals for $\hat{\beta}$; hypothesis tests for $\hat{\beta}$} Let's trace through one of the consequences of Eq.\ \ref{eqn:t-statistic}. For any $k > 0$, \begin{eqnarray} \lefteqn{\Prob{\beta_1 - k \sehat{\hat{\beta}_1} \leq \hat{\beta}_1 \leq \beta_1 + k \sehat{\hat{\beta}_1}}} & & \\ & = & \Prob{k\sehat{\hat{\beta}_1} \leq \hat{\beta}_1 - \beta_1 \leq k \sehat{\hat{\beta}_1}}\\ & = & \Prob{k \leq \frac{\hat{\beta}_1 - \beta_1}{\sehat{\hat{\beta}_1}} \leq k}\\ & = & \int_{-k}^{k}{t_{n-2}(u) du} \end{eqnarray} where by a slight abuse of notation I am writing $t_{n-2}(u)$ for the probability density of the $t$ distribution with $n-2$ degrees of freedom, evaluated at the point $u$. It should be evident that if you pick any $\alpha$ between $0$ and $1$, I can find a $k(n,\alpha)$ such that \[ \int_{-k(n,\alpha)}^{k(n,\alpha)}{t_{n-2}(u) du} = 1-\alpha \] I therefore define the (symmetric) $1-\alpha$ {\bf sampling interval} for $\hat{\beta}_1$, when the true slope is $\beta_1$, as \begin{equation} \label{eqn:sampling-interval} [\beta_1 - k(n,\alpha) \sehat{\hat{\beta}_1}, \beta_1 + k(n,\alpha) \sehat{\hat{\beta}_1}] \end{equation} If the true slope is $\beta_1$, then $\hat{\beta}_1$ will be within this sampling interval with probability $1-\alpha$. This leads directly to a test of the null hypothesis that the slope $\beta_1=\beta_1^*$: reject the null if $\hat{\beta}_1$ is outside the sampling interval for $\beta_1^*$, and retain the null when $\hat{\beta}_1$ is inside that sampling interval. This test is called the {\bf Wald test}, after the great statistician Abraham Wald\footnote{As is common with eponyms in the sciences, Wald was not, in fact, the first person to use the test, but he made one of the most important early studies of its properties, and he was already famous for other reasons.}. By construction, the Wald test's probability of rejection under the null hypothesis --- the {\bf size}, or {\bf type I error rate}, or {\bf false alarm rate} of the test --- is exactly $\alpha$. Of course, the other important property of a hypothesis test is its {\bf power} --- the probability of rejecting the null when it is false. From Eqn.\ \ref{eqn:t-statistic}, it should be clear that if the true $\beta_1 \neq \beta_1^*$, the probability that $\hat{\beta}_1$ is inside the sampling interval for $\beta_1^*$ is $< 1-\alpha$, with the difference growing as $|\beta_1 - \beta_1^*|$ grows. An exact calculation could be done (it'd involve what's called the ``non-central $t$ distribution''), but is not especially informative. The point is that the power is always $> \alpha$, and grows with the departure from the null hypothesis. If you were an economist, psychologist, or something of their ilk, you have a powerful drive --- almost a spinal reflex not involving the higher brain regions --- to test whether $\beta_1 = 0$. Under the Wald test, you would reject that point null hypothesis when $|\hat{\beta}_1|$ exceeds a certain number of standard deviations. As an intelligent statistician in control of your own actions, you would read the section on ``statistical significance'' below, before doing any such thing. All of the above applies, {\em mutatis mutandis}, to $\frac{\hat{\beta}_0 - \beta_0}{\sehat{\hat{\beta}_0}}$. \section{Building Confidence Intervals from Sampling Intervals} Once we know how to calculate sampling intervals, we can plot the sampling interval for every possible value of $\beta_1$ (Figure \ref{fig:sampling-intervals}). They're the region marked off by two parallel lines, one $k(n,\alpha)\sehat{\hat{\beta}_1}$ above the main diagonal and one equally far below the main diagonal. \begin{figure} <>= lm.sim <- lm(y~x, data=sim.gnslrm(x=x, 5, -2, 0.1, coefficients=FALSE)) hat.sigma.sq <- mean(residuals(lm.sim)^2) se.hat.beta.1 <- sqrt(hat.sigma.sq/(var(x)*(length(x)-2))) alpha <- 0.02 k <- qt(1-alpha/2, df=length(x)-2) plot(0, xlim=c(-3,-1),ylim=c(-3,-1),type="n", xlab=expression(beta[1]), ylab=expression(hat(beta)[1]), main="") abline(a=k*se.hat.beta.1,b=1) abline(a=-k*se.hat.beta.1,b=1) abline(a=0,b=1,lty="dashed") beta.1.star <- -1.73 segments(x0=beta.1.star,y0=k*se.hat.beta.1+beta.1.star, x1=beta.1.star,y1=-k*se.hat.beta.1+beta.1.star, col="blue") @ <>= <> @ \caption{Sampling intervals for $\hat{\beta}_1$ as a function of $\beta_1$. For compatibility with the earlier simulation, I have set $n=42$, $s^2_X = \Sexpr{signif(var(x),2)}$, and (from one run of the model) $\hat{\sigma}^2 = \Sexpr{signif(hat.sigma.sq,2)}$; and, just because $\alpha=0.05$ is cliched, $\alpha=\Sexpr{alpha}$. Equally arbitrarily, the blue vertical line illustrates the sampling interval when $\beta_1 = \Sexpr{beta.1.star}$.} \label{fig:sampling-intervals} \end{figure} The sampling intervals (as in Figure \ref{fig:sampling-intervals}) are theoretical constructs --- mathematical consequences of the assumptions in the the probability model that (we hope) describes the world. After we gather data, we can actually calculate $\hat{\beta}_1$. This is a random quantity, but it will have some particular value on any data set. We can mark this realized value, and draw a horizontal line across the graph at that height (Figure \ref{fig:sampling-intervals-plus-hat-beta-1}). \begin{figure} <>= plot(0, xlim=c(-3,-1),ylim=c(-3,-1),type="n", xlab=expression(beta[1]), ylab=expression(hat(beta)[1]), main="") abline(a=k*se.hat.beta.1,b=1) abline(a=-k*se.hat.beta.1,b=1) abline(a=0,b=1,lty="dashed") beta.1.hat <- coefficients(lm.sim)[2] abline(h=beta.1.hat,col="grey") @ <>= <> @ \caption{As in Figure \ref{fig:sampling-intervals}, but with the addition of a horizontal line marking the observed value of $\hat{\beta}_1$ on a particular realization of the simulation (in grey).} \label{fig:sampling-intervals-plus-hat-beta-1} \end{figure} The $\hat{\beta}_1$ we observed is within the sampling interval for some (possible or hypothetical) values of $\beta_1$, and outside the sampling interval for others. We define the {\bf confidence set}, with {\bf confidence level} $1-\alpha$, as \begin{equation} \label{eqn:defn-of-ci} \left\{ \beta_1 ~: ~ \hat{\beta}_1 \in [\beta_1 - k(n,\alpha) \sehat{\hat{\beta}_1}, \beta_1 + k(n,\alpha) \sehat{\hat{\beta}_1}] \right\} \end{equation} This is precisely the set of $\beta_1$ which we retain when we run the Wald test with size $\alpha$\marginpar{Confidence set = Test all the hypotheses!}. In other words: we test every possible $\beta_1$; if we'd reject that null hypothesis, that value of $\beta_1$ gets removed from the hypothesis test; if we'd retain that null, $\beta_1$ stays in the confidence set\footnote{Cf.\ the famous Sherlock Holmes line ``When you have eliminated the impossible, whatever remains, however improbable, must be the truth.'' In forming the confidence set, we are eliminating the merely {\em unlikely}, rather than the absolutely impossible. This is because, not living in a detective story, we get only noisy and imperfect evidence.}. Figure \ref{fig:confidence-set} illustrate a confidence set, and shows (unsurprisingly) that in this case the confidence set is indeed a confidence {\em interval}. Indeed, a little manipulation of Eq.\ \ref{eqn:defn-of-ci} gives us an explicit formula for the confidence set, which is an interval: \[ [\hat{\beta}_1 - k(n,\alpha) \sehat{\hat{\beta}_1}, \hat{\beta}_1 + k(n,\alpha) \sehat{\hat{\beta}_1} \] \begin{figure} <>= plot(0, xlim=c(-3,-1),ylim=c(-3,-1),type="n", xlab=expression(beta[1]), ylab=expression(hat(beta)[1]), main="") abline(a=k*se.hat.beta.1,b=1) abline(a=-k*se.hat.beta.1,b=1) abline(a=0,b=1,lty="dashed") beta.1.hat <- coefficients(lm.sim)[2] abline(h=beta.1.hat,col="grey") segments(x0=beta.1.hat-k*se.hat.beta.1, y0=beta.1.hat, x1=beta.1.hat+k*se.hat.beta.1, y1=beta.1.hat, col="red") @ <>= <> @ \caption{As in Figure \ref{fig:sampling-intervals-plus-hat-beta-1}, but with the {\bf confidence set} marked in red. This is the collection of all $\beta_1$ where $\hat{\beta}_1$ falls within the $1-\alpha$ sampling interval.} \label{fig:confidence-set} \end{figure} The correct interpretation of a confidence set is that it offers us a dilemma. One of two\footnote{Strictly speaking, there is a third option: our model could be wrong. Hence the importance of model checking {\em before} doing within-model inference.} things must be true: \begin{enumerate} \item The true $\beta_1$ is inside the confidence set. \item $\hat{\beta}_1$ is outside the sampling interval of the true $\beta_1$. \end{enumerate} We know that the second option has probability at most $\alpha$, no matter what the true $\beta_1$ is, so we may rephrase the dilemma. Either \begin{enumerate} \item The true $\beta_1$ is inside the confidence set, or \item We're very unlucky, because something whose probability is $\leq \alpha$ happened. \end{enumerate} Since, most of the time, we're not very unlucky, the confidence set is, in fact, a reliable way of giving a margin of error for the true parameter $\beta_1$. \paragraph{Width of the confidence interval} Notice that the width of the confidence interval is $2k(n,\alpha) \sehat{\hat{\beta}_1}$. This tells us what controls the width of the confidence interval: \begin{enumerate} \item As $\alpha$ shrinks, the interval widens. (High confidence comes at the price of big margins of error.) \item As $n$ grows, the interval shrinks. (Large samples mean precise estimates.) \item As $\sigma^2$ increases, the interval widens. (The more noise there is around the regression line, the less precisely we can measure the line.) \item As $s^2_X$ grows, the interval shrinks. (Widely-spread measurements give us a precise estimate of the slope.) \end{enumerate} \paragraph{What about $\beta_0$?} By exactly parallel reasoning, a $1-\alpha$ confidence interval for $\beta_0$ is $[\hat{\beta}_0 - k(n,\alpha)\sehat{\hat{\beta}_0}, \hat{\beta}_0 + k(n,\alpha)\sehat{\hat{\beta}_0}]$. \paragraph{What about $\sigma^2$?} See Exercise \ref{exercise:CI-for-sigma}. \paragraph{What $\alpha$ should we use?} It's become conventional to set $\alpha=0.05$. To be honest, this owes more to the fact that the resulting $k$ tends to $1.96$ as $n\rightarrow\infty$, and $1.96 \approx 2$, and most psychologists and economists could multiply by 2, even in 1950, than to any genuine principle of statistics or scientific method. A 5\% error rate corresponds to messing up about one working day in every month, which you might well find high. On the other hand, there is nothing which stops you from increasing $\alpha$. It's often illuminating to plot a series of confidence sets, at different values of $\alpha$. \paragraph{What about power?} The {\bf coverage} of a confidence set is the probability that it includes the true parameter value. This is not, however, the only virtue we want in a confidence set; if it was, we could just say ``Every possible parameter is in the set'', and have 100\% coverage no matter what. We would also like the {\em wrong} values of the parameter to have a high probability of {\em not} being in the set. Just as the coverage is controlled by the size / false-alarm probability / type-I error rate $\alpha$ of the hypothesis test, the probability of excluding the wrong parameters is controlled by the power / miss probability / type-II error rate. Test with higher power exclude (correctly) more parameter values, and give smaller confidence sets. \subsection{Confidence Sets and Hypothesis Tests} I have derived confidence sets for $\beta$ by inverting a specific hypothesis test, the Wald test. There is a more general relationship between confidence sets and hypothesis tests. \begin{enumerate} \item Inverting any hypothesis test gives us a confidence set. \item If we have a way of constructing a $1-\alpha$ confidence set, we can use it to test the hypothesis that $\beta = \beta^*$: reject when $\beta^*$ is outside the confidence set, retain the null when $\beta^*$ is inside the set. \end{enumerate} I will leave it as a pair of exercises (\ref{exercise:ci-from-ht} and \ref{exercise:ht-from-ci}) to that inverting a test of size $\alpha$ gives a $1-\alpha$ confidence set, and that inverting a $1-\alpha$ confidence set gives a test of size $\alpha$. \subsection{Large-$n$ Asymptotics} As $n\rightarrow\infty$, $\hat{\sigma}^2 \rightarrow \sigma^2$. It follows (by continuity) that $\sehat{\hat{\beta}} \rightarrow \se{\hat{\beta}}$. Hence, \[ \frac{\hat{\beta} - \beta}{\sehat{\hat{\beta}}} \rightarrow N(0,1) \] which considerably simplifies the sampling intervals and confidence sets; as $n$ grows, we can forget about the $t$ distribution and just use the standard Gaussian distribution. Figure \ref{fig:t-dist-vs-Gaussian-dist} plots the convergence of $k(n,\alpha)$ towards the $k(\infty,\alpha)$ we'd get from the Gaussian approximation. As you can see from the figure, by the time $n=100$ ---a quite small data set by modern standards --- the difference between the $t$ distribution and the standard-Gaussian is pretty trivial. \begin{figure} <>= curve(qt(0.995,df=x-2),from=3,to=1e4,log="x", ylim=c(0,10), xlab="Sample size (n)", ylab=expression(k(n,alpha)),col="blue") abline(h=qnorm(0.995),lty="dashed",col="blue") curve(qt(0.975,df=x-2), add=TRUE) abline(h=qnorm(0.975),lty="dashed") curve(qt(0.75,df=x-2), add=TRUE, col="orange") abline(h=qnorm(0.75), lty="dashed", col="orange") legend("topright", legend=c(expression(alpha==0.01), expression(alpha==0.05), expression(alpha==0.5)), col=c("blue","black","orange"), lty="solid") @ <>= <> @ \caption{Convergence of $k(n,\alpha)$ as $n\rightarrow\infty$, illustrated for $\alpha = 0.01$, $\alpha=0.05$ and $\alpha=0.5$. (Why do I plot the $97.5^{\mathrm{th}}$ percentile when I'm interested in $\alpha=0.05$?)} \label{fig:t-dist-vs-Gaussian-dist} \end{figure} \clearpage \section{Statistical Significance: Uses and Abuses} \subsection{$p$-Values} The test statistic for the Wald test, \[ T = \frac{\hat{\beta}_1 - \beta_1^*}{\sehat{\hat{\beta}_1}} \] has the nice, intuitive property that it ought to be close to zero when the null hypothesis $\beta_1 = \beta_1^*$ is true, and take large values (either positive or negative) when the null hypothesis is false. When a test statistic works like this, it makes sense to summarize just how bad the data looks for the null hypothesis in a {\bf $p$-value}: when our observed value of the test statistic is $T_{obs}$, the $p$-value is \[ P = \Prob{|T| \geq |T_{obs}|} \] calculating the probability under the null hypothesis. (I write a capital $P$ here as a reminder that this is a random quantity, though it's conventional to write the phrase ``$p$-value'' with a lower-case $p$.) This is the probability, under the null, of getting results which are at least as extreme as what we saw. It should be easy to convince yourself that rejecting the null in a level-$\alpha$ test is the same as getting a $p$-value $< \alpha$. It is not too hard (Exercise \ref{exercise:p-value-is-uniform-under-null}) to show that $P$ has a uniform distribution over $[0,1]$ under the null hypothesis. \subsection{$p$-Values and Confidence Sets} When our test lets us calculate a $p$-value, we can form a $1-\alpha$ confidence set by taking all the $\beta$'s where the $p$-value is $\geq \alpha$. Conversely, if we have some way of making confidence sets already, we can get a $p$-value for the hypothesis $\beta = \beta^*$; it's the largest $\alpha$ such that $\beta^*$ is in the $1-\alpha$ confidence set. \subsection{Statistical Significance} If we test the hypothesis that $\beta_1 = \beta_1^*$ and reject it, we say that the difference between $\beta_1$ and $\beta_1^*$ is {\bf statistically significant}. Since, as I mentioned, many professions have an overwhelming urge to test the hypothesis $\beta_1 = 0$, it's common to hear people say that ``$\beta_1$ is statistically significant'' when they mean ``$\beta_1$ is difference from 0 is statistically significant''. This is harmless enough, as long as we keep firmly in mind that ``significant'' is used here as a technical term, with a special meaning, and is {\em not} the same as ``important'', ``relevant'', etc. When we reject the hypothesis that $\beta_1 = 0$, what we're saying is ``It's really implausibly hard to fit this data with a flat line, as opposed to one with a slope''. This is informative, if we had serious reasons to think that a flat line was a live option. It is incredibly common for researchers from other fields, and even some statisticians, to reason as follows: ``I tested whether $\beta_1 = 0$ or not, and I retained the null; {\em therefore} $\beta_1$ is {\em in}significant, and I can ignore it.'' This is, of course, a complete fallacy. To see why, it is enough to realize that there are (at least) two reasons why our hypothesis test might retain the null $\beta_1 = 0$: \begin{enumerate} \item $\beta_1$ is, in fact, zero, \item $\beta_1 \neq 0$, but $\sehat{\hat{\beta}_1}$ is so large that we can't tell anything about $\beta_1$ with any confidence. \end{enumerate} There is a very big difference between data which lets us say ``we can be quite confident that the true $\beta_1$ is, if not perhaps exactly 0, then very small'', and data which only lets us say ``we have no earthly idea what $\beta_1$ is, and it may as well be zero for all we can tell''\footnote{Imagine hearing what sounds like the noise of an animal in the next room. If the room is small, brightly lit, free of obstructions, and you make a thorough search of it with unimpaired vision and concentration, not finding an animal in it is, in fact, good evidence that there was no animal there to be found. If on the other hand the room is dark, large, full of hiding places, and you make a hurried search while distracted, without your contact lenses and after a few too many drinks, you could easily have missed all sorts of things, and your negative report has little weight as evidence. (In this parable, the difference between a large $|\beta_1|$ and a small $|\beta_1|$ is the difference between looking for a Siberian tiger and looking for a little black cat.)}. It is good practice to always compute a confidence interval, but it is {\em especially} important to do so when you retain the null, so you know whether you can say ``this parameter is zero to within such-and-such a (small) precision'', or whether you have to admit ``I couldn't begin to tell you what this parameter is''. \paragraph{Substantive vs.\ statistical significance} Even a huge $\beta_1$, which it would be crazy to ignore in any circumstance, can be statistically insignificant, so long as $\sehat{\hat{\beta}_1}$ is large enough. Conversely, any $\beta_1$ which isn't exactly zero, no matter how close it might be to 0, will become statistically significant at any threshold once $\sehat{\hat{\beta}_1}$ is small enough. Since, as $n \rightarrow \infty$, \[ \sehat{\hat{\beta}_1} \rightarrow \frac{\sigma}{s_X \sqrt{n}} \] we can show that $\sehat{\hat{\beta}_1} \rightarrow 0$, and $\frac{\hat{\beta}_1}{\sehat{\hat{\beta}_1}} \rightarrow \pm\infty$, unless $\beta_1$ is exactly 0 (see below). Statistical significance is a weird mixture of how big the coefficient is, how big a sample we've got, how much noise there is around the regression line, and how spread out the data is along the $x$ axis. This has so little to do with ``significance'' in ordinary language that it's pretty unfortunate we're stuck with the word; if the Ancestors had decided to say ``statistically detectable'' or ``statistically distinguishable from 0'', we might have avoided a lot of confusion. If {\em you} confuse substantive and statistical significance in this class, it will go badly for you. \subsection{Appropriate Uses of $p$-Values and Significance Testing} I do not want this section to give the impression that $p$-values, hypothesis testing, and statistical significance are unimportant or necessarily misguided. They're often used badly, but that's true of every statistical tool from the sample mean on down the line. There are certainly situations where we really do want to know whether we have good evidence against some {\em exact} statistical hypothesis, and that's just the job these tools do. What are some of these situations? \paragraph{Model checking} Our statistical models often make very strong, claims about the probability distribution of the data, with little wiggle room. The simple linear regression model, for instance, claims that the regression function is {\em exactly} linear, and that the noise around this line has {\em exactly} constant variance. If we test these claims and find very small $p$-values, then we have evidence that there's a detectable, systematic departure from the model assumptions, and we should re-formulate the model. \paragraph{Actual scientific interest} Some scientific theories make very precise predictions about coefficients. According to Newton, the gravitational force between two masses is inversely proportional to the {\em square} of the distance between them, $\propto r^{-2}$. The prediction is exactly $\propto r^{-2}$, not $\propto r^{-1.99}$ nor $\propto r^{-2.05}$. Measuring that exponent and finding even tiny departures from $2$ would be big news, if we had reason to think they were real and not just noise\footnote{In fact, it {\em was} big news: Einstein's theory of general relativity.}. One of the most successful theories in physics, quantum electrodynamics, makes predictions about some properties of hydrogen atoms with a theoretical precision of one part in a trillion; finding even tiny discrepancies between what the theory predicts and what we estimate would force us to rethink lots of physics\footnote{\citet{Feynman-QED} gives a great conceptual overview of quantum electrodynamics. Currently, theory agrees with experiment to the limits of experimental precision, which is only about one part in a billion (\url{https://en.wikipedia.org/wiki/Precision_tests_of_QED}).}. Experiments to detect new particles, like the Higgs boson, essentially boil down to hypothesis testing, looking for deviations from theoretical predictions which should be exactly zero if the particle doesn't exist. Outside of the natural sciences, however, it is harder to find examples of interesting, exact null hypothesis which are, so to speak, ``live options''. The best I can come up with are theories of economic growth and business cycles which predict that the share of national income going to labor (as opposed to capital) should be constant over time. Otherwise, in the social sciences, there's usually little theoretical reason to think that certain regression coefficients should be {\em exactly} zero, or {\em exactly} one, or anything else. \paragraph{Neutral models} A partial exception is the use of {\bf neutral models}, which comes out of genetics and ecology. The idea here is to check whether some mechanism is at work in a particular situation --- say, whether some gene is subject to natural selection. One constructs two models; one incorporates all the mechanisms (which we think are) at work, including the one under investigation, and the other incorporate all the {\em other} mechanisms, but ``neutralizes'' the one of interest. (In a genetic example, the neutral model would probably incorporate the effects of mutation, sexual repdouction, the random sampling of which organisms become the ancestors of the next generation, perhaps migration, etc. The non-neutral model would include all this {\em plus} the effects of natural selection.) Rejecting the neutral model in favor of the non-neutral one then becomes evidence that the disputed mechanism is needed to explain the data. In the cases where this strategy has been done well, the neutral model is usually a pretty sophisticated stochastic model, and the ``neutralization'' is not as simple as just setting some slope to zero. Nonetheless, this is a situation where we do actually learn something about the world by testing a null hypothesis. \section{Any Non-Zero Parameter Becomes Significant with Enough Information} \label{sec:p-value-is-a-measure-of-sample-size} (This section is optional, but strongly recommended.) Let's look more close at what happens to the test statistic when $n\rightarrow\infty$, and so at what happens to the $p$-value. Throughout, we'll be testing the null hypothesis that $\beta_1 = 0$, since this is what people most often do, but the same reasoning applies to departures from any fixed value of the slope. (Everything carries over with straightforward changes to testing hypotheses about the intercept $\beta_0$, too.) We know that $\hat{\beta}_1 \sim N(\beta_1, \sigma^2/n s^2_X)$. This means\footnote{If seeing something like $\frac{\sigma}{s_X \sqrt{n}} N(0,1)$, feel free to introduce random variables $Z_n \sim N(0,1)$ (though not necessarily independent ones), and modify the equations accordingly.} \begin{eqnarray} \hat{\beta}_1 & \sim & \beta_1 + N(0, \sigma^2/n s^2_X)\\ & = & \beta_1 + \frac{\sigma}{s_X \sqrt{n}} N(0,1)\\ & = & \beta_1 + O(1/\sqrt{n}) \end{eqnarray} where $O(f(n))$ is read ``order-of $f(n)$'', meaning that it's a term whose size grows like $f(n)$ as $n\rightarrow\infty$, and we don't want (or need) to keep track of the details. Similarly, since $n\hat{sigma}^2/\sigma^2 \sim \chi^2_{n-2}$, we have\footnote{Again, feel free to introduce the random variable $\Xi_n$, which just so happens to have a $\chi^2_{n-2}$ distribution.} \begin{eqnarray} n\hat{\sigma}^2 & \sim & \sigma^2 \chi^2_{n-2}\\ \hat{\sigma}^2 & \sim & \sigma^2 \frac{\chi^2_{n-2}}{n} \end{eqnarray} Since $\Expect{\chi^2_{n-2}} = n-2$ and $\Var{\chi^2_{n-2}} = 2(n-2)$, \begin{eqnarray} \Expect{\frac{\chi^2_{n-2}}{n}} & = & \frac{n-2}{n} \rightarrow 1\\ \Var{\frac{\chi^2_{n-2}}{n}} & = & \frac{2(n-2)}{n^2} \rightarrow 0 \end{eqnarray} with both limits happening as $n\rightarrow\infty$. In fact $\Var{\frac{\chi^2_{n-2}}{n}} = O(1/n)$, so \begin{equation} \hat{\sigma}^2 = \sigma^2\left(1+O(1/\sqrt{n})\right) \end{equation} Taking the square root, and using the fact\footnote{From the binomial theorem, back in high school algebra.} that $(1+x)^a \approx 1+ax$ when $|x| \ll 1$, \begin{equation} \hat{\sigma} = \sigma\left(1+O(1/\sqrt{n})\right) \end{equation} Put this together to look at our test statistic: \begin{eqnarray} \frac{\hat{\beta}_1}{\sehat{\hat{\beta}_1}} & = & \frac{\beta_1 + O(1/\sqrt{n})}{\frac{\sigma\left(1+O(1/\sqrt{n})\right)}{s_X \sqrt{n}}}\\ & = & \sqrt{n}\frac{\beta_1 + O(1/\sqrt{n})}{(\sigma/s_X)\left(1+O(1/\sqrt{n})\right)}\\ & = & \sqrt{n}\frac{\beta_1}{\sigma/s_X}\left(1+O(1/\sqrt{n})\right)\\ & = & \sqrt{n}\frac{\beta_1}{\sigma/s_X} + O(1) \end{eqnarray} In words: so long as the true $\beta_1 \neq 0$, the test statistic is going to go off to $\pm \infty$, and the rate at which it escapes towards infinity is going to be proportional to $\sqrt{n}$. When we compare this against the null distribution, which is $N(0,1)$, eventually we'll get arbitrarily small $p$-values. We can actually compute what those $p$-values should be, by two bounds on the standard Gaussian distribution\footnote{See \citet{Feller-prob-theory}, Chapter VII, \S 1, Lemma 2. For a brief proof online, see \url{http://www.johndcook.com/normalbounds.pdf}.}: \begin{equation} \label{eqn:std-gaussian-bounds} \left(\frac{1}{x}-\frac{1}{x^3}\right)\frac{1}{\sqrt{2\pi}}e^{-x^2/2} < 1 - \Phi(x) < \frac{1}{x}\frac{1}{\sqrt{2\pi}}e^{-x^2/2} \end{equation} Thus \begin{eqnarray} P_n & = & \Prob{|Z| \geq \left| \frac{\hat{\beta}_1}{\hat{\sigma}/\sqrt{n}s_X} a \right| }\\ & = & 2 \Prob{Z \geq \left| \frac{\hat{\beta}_1}{\hat{\sigma}/\sqrt{n}s_X} \right| }\\ & \leq & \frac{2}{\sqrt{2\pi}}\frac{e^{-\frac{1}{2} \frac{\hat{\beta}_1^2}{\hat{\sigma}^2/n s^2_X}}}{\left| \frac{\hat{\beta}_1}{\hat{\sigma}/\sqrt{n}s_X} \right|} \end{eqnarray} To clarify the behavior, let's take the logarithm and divide by $n$: \begin{eqnarray} \frac{1}{n}\log{P_n} & \leq & \frac{1}{n}\log{\frac{2}{\sqrt{2\pi}}}\\ \nonumber & & - \frac{1}{n}\log{\left| \frac{\hat{\beta}_1}{\hat{\sigma}/\sqrt{n}s_X} \right| }\\ \nonumber & & -\frac{1}{2n}\frac{\hat{\beta}_1^2}{\hat{\sigma}^2/ns^2_X}\\ & = & \frac{\log{\sqrt{2}{\pi}}}{n}\\ \nonumber & & + \frac{\log{\left|\frac{\hat{\beta_1}}{\hat{\sigma}/s_x}\right| }}{n}\\ \nonumber & & - \frac{\log{n}}{2n}\\ \nonumber & & -\frac{\hat{\beta}_1^2}{2 \hat{\sigma}^2/s_X^2} \end{eqnarray} Take the limit as $n\rightarrow\infty$: \begin{eqnarray} \lim_{n\rightarrow\infty}{\frac{1}{n}\log{P_n}} & \leq & \lim_{n}{\frac{\log{\sqrt{2}{\pi}}}{n}}\\ \nonumber & & + \lim_{n}{\frac{\log{\frac{\hat{\beta_1}}{\hat{\sigma}/s_x}}}{n}}\\ \nonumber & & -\lim_{n}{\frac{\log{n}}{2n}}\\ \nonumber & & -\lim_{n}{\frac{\hat{\beta}_1^2}{2 \hat{\sigma}^2/s_X^2}} \end{eqnarray} Since $\hat{\beta}_1/(\hat{\sigma}/s_X) \rightarrow \beta_1/(\sigma/s_X)$, and $n^{-1}\log{n}\rightarrow 0$, \begin{eqnarray} \label{eqn:upper-p-value-bound} \lim_{n\rightarrow\infty}{\frac{1}{n}\log{P_n}} & \leq & - \frac{\beta_1^2}{2\sigma^2/s^2_X} \end{eqnarray} I've only used the upper bound on $1-\Phi(x)$ from Eq.\ \ref{eqn:std-gaussian-bounds}; if we use the lower bound from that equation, we get (Exercise \ref{exercise:lower-p-value-bound}) \begin{equation} \label{eqn:lower-p-value-bound} \lim_{n\rightarrow\infty}{\frac{1}{n}\log{P_n}} \geq - \frac{\beta_1^2}{2\sigma^2/s^2_X} \end{equation} Putting the upper and lower limits together, \[ \lim_{n\rightarrow\infty}{\frac{1}{n}\log{P_n}} = - \frac{\beta_1^2}{2\sigma^2/s^2_X} \] Turn the limit around: at least for large $n$, \begin{equation} \label{eqn:asymptotic-p-values} P_n \approx e^{-n\frac{\beta_1^2}{2\sigma^2/s^2_X}} \end{equation} Thus, {\em any} $\beta_1 \neq 0$ will (eventually) give exponentially small $p$-values. This is why, as a saying among statisticians have it, ``the $p$-value is a measure of sample size'': any non-zero coefficient will become arbitrarily statistically significant with enough data. This is just another way of saying that with enough data, we can (and will) detect even arbitrarily small coefficients, which is what we {\em want}. The flip-side, however, is that it's just senseless to say that one coefficient is important because it has a really small $p$-value and another is unimportant because it's got a big $p$-value. As we can see from Eq.\ \ref{eqn:asymptotic-p-values}, the $p$-value runs together the magnitude of the coefficient ($|\beta_1|$), the sample size ($n$), the noise around the regression line ($\sigma^2$), and how spread out the data is along the $x$ axis ($s^2_X$), the last of these because they control how precisely we can estimate $\beta_1$. Saying ``this coefficient must be really important, because we can measure it really precisely'' is not smart. \section{Confidence Sets and $p$-Values in R} When we estimate a model with \texttt{lm}, R makes it easy for us to extract the confidence intervals of the coefficients: <>= confint(object, level=0.95) @ Here \texttt{object} is the name of the fitted model object, and \texttt{level} is the confidence level; if you want 95\% confidence, you can omit that argument. For instance: <<>>= library(gamair); data(chicago) death.temp.lm <- lm(death ~ tmpd, data=chicago) confint(death.temp.lm) confint(death.temp.lm, level=0.90) @ If you want $p$-values for the coefficients\footnote{And, really, why do you?}, those are conveniently computed as part of the \texttt{summary} function: <<>>= coefficients(summary(death.temp.lm)) @ Notice how this actually gives us an array with four columns: the point estimate, the standard error, the $t$ statistic, and finally the $p$-value. Each row corresponds to a different coefficient of the model. If we want, say, the $p$-value of the intercept, that's <<>>= coefficients(summary(death.temp.lm))[1,4] @ The summary function will also print out a {\em lot} of information about the model: <<>>= summary(death.temp.lm) @ As my use of \texttt{coefficients(summary(death.temp.lm))} above suggests, the \texttt{summary} function actually returns a complex object, which can be stored for later access, and printed. Controlling how it gets printed is done through the \texttt{print} function: <<>>= print(summary(death.temp.lm), signif.stars=FALSE, digits=3) @ Here I am indulging in two of my pet peeves. It's been conventional (at least since the 1980s) to decorate this sort of regression output with stars beside the coefficients which are significant at various traditional levels. Since (as we've just seen at tedious length) statistical significance has almost nothing to do with real importance, this just clutters the print-out to no advantage\footnote{In fact, I strongly recommend running \texttt{options(show.signif.stars=FALSE)} at the beginning of your R script, to turn off the stars forever.}. Also, \texttt{summary} has a bad habit of using far more significant\footnote{A different sense of ``significant''!} digits than is justified by the precision of the estimates; I've reined that in. \subsection{Coverage of the Confidence Intervals: A Demo} Here is a little computational demonstration of how the confidence interval for a parameter is a random parameter, and how it covers the true parameter value with the probability we want. I'll repeat many simulations of the model from Figure \ref{fig:sampling-distribution-of-hat-beta-1}, calculate the confidence interval on each simulation, and plot those. I'll also keep track of how often, in the first $m$ simulations, the confidence interval covers the truth; this should converge to $1-\alpha$ as $m$ grows. \begin{figure} <>= # Run 1000 simulations and get the confidence interval from each CIs <- replicate(1000, confint(lm(y~x,data=sim.gnslrm(x=x,5,-2,0.1,FALSE)))[2,]) # Plot the first 100 confidence intervals; start with the lower limits plot(1:100, CIs[1,1:100], ylim=c(min(CIs),max(CIs)), xlab="Simulation number", ylab="Confidence limits for slope") # Now the lower limits points(1:100, CIs[2,1:100]) # Draw line segments connecting them segments(x0=1:100, x1=1:100, y0=CIs[1,1:100], y1=CIs[2,1:100], lty="dashed") # Horizontal line at the true coefficient value abline(h=-2, col="grey") @ <>= <> @ \end{figure} \begin{figure} <>= # For each simulation, check whether the interval covered the truth covered <- (CIs[1,] <= -2) & (CIs[2,] >= -2) # Calculate the cumulative proportion of simulations where the interval # contained the truth, plot vs. number of simulations. plot(1:length(covered), cumsum(covered)/(1:length(covered)), xlab="Number of simulations", ylab="Sample coverage proportion", ylim=c(0,1)) abline(h=0.95, col="grey") @ <>= <> @ \end{figure} \clearpage \section{Further Reading} There is a lot of literature on significance testing and $p$-values. They are often quite badly abused, leading to a harsh reaction against them, which in some cases goes as badly wrong as the abuses being complained of\footnote{Look, for instance, at the exchange between \citet{McCloskey-secret-sins,McCloskey-Ziliak-standard-error-of-regressions} and \citet{Hoover-and-Siegler-contra-McCloskey}.}. I find the work of D. Mayo and collaborators particularly useful here \citep{Mayo-error,Mayo-Cox-frequentist,Mayo-Spanos-post-data}. You may also want to read \url{http://bactra.org/weblog/1111.html}, particularly if you find \S \ref{sec:p-value-is-a-measure-of-sample-size} interesting, or confusing. The correspondence between confidence sets and hypothesis tests goes back to \citet{Neyman-1937}, which was the first formal, conscious introduction of confidence sets. (As usual, there are precursors.) That every confidence set comes from inverting a hypothesis test is a classical result in statistical theory, which can be found in, e.g., \citet{Casella-Berger-2nd}. (See also Exercises \ref{exercise:ci-from-ht} and \ref{exercise:ht-from-ci} below.) Some confusion on this point seems to arise from people not realizing that ``does $\hat{\beta}_1$ fall inside the sampling interval for $\beta_1^*$?'' is a test of the hypothesis that $\beta_1 = \beta_1^*$. In later lectures, we will look at how to get confidence sets for multiple parameters at once (when we do multiple linear regression), and how to get confidence sets by simulation, without assuming Gaussian noise (when we introduce the bootstrap). \section*{Exercises} To think through or to practice on, not to hand in. \begin{enumerate} \item \label{exercise:CI-for-sigma} {\em Confidence interval for $\sigma^2$:} Start with the observation that $n\hat{\sigma}^2/\sigma^2 \sim \chi^2_{n-2}$. \begin{enumerate} \item Find a formula for the $1-\alpha$ sampling interval for $\hat{\sigma}^2$, in terms of the CDF of the $\chi^2_{n-2}$ distribution, $\alpha$, $n$ and $\sigma^2$. (Some of these might not appear in your answer.) Is the width of your sampling interval the same for all $\sigma^2$, the way the width of the sampling interval for $\hat{\beta}_1$ doesn't change with $\beta_1$? \item Fix $\alpha=0.05$, $n=40$, and plot the sampling intervals against $\sigma^2$. \item Find a formula for the $1-\alpha$ confidence interval for $\sigma^2$, in terms of $\hat{\sigma}^2$, the CDF of the $\chi^2_{n-2}$ distribution, $\alpha$ and $n$. \end{enumerate} \item \label{exercise:ci-from-ht} Suppose we start a way of testing the hypothesis $\beta=\beta^*$ which can be applied to any $\beta^*$, and which has size (false alarm / type I error) probability $\alpha$ for $\beta^*$. Show that the set of $\beta$ retained by their tests is a confidence set, with confidence level $1-\alpha$. What happens if the size is $\leq \alpha$ for all $\beta^*$ (rather than exactly $\alpha$)? \item \label{exercise:ht-from-ci} Suppose we start from a way of creating confidence sets which we know has confidence level $1-\alpha$. We test the hypothesis $\beta=\beta^*$ by rejecting when $\beta^*$ is outside the confidence set, and retaining when $\beta^*$ is inside the confidence set. Show that the size of this test is $\alpha$. What happens if the initial confidence level is $\geq 1-\alpha$, rather exactly $1-\alpha$? \item \label{exercise:p-value-is-uniform-under-null} Prove that the $p$-value $P$ is uniformly distributed under the null hypothesis. You may, throughout, assume that the test statistic $T$ has a continuous distribution. \begin{enumerate} \item Show that if $Q \sim \mathrm{Unif}(0,1)$, then $P=1-Q$ has the same distribution. \item Let $X$ be a continuous random variable with CDF $F$. Show that $F(X) \sim \mathrm{Unif}(0,1)$. {\em Hint:} the CDF of the uniform distribution $F_{\mathrm{Unif}(0,1)}(x) = x$. \item Show that $P$, as defined, is $1-F_{|T|}(|T_{obs}|)$. \item Using the previous parts, show that $P \sim \mathrm{Unif}(0,1)$. \end{enumerate} \item \label{exercise:lower-p-value-bound} Use Eq.\ \ref{eqn:std-gaussian-bounds} to show Eq.\ \ref{eqn:lower-p-value-bound}, following the derivation of Eq.\ \ref{eqn:upper-p-value-bound}. \end{enumerate} \bibliography{locusts} \bibliographystyle{crs} \end{document}