\documentclass{article} \usepackage{amsmath,amsfonts} \usepackage{natbib} \usepackage{hyperref} \usepackage[font={it,small},labelfont={sc,small}]{caption} \usepackage{datetime} \usepackage{fancyhdr} % For putting the revision date on the bottom of each page % Needs fancyhdr, datetime packages \pagestyle{fancy} \fancyhead{} \fancyhead[LO,RE]{\thepage} \fancyhead[RO,LE]{\rightmark} \fancyfoot{} \fancyfoot[CO,CE]{\currenttime\ \today} \renewcommand{\headrulewidth}{0pt} \fancypagestyle{plain}{ \fancyhf{} \fancyhead[C]{\currenttime\ \today\\ \scriptsize{See updates and corrections at \url{http://www.stat.cmu.edu/~cshalizi/mreg/}}} \fancyfoot[C]{\thepage} \renewcommand{\headrulewidth}{0pt} \renewcommand{\footrulewidth}{0pt} } \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \begin{document} \title{Lecture 3: Introducing Statistical Modeling} \author{36-401, Fall 2015, Section B} \date{8 September 2015} \maketitle << echo=FALSE >>= library(knitr) opts_chunk$set(size="small",background="white", highlight=FALSE, cache=TRUE, autdep=TRUE, tidy=TRUE, tidy.opts=list(comment=FALSE)) @ \section{Motivating} Let's start this off with a motivating example\footnote{I learned about this data from a paper by \citet{Bettencout-et-al-growth-innovation-scaling}, but I think their data analysis is deeply flawed.}. We'll begin by loading some data which comes from the Bureau of Economic Analysis, on the economic output of cities in the U.S. (\url{http://www.bea.gov/regional/gdpmetro/}). <<>>= bea <-read.csv("http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/03/bea-2006.csv") @ For each city --- more precisely, each ``Metropolitan Statistical Area'', which ignores legal divisions of cities and counties and instead is based on patterns of commuting --- this records the name of the city, its population, its per-capita ``gross metropolitan product'' in 2006 (the total value of goods and services produced), and the share of the economy coming from four selected industries. <<>>= dim(bea) # Should have 7 columns and 366 rows, and we do head(bea) # Look at the beginning of the data @ Let's add a new column, which records the {\em total} GMP, by multiplying the output per person by the number of people: <<>>= bea$gmp <- bea$pcgmp * bea$pop @ And now let's look at this visually: <<>>= plot(gmp ~ pop, data=bea, xlab="Population", ylab="Total GMP") @ The \verb_plot_ command, naturally enough, makes plots. The first argument to it tells it what to plot: here, we're telling it to plot \verb_gmp_ as a function of \verb_pop_. (The tilde sign, is used in such ``formulas'' to indicate that what goes on the left is being treated as a function of what's on the right.) The next argument tells R where to look up the variables in the formula: in the data frame \verb_bea_ that we just loaded. The other two arguments give the axis some sensible labels. This plot shows, unsurprisingly, that larger cities have larger total economic outputs. What is more remarkable is how closely the points fall around a straight line. To see this, we'll re-plot the points, and use the function \verb_abline_ to add a straight line. To do this we need an intercept (the \verb_a_) and a slope (the \verb_b_). A reasonable guess for the intercept is 0 (since presumably a city with no inhabitants has no economy). One could reasonably guess the slope at $4\times{10}^4$ dollars/person, say by noticing the city with a population of about ten million (as it happens, Chicago) and seeing where it falls on the vertical axis. <<>>= plot(gmp ~ pop, data=bea, xlab="Population", ylab="Total GMP") abline(a=0,b=4e4) @ This isn't bad at all, but it looks like it's systematically too low for the larger cities. This is {\em suggestive} that there may be differences between the economies of large and small cities. Let's explore this by looking at the per-capita figures. <<>>= plot(pcgmp ~ pop, data=bea, xlab="Population", ylab="Per-capita GMP") @ At this point, it becomes annoying that the larger cities in the US are so much larger than the small ones. By using a linear scale for the horizontal axis, we devote most of the plot to empty space around New York, Los Angeles and Chicago, which makes it harder to see if there is any trend. A useful trick is to switch to a logarithmic scale for that axis, where equal distances correspond to equal {\em multiples} of population. <<>>= plot(pcgmp ~ pop, data=bea, xlab="Population", ylab="Per-capita GMP", log="x") @ Two things are noticeable from this plot. First, there is a wide range of per-capita GMP for smaller cities, which narrows as population grows. Second, there seems to be an increasing trend, or at least an increasing lower limit. Let's restore the previous plot, but make it a bit less visually cluttered. <<>>= # alter the plotting symbol from a hollow circle to a filled dot (pch) # shrink the plotting symbols by a factor of 0.5 (cex) plot(pcgmp ~ pop, data=bea, xlab="Population", ylab="Per-capita GMP", pch=19, cex=0.5) @ Let's now calculate our first regression line. R has a function for estimating linear models, with which we'll become very familiar: <<>>= lm(pcgmp ~ pop, data=bea) @ The first argument to \verb_lm_ is a formula, telling R which variable we're trying to predict (the one on the left, here \verb_pcgmp_), and which variable we're trying to predict it from (the one on the right, here \verb_pop_), and what data set to take those variables from (\verb_bea_ again)\footnote{Much more complicated formulas are possible, but this will do for now.} R then {\em estimates} the coefficients of the best linear predictor --- we will see later how it does this --- and returns those coefficients, along with a lot of other stuff which is invisible in this view. The \verb_abline_ command is smart enough to get an intercept and a slope from the output of \verb_lm_, so we can use it to decorate the plot: <<>>= plot(pcgmp ~ pop, data=bea, xlab="Population", ylab="Per-capita GMP", pch=19, cex=0.5) abline(lm(pcgmp ~ pop, data=bea), col="blue") @ But why should we believe that line? R does it, but I hope by this point in your life you don't think ``the computer says it, so it must be right'' is ever a good idea. Why prefer that blue line over this grey line, or the green one, or for that matter over the orange curve? <>= plot(pcgmp ~ pop, data=bea, xlab="Population", ylab="Per-capita GMP", pch=19, cex=0.5) abline(lm(pcgmp ~ pop, data=bea), col="blue") abline(a=4e4, b=0, col="grey") abline(a=2e4, b=4e4/2e7, col="green") lines(smooth.spline(x=bea$pop, y=bea$pcgmp, cv=TRUE), col="orange") @ Why do we want to fit any lines here at all? \section{What Are Statistical Models For?} One basic use for these lines is as {\bf summaries}. There is a lot of detail in those figures --- each one requires $366 \times 2 = 732$ numbers. This is a lot of information to keep track of; it makes our heads hurt. In many situations, we'd rather ignore all of that precise detail and get away with a summary; we might want to {\em compress} the 732 numbers into just an intercept and a slope that describe the general trend or over-all shape of the data. There would be lots of ways to do that, and which ones would make sense would depend on how we want to use the compressed summary. We might, for instance, aim at being able to recover the original data with minimal error from the summary, so we'd want a line which came close to the original points. But we might also decide that it was more important for the line to come closer to large cities than small ones (since they contain so many more people), etc., etc. There is nothing wrong with data compression --- it is, in fact, one of the fundamental technologies of modern life --- but I have little more to say about it (here). The one thing I will say is that {\em anything} you can calculate from the data could, in principle, be used as a summary. Even if the calculation was originally inspired by doing some sort of statistical inference, every statistic can be a descriptive statistic. If we want to go beyond describing, summarizing or compressing the data, we enter the realm of {\bf inference} --- we try to reach out and extend our knowledge from the data we have, to other variables we have not measured, or not measured so directly. This is inherently somewhat risky, imprecise, and uncertain. In statistics, we aim not only to draw such inferences, but to say something about the level of risk, imprecision, and uncertainty which accompanies them. You have, by this point in your educations, been thoroughly schooled in one way in which inferences can be subject to uncertainty: when the data is just a sample of a larger population, and we want to extrapolate from the sample to the population. In fact, many people get taught that this is the only sort of uncertainty statistics can handle. If that were true, there would be no role for statistics in dealing with this data set. It isn't any sort of sample at all --- every city in the US in 2006 really is in there. So why, then, does it seem wrong to say that the slope of the optimal linear predictor is {\em exactly} \Sexpr{coefficients(lm(pcgmp~pop,data=bea)[1])}? There are at least two reasons. One reason is that while we have measurements on the complete population, those measurements are themselves subject to error. In this case, while the people at the BEA try very hard to provide reliable numbers, their figures are the result of a complicated process, at which error can creep in at many points, from mistakes, accidents, deliberate lies, possibly-incorrect assumptions made at various stages, the use of random sampling in some steps, etc., etc.\footnote{This is, among other things, a reason to want to compress the data, rather than just memorizing it: some part of the data is just wrong.} Generally, just about every process of measurement is subject to some error: there are flaws in the measurement instrument or process, or we measure some imperfect proxy for what we're really interested in, or the measurement is perturbed by unrepeatable, irrelevant disturbances. In fact, much of the theory of mathematical statistics generally, and linear models specifically, was developed in the 19th century by astronomers and other physical scientists to quantify, and where possible reduce, the impacts of measurement error. Some sources of measurement error are {\bf systematic}: they tend to distort the measurement in predictable, repeatable ways. They may make the measured value larger than it should be, or smaller, or might shrink extreme values towards more mediocre ones, etc. Ideally, these systematic errors should be identified and modeled, so we can adjust for them. Other sources of error are (purely or merely) {\bf statistical}: they are directionless and do not repeat, but the {\em distribution} of errors is predictable and stable; we can handle them with probability theory. A second reason why it's reasonable to do statistical inference on a complete data set is that even the real values re somewhat accidental, and we'd like to know the general, underlying trends or relationships. Take the BEA data again: the {\em exact} values of the GMPs of all American cities in 2006 were to some extent the results of accidents, of chance, unrepeatable causes, and this would still be true even if we could measure those exact GMPs without error. To be really concrete for a moment, the city with the highest per-capita income in the data set is \Sexpr{as.character(bea[which.max(bea[,"pcgmp"]),1])}. This is a center for insurance companies, banks and hedge funds. Exactly how much money they made in 2006 depended on things like just how many hurricanes there were in Florida, wild fires in California, mortgages from Cleveland and Phoenix sold to small towns in Germany, etc., etc. Even if one thinks that there is, ultimately, some sort of deterministic explanation for all of these quantities, they're plainly very far removed from any general relationship between a city's population and its economic productivity. They really happen --- they are not just measurement errors\footnote{``As I regularly find myself having to remind cadet risk managers with newly-minted PhDs in financial econometrics, the Great Depression did actually happen; it wasn't just a particularly innaccurate observation of the underlying 4\% rate of return on equities.'' (\url{http://d-squareddigest.blogspot.com/2006/09/tail-events-phrase-considered-harmful.html})} --- but they could easily have happened a bit differently. Long experience has shown that the {\em distribution} of these accidents is often stable, and can be modeled with probability theory\footnote{In fact, there are precise mathematical senses in which sufficiently complicated deterministic processes end up looking just like random ones. If this intrigues you, see \citet{Ruelle-chance} and \citet{Smith-on-chaos}.}. When that happens, they are often called by the more respectable name of {\bf fluctuations}. To sum up: whether it's due to sampling, or measurement error, or fluctuations, we often have good reason to think that our data could have been more or less different. If we re-ran the experiment, or ``re-wound the tape of history'' (S. J. Gould), the results would not have been quite the same. This means that any statistic we calculate from the data would have been more or less different as well. When we try to quantify uncertainty, we want to know how different our calculations could have been. To say anything useful here, we will need to make assumptions. Without {\em some} assumptions, we can't really say anything at all about how different the data could, plausibly, have been. In statistics, a lot of accumulated experience says that useful assumptions generally take the form of {\bf statistical models} or {\bf probability models}. In a statistical model, we act as though the variables we measure, and possibly others we don't measure, are random variables. (We ``model them as random variables.'') The {\bf specification} of a statistical model says what the random variables are, and lays down more or less narrow restrictions on their distributions and how they relate to each other. Here, for instance, are some {\em conceivable} statistical models for the BEA data. In all of them, $X$ stands for a city's population and $Y$ for its per-capita GMP. \begin{enumerate} \item $X \sim N(6.81\times{10}^5,2.42\times{10}^{12})$; $Y|X \sim N(4.00\times{10}^4,8.50\times{10}^7)$; $X$ independent across cities; $Y$ independent across cities given their $X$'s. \item $X \sim N(\mu_X, \sigma^2_X)$ for some mean $\mu_X$ and variance $\sigma^2_X$; $Y|X \sim N(4.00\times{10}^4,8.50\times{10}^7)$; $Y$ independent across cities given their $X$'s. \item $X \sim N(\mu_X, \sigma^2_X)$ for some mean $\mu_X$ and variance $\sigma^2_X$; $Y|X \sim N(4.00\times{10}^4,\sigma^2_Y)$ for some variance $\sigma^2_Y$; $Y$ independent across cities given their $X$'s. \item distribution of $X$ unspecified; $Y|X \sim N(4.00\times{10}^4, \sigma^2_Y)$ for some $\sigma^2_Y$; $Y$ independent across cities given their $X$'s. \item distribution of $X$ unspecified; $Y|X \sim N(\beta_0 + \beta_1 x, \sigma^2_Y)$ for some $\beta_0, \beta_1, \sigma^2_Y$; $Y$ independent across cities given their $X$'s. \item distribution of $X$ unspecified; $\Expect{Y|X=x} = \beta_0 + \beta_1 x$ for some $\beta_0$ and $\beta_1$; $\Var{Y|X=x} = \sigma^2_Y$ for some $\sigma^2_Y$; $Y$ independent across cities given their $X$'s. \item distribution of $X$ unspecified; $\Expect{Y|X=x} = \beta_0 + \beta_1 x$ for some $\beta_0$ and $\beta_1$; $Y$ uncorrelated across cities given their $X$'s. \end{enumerate} As we go down this list of models, we make weaker and weaker assumptions about the process which generated the data. This means that, with the same data, we can infer less and less about that data-generating process. The very first model specifies a single, complete, unambiguous distribution for the data. If we assumed that model was true, we could then make all sorts of assertions about the distribution of city population and economic output, without even having to look at the data at all. Later models in the list leave more about the data-generating process undetermined, so we can use the data to estimate those parts of the model (e.g., the mean and variance of city populations in model 2, or the slope $\beta_1$ in models from 4 on), or test hypotheses about them, etc. Because a model specifies how the data variable $X$ and $Y$ are distributed, and any statistics we calculate are functions of $X$ and $Y$, a model also, implicitly, tells us how those statistics are distributed\footnote{Whether we can work out that distribution in a nice closed form is another question, but mathematically exists, and there are ways to cope when there's no closed form, as we'll see later in this class, and in greater detail in 402.}. These distributions of the statistics are what we'll use to quantify the uncertainty in our inferences. The stronger the assumptions we make, the stronger the inferences we can draw, {\em if the assumptions are true}. There is no virtue to strong conclusions which rest on faulty premises. Therefore: we are going to go over how to draw these inferences, but we are also going to go over checking model assumptions. When confronting a data analysis problem, you first need to {\bf formulate} a statistical model. This is partly about absorbing what's already known about the subject\footnote{For instance, economists and geographers have long known about an ``urban wage premium'', where otherwise-similar workers in bigger cities get paid more \citep{Thompson-preface-to-urban-econ}.}, partly about looking for similar problems others have dealt with and seeing what you can learn from them (i.e., analogy and tradition), and partly about being inspired by initial explorations of the data. Once you have a model, the two key tasks are {\bf inference} within the model, and {\bf checking} the model. Inference within the model is about doing calculations which presume the model's assumptions are true: estimation, or prediction, or hypothesis testing, or confidence intervals, or what-have-you. This is usually what people actually want out of the data analysis. Unfortunately, these inferences are only as good as the modeling assumptions that they're based on. Model checking, on the other hand, is about seeing whether the assumptions are really true. This includes formal goodness-of-fit testing, but also various ``specification tests'' (some of which we will cover), the less formal checks called ``diagnostics'', and sheer judgment, often aided by visualizations. If the assumptions of the model we started with turn out to be wrong, we need to go back and revise the model, either replacing the faulty assumptions with others, or weakening them. People usually list put within-model inference before model checking --- that's the order our textbook uses --- but that's more because teachers, and students, are generally more comfortable with the more cut-and-dried topic of inference. That topic is extremely formalized and mathematical, with lots of theory to guide us. In fact, for lots of inference problems there is an unambiguous optimal procedure, which we should follow. Model checking, on the other hand, is much less formalized, mathematical and algorithmic than inference, and very little about it can be said to be definitely optimal or The Right Way To Do It. Nonetheless, {\em assumption checking is much more important}. Impressive-seeming inferences from strong-but-wrong assumptions don't actually tell us anything about the world, and are useless, no matter how much technical skill they might demonstrate. When reading other people's data analyses, you should get into the habit of paying very close attention to how they check their models, and you should apply that same habit to yourself. \section{The Simple Linear Regression Model} To make this philosophizing a bit more concrete, let's introduce the most basic of all statistical models that is actually useful for anything, the {\bf simple linear regression model}. This is a model with two random variables, $X$ and $Y$, where we are trying to predict $Y$ from $X$. Here are the model's assumptions: \begin{enumerate} \item The distribution of $X$ is unspecified, possibly even deterministic; \item $Y|X = \beta_0 + \beta_1 x + \epsilon$, where $\epsilon$ is a noise variable; \item $\epsilon$ has mean 0, a constant variance $\sigma^2$, and is uncorrelated with $X$ and uncorrelated across observations. \end{enumerate} The noise variable may represent measurement error, or fluctuations in $Y$, or some combination of both. The assumption of {\bf additive} noise is non-trivial --- it's not absurd to imagine that either measurement error or fluctuations might change $Y$ multiplicatively (for instance). The assumption of a {\bf linear functional form} for the relationship between $Y$ and $X$ is non-trivial; lots of non-linear relationships actually exist. The assumption of {\bf constant variance}, or {\bf homoskedasticity}, is non-trivial; the non-correlation assumptions are non-trivial. But the assumption that the noise has mean 0 {\em is} trivial. (Why?) Ideally, all of the non-trivial assumptions will be checked, and we will talk later in the course about ways to check them. The assumptions I have just laid out, while they are non-trivial because they could be violated (and are, in many situations), are still strong enough to let us get a start on inference. While we will go into this in some detail next time, let's give this at least a start here. Remember we saw last time that the optimal linear predictor of $Y$ from $X$ has slope $\beta_1 = \Cov{X,Y}/\Var{X}$. But both $\Cov{X,Y}$ and $\Var{X}$ are functions of the true distribution. Rather than having that full distribution, we merely have data points, say $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$. How might we {\em estimate} $\beta_1$ from this data? An obvious approach would be to use the data to find the {\em sample} covariance and {\em sample} variance, and take their ratio. As a reminder, the sample variance of $X$ is \[ s^2_X = \frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})^2} \] while the sample covariance is \[ c_{XY} = \frac{1}{n}\sum_{i=1}^{n}{(x_i - \overline{x})(y_i - \overline{y})} \] (Here I am writing $\overline{x}$ for the sample average of the $x_i$, and similarly for other variables.)\footnote{Some people prefer to define these with denominators $n-1$ rather than $n$, to get unbiased estimates of the population quantities. The way I am doing it will simplify some book-keeping presently.} So we'd have a {\bf sample} (or {\bf empirical}) slope \[ \widehat{\beta_1} = \frac{c_{XY}}{s^2_{X}} \] We can't hope that $\widehat{\beta_1} = \beta_1$, but we {\em can} hope that as $n\rightarrow\infty$, $\widehat{\beta_1} \rightarrow \beta_1$. When an estimator converges on the truth like that, the estimator is called {\bf consistent}, and this is the most basic property a good estimator should have. What do we need to assume in order for $\widehat{\beta_1} \rightarrow \beta_1$? Let's look at the sample covariance. A little algebra shows \begin{equation} \label{eqn:sample-cov-as-product-moment-minus-moment-product} c_{XY} = \frac{1}{n}\sum_{i=1}^{n}{x_i y_i} - \bar{x}\bar{y} \end{equation} According to the model, $y_i = (\beta_0 + \beta_1 x_i + \epsilon_i)$. So (after a little more algebra) \begin{eqnarray} \label{eqn:sample-cov-as-sample-averages} c_{XY} & = & \frac{1}{n}\sum_{i=1}^{n}{x_i (\beta_0 + \beta_1 x_i + \epsilon_i)} - \overline{x}\overline{\beta_0 + \beta_1 x + \epsilon}\\ & = & \beta_0 \overline{x} + \beta_1 \overline{x^2} + \overline{x \epsilon} - \overline{x}\beta_0 - \beta_1 \overline{x}^2 - \bar{x}\bar{\epsilon}\\ & = & \beta_1 s^2_X +\overline{x \epsilon} - \bar{x}\bar{\epsilon} \end{eqnarray} Because $\epsilon$ has mean 0, as $n\rightarrow \infty$, the law of large numbers says $\overline{\epsilon} \rightarrow 0$. Because $\epsilon$ is uncorrelated with $x$, using the law of large numbers again says that $\overline{x\epsilon} \rightarrow 0$ as well. So \[ c_{XY} \rightarrow \beta_1 s^2_X \] and therefore \[ \widehat{\beta_1} = \frac{c_{XY}}{s^2_X} \rightarrow \frac{\beta_1 s^2_X}{s^2_X} = \beta_1 \] as desired. As I said, this argument rests on all the model assumptions. Strictly speaking, the estimator $\widehat{\beta}$ is consistent under even weaker assumptions --- it's enough that $c_{XY} \rightarrow \Cov{X,Y}$, and $s^2_X \rightarrow \Var{X}$. On the other hand, it would be nice to say more: we want to know {\em how far} from the truth our estimate is likely to be, whether it tends to over- or under- estimate the slope, etc. we will see in later lectures how the assumptions of the simple linear regression model will let us say something about all of these matters, and how the even stronger assumption that the noise is Gaussian will let us be even more precise. (We will, I promise, come back to this data set, and the question of which regression line, if any, best describes the relationship between a city's size and its economic output, but that, too, will have to wait for later.) \section*{Exercises} To think through, not to turn in. \begin{enumerate} \item What, if anything, makes \verb_plot(pcgmp ~ log(pop), data=bea)_ a worse plot than \verb_plot(pcgmp ~ pop, data=bea, log="x")_? \item Fill in the algebra for \eqref{eqn:sample-cov-as-product-moment-minus-moment-product}. \item Fill in the algebra for \eqref{eqn:sample-cov-as-sample-averages}. \end{enumerate} \bibliography{locusts} \bibliographystyle{crs} \end{document}