Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100

The goal of this class is to train students in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of their analyses to collaborators and to non-statisticians.

Graduate students from other departments wishing to take this course should register for it under the number "36-608".

36-401, or consent of the instructor. The latter is only granted under *very* unusual circumstances.

Professor | Cosma Shalizi | cshalizi [at] cmu.edu |

229 C Baker Hall | ||

268-7826 | ||

Teaching assistants | Ms. Stefa Etchegaray | |

Mr. Mingyu Tang | ||

Mr. Zachary Kurtz | ||

Mr. Cong Lu |

*Model evaluation*: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; the bootstrap; penalized fitting; mis-specification checks*Yet More Linear Regression*: what is regression, really?; review of ordinary linear regression and its limits; extensions*Smoothing*: kernel smoothing, including local polynomial regression; splines; additive models; kernel density estimation*Generalized linear and additive models*: logistic regression; generalized linear models; generalized additive models.*Latent variables and structured data*: principal components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general*Causality*: graphical causal models; identification of causal effects from observations; estimation of causal effects; discovering causal structure*Dependent data*: Markov models for time series without latent variables; hidden Markov models for time series with latent variables; longitudinal, spatial and network data

The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. There will be twelve weekly homework assignments, nearly one every week; they will all be due on Tuesdays at the beginning of class, through Coursekit, and will all count equally, totaling 60% of your grade. The lowest three homework grades will be dropped; consequently, no late homework will be accepted for any reason.

Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it. This portion of the
assignment will be graded, along with the other questions, but failure to do
the writing (or to show signs of a serious attempt at it) will result in an
automatic zero for that assignment. As always, raw computer output and R code
is not acceptable, but should be put in an appendix to each assignment.
Homework may be submitted either as a PDF (preferred) or as a plain text file
(`.txt`). If you prepare your homework in Word, be sure to submit a PDF
file; `.doc`, `.docx`, etc., files will not be graded.

Unlike PDF or plain text, Word files do not display consistently across different machines, different versions of the program on the same machine, etc., so not using them eliminates any doubt that what we grade differs from what you think you wrote. Word files are also much more of a security hole than PDF or (especially) plain text. Finally, it is obnoxious to force people to buy commercial, closed-source software just to read what you write. (It would be obnoxious even if Microsoft paid you for marketing its wares that way, but it doesn't.)

There will be two take-home mid-term exams (10% each), due at 10:30 am on March 6th and April 17th. You will have one week to work on each midterm. There will be no homework in those weeks. There will also be a take-home final exam (20%), due at 10:30 am on May 15, which you will have two weeks to do.

Exams must also be submitted through Coursekit, under the same rules as homework.

In addition to the textbooks, you are expected to read the notes. They contain valuable information which goes beyond what is in the texts, which you will need to understand to do the assignments and the exams.

R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before), and this class presumes that you have.
Many of the assignments will require you to use it. You can expect no
assistance from the instructors with any other programming language or
statistical software. If you are *not* able to use R, or do not have
ready, reliable access to a computer on which you can do so, let me know at
once.

Here are some resources for learning R:

- The official intro, "An Introduction to R", available online in HTML and PDF
- John Verzani, "simpleR", in PDF
- Quick-R. This is primarily aimed at those who already know a commercial statistics package like SAS, SPSS or Stata, but it's very clear and well-organized, and others may find it useful as well.
- Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques" (large PDF)
- Paul Teetor, The R Cookbook, explains how to use R to do many, many common tasks. (It's like the inverse to R's help: "What command does X?", instead of "What does command Y do?"). It is one of the required texts, and is available at the campus bookstore.
- The notes for 36-350, Introduction to Statistical Computing
- There are now many books about R. Some recommendable ones:
- Joseph Adler R in a Nutshell (O'Reilly, 2009; ISBN 9780596801700). Probably most useful for those with previous experience programming in another language.
- W. John Braun and Duncan J. Murdoch, A First Course in Statistical Programming with R (Cambridge University Press, 2008; ISBN 978-0-521-69424-7)
- John M. Chambers, Software for Data Analysis: Programming with R (Springer, 2008, ISBN 978-0-387-75935-7). The best book on writing clean and reliable R programs; probably more advanced than you will need.
- Norman Matloff, The Art of R Programming (No Starch Press, 2011, ISBN 978-1-59327-384-2). Good introduction to programming for complete novices using R. Less statistics than Braun and Murdoch, more programming skills.

The complete notes (large PDF)

- January 17 (Tuesday): Lecture 1, Introduction to the class
- Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.
*Reading*: Notes, chapter 1; Faraway, chapter 1 (especially up to p. 17)- Homework 1: assignment, data
- January 19 (Thursday): Lecture 2, The truth about linear regression
- Using Taylor's theorem to justify linear regression locally. Collinearity.
Consistency of ordinary least squares estimates under weak conditions. Linear
regression coefficients will change with the distribution of the input
variables: examples. Why R
^{2}is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable effects). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means. *Reading*: Notes, chapter 2 (R); Faraway, chapter 1- January 24 (Tuesday): Lecture 3, Evaluation: Error and inference
- Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences.
*Reading*: Notes, chapter 3 (R)- Homework 1 due: solutions
- Homework 2: assignment, R,
`penn-select.csv`data file - January 26 (Thursday): Lecture 4, Smoothing methods in regression
- The bias-variance trade-off tells us how much we should smooth. Adapting to unknown roughness with cross-validation; detailed examples. How quickly does kernel smoothing converge on the truth? Using kernel regression with multiple inputs. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.
*Reading*: Notes, chapter 4 (R); Faraway, section 11.1*Optional readings*: Hayfield and Racine, "Nonparametric Econometrics: The`np`Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]- January 31 (Tuesday): Lecture 5, The Bootstrap
- Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?
*Reading*: Notes, chapter 5 (R for figures and examples;`pareto.R`;`wealth.dat`)- R for in-class examples
- Homework 2 due
- Homework 3 assigned: assignment,
`nampd.csv`data set - February 2 (Thursday): Lecture 6, Heteroskedasticity, weighted least squares, and variance estimation
- Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.
*Reading*: Notes, chapter 6; Faraway, section 11.3- February 7 (Tuesday): Lecture 7, Splines
- Kernel regression controls the amount of smoothing indirectly by bandwidth;
why not control the irregularity of the smoothed curve directly? The spline
smoothing problem is a penalized least squares problem: minimize mean squared
error,
*plus*a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression. *Reading*: Notes, chapter 7; Faraway, section 11.2- Homework 3 due
- Homework 4: Assignment
- February 9 (Thursday): Lecture 8, Additive models
- The curse of dimensionality limits the usefulness of fully non-parametric
regression in problems with many variables: bias remains under control, but
variance grows rapidly with dimensionality. Parametric models do not have this
problem, but have bias and do not let us
*discover*anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there is hardly ever any reason to prefer linear models to additive ones, and the continued thoughtless use of linear regression is a scandal. *Reading*: Notes, chapter 8; Faraway, chapter 12- February 14 (Tuesday): Lecture 9, Writing R Code
- (By popular demand.)
- R programs are built around functions: pieces of code that take inputs or arguments, do calculations on them, and give back outputs or return values. The most basic use of a function is to encapsulate something we've done in the terminal, so we can repeat it, or make it more flexible. To assure ourselves that the function does what we want it to do, we subject it to sanity-checks, or "write tests". To make functions more flexible, we use control structures, so that the calculation done, and not just the result, depends on the argument. R functions can call other functions; this lets us break complex problems into simpler steps, passing partial results between functions. Programs inevitably have bugs: debugging is the cycle of figuring out what the bug is, finding where it is in your code, and fixing it. Good programming habits make debugging easier, as do some tricks. Avoiding iteration. Re-writing code to avoid mistakes and confusion, to be clearer, and to be more flexible.
*Reading*: Notes, chapter 9*Optional reading*: Slides from 36-350, introduction to statistical computing, especially through lecture 15.- R for in-class demos
- Homework 4 due
- Homework 5: assignment, R
- February 16 (Thursday): Lecture 10, Testing Regression Specifications
- Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.
*Reading*: Notes, chapter 10- February 21 (Tuesday): Lecture 11, More about Hypothesis Testing
- The logic of hypothesis testing: significance, power, the will to believe, and the (shadow) price of power. Severe tests of hypotheses: severity of rejection vs. severity of acceptance. Common abuses. Confidence sets as the "dual" to hypothesis tests. Crucial role of sampling distributions. Examples, right and wrong.
*Reading*: Notes, chapter 11- Homework 5 due
- Homework 6: assignment,
`strikes.csv`data set - February 23 (Thursday): Lecture 12, Logistic regression
- Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models.
*Reading*: Notes, chapter 12; Faraway, chapter 2 (omitting sections 2.11 and 2.12)- February 28 (Tuesday): Lecture 13, Generalized linear models and generalized additive models
- Poisson regression for counts; iteratively re-weighted least squares again. The general pattern of generalized linear models; over-dispersion. Generalized additive models.
*Reading*: Notes, first half of chapter 13; Faraway, section 3.1, chapter 6- Homework 6 due
- Midterm 1: assignment. Your data-set has been e-mailed to you.
- March 1 (Thursday): Lecture 14, GLM and GAM Examples
- Building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.
*Reading*: Notes, second half of chapter 13; Faraway, chapters 6 and 7 (continued from previous lecture)- March 6 (Tuesday): Lecture 15, Multivariate Distributions
- Reminders about multivariate distributions. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; conditional distributions in multivariate Gaussians and linear regression; computational aspects, specifically in R. General methods for estimating parametric distributional models in arbitrary dimensions: moment-matching and maximum likelihood; asymptotics of maximum likelihood; bootstrapping; model comparison by cross-validation and by likelihood ratio tests; goodness of fit by the random projection trick.
*Reading*: Notes, chapter 14- Midterm 1 due
- March 8 (Thursday): Lecture 16, Density Estimation
- The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties: convergence on the true density if the bandwidth shrinks at the right rate; superior performance to histograms; the curse of dimensionality again. An example with cross-country economic data. Kernels for discrete variables. Estimating conditional densities; another example with the OECD data. Some issues with likelihood, maximum likelihood, and non-parametric estimation.
*Reading*: Notes, chapter 15- March 13 and 15: Spring break
- March 20 (Tuesday): Lecture 17, Simulation
- Simulation: implementing the story encoded in the model, step by step, to produce something data-like. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random variables. How to generate random variables with specified distributions. Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.
*Reading*: Notes, chapter 16; R- Homework 7: assignment,
`n90_pol.csv`data - March 22 (Thursday): Lecture 18, Relative Distributions and Smooth Tests of Goodness-of-Fit
- Applying the right CDF to a continuous random variable makes it uniformly distributed. How do we test whether some variable is uniform? The smooth test idea, based on series expansions for the log density. Asymptotic theory of the smooth test. Choosing the basis functions for the test and its order. Smooth tests for non-uniform distributions through the transformation. Dealing with estimated parameters. Some examples. Non-parametric density estimation on [0,1]. Checking conditional distributions and calibration with smooth tests. The relative distribution idea: comparing whole distributions by seeing where one set of samples falls in another distribution. Relative density and its estimation. Illustrations of relative densities. Decomposing shifts in relative distributions.
*Reading*: Notes, chapter 17*Optional reading*: Bera and Ghosh, "Neyman's Smooth Test and Its Applications in Econometrics"; Handcock and Morris, "Relative Distribution Methods"- March 27 (Tuesday): Lecture 19, Principal Components Analysis
- Principal components is the simplest, oldest and most robust of dimensionality-reduction techniques. It works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the projection of the data on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.
*Reading*: Notes, chapter 18;`pca.R`,`pca-examples.Rdata`, and`cars-fixed04.dat`- Homework 7 due
- Homework 8: assignment
- March 29 (Thursday): Lecture 20, Factor Analysis
- Adding noise to PCA to get a statistical model. The factor analysis model, or linear regression with unobserved independent variables. Assumptions of the factor analysis model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. (Our first look at latent variables and conditional independence.) Geometrically, the factor model says the data have a Gaussian distribution on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.
*Reading*: Notes, chapter 19;`factors.R`and`sleep.txt`- April 3 (Tuesday): Lecture 21, Mixture Models
- From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.
*Reading*: Notes, first half of chapter 20- Homework 8 due
- Homework 9: assignment,
`MOM_data_full.txt` - April 5 (Thursday): Lecture 22, Mixture Model Examples and Complements
- Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components.
*Reading*: Notes, second half of chapter 20;`mixture-examples.R`- April 10 (Tuesday): Lecture 23, Graphical Models
- Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth?
*Reading*: Notes, chapter 21- Homework 9 due
- Midterm 2: assignment
- April 12 (Thursday): Lecture 24, Graphical Causal Models
- Probabilistic prediction is about passively selecting a sub-ensemble,
leaving all the mechanisms in place, and seeing what turns up after applying
that filter. Causal prediction is about actively
*producing*a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models. *Reading*: Notes, chapter 22- April 17 (Tuesday): Lecture 25, Identifying Causal Effects from Observations
- Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Summary recommendations for identifying causal effects.
*Reading*: Notes, chapter 23- Midterm 2 due
- Homework 10: assignment
- April 19 (Thursday): Carnival
- April 24 (Tuesday): Lecture 26, Estimating Causal Effects from Observations
- Estimating graphical models: substituting consistent estimators into the formulas for front and back door identification; average effects and regression; tricks to avoid estimating marginal distributions; propensity scores and matching and propensity scores as computational short-cuts in back-door adjustment. Instrumental variables estimation: the Wald estimator, two-stage least-squares. Summary recommendations for estimating causal effects.
*Reading*: Notes, chapter 24- Homework 10 due
- Homework 11: assignment,
`sesame.csv` - April 26 (Thursday): Lecture 27, Discovering Causal Structure from Observations
- How do we get our causal graph? Comparing rival DAGs by testing selected
conditional independence relations (or dependencies). Equivalence classes of
graphs. Causal arrows never go away no matter what you condition on ("no
causation without association"). The crucial difference between common causes
and common effects: conditioning on common causes makes their effects
independent, conditioning on common effects makes their causes dependent.
Identifying colliders, and using them to orient arrows. Inducing orientation
to enforce consistency. The SGS algorithm for discovering causal graphs; why
it works. The PC algorithm: the SGS algorithm for lazy people. What about
latent variables? Software:
`TETRAD`and`pcalg`; examples of working with`pcalg`. Limits to observational causal discovery: universal consistency is possible (and achieved), but uniform consistency is not. *Reading*: Notes, chapter 25- May 1 (Tuesday): Lecture 28, Time Series I
- What time series are. Properties: autocorrelation or serial correlation; strong and weak stationarity. The correlation time, the world's simplest ergodic theorem, effective sample size. The meaning of ergodicity: a single increasing long time series becomes representative of the whole process. Conditional probability estimates; Markov models; the meaning of the Markov property. Autoregressive models, especially additive autoregressions; conditional variance estimates. Bootstrapping time series. Trends and de-trending.
*Reading*: Notes, chapter 26; R for examples;`gdp-pc.csv`- Homework 11 due
- Final exam: assignment; macro.csv
- Help installing
`pcalg` - May 3 (Thursday): Lecture 29, Time Series II
- Cross-validation for time series. Change-points and "structural breaks". Moving averages: spurious correlations (Yule effect) and oscillations (Slutsky effect). State-space or hidden Markov models; moving average and ARMA models as state-space models. The EM algorithm for hidden Markov models; particle filtering. Multiple time series: "dynamic" graphical models; "Granger" causality (which is not causal); the possibility of real causality.
*Reading*: Notes, chapter 27; Faraway, section 9.1- May 15
- Final exam due