Cosma Shalizi

36-402, Undergraduate Advanced Data Analysis

Spring 2011

This page has information about the 2011 version of the class. The 2012 version is over here.
Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100
Keen-eyed fellow investigators

The goal of this class is to train students in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of their analyses to collaborators and to non-statisticians.

Graduate students from other departments wishing to take this course should register for it under the number "36-608".


36-401, or, in unusual circumstances, an equivalent course approved by the instructor.


Professor Cosma Shalizi cshalizi [at]
229 C Baker Hall
Teaching assistants Gaia Bellone gbellone [at]
Zachary Kurtz zkurtz [at]
Shuhei Okumura sokumura [at]

Topics, Notes, Readings

Model evaluation: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; bootstrap; penalized fitting; information criteria; mis-specification checks; model averaging
Yet More Linear Regression: what is regression, really?; review of ordinary linear regression and its limits; extensions
Smoothing: kernel smoothing, including local polynomial regression; splines; additive models; classification and regression trees; kernel density estimation
GAMs: logistic regression; generalized linear models; generalized additive models.
Latent variables and structured data: principal components; factor analysis and latent variables; graphical models in general; latent cluster/mixture models; hierarchical models and partial pooling
Causality: estimating causal effects; discovering causal structure
Time series: Markov models for time series without latent variables; hidden Markov models for time series with latent variables
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.

Course Mechanics

Grades will not go away if you avert your eyes (photo by laurent KB on Flickr) Homework will be 60% of the grade, two midterms 10% each, and the final 20%.


There will be twelve eleven weekly homework assignments, nearly one every week; they will all count equally, and be 60% of your grade. The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. Communicating your results to others is as important as getting good results in the first place. Raw computer output and R code is not acceptable, but should be put in an appendix to each assignment. Homework will be due, in hard-copy, at the beginning of class on Tuesdays. The lowest three homework grades will be dropped; consequently, no late homework will be accepted.


There will be two take-home mid-term exams (10% each), due at 5 pm on March 1st and April 12th. (Please let me know as soon as possible if you have a conflict with either date.) You will have one week to work on each midterm. There will be no homework in those weeks, and lecture on the day they are due will be replaced with special office hours. There will also be a take-home final exam (20%), due at 10 am on May 9, which you will have two weeks to do.

Office Hours

Prof. Shalizi will hold office hours Mondays, 2--4 pm, in Baker Hall 229A, or by appointment. Ms. Bellone will hold office hours Fridays 1:30 to 2:30, and Mr. Okumura Thursdays 1--2 pm, both in Wean Hall 8110. If you want help with computation, please bring your laptop.


Blackboard will be used only for announcements, grades, and a discussion forum. Assignments and solutions will be posted here.


Julian Faraway, Extending the Linear Model with R (Chapman Hall/CRC Press, 2006, ISBN 978-1-58488-424-8) will be required. (Faraway's page on the book, with help and errata.) Adler's R in a Nutshell (O'Reilly, 2009; ISBN 9780596801700), Berk's Statistical Learning From a Regression Perspective (Springer, 2008; ISBN 9780387775005), and Venables and Ripley's Modern Applied Statistics with S (Springer, 2003; ISBN 9780387954578) will all be optional. The campus bookstore should have copies.

Collaboration, Cheating and Plagiarism

Cheating leads to desolation and ruin (photo by paddyjoe on Flickr) Feel free to discuss all aspects of the course with one another, including homework and exams. However, the work you hand in must be your own. You must not copy mathematical derivations, computer output and input, or written descriptions from anyone or anywhere else, without reporting the source within your work. Please review the CMU Policy on Cheating and Plagiarism.

Physically Disabled and Learning Disabled Students

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] or (412) 268-2012.


R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before), and this class presumes that you have. Many of the problems will be easier with R, and some of them will require R. You should have no expectations of assistance from the instructors with programming in any other language. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Here are some resources for learning R: Caught in a thicket of syntax (photo by missysnowkitten on Flickr)

You should read the Notes on Writing R Functions, and Re-writing Your Code. Even if you know how to do some basic coding (or more), you should read the page of Minimal Advice on Programming.


Subject to revision. Lecture notes, assignments and solutions will all be linked here, as they are available. Identifying significant features from background (photo by Gord McKenna on Flickr)
January 11 (Tuesday): Lecture 1, Introduction to the class
Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.
PDF, R, example data for the lecture
Homework 1; data set
January 13 (Thursday): Lecture 2, The truth about linear regression
Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R2 is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable effects). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means.
January 18 (Tuesday): Lecture 3, Evaluation: Error and inference
Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection.
PDF, R for figures
Homework 1 due: solutions
Homework 2; R for problem #2
January 20 (Thursday): Lecture 4, Smoothing methods in regression
The bias-variance trade-off tells us how much we should smooth. Adapting to unknown roughness with cross-validation; detailed examples. Using kernel regression with multiple inputs: multivariate kernels, product kernels. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results.
PDF notes, R
January 25 (Tuesday): Lecture 5, Heteroskedasticity, weighted least squares, and variance estimation
Average predictive comparisons. Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.
PDF handout
Homework 2 due: PDF of solutions, R
Homework 3 out: Assignment
January 27 (Thursday): Lecture 6, Density estimation
The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties. An example with homework data. Estimating conditional densities; another example with homework data. Some issues with likelihood, maximum likelihood, and non-parametric estimation.
PDF notes, R for figures
February 1 (Tuesday): Lecture 7, Simulation
Simulation: implementing the story encoded in the model, step by step, to produce something data-like. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random numbers. Means of generating random numbers with specified distributions. Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.
PDF notes, R
Homework 3 due: solutions, R
Homework 4 out: Assignment, SPhistory.short.csv
February 3 (Thursday): Lecture 8, The Bootstrap
Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?
PDF notes, R for figures and examples
pareto.R, wealth.dat
February 8 (Tuesday): Lecture 9, Catch-up and consolidation day
Reviewing the course so far.
Homework 4 due: Solutions
Homework 5 out: Assignment
February 10 (Thursday): Lecture 10, Testing regression specifications (guest lecture by Prof. Rinaldo)
Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.
PDF notes, incorporating R examples
February 15 (Tuesday): Lecture 11, Splines
Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data from homework 4, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression. Appendix: Lagrange multipliers and the correspondence between constrained and penalized optimization.
PDF notes, incorporating R examples
Homework 5 due: Solutions
Homework 6 out: Assginment; data files: gmp_2006.csv, pcgmp_2006.csv
February 17 (Thursday): Lecture 12, Additive models
The curse of dimensionality limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimensionality. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there is hardly ever any reason to prefer linear models to additive ones, and the continued thoughtless use of linear regression is a scandal.
PDF notes, incorporating R examples
February 22 (Tuesday): Lecture 13, More about Hypothesis Testing
Homework 6 due: PDF solutions, R code
Midterm 1 out: Exam; your data set was e-mailed to your Andrew account
February 24 (Thursday): No lecture
March 1 (Tuesday): Q & A session
Midterm 1 due (at 5 pm): PDF solutions, R, master data set
March 3 (Thursday): Consolidation and examples
With an emphasis on exam debriefing
March 8 and March 10 (Tuesday and Thursday)
Spring break
March 15 (Tuesday): Lecture 14, Logistic regression
Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models
PDF notes
Homework 7 out: PDF assignment
March 17 (Thursday): Lecture 15, Generalized linear models and generalized additive models
Poisson regression and other generalized linear models; over-dispersion; generalized additive models
March 22 (Tuesday): Lecture 16, Consolidation and examples
Building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.
PDF handout, snoqualmie.csv data set, R
Homework 8 out: assignment; Fair, 1978
March 24 (Thursday): Lecture 17, Principal components analysis
Principal components: the simplest, oldest and most robust of dimensionality-reduction techniques. PCA works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the coordinates of projections on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.
PDF handout, pca.R for examples, cars data set, R workspace for the New York Times examples
Homework 7 due (extended due to server outage): solutions
March 29 (Tuesday): Lecture 18, Factor analysis
Adding noise to PCA to get a statistical model. The factor analysis model, or linear regression with unobserved independent variables. Assumptions of the factor analysis model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. (Our first look at latent variables and conditional independence.) Geometrically, the factor model says the data have a Gaussian distribution on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.
PDF handout; lecture-18.R computational examples you should step through (not done in class); correlates of sleep in mammals data set for those examples; thomson-model.R
Homework 8 due: solutions; Li and Racine, 2004
Homework 9: assignment, fx.csv data set
March 31 (Thursday): Lecture 19, Mixture Models
From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry. Clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.
PDF handout
April 5 (Tuesday): Lecture 20, Mixture model examples and complements
Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; estimation by maximum likelihood; computational aspects, specifically in R.
PDF, R; bootcomp.R (patch graciously provided by Dr. Derek Young)
Homework 9 due: solutions
Midterm 2 out: Assignment; your data set was mailed to you
April 7 (Thursday): Lecture 21, Graphical models
Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth? Appendix: undirected graphical models, the Gibbs-Markov theorem; directed but cyclic graphical models. Appendix: Some basic notions of graph theory; Guthrie diagrams.
April 12 (Tuesday) Lecture 22, Graphical causal models
Statistical dependence, counterfactuals, causation. Probabilistic prediction (selecting a sub-ensemble) vs. causal prediction (generating a new ensemble). Graphical causal models, structural equation models. The causal Markov property. Faithfulness. Counterfactual prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules. Appendix: mutual information and independence; conditional mutual information and conditional independence.
PDF notes
Midterm 2 due: Solutions, R for solutions
Homework 10 out: assignment, fake-smoke.csv
April 14 (Thursday): Spring carnival
April 19 (Tuesday): Lecture 23, Estimating causal effects from observations
Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Matching and propensity scores as computational short-cuts in back-door adjustment. Summary recommendations for identifying and estimating causal effects.
PDF notes
Homework 10 due: Solutions
Homework 11 out: Assignment
April 21 (Thursday): Lecture 24, Discovering causal structure from observations
How do we get our causal graph? Comparing rival DAGs by testing selected conditional independence relations (or dependencies). The crucial difference between common causes and common effects. Identifying colliders, and using them to orient arrows. Inducing orientation to enforce consistency. The SGS algorithm for discovering causal graphs; why it works. Refinements of the SGS algorithm (the PC algorithm). What about latent variables? Software: TETRAD and pcalg. Limits to observational causal discovery: universal consistency is possible (and achieved), but uniform consistency is not.
PDF notes
April 26 (Tuesday): Lecture 25, Recap on estimation causal effects
Substituting consistent estimators into the formulas for front and back door identification. Tricks to avoid estimating marginal distributions. Uncertainty in estimates of effects
Homework 11 due: Solutions
Final exam out: Assignment
April 28 (Thursday): General review
May 9 (Monday): Final exam due at 10 am
photo by barjack on Flickr