Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100

The goal of this class is to train students in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of their analyses to collaborators and to non-statisticians.

Graduate students from other departments wishing to take this course should register for it under the number "36-608".

36-401, or, in unusual circumstances, an equivalent course approved by the instructor.

Professor | Cosma Shalizi | cshalizi [at] cmu.edu |

229 C Baker Hall | ||

268-7826 | ||

Teaching assistants | Gaia Bellone | gbellone [at] stat.cmu.edu |

Zachary Kurtz | zkurtz [at] stat.cmu.edu | |

Shuhei Okumura | sokumura [at] stat.cmu.edu |

*Model evaluation*: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; bootstrap; penalized fitting; information criteria; mis-specification checks; model averaging*Yet More Linear Regression*: what is regression, really?; review of ordinary linear regression and its limits; extensions*Smoothing*: kernel smoothing, including local polynomial regression; splines; additive models; classification and regression trees; kernel density estimation*GAMs*: logistic regression; generalized linear models; generalized additive models.*Latent variables and structured data*: principal components; factor analysis and latent variables; graphical models in general; latent cluster/mixture models; hierarchical models and partial pooling*Causality*: estimating causal effects; discovering causal structure*Time series*: Markov models for time series without latent variables; hidden Markov models for time series with latent variables

There will be ~~twelve~~ eleven weekly homework assignments,
nearly one every week; they will all count equally, and be 60% of your grade.
The homework will give you practice in using the techniques you are learning to
analyze data, and to interpret the analyses. Communicating your results to
others is as important as getting good results in the first place. Raw
computer output and R code is not acceptable, but should be put in an appendix
to each assignment.
Homework will be due, in hard-copy, at the beginning of class on Tuesdays. The
lowest three homework grades will be dropped; consequently, no late homework
will be accepted.

There will be two take-home mid-term exams (10% each), due at 5 pm on March 1st and April 12th. (Please let me know as soon as possible if you have a conflict with either date.) You will have one week to work on each midterm. There will be no homework in those weeks, and lecture on the day they are due will be replaced with special office hours. There will also be a take-home final exam (20%), due at 10 am on May 9, which you will have two weeks to do.

R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before), and this class presumes that you have.
Many of the problems will be easier with R, and some of them will require R.
You should have no expectations of assistance from the instructors with
programming in any other language. If you are *not* able to use R,
or do not have ready, reliable access to a computer on which you can do so,
let me know at once.

Here are some resources for learning R:

- The official intro, "An Introduction to R", available online in HTML and PDF
- John Verzani, "simpleR", in PDF
- Quick-R. This is primarily aimed at those who already know a commercial statistics package like SAS, SPSS or Stata, but it's very clear and well-organized, and others may find it useful as well.
- Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques" (large PDF)
- There are now many books about R. Adler's R in a Nutshell, and Venables and Ripley, will be available at the campus bookstore. John M. Chambers, Software for Data Analysis: Programming with R (Springer, 2008, ISBN 978-0-387-75935-7) is the best book on writing programs in R, but we will not have to do much actual programming.

- January 11 (Tuesday): Lecture 1, Introduction to the class
- Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.
- PDF, R, example data for the lecture
- Homework 1; data set
- January 13 (Thursday): Lecture 2, The truth about linear regression
- Using Taylor's theorem to justify linear regression locally. Collinearity.
Consistency of ordinary least squares estimates under weak conditions. Linear
regression coefficients will change with the distribution of the input
variables: examples. Why R
^{2}is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable effects). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means. - PDF, R
- January 18 (Tuesday): Lecture 3, Evaluation: Error and inference
- Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection.
- PDF, R for figures
- Homework 1 due: solutions
- Homework 2; R for problem #2
- January 20 (Thursday): Lecture 4, Smoothing methods in regression
- The bias-variance trade-off tells us how much we should smooth. Adapting to unknown roughness with cross-validation; detailed examples. Using kernel regression with multiple inputs: multivariate kernels, product kernels. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results.
- PDF notes, R
- January 25 (Tuesday): Lecture 5, Heteroskedasticity, weighted least squares, and variance estimation
- Average predictive comparisons. Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.
- PDF handout
- Homework 2 due: PDF of solutions, R
- Homework 3 out: Assignment
- January 27 (Thursday): Lecture 6, Density estimation
- The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties. An example with homework data. Estimating conditional densities; another example with homework data. Some issues with likelihood, maximum likelihood, and non-parametric estimation.
- PDF notes, R for figures
- February 1 (Tuesday): Lecture 7, Simulation
- Simulation: implementing the story encoded in the model, step by step, to produce something data-like. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random numbers. Means of generating random numbers with specified distributions. Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.
- PDF notes, R
- Homework 3 due: solutions, R
- Homework 4 out: Assignment, SPhistory.short.csv
- February 3 (Thursday): Lecture 8, The Bootstrap
- Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?
- PDF notes, R for figures and examples
- pareto.R, wealth.dat
- February 8 (Tuesday): Lecture 9, Catch-up and consolidation day
- Reviewing the course so far.
- Homework 4 due: Solutions
- Homework 5 out: Assignment
- February 10 (Thursday): Lecture 10, Testing regression specifications (guest lecture by Prof. Rinaldo)
- Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.
- PDF notes, incorporating R examples
- February 15 (Tuesday): Lecture 11, Splines
- Kernel regression controls the amount of smoothing indirectly by bandwidth;
why not control the irregularity of the smoothed curve directly? The spline
smoothing problem is a penalized least squares problem: minimize mean squared
error,
*plus*a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data from homework 4, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression. Appendix: Lagrange multipliers and the correspondence between constrained and penalized optimization. - PDF notes, incorporating R examples
- Homework 5 due: Solutions
- Homework 6 out: Assginment; data files: gmp_2006.csv, pcgmp_2006.csv
- February 17 (Thursday): Lecture 12, Additive models
- The curse of dimensionality limits the usefulness of fully non-parametric
regression in problems with many variables: bias remains under control, but
variance grows rapidly with dimensionality. Parametric models do not have this
problem, but have bias and do not let us
*discover*anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there is hardly ever any reason to prefer linear models to additive ones, and the continued thoughtless use of linear regression is a scandal. - PDF notes, incorporating R examples
- February 22 (Tuesday): Lecture 13, More about Hypothesis Testing
- Homework 6 due: PDF solutions, R code
- Midterm 1 out: Exam; your data set was e-mailed to your Andrew account
- February 24 (Thursday): No lecture
- March 1 (Tuesday): Q & A session
- Midterm 1 due (at 5 pm): PDF solutions, R, master data set
- March 3 (Thursday): Consolidation and examples
- With an emphasis on exam debriefing
- March 8 and March 10 (Tuesday and Thursday)
- Spring break
- March 15 (Tuesday): Lecture 14, Logistic regression
- Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models
- PDF notes
- Homework 7 out: PDF assignment
- March 17 (Thursday): Lecture 15, Generalized linear models and generalized additive models
- Poisson regression and other generalized linear models; over-dispersion; generalized additive models
- March 22 (Tuesday): Lecture 16, Consolidation and examples
- Building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.
- PDF
handout,
`snoqualmie.csv`data set, R - Homework 8 out: assignment; Fair, 1978
- March 24 (Thursday): Lecture 17, Principal components analysis
- Principal components: the simplest, oldest and most robust of dimensionality-reduction techniques. PCA works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the coordinates of projections on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.
- PDF handout, pca.R for examples, cars data set, R workspace for the New York Times examples
- Homework 7 due (extended due to server outage): solutions
- March 29 (Tuesday): Lecture 18, Factor analysis
- Adding noise to PCA to get a statistical model. The factor analysis model, or linear regression with unobserved independent variables. Assumptions of the factor analysis model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. (Our first look at latent variables and conditional independence.) Geometrically, the factor model says the data have a Gaussian distribution on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.
- PDF handout; lecture-18.R computational examples you should step through (not done in class); correlates of sleep in mammals data set for those examples; thomson-model.R
- Homework 8 due: solutions; Li and Racine, 2004
- Homework 9: assignment,
`fx.csv`data set - March 31 (Thursday): Lecture 19, Mixture Models
- From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry. Clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.
- PDF handout
- April 5 (Tuesday): Lecture 20, Mixture model examples and complements
- Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; estimation by maximum likelihood; computational aspects, specifically in R.
- PDF, R; bootcomp.R (patch graciously provided by Dr. Derek Young)
- Homework 9 due: solutions
- Midterm 2 out: Assignment; your data set was mailed to you
- April 7 (Thursday): Lecture 21, Graphical models
- Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth? Appendix: undirected graphical models, the Gibbs-Markov theorem; directed but cyclic graphical models. Appendix: Some basic notions of graph theory; Guthrie diagrams.
- April 12 (Tuesday) Lecture 22, Graphical causal models
- Statistical dependence, counterfactuals, causation. Probabilistic prediction (selecting a sub-ensemble) vs. causal prediction (generating a new ensemble). Graphical causal models, structural equation models. The causal Markov property. Faithfulness. Counterfactual prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules. Appendix: mutual information and independence; conditional mutual information and conditional independence.
- PDF notes
- Midterm 2 due: Solutions, R for solutions
- Homework 10 out: assignment,
`fake-smoke.csv` - April 14 (Thursday): Spring carnival
- April 19 (Tuesday): Lecture 23, Estimating causal effects from observations
- Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Matching and propensity scores as computational short-cuts in back-door adjustment. Summary recommendations for identifying and estimating causal effects.
- PDF notes
- Homework 10 due: Solutions
- Homework 11 out: Assignment
- April 21 (Thursday): Lecture 24, Discovering causal structure from observations
- How do we get our causal graph? Comparing rival DAGs by testing selected
conditional independence relations (or dependencies). The crucial difference
between common causes and common effects. Identifying colliders, and using
them to orient arrows. Inducing orientation to enforce consistency. The SGS
algorithm for discovering causal graphs; why it works. Refinements of the SGS
algorithm (the PC algorithm). What about latent variables?
Software:
`TETRAD`and`pcalg`. Limits to observational causal discovery: universal consistency is possible (and achieved), but uniform consistency is not. - PDF notes
- April 26 (Tuesday): Lecture 25, Recap on estimation causal effects
- Substituting consistent estimators into the formulas for front and back door identification. Tricks to avoid estimating marginal distributions. Uncertainty in estimates of effects
- Homework 11 due: Solutions
- Final exam out: Assignment
- April 28 (Thursday): General review
- May 9 (Monday): Final exam due at 10 am