36-402, Undergraduate Advanced Data Analysis
Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100
The goal of this class is to train you in using statistical models to
analyze data — as data summaries, as predictive instruments, and as tools
for scientific inference. We will build on the theory and applications of the
linear model, introduced
extending it to more general functional forms, and more general kinds of data,
emphasizing the computation-intensive methods introduced since the 1980s.
After taking the class, when you're faced with a new data-analysis problem, you
should be able to (1) select appropriate methods, (2) use statistical software
to implement them, (3) critically evaluate the resulting statistical models,
and (4) communicate the results of their analyses to collaborators and to
During the class, you will do data analyses with existing software, and
write your own simple programs to implement and extend key techniques. You
will also have to write reports about your analyses.
Graduate students from other departments wishing to take this course should
register for it under the number "36-608". Enrollment for 36-608 is very
limited, and by permission of the professor only.
36-401, or consent of the instructor. The latter is only granted under very unusual circumstances.
|Professor ||Cosma Shalizi ||cshalizi [at] cmu.edu|
|229 C Baker Hall|
|Teaching assistants ||Mr. Beau Dabbs |
| ||Ms. Francesca Matano|
| ||Mr. Mingyu Tang|
| ||Ms. Xiaolin Yang|
Topics, Notes, Readings
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.
- Model evaluation: statistical inference, prediction, and
scientific inference; in-sample and out-of-sample errors, generalization and
over-fitting, cross-validation; evaluating by simulating; the bootstrap;
penalized fitting; mis-specification checks
- Yet More Linear Regression: what is regression, really?;
what ordinary linear regression actually does; what it cannot do; extensions
- Smoothing: kernel smoothing, including local polynomial
regression; splines; additive models; kernel density estimation
- Generalized linear and additive models: logistic
regression; generalized linear models; generalized additive models.
- Latent variables and structured data: principal
components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
- Causality: graphical causal models; identification of
causal effects from observations; estimation of causal effects;
discovering causal structure
- Dependent data: Markov models for time
series without latent variables; hidden Markov models for time series with
latent variables; longitudinal, spatial and network data
Homework will be 60% of the grade, two midterms 10% each, and the final
The homework will give you practice in using the techniques you are learning
to analyze data, and to interpret the analyses. There will be twelve weekly
homework assignments, nearly one every week; they will all be due on Mondays at
11:59 pm (i.e., the night before Tuesday classes), through Blackboard. All
homeworks count equally, totaling 60% of your grade. The lowest three homework
grades will be dropped; consequently, no late homework will be accepted for any
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it. This portion of the
assignment will be graded, along with the other questions. As always, raw
computer output and R code is not acceptable, but should be put in an appendix
to each assignment. Homework may be submitted either as a PDF (preferred) or
as a plain text file (.txt). If you prepare your homework in Word, be
sure to submit a PDF file; .doc, .docx, etc., files will not
Unlike PDF or plain text, Word files do not display
consistently across different machines, different versions of the program on
the same machine, etc., so not using them eliminates any doubt that what we
grade differs from what you think you wrote. Word files are also much more of
a security hole than PDF or (especially) plain text. Finally, it is obnoxious
to force people to buy commercial, closed-source software just to read what you
write. (It would be obnoxious even if Microsoft paid you for marketing its
wares that way, but it doesn't.)
There will be two take-home mid-term exams (10% each), due at 11:59 pm on
March 4th and April 15th. You will have one week to work on each midterm.
There will be no homework in those weeks. There will also be a take-home final
exam (20%), due at 10:30 am on May 13, which you will have two weeks to do.
Exams must also be submitted through Blackboard, under the same rules as
To help control the quality of the grading, every week (after the first week of
classes), six students will be selected at random, and will meet with the
professor for ten minutes each, to explain their work and to answer questions
about it. You may be selected on multiple weeks, if that's how the random
numbers come up. This is not a punishment, but a way for the
instructor to see whether the problem sets are really measuring learning of the
course material; being selected will not hurt your grade in any way.
If you want help with computing, please bring your laptop.
|Prof. Shalizi ||Baker Hall 229A ||Monday 11:00--12:00
| ||Baker Hall 229C ||Thursday 12:00--1:00
| ||Baker Hall 229C ||Friday 3:30--4:30
|Mr. Dabbs ||FMS 320 ||Monday 2:00--4:00
| ||FMS 320 ||Thursday 3:00--4:00
Blackboard will be used for submitting assignments electronically, and as a
gradebook. All properly enrolled students should have access to the Blackboard
site by the beginning of classes.
The primary textbook for the course will be the
draft Advanced Data Analysis from an
Elementary Point of View. Chapters will be linked to here as they
become needed. You are expected to read these notes, and are unlikely to be
able to do the assignments without doing so. In addition, Paul
Teetor, The R Cookbook (O'Reilly Media, 2011,
Cox and Donnelly's Principles of Applied Statistics (Cambridge
University Press, 2011,
will also have required readings, but we will not use all of it. If you are
unable to purchase it, contact the professor for photocopies.
Faraway, Extending the Linear Model with R (Chapman Hall/CRC
and Venables and Ripley's Modern Applied Statistics with S
will be optional.
(Faraway's page on the book,
with help and errata.) The campus bookstore should have copies of all of
Collaboration, Cheating and Plagiarism
Feel free to discuss all aspects of the course with one another, including
homework and exams. However, the work you hand in must be your own. You must
not copy mathematical derivations, computer output and input, or written
descriptions from anyone or anywhere else, without reporting the source within
your work. This includes copying from solutions provided to previous
semesters' of the course. Unacknowledged copying will lead to severe
disciplinary action. Please read the
CMU Policy on
Cheating and Plagiarism, and don't plagiarize.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012.
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before), and this class presumes that you have.
Almost every assignment will require you to use it. No other form of
computational work will be accepted. If you are not able to use R, or
do not have ready, reliable access to a computer on which you can do so, let me
know at once.
Here are some resources for learning R:
Even if you know how to do some basic coding (or more), you
should read the page of Minimal
Advice on Programming.
- The official intro, "An Introduction to R", available online in
- John Verzani, "simpleR",
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
- Thomas Lumley, "R Fundamentals and Programming Techniques"
- Paul Teetor, The R Cookbook, explains how to use R to
do many, many common tasks. (It's like the inverse to R's help: "What
command does X?", instead of "What does command Y do?"). It is one of the required texts, and is available at the campus bookstore.
- The notes for 36-350, Introduction to
- There are now many books about R. Some recommendable ones:
- Joseph Adler R in a Nutshell
ISBN 9780596801700). Probably most useful for those with previous experience programming in another language.
- W. John Braun and
J. Murdoch, A
First Course in Statistical Programming with R (Cambridge University Press, 2008; ISBN 978-0-521-69424-7)
- John M. Chambers, Software for Data Analysis:
Programming with R
The best book on writing clean and reliable R programs; probably more advanced
than you will need.
Matloff, The Art of R Programming (No Starch Press, 2011,
Good introduction to programming for complete novices using R. Less statistics
than Braun and Murdoch, more programming skills.
Some handouts on stuff all of you should already know, but where evidently some of you could use refreshers:
- Uncorrelated vs. Independent
- Propagation of Error
- Which Bootstrap When?
Other Iterations of the Class
Some material is available from versions of this class taught in
other years. Copying from any solutions provided there is not only
cheating, it is very easily detected cheating.
Subject to revision. Lecture notes, assignments and solutions
will all be linked here, as they are available.
Current revision of the complete notes
- January 15 (Tuesday): Lecture 1, Introduction to the class; regression
- Statistics is the branch of mathematical engineering which designs and
analyzes methods for learning from imperfect data. Regression is a statistical
model of functional relationships between variables. Getting relationships
right means being able to predict well. The least-squares optimal prediction
is the expectation value; the conditional expectation function is the
regression function. The regression function must be estimated from data; the
bias-variance trade-off controls this estimation. Ordinary least squares
revisited as a smoothing method. Other linear smoothers: nearest-neighbor
averaging, kernel-weighted averaging.
- Reading: Notes, chapter 1 (examples.dat for running example; ckm.csv data set for optional exercises); Cox and Donnelly, chapter 1
- Optional reading: Faraway, chapter 1 (especially up to
- Homework 1: assignment, data
- January 17 (Thursday): Lecture 2, The truth about linear regression
- Using Taylor's theorem to justify linear regression locally. Collinearity.
Consistency of ordinary least squares estimates under weak conditions. Linear
regression coefficients will change with the distribution of the input
variables: examples. Why R2 is usually a distraction. Linear
regression coefficients will change with the distribution of unobserved
variables (omitted variable effects). Errors in variables. Transformations of
inputs and of outputs. Utility of probabilistic assumptions; the importance of
looking at the residuals. What it really means when coefficients are
significantly non-zero. What "controlled for in a linear regression" really
- Reading: Notes, chapter 2 (R); Notes, appendix B
- Optional reading: Faraway, rest of chapter 1
- January 22 (Tuesday): Lecture 3, Evaluation: Error and inference
- Statistical models have
three main uses: as ways of summarizing (reducing, compressing) the data; as
scientific models, facilitating actually scientific inference; and as
predictors. Both summarizing and scientific inference are linked to prediction
(though in different ways), so we'll focus on prediction. In particular for
now we focus on the average error of prediction, under some particular
measure of error. The distinction between in-sample error and generalization
error, and why the former is almost invariably optimistic about the latter.
Over-fitting. Examples of just how spectacularly one can over-fit really very
harmless data. A brief sketch of the ideas of learning theory and capacity
control. Data-set-splitting as a first attempt at practically controlling
over-fitting. Cross-validation for estimating generalization error and for
model selection. Justifying model-based inferences.
- Reading: Notes, chapter 3 (R)
- Cox and Donnelly, ch. 6 (on Blackboard)
- Homework 1 due at midnight on Monday
- Homework 2: assignment, R, penn-select.csv data file
- January 24 (Thursday): Lecture 4, Smoothing methods in regression
- The bias-variance trade-off tells us how much we should smooth. Adapting
to unknown roughness with cross-validation; detailed examples. How quickly
does kernel smoothing converge on the truth? Using kernel regression with
multiple inputs. Using smoothing to automatically discover interactions.
Plots to help interpret multivariate smoothing results. Average predictive
- Reading: Notes, chapter 4 (R)
- Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
- January 29 (Tuesday): Lecture 5, Simulation
- Simulation: implementing the story encoded in the model, step by step, to
produce something data-like. Stochastic models have random components and so
require some random steps. Stochastic models specified through conditional
distributions are simulated by chaining together random variables. How to
generate random variables with specified distributions. Simulation shows us
what a model predicts (expectations, higher moments, correlations, regression
functions, sampling distributions); analytical probability calculations are
short-cuts for exhaustive simulation. Simulation lets us check aspects of the
model: does the data look like typical simulation output? if we repeat our
exploratory analysis on the simulation output, do we get the same results?
Simulation-based estimation: the method of simulated moments.
- Reading: Notes, chapter 5 (but sections 5.4--5.6 are optional); R
- Homework 2 due at midnight on Monday
- Homework 3 assigned: assignment,
- January 31 (Thursday): Lecture 6, The Bootstrap
- Quantifying uncertainty by looking at sampling distributions. The
bootstrap principle: sampling distributions under a good estimate of the truth
are close to the true sampling distributions. Parametric bootstrapping:
simulating from a model. Non-parametric bootstrapping: re-sampling the data.
Special issues for regression: re-sampling residuals vs. re-sampling cases.
Many examples. When does the bootstrap fail?
- Reading: Notes, chapter 6 (R for figures and examples; pareto.R; wealth.dat)
- Lecture slides;
R for in-class examples
- Cox and Donnelly, chapter 8
- February 5 (Tuesday): Lecture 7, Writing R Code
- R programs are built around functions: pieces of code that take inputs or
arguments, do calculations on them, and give back outputs or return values.
The most basic use of a function is to encapsulate something we've done in the
terminal, so we can repeat it, or make it more flexible. To assure ourselves
that the function does what we want it to do, we subject it to sanity-checks,
or "write tests". To make functions more flexible, we use control structures,
so that the calculation done, and not just the result, depends on the argument.
R functions can call other functions; this lets us break complex problems into
simpler steps, passing partial results between functions. Programs inevitably
have bugs: debugging is the cycle of figuring out what the bug is, finding
where it is in your code, and fixing it. Good programming habits make
debugging easier, as do some tricks. Avoiding iteration. Re-writing code
to avoid mistakes and confusion, to be clearer, and to be more flexible.
- Reading: Notes, Appendix A
- Optional reading: Slides from 36-350, introduction
to statistical computing, especially through lecture 15.
- Homework 3 due at midnight on Monday
- Homework 4 Assignment, nampd.csv data set, code for the assignment
- February 7 (Thursday): Lecture 8, Heteroskedasticity, weighted least
squares, and variance estimation
- Weighted least squares estimates, to give more emphasis to particular data
points. Heteroskedasticity and the problems it causes for inference. How
weighted least squares gets around the problems of heteroskedasticity, if we
know the variance function. Estimating the conditional variance function from
regression residuals. An iterative method for estimating the regression
function and the variance function together. Examples of conditional variance
estimation. Locally constant and locally linear modeling. Lowess.
- Reading: Notes, chapter 7
- Optional reading: Faraway, section 11.3
- February 12 (Tuesday): Lecture 9, Splines
- Kernel regression controls the amount of smoothing indirectly by
bandwidth; why not control the irregularity of the smoothed curve directly?
The spline smoothing problem is a penalized least squares problem: minimize
mean squared error, plus a penalty term proportional to average
curvature of the function over space. The solution is always a continuous
piecewise cubic polynomial, with continuous first and second derivatives.
Altering the strength of the penalty moves along a bias-variance trade-off,
from pure linear regression at one extreme to pure interpolation at the other;
changing the strength of the penalty is equivalent to minimizing the mean
squared error under a constraint on the average curvature. To ensure
consistency, the penalty/constraint should weaken as the data grows; the
appropriate size is selected by cross-validation. An example with the data,
including confidence bands. Writing splines as basis functions, and fitting as
least squares on transformations of the data, plus a regularization term. A
brief look at splines in multiple dimensions. Splines versus kernel
- Reading: Notes, chapter 8
- Optional reading: Faraway, section 11.2
- Homework 4 due at midnight on Monday
- Homework 5 Assignment
- February 14 (Thursday): Lecture 10, Additive models
- The curse of dimensionality limits the usefulness of fully non-parametric
regression in problems with many variables: bias remains under control, but
variance grows rapidly with dimensionality. Parametric models do not have this
problem, but have bias and do not let us discover anything about the
true function. Structured or constrained non-parametric regression
compromises, by adding some bias so as to reduce variance. Additive models are
an example, where each input variable has a "partial response function", which
add together to get the total regression function; the partial response
functions are unconstrained. This generalizes linear models but still evades
the curse of dimensionality. Fitting additive models is done iteratively,
starting with some initial guess about each partial response function and then
doing one-dimensional smoothing, so that the guesses correct each other until a
self-consistent solution is reached. Examples in R using the California
house-price data. Conclusion: there is hardly ever any reason to prefer linear
models to additive ones, and the continued thoughtless use of linear regression
is a scandal.
- Reading: Notes, chapter 9 (mapper.R)
- Optional reading: Faraway, chapter 12
- February 19 (Tuesday): Lecture 11, Testing Regression Specifications
- Non-parametric smoothers can be used to test parametric models. Forms of
tests: differences in in-sample performance; differences in generalization
performance; whether the parametric model's residuals have expectation zero
everywhere. Constructing a test statistic based on in-sample performance.
Using bootstrapping from the parametric model to find the null distribution of
the test statistic. An example where the parametric model is correctly
specified, and one where it is not. Cautions on the interpretation of
goodness-of-fit tests. Why use parametric models at all? Answers: speed of
convergence when correctly specified; and the scientific interpretation of
parameters, if the model actually comes from a scientific theory.
Mis-specified parametric models can predict better, at small sample sizes, than
either correctly-specified parametric models or non-parametric smoothers,
because of their favorable bias-variance characteristics; an example.
- Reading: Notes, chapter 10;
R for in-class demos
- Cox and Donnelly, chapter 7
- Homework 5 due at midnight on Monday
- Homework 6: assignment
- February 21 (Thursday): Lecture 12, More about Hypothesis Testing
- The logic of hypothesis testing: significance, power, the will to believe,
and the (shadow) price of power. Severe tests of hypotheses: severity of
rejection vs. severity of acceptance. Common abuses. Confidence sets as the
"dual" to hypothesis tests. Crucial role of sampling distributions. Examples,
right and wrong.
- Reading: Notes, chapter 11
- February 26 (Tuesday): Lecture 13, Logistic regression
- Modeling conditional probabilities; using regression to model
probabilities; transforming probabilities to work better with regression; the
logistic regression model; maximum likelihood; numerical maximum likelihood by
Newton's method and by iteratively re-weighted least squares; comparing
logistic regression to logistic-additive models.
- Reading: Notes, chapter 12
- Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
- Homework 6 due at midnight on Monday
- Midterm 1: assignment. Your data-set has been e-mailed to you.
- February 28 (Thursday): Lecture 14, Generalized linear models and
generalized additive models
- Poisson regression for counts; iteratively re-weighted least squares again.
The general pattern of generalized linear models; over-dispersion. Generalized
- Reading: Notes, first half of chapter 13
- Optional reading: Faraway, section 3.1 and chapter 6
- March 5 (Tuesday): Lecture 15, GLM and GAM Examples
- Building a weather forecaster for Snoqualmie Falls, Wash., with logistic
regression. Exploratory examination of the data. Predicting wet or dry days
form the amount of precipitation the previous day. First logistic regression
model. Finding predicted probabilities and confidence intervals for them.
Comparison to spline smoothing and a generalized additive model. Model
comparison test detects significant mis-specification. Re-specifying the
model: dry days are special. The second logistic regression model and its
comparison to the data. Checking the calibration of the second model.
- Reading: Notes, second half of chapter 13
- Optional reading: Faraway, chapters 6 and 7 (continued from previous lecture)
- Midterm 1 due at midnight on Monday
- March 7 (Thursday): Lecture 16, Multivariate Distributions
- Reminders about multivariate distributions. The multivariate Gaussian
distribution: definition, relation to the univariate or scalar Gaussian
distribution; effect of linear transformations on the parameters; plotting
probability density contours in two dimensions; using eigenvalues and
eigenvectors to understand the geometry of multivariate Gaussians; conditional
distributions in multivariate Gaussians and linear regression; computational
aspects, specifically in R. General methods for estimating parametric
distributional models in arbitrary dimensions: moment-matching and maximum
likelihood; asymptotics of maximum likelihood; bootstrapping; model comparison
by cross-validation and by likelihood ratio tests; goodness of fit by the
random projection trick.
- Reading: Notes, chapter 14
- March 12 and 14: Spring break
- March 19 (Tuesday): Lecture 17, Density Estimation
- The desirability of estimating not just conditional means, variances, etc.,
but whole distribution functions. Parametric maximum likelihood is a solution,
if the parametric model is right. Histograms and empirical cumulative
distribution functions are non-parametric ways of estimating the distribution:
do they work? The Glivenko-Cantelli law on the convergence of empirical
distribution functions, a.k.a. "the fundamental theorem of statistics". More
on histograms: they converge on the right density, if bins keep shrinking but
the number of samples per bin keeps growing. Kernel density estimation and its
properties: convergence on the true density if the bandwidth shrinks at the
right rate; superior performance to histograms; the curse of dimensionality
again. An example with cross-country economic data. Kernels for discrete
variables. Estimating conditional densities; another example with the OECD
data. Some issues with likelihood, maximum likelihood, and non-parametric
- Reading: Notes, chapter 15
- Homework 7 assignment, n90_pol.csv data
- March 21 (Thursday): Lecture 18, Relative Distributions and Smooth Tests
- Applying the right CDF to a continuous random variable makes it uniformly
distributed. How do we test whether some variable is uniform? The smooth test
idea, based on series expansions for the log density. Asymptotic theory of the
smooth test. Choosing the basis functions for the test and its order. Smooth
tests for non-uniform distributions through the transformation. Dealing with
estimated parameters. Some examples. Non-parametric density estimation on
[0,1]. Checking conditional distributions and calibration with smooth tests.
The relative distribution idea: comparing whole distributions by seeing where
one set of samples falls in another distribution. Relative density and its
estimation. Illustrations of relative densities. Decomposing shifts in
- Reading: Notes, chapter 16
- Optional reading: Bera and Ghosh, "Neyman's Smooth Test and Its Applications in Econometrics";
Handcock and Morris, "Relative Distribution Methods"
- March 26 (Tuesday): Lecture 19, Principal Components Analysis
- Principal components is the simplest, oldest and most robust of
dimensionality-reduction techniques. It works by finding the line (plane,
hyperplane) which passes closest, on average, to all of the data points. This
is equivalent to maximizing the variance of the projection of the data on to
the line/plane/hyperplane. Actually finding those principal components reduces
to finding eigenvalues and eigenvectors of the sample covariance matrix. Why
PCA is a data-analytic technique, and not a form of statistical inference. An
example with cars. PCA with words: "latent semantic analysis"; an example with
real newspaper articles. Visualization with PCA and multidimensional scaling.
Cautions about PCA; the perils of reification; illustration with genetic
- Reading: Notes, chapter 17;
pca.R, pca-examples.Rdata, and cars-fixed04.dat
- Homework 7 due at midnight on Monday
- Homework 8 assignment, MOM data file
- March 28 (Thursday): Lecture 20, Factor Analysis
- Adding noise to PCA to get a statistical model. The factor analysis model,
or linear regression with unobserved independent variables. Assumptions of the
factor analysis model. Implications of the model: observable variables are
correlated only through shared factors; "tetrad equations" for one factor
models, more general correlation patterns for multiple factors. (Our first
look at latent variables and conditional independence.) Geometrically, the
factor model says the data have a Gaussian distribution on some low-dimensional
plane, plus noise moving them off the plane. Estimation by heroic linear
algebra; estimation by maximum likelihood. The rotation problem, and why it is
unwise to reify factors. Other models which produce the same correlation
patterns as factor models.
- Reading: Notes, chapter 18;
- April 2 (Tuesday): Lecture 21, Mixture Models
- From factor analysis to mixture models by allowing the latent variable to
be discrete. From kernel density estimation to mixture models by reducing the
number of points with copies of the kernel. Probabilistic formulation of
mixture models. Geometry: planes again. Probabilistic clustering. Estimation
of mixture models by maximum likelihood, and why it leads to a vicious circle.
The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious
circle with iterative approximation. More on the EM algorithm: convexity,
Jensen's inequality, optimizing a lower bound, proving that each step of EM
increases the likelihood. Mixtures of regressions. Other extensions.
- Reading: Notes, first half of chapter 19
- Homework 8 due at midnight on Monday
Homework 9 (cancelled)
- April 4 (Thursday): Lecture 22, Mixture Model Examples and Complements
- Precipitation in Snoqualmie Falls revisited. Fitting a two-component
Gaussian mixture; examining the fitted distribution; checking calibration.
Using cross-validation to select the number of components to use. Examination
of the selected mixture model. Suspicious patterns in the parameters of the
selected model. Approximating complicated distributions vs. revealing hidden
structure. Using bootstrap hypothesis testing to select the number of mixture
- Reading: Notes, second half of chapter 19; mixture-examples.R
- April 9 (Tuesday): Lecture 23, Graphical Models
- Conditional independence and dependence properties in factor models. The
generalization to graphical models. Directed acyclic graphs. DAG models.
Factor, mixture, and Markov models as DAGs. The graphical Markov property.
Reading conditional independence properties from a DAG. Creating conditional
dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with
DAGs; does asbestos whiten teeth?
- Reading: Notes, chapter 20
Homework 9 due at midnight on Monday
- Exam 2: assignment (your data set was mailed to you)
- April 11 (Thursday): Lecture 24, Graphical Causal Models
- Probabilistic prediction is about passively selecting a sub-ensemble,
leaving all the mechanisms in place, and seeing what turns up after applying
that filter. Causal prediction is about actively producing a new
ensemble, and seeing what would happen if something were to change
("counterfactuals"). Graphical causal models are a way of reasoning about
causal prediction; their algebraic counterparts are structural equation models
(generally nonlinear and non-Gaussian). The causal Markov property.
Faithfulness. Performing causal prediction by "surgery" on causal graphical
models. The d-separation criterion. Path diagram rules for linear
- Reading: Notes, chapter 21
- Optional reading: Cox and Donnelly, chapter 9;
Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2
- April 16 (Tuesday): Lecture 25, Identifying Causal Effects from Observations
- Reprise of causal effects vs. probabilistic conditioning. "Why think, when
you can do the experiment?" Experimentation by controlling everything
(Galileo) and by randomizing (Fisher). Confounding and identifiability. The
back-door criterion for identifying causal effects: condition on covariates
which block undesired paths. The front-door criterion for identification: find
isolated and exhaustive causal mechanisms. Deciding how many black boxes to
open up. Instrumental variables for identification: finding some exogenous
source of variation and tracing its effects. Critique of instrumental
variables: vital role of theory, its fragility, consequences of weak
instruments. Irremovable confounding: an example with the detection of social
influence; the possibility of bounding unidentifiable effects. Summary
recommendations for identifying causal effects.
- Reading: Notes, chapter 22
- Optional reading: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1
- Midterm 2 due at midnight on Monday
- Homework 10: assignment, sesame.csv
- April 18 (Thursday): Carnival, no class
- April 23 (Tuesday): Lecture 26, Estimating Causal Effects from Observations
- Estimating graphical models: substituting consistent estimators into the
formulas for front and back door identification; average effects and
regression; tricks to avoid estimating marginal distributions; propensity
scores and matching and propensity scores as computational short-cuts in
back-door adjustment. Instrumental variables estimation: the Wald estimator,
two-stage least-squares. Summary recommendations for estimating causal
- Reading: Notes, chapter 23
- Homework 10 due at midnight on Monday
- Homework 11 assignment, debt.csv
- April 25 (Thursday): Lecture 27, Discovering Causal Structure from Observations
- How do we get our causal graph? Comparing rival DAGs by testing selected
conditional independence relations (or dependencies). Equivalence classes of
graphs. Causal arrows never go away no matter what you condition on ("no
causation without association"). The crucial difference between common causes
and common effects: conditioning on common causes makes their effects
independent, conditioning on common effects makes their causes dependent.
Identifying colliders, and using them to orient arrows. Inducing orientation
to enforce consistency. The SGS algorithm for discovering causal graphs; why
it works. The PC algorithm: the SGS algorithm for lazy people. What about
latent variables? Software: TETRAD and pcalg; examples of
working with pcalg. Limits to observational causal discovery:
universal consistency is possible (and achieved), but uniform consistency is
- Reading: Notes, chapter 24
- April 30 (Tuesday): Lecture 28, Time Series I
- What time series are. Properties: autocorrelation or serial correlation;
strong and weak stationarity. The correlation time, the world's simplest
ergodic theorem, effective sample size. The meaning of ergodicity: a single
increasing long time series becomes representative of the whole process.
Conditional probability estimates; Markov models; the meaning of the Markov
property. Autoregressive models, especially additive autoregressions;
conditional variance estimates. Bootstrapping time series. Trends and
- Reading: Notes, chapter 25;
- Homework 11 due at midnight on Monday
- Final exam assignment; strikes.csv and macro.csv data sets
- Help installing pcalg
- May 2 (Thursday): Lecture 29, Time Series II
- Cross-validation for time series. Change-points and "structural breaks".
Moving averages: spurious correlations (Yule effect) and oscillations (Slutsky
effect). State-space or hidden Markov models; moving average and ARMA models
as state-space models. The EM algorithm for hidden Markov models; particle
filtering. Multiple time series: "dynamic" graphical models; "Granger"
causality (which is not causal); the possibility of real causality.
- May 13
- Final exam due at 10:30 am