Cosma Shalizi
36-402, Undergraduate Advanced Data Analysis
Spring 2012
Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100
This is the page for the 2012 class. You are probably looking for the 2013 class.
The goal of this class is to train students in using statistical models to
analyze data — as data summaries, as predictive instruments, and as tools
for scientific inference. We will build on the theory and applications of the
linear model, introduced
in 36-401,
extending it to more general functional forms, and more general kinds of data,
emphasizing the computation-intensive methods introduced since the 1980s.
After taking the class, when you're faced with a new data-analysis problem, you
should be able to (1) select appropriate methods, (2) use statistical software
to implement them, (3) critically evaluate the resulting statistical models,
and (4) communicate the results of their analyses to collaborators and to
non-statisticians.
Graduate students from other departments wishing to take this course should
register for it under the number "36-608".
Prerequisites
36-401, or consent of the instructor. The latter is only granted under very unusual circumstances.
Instructors
Professor | Cosma Shalizi | cshalizi [at] cmu.edu |
| | 229 C Baker Hall |
| | 268-7826 |
Teaching assistants | Ms. Stefa Etchegaray | |
| Mr. Mingyu Tang | |
| Mr. Zachary Kurtz | |
| Mr. Cong Lu | |
Topics, Notes, Readings
- Model evaluation: statistical inference, prediction, and
scientific inference; in-sample and out-of-sample errors, generalization and
over-fitting, cross-validation; evaluating by simulating; the bootstrap;
penalized fitting; mis-specification checks
- Yet More Linear Regression: what is regression, really?;
review of ordinary linear regression and its limits; extensions
- Smoothing: kernel smoothing, including local polynomial
regression; splines; additive models; kernel density estimation
- Generalized linear and additive models: logistic
regression; generalized linear models; generalized additive models.
- Latent variables and structured data: principal
components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
- Causality: graphical causal models; identification of
causal effects from observations; estimation of causal effects;
discovering causal structure
- Dependent data: Markov models for time
series without latent variables; hidden Markov models for time series with
latent variables; longitudinal, spatial and network data
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.
Course Mechanics
Homework will be 60% of the grade, two midterms 10% each, and the final
20%.
Homework
The homework will give you practice in using the techniques you are learning
to analyze data, and to interpret the analyses. There will be twelve weekly
homework assignments, nearly one every week; they will all be due on Tuesdays
at the beginning of class, through Coursekit, and will all count equally,
totaling 60% of your grade. The lowest three homework grades will be dropped;
consequently, no late homework will be accepted for any reason.
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it. This portion of the
assignment will be graded, along with the other questions, but failure to do
the writing (or to show signs of a serious attempt at it) will result in an
automatic zero for that assignment. As always, raw computer output and R code
is not acceptable, but should be put in an appendix to each assignment.
Homework may be submitted either as a PDF (preferred) or as a plain text file
(.txt). If you prepare your homework in Word, be sure to submit a PDF
file; .doc, .docx, etc., files will not be graded.
Unlike PDF or plain text, Word files do not display
consistently across different machines, different versions of the program on
the same machine, etc., so not using them eliminates any doubt that what we
grade differs from what you think you wrote. Word files are also much more of
a security hole than PDF or (especially) plain text. Finally, it is obnoxious
to force people to buy commercial, closed-source software just to read what you
write. (It would be obnoxious even if Microsoft paid you for marketing its
wares that way, but it doesn't.)
Exams
There will be two take-home mid-term exams (10% each), due at 10:30 am on
March 6th and April 17th. You will have one week to work on each midterm.
There will be no homework in those weeks. There will also be a take-home final
exam (20%), due at 10:30 am on May 15, which you will have two weeks to do.
Exams must also be submitted through Coursekit, under the same rules as
homework.
Quality Control
To help control the quality of the grading, every week (after the first week of
classes), six students will be selected at random, and will meet with the
professor for ten minutes each, to explain their work and to answer questions
about it. Grades on that week's assignment (possibly including the take-home
exams) will be revised (up or down) by the professor in light of performance
during these sessions. You may be selected on multiple weeks, if that's how
the random numbers come up.
Office Hours
Prof. Shalizi will hold office hours Mondays, 1:30--3:30 pm, in Baker Hall
229A, or by appointment. TA office hours will be Thursdays, 3:30--4:30 pm in
Porter Hall A20A. If you want help with computing, please bring your laptop.
Coursekit
We will be trying a replacement for Blackboard called "Coursekit" for
announcements, turning in homework, grades, and a discussion forum. An
e-mail with an invitation to the system was sent to your Andrew address;
here is our Coursekit website. Assignments
and notes will be posted here, with links on Coursekit; solutions will
be distributed as hard-copies, in class.
Textbook
Julian
Faraway, Extending the Linear Model with R (Chapman Hall/CRC
Press, 2006,
ISBN 978-1-58488-424-8),
and Paul Teetor, The R Cookbook (O'Reilly Media, 2011,
ISBN 978-0-596-80915-7)
will both be required.
(Faraway's page on the book,
with help and errata.) Venables and Ripley's Modern Applied
Statistics with S
(Springer,
2003;
ISBN 9780387954578)
will be optional. The campus bookstore should have copies of
all of these.
In addition to the textbooks, you are expected to read the notes. They
contain valuable information which goes beyond what is in the texts, which you
will need to understand to do the assignments and the exams.
Collaboration, Cheating and Plagiarism
Feel free to discuss all aspects of the course with one another, including
homework and exams. However, the work you hand in must be your own. You must
not copy mathematical derivations, computer output and input, or written
descriptions from anyone or anywhere else, without reporting the source within
your work. Unacknowledged copying will lead at the very least to an automatic
zero on that assignment, and possibly severe disciplinary action. Please read
the CMU Policy on
Cheating and Plagiarism, and don't plagiarize.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012.
R
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before), and this class presumes that you have.
Many of the assignments will require you to use it. You can expect no
assistance from the instructors with any other programming language or
statistical software. If you are not able to use R, or do not have
ready, reliable access to a computer on which you can do so, let me know at
once.
Here are some resources for learning R:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
- Paul Teetor, The R Cookbook, explains how to use R to
do many, many common tasks. (It's like the inverse to R's help: "What
command does X?", instead of "What does command Y do?"). It is one of the required texts, and is available at the campus bookstore.
- The notes for 36-350, Introduction to
Statistical Computing
- There are now many books about R. Some recommendable ones:
- Joseph Adler R in a Nutshell
(O'Reilly, 2009;
ISBN 9780596801700). Probably most useful for those with previous experience programming in another language.
- W. John Braun and
Duncan
J. Murdoch, A
First Course in Statistical Programming with R (Cambridge University Press, 2008; ISBN 978-0-521-69424-7)
- John M. Chambers, Software for Data Analysis:
Programming with R
(Springer, 2008,
ISBN 978-0-387-75935-7).
The best book on writing clean and reliable R programs; probably more advanced
than you will need.
- Norman
Matloff, The Art of R Programming (No Starch Press, 2011,
ISBN 978-1-59327-384-2).
Good introduction to programming for complete novices using R. Less statistics
than Braun and Murdoch, more programming skills.
Even if you know how to do some basic coding (or more), you
should read the page of Minimal
Advice on Programming.
Reminders
Some handouts on stuff all of you should already know, but where evidently some of you could use refreshers:
- Uncorrelated vs. Independent
- Propagation of Error
Schedule
This was the schedule for the 2012 class. You are probably looking for the 2013 class.
The complete notes (large PDF)
- January 17 (Tuesday): Lecture 1, Introduction to the class
- Statistics is the science which studies methods for learning from imperfect
data. Regression is a statistical model of functional relationships between
variables. Getting relationships right means being able to predict well. The
least-squares optimal prediction is the expectation value; the conditional
expectation function is the regression function. The regression function must
be estimated from data; the bias-variance trade-off controls this estimation.
Ordinary least squares revisited as a smoothing method. Other linear smoothers:
nearest-neighbor averaging, kernel-weighted averaging.
- Reading: Notes, chapter 1; Faraway, chapter 1 (especially up to
p. 17)
- Homework 1: assignment, data
- January 19 (Thursday): Lecture 2, The truth about linear regression
- Using Taylor's theorem to justify linear regression locally. Collinearity.
Consistency of ordinary least squares estimates under weak conditions. Linear
regression coefficients will change with the distribution of the input
variables: examples. Why R2 is usually a distraction. Linear
regression coefficients will change with the distribution of unobserved
variables (omitted variable effects). Errors in variables. Transformations of
inputs and of outputs. Utility of probabilistic assumptions; the importance of
looking at the residuals. What "controlled for in a linear regression" really
means.
- Reading: Notes, chapter 2 (R); Faraway, chapter 1
- January 24 (Tuesday): Lecture 3, Evaluation: Error and inference
- Goals of statistical analysis: summaries, prediction, scientific inference.
Evaluating predictions: in-sample error, generalization error; over-fitting.
Cross-validation for estimating generalization error and for model
selection. Justifying model-based inferences.
- Reading: Notes, chapter 3 (R)
- Homework 1 due: solutions
- Homework 2: assignment, R, penn-select.csv data file
- January 26 (Thursday): Lecture 4, Smoothing methods in regression
- The bias-variance trade-off tells us how much we should smooth. Adapting
to unknown roughness with cross-validation; detailed examples. How quickly
does kernel smoothing converge on the truth? Using kernel regression with
multiple inputs. Using smoothing to automatically discover interactions.
Plots to help interpret multivariate smoothing results. Average predictive
comparisons.
- Reading: Notes, chapter 4 (R); Faraway, section 11.1
- Optional readings: Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
- January 31 (Tuesday): Lecture 5, The Bootstrap
- Quantifying uncertainty by looking at sampling distributions. The
bootstrap principle: sampling distributions under a good estimate of the truth
are close to the true sampling distributions. Parametric bootstrapping.
Non-parametric bootstrapping. Many examples. When does the bootstrap
fail?
- Reading: Notes, chapter 5 (R for figures and examples; pareto.R; wealth.dat)
- R for in-class examples
- Homework 2 due
- Homework 3 assigned: assignment,
nampd.csv data set
- February 2 (Thursday): Lecture 6, Heteroskedasticity, weighted least
squares, and variance estimation
- Weighted least squares estimates. Heteroskedasticity and the problems it
causes for inference. How weighted least squares gets around the problems of
heteroskedasticity, if we know the variance function. Estimating the variance
function from regression residuals. An iterative method for estimating the
regression function and the variance function together. Locally constant and
locally linear modeling. Lowess.
- Reading: Notes, chapter 6; Faraway, section 11.3
- February 7 (Tuesday): Lecture 7, Splines
- Kernel regression controls the amount of smoothing indirectly by bandwidth;
why not control the irregularity of the smoothed curve directly? The spline
smoothing problem is a penalized least squares problem: minimize mean squared
error, plus a penalty term proportional to average curvature of the
function over space. The solution is always a continuous piecewise cubic
polynomial, with continuous first and second derivatives. Altering the
strength of the penalty moves along a bias-variance trade-off, from pure OLS at
one extreme to pure interpolation at the other; changing the strength of the
penalty is equivalent to minimizing the mean squared error under a constraint
on the average curvature. To ensure consistency, the penalty/constraint should
weaken as the data grows; the appropriate size is selected by cross-validation.
An example with the data, including confidence bands. Writing splines as basis
functions, and fitting as least squares on transformations of the data, plus a
regularization term. A brief look at splines in multiple dimensions. Splines
versus kernel regression.
- Reading: Notes, chapter 7; Faraway, section 11.2
- Homework 3 due
- Homework 4: Assignment
- February 9 (Thursday): Lecture 8, Additive models
- The curse of dimensionality limits the usefulness of fully non-parametric
regression in problems with many variables: bias remains under control, but
variance grows rapidly with dimensionality. Parametric models do not have this
problem, but have bias and do not let us discover anything about the
true function. Structured or constrained non-parametric regression
compromises, by adding some bias so as to reduce variance. Additive models are
an example, where each input variable has a "partial response function", which
add together to get the total regression function; the partial response
functions are unconstrained. This generalizes linear models but still evades
the curse of dimensionality. Fitting additive models is done iteratively,
starting with some initial guess about each partial response function and then
doing one-dimensional smoothing, so that the guesses correct each other until a
self-consistent solution is reached. Examples in R using the California
house-price data. Conclusion: there is hardly ever any reason to prefer linear
models to additive ones, and the continued thoughtless use of linear regression
is a scandal.
- Reading: Notes, chapter 8; Faraway, chapter 12
- February 14 (Tuesday): Lecture 9, Writing R Code
- (By popular demand.)
- R programs are built around functions: pieces of code that take inputs or
arguments, do calculations on them, and give back outputs or return values.
The most basic use of a function is to encapsulate something we've done in the
terminal, so we can repeat it, or make it more flexible. To assure ourselves
that the function does what we want it to do, we subject it to sanity-checks,
or "write tests". To make functions more flexible, we use control structures,
so that the calculation done, and not just the result, depends on the argument.
R functions can call other functions; this lets us break complex problems into
simpler steps, passing partial results between functions. Programs inevitably
have bugs: debugging is the cycle of figuring out what the bug is, finding
where it is in your code, and fixing it. Good programming habits make
debugging easier, as do some tricks. Avoiding iteration. Re-writing code
to avoid mistakes and confusion, to be clearer, and to be more flexible.
- Reading: Notes, chapter 9
- Optional reading: Slides from 36-350, introduction
to statistical computing, especially through lecture 15.
- R for in-class demos
- Homework 4 due
- Homework 5: assignment, R
- February 16 (Thursday): Lecture 10, Testing Regression Specifications
- Non-parametric smoothers can be used to test parametric models. Forms of
tests: differences in in-sample performance; differences in generalization
performance; whether the parametric model's residuals have expectation zero
everywhere. Constructing a test statistic based on in-sample performance.
Using bootstrapping from the parametric model to find the null distribution of
the test statistic. An example where the parametric model is correctly
specified, and one where it is not. Cautions on the interpretation of
goodness-of-fit tests. Why use parametric models at all? Answers: speed of
convergence when correctly specified; and the scientific interpretation of
parameters, if the model actually comes from a scientific theory.
Mis-specified parametric models can predict better, at small sample sizes, than
either correctly-specified parametric models or non-parametric smoothers,
because of their favorable bias-variance characteristics; an example.
- Reading: Notes, chapter 10
- February 21 (Tuesday): Lecture 11, More about Hypothesis Testing
- The logic of hypothesis testing: significance, power, the will to believe,
and the (shadow) price of power. Severe tests of hypotheses: severity of
rejection vs. severity of acceptance. Common abuses. Confidence sets as the
"dual" to hypothesis tests. Crucial role of sampling distributions. Examples,
right and wrong.
- Reading: Notes, chapter 11
- Homework 5 due
- Homework 6: assignment, strikes.csv data set
- February 23 (Thursday): Lecture 12, Logistic regression
- Modeling conditional probabilities; using regression to model
probabilities; transforming probabilities to work better with regression; the
logistic regression model; maximum likelihood; numerical maximum likelihood by
Newton's method and by iteratively re-weighted least squares; comparing
logistic regression to logistic-additive models.
- Reading: Notes, chapter 12; Faraway, chapter 2 (omitting sections 2.11 and 2.12)
- February 28 (Tuesday): Lecture 13, Generalized linear models and generalized additive models
- Poisson regression for counts; iteratively re-weighted least squares again.
The general pattern of generalized linear models; over-dispersion. Generalized
additive models.
- Reading: Notes, first half of chapter 13; Faraway, section 3.1, chapter 6
- Homework 6 due
- Midterm 1: assignment. Your data-set has been e-mailed to you.
- March 1 (Thursday): Lecture 14, GLM and GAM Examples
- Building a weather forecaster for Snoqualmie Falls, Wash., with logistic
regression. Exploratory examination of the data. Predicting wet or dry days
form the amount of precipitation the previous day. First logistic regression
model. Finding predicted probabilities and confidence intervals for them.
Comparison to spline smoothing and a generalized additive model. Model
comparison test detects significant mis-specification. Re-specifying the
model: dry days are special. The second logistic regression model and its
comparison to the data. Checking the calibration of the second model.
- Reading: Notes, second half of chapter 13; Faraway, chapters 6 and 7 (continued from previous lecture)
- March 6 (Tuesday): Lecture 15, Multivariate Distributions
- Reminders about multivariate distributions. The multivariate Gaussian
distribution: definition, relation to the univariate or scalar Gaussian
distribution; effect of linear transformations on the parameters; plotting
probability density contours in two dimensions; using eigenvalues and
eigenvectors to understand the geometry of multivariate Gaussians; conditional
distributions in multivariate Gaussians and linear regression; computational
aspects, specifically in R. General methods for estimating parametric
distributional models in arbitrary dimensions: moment-matching and maximum
likelihood; asymptotics of maximum likelihood; bootstrapping; model comparison
by cross-validation and by likelihood ratio tests; goodness of fit by the
random projection trick.
- Reading: Notes, chapter 14
- Midterm 1 due
- March 8 (Thursday): Lecture 16, Density Estimation
- The desirability of estimating not just conditional means, variances, etc.,
but whole distribution functions. Parametric maximum likelihood is a solution,
if the parametric model is right. Histograms and empirical cumulative
distribution functions are non-parametric ways of estimating the distribution:
do they work? The Glivenko-Cantelli law on the convergence of empirical
distribution functions, a.k.a. "the fundamental theorem of statistics". More
on histograms: they converge on the right density, if bins keep shrinking but
the number of samples per bin keeps growing. Kernel density estimation and its
properties: convergence on the true density if the bandwidth shrinks at the
right rate; superior performance to histograms; the curse of dimensionality
again. An example with cross-country economic data. Kernels for discrete
variables. Estimating conditional densities; another example with the OECD
data. Some issues with likelihood, maximum likelihood, and non-parametric
estimation.
- Reading: Notes, chapter 15
- March 13 and 15: Spring break
- March 20 (Tuesday): Lecture 17, Simulation
- Simulation: implementing the story encoded in the model, step by step, to
produce something data-like. Stochastic models have random components and so
require some random steps. Stochastic models specified through conditional
distributions are simulated by chaining together random variables. How to
generate random variables with specified distributions. Simulation shows us
what a model predicts (expectations, higher moments, correlations, regression
functions, sampling distributions); analytical probability calculations are
short-cuts for exhaustive simulation. Simulation lets us check aspects of the
model: does the data look like typical simulation output? if we repeat our
exploratory analysis on the simulation output, do we get the same results?
Simulation-based estimation: the method of simulated moments.
- Reading: Notes, chapter 16; R
- Homework 7: assignment, n90_pol.csv data
- March 22 (Thursday): Lecture 18, Relative Distributions and Smooth Tests
of Goodness-of-Fit
- Applying the right CDF to a continuous random variable makes it uniformly
distributed. How do we test whether some variable is uniform? The smooth test
idea, based on series expansions for the log density. Asymptotic theory of the
smooth test. Choosing the basis functions for the test and its order. Smooth
tests for non-uniform distributions through the transformation. Dealing with
estimated parameters. Some examples. Non-parametric density estimation on
[0,1]. Checking conditional distributions and calibration with smooth tests.
The relative distribution idea: comparing whole distributions by seeing where
one set of samples falls in another distribution. Relative density and its
estimation. Illustrations of relative densities. Decomposing shifts in
relative distributions.
- Reading: Notes, chapter 17
- Optional reading: Bera and Ghosh, "Neyman's Smooth Test and Its Applications in Econometrics";
Handcock and Morris, "Relative Distribution Methods"
- March 27 (Tuesday): Lecture 19, Principal Components Analysis
- Principal components is the simplest, oldest and most robust of
dimensionality-reduction techniques. It works by finding the line (plane,
hyperplane) which passes closest, on average, to all of the data points. This
is equivalent to maximizing the variance of the projection of the data on to
the line/plane/hyperplane. Actually finding those principal components reduces
to finding eigenvalues and eigenvectors of the sample covariance matrix. Why
PCA is a data-analytic technique, and not a form of statistical inference. An
example with cars. PCA with words: "latent semantic analysis"; an example with
real newspaper articles. Visualization with PCA and multidimensional scaling.
Cautions about PCA; the perils of reification; illustration with genetic
maps.
- Reading: Notes, chapter 18;
pca.R, pca-examples.Rdata, and cars-fixed04.dat
- Homework 7 due
- Homework 8: assignment
- March 29 (Thursday): Lecture 20, Factor Analysis
- Adding noise to PCA to get a statistical model. The factor analysis model,
or linear regression with unobserved independent variables. Assumptions of the
factor analysis model. Implications of the model: observable variables are
correlated only through shared factors; "tetrad equations" for one factor
models, more general correlation patterns for multiple factors. (Our first
look at latent variables and conditional independence.) Geometrically, the
factor model says the data have a Gaussian distribution on some low-dimensional
plane, plus noise moving them off the plane. Estimation by heroic linear
algebra; estimation by maximum likelihood. The rotation problem, and why it is
unwise to reify factors. Other models which produce the same correlation
patterns as factor models.
- Reading: Notes, chapter 19;
factors.R and
sleep.txt
- April 3 (Tuesday): Lecture 21, Mixture Models
- From factor analysis to mixture models by allowing the latent variable to
be discrete. From kernel density estimation to mixture models by reducing the
number of points with copies of the kernel. Probabilistic formulation of
mixture models. Geometry: planes again. Probabilistic clustering. Estimation
of mixture models by maximum likelihood, and why it leads to a vicious circle.
The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious
circle with iterative approximation. More on the EM algorithm: convexity,
Jensen's inequality, optimizing a lower bound, proving that each step of EM
increases the likelihood. Mixtures of regressions. Other extensions.
- Reading: Notes, first half of chapter 20
- Homework 8 due
- Homework 9: assignment, MOM_data_full.txt
- April 5 (Thursday): Lecture 22, Mixture Model Examples and Complements
- Precipitation in Snoqualmie Falls revisited. Fitting a two-component
Gaussian mixture; examining the fitted distribution; checking calibration.
Using cross-validation to select the number of components to use. Examination
of the selected mixture model. Suspicious patterns in the parameters of the
selected model. Approximating complicated distributions vs. revealing hidden
structure. Using bootstrap hypothesis testing to select the number of mixture
components.
- Reading: Notes, second half of chapter 20; mixture-examples.R
- April 10 (Tuesday): Lecture 23, Graphical Models
- Conditional independence and dependence properties in factor models. The
generalization to graphical models. Directed acyclic graphs. DAG models.
Factor, mixture, and Markov models as DAGs. The graphical Markov property.
Reading conditional independence properties from a DAG. Creating conditional
dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with
DAGs; does asbestos whiten teeth?
- Reading: Notes, chapter 21
- Homework 9 due
- Midterm 2: assignment
- April 12 (Thursday): Lecture 24, Graphical Causal Models
- Probabilistic prediction is about passively selecting a sub-ensemble,
leaving all the mechanisms in place, and seeing what turns up after applying
that filter. Causal prediction is about actively producing a new
ensemble, and seeing what would happen if something were to change
("counterfactuals"). Graphical causal models are a way of reasoning about
causal prediction; their algebraic counterparts are structural equation models
(generally nonlinear and non-Gaussian). The causal Markov property.
Faithfulness. Performing causal prediction by "surgery" on causal graphical
models. The d-separation criterion. Path diagram rules for linear
models.
- Reading: Notes, chapter 22
- April 17 (Tuesday): Lecture 25, Identifying Causal Effects from Observations
- Reprise of causal effects vs. probabilistic conditioning. "Why think, when
you can do the experiment?" Experimentation by controlling everything
(Galileo) and by randomizing (Fisher). Confounding and identifiability. The
back-door criterion for identifying causal effects: condition on covariates
which block undesired paths. The front-door criterion for identification: find
isolated and exhaustive causal mechanisms. Deciding how many black boxes to
open up. Instrumental variables for identification: finding some exogenous
source of variation and tracing its effects. Critique of instrumental
variables: vital role of theory, its fragility, consequences of weak
instruments. Irremovable confounding: an example with the detection of social
influence; the possibility of bounding unidentifiable effects. Summary
recommendations for identifying causal effects.
- Reading: Notes, chapter 23
- Midterm 2 due
- Homework 10: assignment
- April 19 (Thursday): Carnival
- April 24 (Tuesday): Lecture 26, Estimating Causal Effects from Observations
- Estimating graphical models: substituting consistent estimators into the
formulas for front and back door identification; average effects and
regression; tricks to avoid estimating marginal distributions; propensity
scores and matching and propensity scores as computational short-cuts in
back-door adjustment. Instrumental variables estimation: the Wald estimator,
two-stage least-squares. Summary recommendations for estimating causal
effects.
- Reading: Notes, chapter 24
- Homework 10 due
- Homework 11: assignment, sesame.csv
- April 26 (Thursday): Lecture 27, Discovering Causal Structure from Observations
- How do we get our causal graph? Comparing rival DAGs by testing selected
conditional independence relations (or dependencies). Equivalence classes of
graphs. Causal arrows never go away no matter what you condition on ("no
causation without association"). The crucial difference between common causes
and common effects: conditioning on common causes makes their effects
independent, conditioning on common effects makes their causes dependent.
Identifying colliders, and using them to orient arrows. Inducing orientation
to enforce consistency. The SGS algorithm for discovering causal graphs; why
it works. The PC algorithm: the SGS algorithm for lazy people. What about
latent variables? Software: TETRAD and pcalg; examples of
working with pcalg. Limits to observational causal discovery:
universal consistency is possible (and achieved), but uniform consistency is
not.
- Reading: Notes, chapter 25
- May 1 (Tuesday): Lecture 28, Time Series I
- What time series are. Properties: autocorrelation or serial correlation;
strong and weak stationarity. The correlation time, the world's simplest
ergodic theorem, effective sample size. The meaning of ergodicity: a single
increasing long time series becomes representative of the whole process.
Conditional probability estimates; Markov models; the meaning of the Markov
property. Autoregressive models, especially additive autoregressions;
conditional variance estimates. Bootstrapping time series. Trends and
de-trending.
- Reading: Notes, chapter 26;
R for
examples; gdp-pc.csv
- Homework 11 due
- Final exam: assignment; macro.csv
- Help installing pcalg
- May 3 (Thursday): Lecture 29, Time Series II
- Cross-validation for time series. Change-points and "structural breaks".
Moving averages: spurious correlations (Yule effect) and oscillations (Slutsky
effect). State-space or hidden Markov models; moving average and ARMA models
as state-space models. The EM algorithm for hidden Markov models; particle
filtering. Multiple time series: "dynamic" graphical models; "Granger"
causality (which is not causal); the possibility of real causality.
- Reading: Notes, chapter 27; Faraway, section 9.1
- May 15
- Final exam due