Tuesdays and Thursdays, 10:30--11:50 Wean Hall 7500

The goal of this class is to train you in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of their analyses to collaborators and to non-statisticians.

During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

Graduate students from other departments wishing to take this course should register for it under the number "36-608". Enrollment for 36-608 is very limited, and by permission of the professor only.

36-401, or consent of the instructor. The latter is only granted under *very* unusual circumstances.

Professors | Cosma Shalizi | cshalizi [at] cmu.edu |

229 C Baker Hall | ||

Xizhen Cai | xizhen [at] stat.cmu.edu | |

232B Baker Hall | ||

Teaching assistants | Ms. Dena Asta | |

Mr. Collin Eubanks | ||

Mr. Sangwon "Justin" Hyun | ||

Ms. Natalie Klein |

*Model evaluation*: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; the bootstrap; penalized fitting; mis-specification checks*Yet More Linear Regression*: what is regression, really?; what ordinary linear regression actually does; what it cannot do; extensions*Smoothing*: kernel smoothing, including local polynomial regression; splines; additive models; kernel density estimation*Generalized linear and additive models*: logistic regression; generalized linear models; generalized additive models.*Latent variables and structured data*: principal components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general*Causality*: graphical causal models; identification of causal effects from observations; estimation of causal effects; discovering causal structure; experimental design and analysis*Dependent data*: Markov models for time series without latent variables; hidden Markov models for time series with latent variables; longitudinal, spatial and network data

The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. There will be twelve weekly homework assignments, nearly one every week; they will all be due on Mondays at 11:59 pm (i.e., the night before Tuesday classes), through Blackboard. All homeworks count equally, totaling 60% of your grade. The lowest three homework grades will be dropped; consequently, no late homework will be accepted for any reason whatsoever.

Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it. This portion of the
assignment will be graded, along with the other questions. As always, raw
computer output and R code is not acceptable; your document must be humanly
readable. We prefer that you submit
an R Markdown
or knitr file, integrating text, figures
and R code; submit *both* your knitted file and the source. If that is
not feasible, submit a PDF with text and figure, and a separate .R file with
all the commands needed to reproduce your work. Do not submit Word (.doc or
.docx) files, since they will not be graded. (You can write in Word, just be
sure to submit a PDF.)

Unlike PDF or plain text, Word files do not display consistently across different machines, different versions of the program on the same machine, etc., so not using them eliminates any doubt that what we grade differs from what you think you wrote. Word files are also much more of a security hole than PDF or (especially) plain text. Finally, it is obnoxious to force people to buy commercial, closed-source software just to read what you write. (It would be obnoxious even if Microsoft paid you for marketing its wares that way, but it doesn't.)

There will be two take-home mid-term exams (10% each), due at 11:59 pm on
March ~~2th~~ 4th and April 13th. You will have one week to work
on each midterm. There will be no homework in those weeks. There will also be
a take-home final exam (20%), due at 10:30 am on May ~~12~~ 11,
which you will have two weeks to do.

Exams must also be submitted through Blackboard, under the same rules about file formats as homework.

If you want help with computing, please bring your laptop.

Monday | 1:00--2:00 | Prof. Shalizi | Baker Hall 229A |

2:30--3:30 | Mr. Hyun | Wean Hall 8110 | |

3:30--4:30 | Ms. Klein | Wean Hall 8110 | |

Wednesday | 11:00--12:00 | Ms. Asta | Wean Hall 8110 |

Thursday | 3:30--4:30 | Prof. Cai | Baker Hall 232B |

Friday | 12:00--1:00 | Mr. Eubanks | Wean Hall 8110 |

3:30--4:30 | Prof. Shalizi | Baker Hall 229C |

If you cannot make office hours, please e-mail the professors about making an appointment.

The primary textbook for the course will be the
draft Advanced Data Analysis from an
Elementary Point of View. Chapters will be linked to here as they
become needed. You are expected to read these notes, and are unlikely to be
able to do the assignments without doing so. (There will be a prize for the
student who identifies the most errors in the notes by 1 May.) In addition,
Paul Teetor, The R Cookbook (O'Reilly Media, 2011,
ISBN 978-0-596-80915-7)
is **required** as a reference.

Cox and Donnelly, Principles of Applied Statistics (Cambridge
University Press, 2011,
ISBN 978-1-107-64445-8); Faraway, Extending
the Linear Model with R (Chapman Hall/CRC Press, 2006,
ISBN 978-1-58488-424-8; errata);
and Venables and Ripley, Modern Applied Statistics with S
(Springer,
2003;
ISBN 9780387954578)
will be **optional**. The campus bookstore should have copies of
all of these.

R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before), and this class presumes that you have.
Almost every assignment will require you to use it. No other form of
computational work will be accepted. If you are *not* able to use R, or
do not have ready, reliable access to a computer on which you can do so, let me
know at once.

Here are some resources for learning R:

- The official intro, "An Introduction to R", available online in HTML and PDF
- John Verzani, "simpleR", in PDF
- Quick-R. This is primarily aimed at those who already know a commercial statistics package like SAS, SPSS or Stata, but it's very clear and well-organized, and others may find it useful as well.
- Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques" (large PDF)
- Paul Teetor, The R Cookbook, explains how to use R to do many, many common tasks. (It's like the inverse to R's help: "What command does X?", instead of "What does command Y do?"). It is one of the required texts, and is available at the campus bookstore.
- The notes for 36-350, Introduction to Statistical Computing
- There are now many books about R. Some recommendable ones:
- Joseph Adler R in a Nutshell (O'Reilly, 2009; ISBN 9780596801700). Probably most useful for those with previous experience programming in another language.
- W. John Braun and Duncan J. Murdoch, A First Course in Statistical Programming with R (Cambridge University Press, 2008; ISBN 978-0-521-69424-7)
- John M. Chambers, Software for Data Analysis: Programming with R (Springer, 2008, ISBN 978-0-387-75935-7). The best book on writing clean and reliable R programs; probably more advanced than you will need.
- Norman Matloff, The Art of R Programming (No Starch Press, 2011, ISBN 978-1-59327-384-2). Good introduction to programming for complete novices using R. Less statistics than Braun and Murdoch, more programming skills.

Irregular supplements to the text, covering common issues.

- "
`predict`and Friends: Common Methods for Predictive Models in R" (PDF, R Markdown)

Current revision of the complete notes

- January 13 (Tuesday): Lecture 1, Introduction to the class; regression
*Reading*: Notes, chapter 1- R and
`examples.dat`for examples in the notes;`ckm.csv`for optional end-of-chapter exercise. *Optional reading*: Cox and Donnelly, chapter 1; Faraway, chapter 1 (especially up to p. 17).- Homework 1 assigned: assignment,
`mobility.csv`data file. - January 15 (Thursday): Lecture 2, The truth about linear regression
*Reading*: Notes, chapter 2- R for examples in the notes.
*Optional reading*: Faraway, rest of chapter 1- January 20 (Tuesday): Lecture 3, Evaluation of Models: Error and inference
*Reading*: Notes, chapter 3; R for in-class demos*Optional reading*: Cox and Donnelly, ch. 6- Homework 1 due; solutions on Blackboard
- Homework 2: assignent,
`uv.csv`data file. - January 22 (Thursday): Lecture 4, Smoothing methods in regression
*Reading*: Notes, chapter 4; commented R for the notes.*Optional readings*: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The`np`Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]- January 27 (Tuesday): Lecture 5, Simulation
*Reading*: Notes, chapter 5; R for in-class demos- Homework 2 due; solutions on Blackboard
- Homework 3: assignment,
`stock_history.csv` - January 29 (Thursday): Lecture 6, The Bootstrap
*Reading*: Notes, chapter 6; R examples in the notes`pareto.R`and`wealth.dat`for some of the chapter's examples*Optional reading*: Cox and Donnelly, chapter 8- February 3 (Tuesday): Lecture 7, Writing R Code
*Reading*: Notes, Appendix on writing R code- In-class examples: R Markdown source, knitted webpage
- Homework 3 due (solutions on Blackboard)
- Homework 4: assignment
- February 5 (Thursday): Lecture 8, Heteroskedasticity, weighted least squares, and variance estimation
*Reading*: Notes, chapter 7*Optional reading*: Faraway, section 11.3- February 10 (Tuesday): Lecture 9, Splines
*Reading*: Notes, chapter 8*Optional reading*: Faraway, section 11.2- Homework 4 due (solutions on Blackboard)
- Homework 5: assignment,
`nampd.csv`,`MoM.txt` - February 12 (Thursday): Lecture 10, Additive models
*Reading*: Notes, chapter 9;`mapper.R`code (commented)- In-class R demos
*Optional reading*: Faraway, chapter 12- February 17 (Tuesday): Lecture 11, Testing Regression Specifications
*Reading*: Notes, chapter 10; in-class demos*Optional reading*: Cox and Donnelly, chapter 7- Homework 5 due
- Homework 6: assignment,
`ch.csv` - February 19 (Thursday): Lecture 12, Logistic Regression
*Reading*: Notes, chapter 12*Optional reading*: Faraway, chapter 2 (omitting sections 2.11 and 2.12)- February 24 (Tuesday): Lecture 13, Generalized linear models and generalized additive models
*Reading*: Notes, chapter 13*Optional reading*: Faraway, section 3.1 and chapter 6- Homework 6 due
- Exam
1: assignment,
`navc.csv`data file - February 26 (Thursday): Lecture 14, Multivariate Distributions
*Reading*: Notes, chapter 15- March 3 (Tuesday): Lecture 15, Density Estimation
*Reading*: Notes, chapter 16- March 4 (Wednesday)
~~Exam 1 due at 11:59 pm~~- March 5 (Thursday): Lecture 16, Density Estimation II: Demos
- R for in-class demos, with comments
- Exam 1 due at 11:59 pm
- March 10 and 12: Spring break
- March 17 (Tuesday): Lecture 17, Principal Components Analysis
*Reading*: Notes, chapter 18- Homework
7: assignment,
`n90_pol.csv`data file - March 19 (Thursday): Lecture 18, canceled due to illness
- March 24 (Tuesday): Lecture 19, Factor Analysis
*Reading*: Notes, chapter 19- In-class demos of PCA and factor models
- Homework 7 due
- Homework 8: canceled
- March 26 (Thursday): Lecture 20, Mixture Models
*Reading*: Notes, chapter 21- March 31 (Tuesday): Lecture 21, Graphical Models
*Reading*: Notes, chapter 22- Homework 9: assignment,
`portfolio.csv`data file,`charles.R`mystery code - April 2 (Thursday): Lecture 22, Missing Data
- April 7 (Tuesday): Lecture 23, Graphical Causal Models
*Reading*: Notes, chapter 23*Optional reading*: Cox and Donnelly, chapters 6 and 9; Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2- Homework 9 due
- Exam 2: assignment,
`neur.csv`data set, log-likelihood code for non-parametric mixture models - April 9 (Thursday): Lecture 24, Identifying Causal Effects from Observations
*Reading*: Notes, chapter 24*Optional reading*: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1- April 13 (Monday)
- Exam 2 due at 11:59 pm
- April 14 (Tuesday): Lecture 25, Estimating Causal Effects from Observations
*Reading*: Notes, chapter 25- Homework 10: assignment,
`sesame.csv` - April 16 (Thursday): Carnival, no class
- April 21 (Tuesday): Lecture 26, Discovering Causal Structure from Observations
*Reading*: Notes, chapter 26- Homework 10 due
- Homework 11: assignment
- April 23 (Thursday): Lecture 27, Estimating Causal Effects from Experiments
*Reading*: Notes, chapter 27- April 28 (Tuesday): Lecture 28, Time Series
*Reading*: Notes, chapter 28*Optional reading*: Faraway, section 9.1- Homework 11 due
- April 30 (Thursday): Lecture 29, More Time Series
*Reading*: Notes, chapter 28- Final exam: assignment, ckm-nodes.csv, ckm-net.dat
- May 11 (Monday)
- Final exam due at
**10:30 am**