# Advanced Data Analysis from an Elementary Point of View

## byCosma Rohilla Shalizi

This is a draft textbook on data analysis methods, intended for a one-semester course for advance undergraduate students who have already taken classes in probability, mathematical statistics, and linear regression. It began as the lecture notes for 36-402 at Carnegie Mellon University.

By making this draft generally available, I am not promising to provide any assistance or even clarification whatsoever. Comments are, however, welcome.

The book is under contract to Cambridge University Press; it should be turned over to the press at the end of 2013 or beginning of 2014 in early 2015. A copy of the next-to-final version will remain freely accessible here permanently.

I. Regression and Its Generalizations
1. Regression Basics
2. The Truth about Linear Regression
3. Model Evaluation
4. Smoothing in Regression
5. Simulation
6. The Bootstrap
7. Weighting and Variance
8. Splines
10. Testing Regression Specifications
12. Logistic Regression
13. Generalized Linear Models and Generalized Additive Models
14. Classification and Regression Trees
II. Multivariate Data, Distribution Estimates, and Latent Structure
15. Multivariate Distributions
16. Density Estimation
17. Relative Distributions and Smooth Tests
18. Principal Components Analysis
19. Factor Analysis
20. Nonlinear Dimensionality Reduction
21. Mixture Models
22. Graphical Models
III. Causal Inference
23. Graphical Causal Models
24. Identifying Causal Effects
25. Estimating Causal Effects
26. Discovering Causal Structure
27. Causal Inference from Experiments
IV. Dependent Data
28. Time Series
29. Time Series with Latent Variables
30. Simulation-Based Inference
31. Longitudinal, Spatial and Network Data
Problem Sets with Data
Appendices
• Reminders from Linear Algebra
• Big O and Little o Notation
• Taylor Expansions
• Propagation of Error, and Standard Errors for Derived Quantities
• Optimization Theory
• Optimization Methods
• chi-squared and the Likelihood Ratio Test
• Proof of the Gauss-Markov Theorem
• Rudimentary Graph Theory
• Uncorrelated vs. Independent
• Writing R Functions
• Random Variable Generation

Planned changes:

• The logistic/GLM/GAM chapters need to be re-organized and consolidated; more on modeling over- and under- dispersion (part I)
• Unified treatment of information-theoretic topics (relative entropy / Kullback-Leibler divergence, entropy, mutual information and independence, hypothesis-testing interpretations) in an appendix, with references from chapters on density estimation, on EM, and on independence testing
• More detailed treatment of calibration and calibration-checking (part II)
• Missing data and imputation (part II)
• Move d-separation material from "causal models" chapter to graphical models chapter as no specifically causal content (parts II and III)?
• Expand treatment of partial identification for causal inference, including partial identification of effects by looking at all data-compatible DAGs (part III)
• Replace code boxes with knitr (in progress)
• Figure out how to cut at least 50 pages
• Make sure notation is consistent throughout: insist that vectors are always matrices, or use more geometric notation?
• Swap out some of the older problem sets for new ones from 2015

(Text last updated 5 May 2015; this page last updated 2 May 2015)