This is a draft textbook on data analysis methods, intended for a one-semester course for advance undergraduate students who have already taken classes in probability, mathematical statistics, and linear regression. It began as the lecture notes for 36-402 at Carnegie Mellon University.

By making this draft generally available, I am not promising to provide any assistance or even clarification whatsoever. Comments are, however, welcome.

The book is under contract to Cambridge
University Press; it should be turned over to the press ~~at the end
of 2013 or beginning of 2014~~ in early 2015. A copy of the
next-to-final version will remain freely accessible here permanently.

Table of contents:

- Regression Basics
- The Truth about Linear Regression
- Model Evaluation
- Smoothing in Regression
- Simulation
- The Bootstrap
- Weighting and Variance
- Splines
- Additive Models
- Testing Regression Specifications
- More about Hypothesis Testing
- Logistic Regression
- Generalized Linear Models and Generalized Additive Models
- Classification and Regression Trees

II. Multivariate Data, Distribution Estimates, and Latent Structure - Multivariate Distributions
- Density Estimation
- Relative Distributions and Smooth Tests
- Principal Components Analysis
- Factor Analysis
- Nonlinear Dimensionality Reduction
- Mixture Models
- Graphical Models

III. Causal Inference - Graphical Causal Models
- Identifying Causal Effects
- Estimating Causal Effects
- Discovering Causal Structure
- Causal Inference from Experiments

IV. Dependent Data - Time Series
- Time Series with Latent Variables
- Simulation-Based Inference
- Longitudinal, Spatial and Network Data

Problem Sets with Data

Appendices- Reminders from Linear Algebra
- Big O and Little o Notation
- Taylor Expansions
- Propagation of Error, and Standard Errors for Derived Quantities
- Optimization Theory
- Optimization Methods
- chi-squared and the Likelihood Ratio Test
- Proof of the Gauss-Markov Theorem
- Rudimentary Graph Theory
- Uncorrelated vs. Independent
- Writing R Functions
- Random Variable Generation

I. Regression and Its Generalizations

Planned changes:

- The logistic/GLM/GAM chapters need to be re-organized and consolidated; more on modeling over- and under- dispersion (part I)
- Unified treatment of information-theoretic topics (relative entropy / Kullback-Leibler divergence, entropy, mutual information and independence, hypothesis-testing interpretations) in an appendix, with references from chapters on density estimation, on EM, and on independence testing
- More detailed treatment of calibration and calibration-checking (part II)
- Missing data and imputation (part II)
- Move d-separation material from "causal models" chapter to graphical models chapter as no specifically causal content (parts II and III)?
- Expand treatment of partial identification for causal inference, including partial identification of effects by looking at all data-compatible DAGs (part III)
- Replace code boxes with
`knitr`(in progress) - Figure out how to cut at least 50 pages
- Make sure notation is consistent throughout: insist that vectors are always matrices, or use more geometric notation?
- Swap out some of the older problem sets for new ones from 2015

(Text last updated 5 May 2015; this page last updated 2 May 2015)