This is a draft textbook on data analysis methods, intended for a one-semester course for advance undergraduate students who have already taken classes in probability, mathematical statistics, and linear regression. It began as the lecture notes for 36-402 at Carnegie Mellon University.

By making this draft generally available, I am not promising to provide any assistance or even clarification whatsoever. Comments are, however, welcome.

The book is under contract to Cambridge University Press; it should be turned over to the press at the end of 2013 or beginning of 2014. A copy of the next-to-final version will remain freely accessible here permanently.

Table of contents:

- Regression Basics
- The Truth about Linear Regression
- Model Evaluation
- Smoothing in Regression
- Simulation
- The Bootstrap
- Weighting and Variance
- Splines
- Additive Models
- Testing Regression Specifications
- More about Hypothesis Testing
- Logistic Regression
- Generalized Linear Models and Generalized Additive Models

II. Multivariate Data, Distribution Estimates, and Latent Structure - Multivariate Distributions
- Density Estimation
- Relative Distributions and Smooth Tests
- Principal Components Analysis
- Factor Analysis
- Mixture Models
- Graphical Models

III. Causal Inference - Graphical Causal Models
- Identifying Causal Effects
- Estimating Causal Effects
- Discovering Causal Structure

IV. Dependent Data - Time Series
- Time Series with Latent Variables
- Longitudinal, Spatial and Network Data

Appendices- A. Writing R Functions
- B. Big O and Little o Notation
- C. chi-squared and the Likelihood Ratio Test
- D. Proof of the Gauss-Markov Theorem
- E. Constrained and Penalized Optimization
- F. Rudimentary Graph Theory
- G. Pseudo-code for the SGS Algorithm

I. Regression and Its Generalizations

Planned changes:

- The logistic/GLM/GAM chapters need to be re-organized and consolidated; more on modeling over- and under- dispersion (part I)
- Add chapter on regression trees (part I)
- Unified treatment of relative entropy (Kullback-Leibler divergence) in material on density estimation and EM (part II)
- More detailed treatment of calibration and calibration-checking (part II)
- Missing data and imputation (part II)
- Merge the causal-identification and causal-estimation chapters (part III)
- Expand treatment of partial identification for causal inference, including partial identification of effects by looking at all data-compatible DAGs (part III)
- Expand appendix on optimization to include the asymptotics of estimating by minimizing and some optimization methods (taking Newton's method material from current logistic regression chapter)
- Incorporate the current homework assignments and exams as real-data examples (throughout)

(Text last updated 16 April 2013; webpage last updated 16 April 2013)