Here's the official description:

This course is an introduction to the real world of statistics and data analysis. We will explore real data sets, examine various models for the data, assess the validity of their assumptions, and determine which conclusions we can make (if any). Data analysis is a bit of an art; there may be several valid approaches. We will strongly emphasize the importance of critical thinking about the data and the question of interest. Our overall goal is to use a basic set of modeling tools to explore and analyze data and to present the results in a scientific report. A minimum grade of C in any one of the pre-requisites is required. A grade of C is required to move on to 36-402 or any 36-46x course.

This is a class on linear statistical models: the oldest, most widely used, and mathematically simplest sort of statistical model. It serves as a first course in serious data analysis, as an introduction to statistical modeling and prediction, and as an initiation into a community of inquiry which has developed over two centuries and grown to include every branch of science, technology and policy.

During the class, you will do data analyses with existing software, and begin learning to write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

Graduate students from other departments wishing to take this course should register for it under the number "36-607". Enrollment for 36-607 is very limited, and by permission of the professor only.

Mathematical statistics: one of 36-226, 36-326 or 36-625, with at least a grade of C; linear algebra, one of 21-240, 21-241 or 21-242, with at least a grade of C. These requirements will not be waived for undergraduates under any circumstances. Graduate students wishing to enroll in 36-607 will need to have had equivalent courses (as determined by the instructor).

Having previously taken 36-350, introduction to statistical computing, or taking it concurrently, is strongly recommended but not required.

Professors | Dr. Cosma Shalizi | cshalizi [at] cmu.edu |

229 C Baker Hall | ||

Teaching assistants | Ms. Natalie Klein | |

Ms. Amanda Luby | ||

Mr. Michael Spece-Ibañez |

This is currently a *tentative* listing of topics, in order.

*Simple linear regression:*Statistical prediction by least squares. Simple linear regression: using one quantitative variable to predict another. Optimal linear prediction. Estimation of the simple linear regression model. Gaussian estimation theory for the simple linear model. Assumption-checking and regression diagnostics. Prediction intervals.*Multiple linear regression:*Linear predictive models with multiple predictor variables. "Population" form of multiple regression. Answering "what if" questions with multiple regression models. Ordinary least squares estimation of multiple regression. Standard errors. Gaussian estimation theory, confidence and prediction intervals. Regression diagnostics. Categorical predictor variables; analysis of variance.*Variable selection:*Review of hypothesis testing theory from mathematical statistics. Significance tests for regression coefficients; confidence sets for coefficients. Common fallacies about "significant" coefficients, and how to avoid them. Model and variable selection.*Beyond strictly linear ordinary least squares:*Interaction terms. Transformation of predictor variables. Transformation of response variable; common fallacies about transformed responses, and how to avoid them. Weighted least squares for non-constant variance; generalized least squares for time series.*Truly modern regression:*Prediction and cross-validation for model and variable selection. Resampling and bootstrap for statistical inference. Regression trees.

There will be two in-class mid-term exams, both focusing on the theoretical portions of the course. Both exams will be cumulative. Each exam will be 15% of your final grade.

There will be three take-home projects where you will analyze real data sets, and write up your findings in the form of a scientific report. You will be graded both on the technical correctness of your work and on your ability to clearly communicate your findings; in particular, raw computer output or code are not acceptable. (Rubrics and example reports will be made available before the first DAP is assigned.)

The DAPs are exams; consequently, collaboration is *not* allowed.
Each DAP will count for 15% of your final grade.

The homework will give you practice in using the techniques you are learning
to analyze data, and to interpret the analyses. They will also include some
theory questions, requiring you to do calculations or prove results
mathematically. There will be one homework assignment every week in which
there is not an exam. Every assignment will count equally towards 25% of your
grade. Your lowest two homework grades will be dropped;
consequently, **no late homework will be accepted for any
reason**.

Communicating your results to others is as important as getting good results in the first place. A portion of the points available for every homework will be set aside to reflect the clarity of your writing, figures, data presentation, and other marks of communication. (Rubrics will be provided for each assignment.) In addition, at least two homeworks will be practice DAPs, where you will have to write reports in the same manner as the data analysis projects.

Except as otherwise noted in the schedule, all assignments will be due at 3 pm on Thursdays (i.e., at the beginning of class), through Blackboard. Late assignments are not accepted for any reason. Coming late to class because you are uploading an assignment is unacceptable.

You will submit a PDF or HTML file containing a readable version of all your
write-ups, mathematics, figures, tables, and *selected* portions of code
as relevant. **Word files will not be graded.** (You may
write in Word if you must, but you need to submit either PDF or HTML.)

You are strongly encouraged to use R Markdown to integrate text, code, images and mathematics. If you do, you will submit both the "knitted" PDF or HTML file, and the source .Rmd file. If you choose not to use R Markdown, you will submit both a humanly-readable file, as PDF or HTML, and a separate plain-text file containing all your R code, clearly commented and formatted to indicate which code section goes with which problem.

If you do not use an equation editor, LaTeX, etc., you may include pictures or scans of hand-written mathematics as needed.

If you want help with computing, please bring your laptop.

Mondays, 2--3 pm | Ms. Klein | Porter Hall 117 |

Mondays, 3--4pm | Mr. Spece-Ibanez | Porter Hall 117 |

Wednesdays, noon--1 pm | Prof. Shalizi | Baker Hall 229C |

Wednesdays, 4--5 pm | Ms. Luby | Porter Hall 117 |

Thursdays, noon--1 pm | Prof. Shalizi | Baker Hall 229C |

If you cannot make the scheduled office hours, please e-mail the professor about making an appointment.

The primary textbook for the course will be Kutner, Nachtsheim and
Neter's Applied Linear Regression Models, 4th edition
(McGraw-Hill, 2004,
ISBN 0-07-238691-6).
This is **required**. (The fifth edition is also acceptable,
though if you use it, when specific problems or readings are assigned from the
text, you are responsible for ensuring that they match up with what's
intended.)

Four other books are **recommended**:

- Julian J. Faraway, Linear Models with R, second edition (CRC Press, 2014, ISBN 978-1-439-88733-2)
- Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7)
- D. R. Cox and Christl Donnelly, Principles of Applied Statistics (Cambridge University Press, 2011, ISBN 978-1-107-64445-8)
- Richard A. Berk, Regression Analysis: A Constructive Critique (Sage Press, 2004, ISBN 978-0-7619-2904-8)

R is a free, open-source software
package/programming language for statistical computing. Many of you will have
some prior exposure to the language; for the rest, now is a great time to start
learning. Almost every assignment will require you to use it. No other form
of computational work will be accepted. If you are *not* able to use R,
or do not have ready, reliable access to a computer on which you can do so, let
me know at once.

R Markdown is an extension to R
which lets you embed your code, and the calculations it produces, in ordinary
text, which can also be formatted, contain figures and equations, etc. Using R
Markdown is **strongly encouraged**. If you do, you need to
submit both your "knitted" file (HTML or PDF, not Word), and the original
.Rmd file.

If you choose to not use R Markdown, for all computational assignments you need to submit both a properly-formatted humanly-readable write-up, in PDF, and a raw text file contain your R code, commented so that it is clear which pieces of code go with which problem. The write-up may be a .txt file, PDF, or HTML; Word files will not be graded.

Here are some resources for learning R:

- The official intro, "An Introduction to R", available online in HTML and PDF
- John Verzani, "simpleR", in PDF
- Quick-R. This is primarily aimed at those who already know a commercial statistics package like SAS, SPSS or Stata, but it's very clear and well-organized, and others may find it useful as well.
- Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques" (large PDF)
- Paul Teetor, The R Cookbook, explains how to use R to do many, many common tasks. (It's like the inverse to R's help: "What command does X?", instead of "What does command Y do?").
- The notes for 36-350, Introduction to Statistical Computing
- There are now many books about R. Some recommendable ones:
- Joseph Adler R in a Nutshell (O'Reilly, 2009; ISBN 9780596801700). Probably most useful for those with previous experience programming in another language.
- W. John Braun and Duncan J. Murdoch, A First Course in Statistical Programming with R (Cambridge University Press, 2008; ISBN 978-0-521-69424-7)
- John M. Chambers, Software for Data Analysis: Programming with R (Springer, 2008, ISBN 978-0-387-75935-7). The best book on writing clean and reliable R programs; probably more advanced than you will need.
- Norman Matloff, The Art of R Programming (No Starch Press, 2011, ISBN 978-1-59327-384-2). Good introduction to programming for complete novices using R. Less statistics than Braun and Murdoch, more programming skills.

- The R Markdown Cheat Sheet

In fall 2015, Section A of the class is being taught by Prof. Xizhen Cai; the two sections will be closely coordinated but are separate classes.

If you came here from a search engine, you may be looking for information on previous versions of the class, as taught by Prof. Rebecca Nugent.

- September 1, Lecture 1: Introduction to the course
- Course mechanics; random variables and probability review; statistical prediction; optimal linear prediction.
- Reading: Appendix A (on Blackboard if you do not yet have the textbook)
- Homework 1: Assignment,
`fha.csv`data set - September 3, Lecture 2: Exploratory data analysis and R
- Office hours will be held in computing labs today and on selected days next week; see Blackboard for details. Attendance at one of these is optional but strongly encouraged.
- Readings: "Introduction to R Selected Handouts for 36-401" (by Prof. Nugent), and "36-401 Fall 2015 R Introduction"
- September 8, Lecture 3: About Statistical Modeling
- An example data set. Drawing lines through scatterplots. Why prefer one line over another? Statistical models as data summaries; models as tools for inference. Sources of uncertainty in inference: sampling, measurement error, fluctuations. Models as assumptions on the data-generating process. Some examples. Inference within a model vs. checking model assumptions. Introducing the simple linear regression model.
- For LaTeX/knitr users: the .Rnw file used to generate the notes
- Reading for the week: sections 1.1--1.5 (on Blackboard)
- September 10, Lecture 4: Simple linear regression models.
- The simple linear regression model: once more with feeling. Consistency, unbiased-ness and variance of the plug-in estimator. "The method of least squares". The Gaussian noise ("normal error") simple linear regression model.
- For LaTeX/knitr users: the .Rnw file used to generate the notes
- Homework 1 due; solutions on Blackboard (please don't share beyond this class)
- Homework 2: assignment
- September 15, Lecture 5: Estimating simple linear regression I
- The method of least squares. Assumptions of the method. Properties of the estimates. Predictive inference. Least-squares estimation in R. Reading: sections 1.6 and 1.7.
- .Rnw file used to generate the notes
- September 17, Lecture 6: Estimating simple linear regression II
- The Gaussian model. Assumptions of the model. Consequences: maximum likelihood estimation; properties of the MLE. Reading: section 1.8.
- R for in-class demos
- Homework 2 due
- Homework 3: assignment
- September 22, Lecture 7: Diagnostics and Transformations
- Assumption checking for the simple linear model; assumption checking for the simple linear model with Gaussian noise. Generalization out of sample. Nonlinearities: transforming the predictor; nonlinear least squares; nonparametric smoothing. Transformations of the response to make the assumptions hold; Box-Cox transformations. Cautions about transforming the response: changed interpretation, changed model of noise, utter lack of motivation for most common transformations. What the residuals look like under mis-specification.
- .Rnw file which produced the notes
- See also: supplement, based on class discussion: Interpreting models after transformations
- Readings: section 3.1--3.3 and 3.8--3.9.
- R for in-class demos
- September 24, Lecture 8: Inference in simple linear regression I
- Inference for coefficients: standard errors; confidence sets; hypothesis tests; reminders about translating between confidence sets and hypothesis tests; reminders that statistical significance is not practical importance. Readings: section 2.1--2.3.
- .Rnw file which produced the notes
- Homework 3 due
- Homework 4: assignment,
`auto-mpg.csv`,`abalone.csv` - September 29, Lecture 9: Inference in simple linear regression II
- Inference for expected values: standard errors, confidence sets. Inference for new measurements: standard errors, confidence sets. Readings: sections 2.4--2.6.
- .Rnw source file for the notes
- Supplement, based on class discussion: Interpreting models after transformations
- October 1, Lecture 10: F tests, R
^{2}and other distractions - The F test for whether the slope is 0; F tests for linear
models generally. Likelihood ratio tests as a more general alternative
to F tests. R
^{2}: distraction or nuisance? Correlation and regression coefficients; "does anyone know when the correlation coefficient is useful?". How to honor tradition in science.Readings: section 2.7--2.9. - Homework 4 due
- October 6, Lecture 11: Exam 1 review
- October 8, Lecture 12: Theory exam 1
- Data analysis project 1: project,
`mobility.csv` - October 13, Lecture 13: Linear regression and linear algebra
- Simple linear regression in matrix form. Readings: chapter 5 (all of it).
- October 15, Lecture 14: Multiple linear regression
- Linear models with multiple predictor variables. Ordinary least squares estimation. Why multiple regression doesn't just add up simple regressions. Readings; sections 6.1--6.4.
- Data analysis project 1 due
- Homework 5: assignment,
`gpa.txt`,`commercial.txt` - October 20, Lecture 15: Diagnostics and Inference
- Assumption-checking for multiple linear regression; diagnostics. Inference for ordinary least squares: sampling distributions, degrees of freedom, confidence sets and hypothesis tests. Readings: sections 6.6--6.8.
- .Rnw source file for the lecture
- October 22, Lecture 16: Polynomials and Categorical Predictors
- Dealing with non-linearities by adding polynomial terms. Cautions about polynomials. Dealing with categorical predictors by adding "dummy" or "indicator" variables. Interpretation of coefficients on categoricals. Readings: sections 8.1--8.7.
- .Rnw source file for the lecture
- Homework 5 due
- Homework 6: assignment,
`SENIC`data set (see Blackboard for the excerpt from the textbook describing this file) - October 27, Lecture 17: Multicollinearity
- Multicollinearity: what it is and why it's a problem. Identifying collinearity from pairs plots; why multicollinearity may not show up this way. Dealing with collinearity by dropping variables. Picking out multicollinearity from eigenvalues and eigenvectors; principal components regression. Ridge regression for multicollinearity and for stabilizing estimates. High dimensional regression.
- Readings: sections 7.1--7.3 and 10.1--10.5.
- .Rnw source file
- October 29, Lecture 18: Testing and Confidence Sets for Multiple Coefficients
- Tests for individual coefficients (in the context of a specific larger model). "Partial" F tests and likelihood ratio tests for groups of coefficients (in the context of a larger model). "Full" F tests and likelihood ratio tests for all the slopes at once (in the context of a larger model). Cautions about these tests. Confidence rectangles for multiple coefficients; confidence ellipsoids for multiple coefficients.
- .Rnw source file for the notes
- Readings: sections 7.3--7.4.
- Homework 6 due
- Homework 7: assignment,
`water.txt`data file - November 3, Lecture 19: Interactions
- General concept of interactions between variables. Conventional form of interactions in linear models. Interactions between numerical and categorical variables. Readings: sections 8.1--8.2.
- .Rnw source file for the lecture
- November 5, Lecture 20: Influential points and outliers
- "Influence" of a data point on OLS estimates. Outlier detection. Dealing with outliers and influential points: by deletion; by robust (non-OLS) regression. Readings: section 10.1--10.5.
- .Rnw source file for the lecture
- Homework 7 due
- Homework 8: assignment,
`real-estate.csv` - November 10, Lecture 21: Model selection
- Comparing competing models. Traditional approaches. Sound approaches. Difficulties of inference after selection. Readings: sections 9.1--9.4.
- .Rnw source file
- November 12, Lecture 22: Midterm review
- Practice Exam 2
- November 13
- Homework 8 due at 4:30 pm
- November 17, Lecture 23: Theory exam 2
- Data analysis project 2: assignment,
`bikes.csv` - November 19, Lecture 24: Non-Constant Noise Variance (special topics I)
- "Heteroskedasticity" = changing noise variance. Dealing with heteroskedasticity by weighted least squares. WLS estimation in practice. Where do the weights come from? Readings: section 11.1; lecture notes.
- November 24, Lecture 25: Correlated noise (special topics II)
- Dealing with correlations in the noise by generalized least squares. GLS estimation in practice. Where do the correlations come from? Readings: chapter 12; lecture notes.
- Data analysis project 2 due
- December 1, Lecture 26: Variable Selection (special topics III)
- Variable selection as a special case of model selection. Why p-values are very bad guides to which variables are important. Cross-validation for variable selection: leave-one-out and k-fold. Stepwise regression; stepwise regression in R. Cautions about inference after selection, again.
- Readings: Re-read lecture 21!
- December 3, Lecture 27: Regression trees (special topics IV)
- "Regressograms": regression by averaging over discretized variables. Partitioning and trees. Interpretation of regression trees. Nonlinearity and interaction; average predictive comparisons. Fitting trees with cross-validation.
*Note:*Sections 1 and 2 of the lecture notes for today are the most relevant; section 3 is about what to do when the response variable is categorical.- Homework 9: assignment --- due on
**Tuesday, 8 December** - December 8, Lecture 28: Bootstrap I (special topics V)
- Sampling distributions and the bootstrap principle. Resampling. Inference when Gaussian assumptions are shaky. Bootstrap standard errors and confidence intervals. Readings: section 11.5; handouts.
**Homework 9 due**- Data analysis project 3: assignment. Your personalized data set has been e-mailed to your Andrew address; contact the professor as soon as possible if you have any problem with the data set.
- December 10, Lecture 29: Bootstrap II (special topics VI)
- More resampling. Bootstrap prediction intervals. Bootstrap plus model selection. When will bootstrapping not work?
- December 15
**Data analysis project 3 due at 5 pm**