Cosma Shalizi

36-402, Undergraduate Advanced Data Analysis, Section A

Spring 2017

Section A
Tuesdays and Thursdays, 10:30--11:50, Wean Hall 7500
Keen-eyed fellow investigators

The goal of this class is to train you in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.

During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

36-608 In previous years, a small number of well-prepared graduate students from other departments have been allowed to take this course, by registering for it as 36-608. (Graduate students enrolling in 36-402 will be dropped automatically from the roster.) This year, because of the number of undergraduate students needing to take 402, we have no resources to accommodate students wishing to take 608 for a grade. If space is available in the classroom, a few may be allowed to audit the course.

Section B

This year, there are two sections of 36-402. This syllabus is for Section A, taught by Prof. Shalizi; section B is taught by Prof. Lee. The two sections are completely independent.

Prerequisites

36-401, with a grade of C or better. Exceptions are only granted for graduate students in other departments taking 36-608.

Instructors

Professors Cosma Shalizi cshalizi [at] cmu.edu
Baker Hall 229C
Teaching assistants TBD

Topics

Model evaluation: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; the bootstrap; penalized fitting; mis-specification checks
Yet More Linear Regression: what is regression, really?; what ordinary linear regression actually does; what it cannot do; extensions
Smoothing: kernel smoothing, including local polynomial regression; splines; additive models; kernel density estimation
Generalized linear and additive models: logistic regression; generalized linear models; generalized additive models.
Latent variables and structured data: principal components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
Causality: graphical causal models; causal inference from randomized experiments; identification of causal effects from observations; estimation of causal effects; discovering causal structure
Dependent data: Markov models for time series without latent variables; hidden Markov models for time series with latent variables; smoothing and modeling for spatial and network data
See the end of this syllabus for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.

Course Mechanics

Grades will not go away if you avert your eyes (photo by laurent KB on Flickr) Homework will be 50% of the grade, a midterms exam 20%, and the final exam 30%.

Lectures

You are responsible for all material covered in lecture, whether or not it is in the textbook. If you are unable to attend a particular lecture, arrange to get notes from a classmate. If you have problems coming to lecture, see the professor.

Homework

The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. There will be 12 weekly homework assignments, nearly one every week; they will all be due on Wednesdays at 11:59 pm (i.e., the night before Thursday classes), through Blackboard. All homeworks count equally, totaling 50% of your grade. The lowest three homework grades will be dropped; consequently, no late homework will be accepted for any reason whatsoever.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. As always, raw computer output and R code is not acceptable; your document must be humanly readable. You should submit an R Markdown or knitr file, integrating text, figures and R code; submit both your knitted file and the source. If that is not feasible, contact the professor as soon as possible. Microsoft Word files get an automatic grade of 0, with no feedback.

For help on using R Markdown, see "Using R Markdown for Class Reports".

Unlike PDF or plain text, Word files do not display consistently across different machines, different versions of the program on the same machine, etc., so not using them eliminates any doubt that what we grade differs from what you think you wrote. Word files are also much more of a security hole than PDF or (especially) plain text. Finally, it is obnoxious to force people to buy commercial, closed-source software just to read what you write. (It would be obnoxious even if Microsoft paid you to push its wares that way, but it doesn't.)

Exams

There will be a take-home mid-term exam (20% of your final grade), due at 11:59 pm on 9 March. You will have one week to work on the midterm, and there will be no homework that week. There will also be a take-home final exam (30%), due at 10:30 am on Monday, 8 May. These due date will not be moved once the semester begins; please schedule job interviews and other extra-curricular activities around them.

The exams may require you to use any material already covered in the readings, lectures or assignments. All exams will be cumulative.

Exams must also be submitted through Blackboard, under the same rules about file formats as homework.

Grading

The three exams will each be curved to ensure that they are comparable in scale to the homework before calculating your final grade. You should not presume that an un-curved average of 90 guarantees you an A.

If you believe your final letter grade has been incorrectly assigned, or that a particular assignment has been incorrectly graded, tell the professor at once. Direct any questions or complaints about your grades to the professor; the teaching assistants have no authority to make changes. Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Solutions

Solutions for all homework and exams will be available, after their due date, through Blackboard. Do not share them with anyone, even after the course has ended.

Interviews

To help the instructors get a better sense of how the class is going, every week (after the first week of classes), six students will be selected at random, and will meet with the professor for 10--15 minutes each, to explain their work and to answer questions about it. You may be selected on multiple weeks, if that's how the random numbers come up. This is not a punishment, but a way to see whether the problem sets are really measuring learning of the course material; being selected will not hurt your grade in any way (and might even help).

Office Hours

If you want help with computing, please bring a laptop.

TBD

If you cannot make office hours, please e-mail the professor about making an appointment.

Piazza

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous posting of questions and replies will be allowed, at least initially; if this is abused, anonymity will go away.

Blackboard

Blackboard will be used for submitting assignments electronically, and as a gradebook. All properly enrolled students should have access to the Blackboard site by the beginning of classes.

Textbook

The primary textbook for the course will be the draft Advanced Data Analysis from an Elementary Point of View. Chapters will be linked to here as they become needed. You are expected to read these chapters, and are unlikely to be able to do the assignments without doing so. (There will be a prize for the student who identifies the most errors by the next-to-last class, presented at the last class meeting.) In addition, Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7) is required as a reference.

Cox and Donnelly, Principles of Applied Statistics (Cambridge University Press, 2011, ISBN 978-1-107-64445-8); Faraway, Extending the Linear Model with R (Chapman Hall/CRC Press, 2006, ISBN 978-1-58488-424-8; errata); and Venables and Ripley, Modern Applied Statistics with S (Springer, 2003; ISBN 9780387954578) will be optional. The campus bookstore should have copies of all of these.

Collaboration, Cheating and Plagiarism

Cheating leads to desolation and ruin (photo by paddyjoe on Flickr) Everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments, books and papers in the library, or online resources, though again, all use of these sources must be acknowledged in your work.

In general, you are free to discuss homework with other students in the class, though not to share work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students or the instructors until after the assignments are due. (Exceptions may be made, with prior permission, for tutors approved by the professor.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

During the take-home exams, you are not allowed to discuss the content of the exams with anyone other than the instructors; in particular, you may not discuss the content of the exam with other students in the course.

Any use of solutions provided for any assignment in this course in previous years is strictly prohibited, both for homework and for exams. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course.

If you are unsure about what is or is not appropriate, please ask the professor before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to contact the professor as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

On the first day of class, every student will receive a written copy of the university's policy on academic integrity, a written copy of these course policies, and a "homework 0" on the content of these policies. To ensure that you have time to actually read these policies, you may not turn in "homework 0" until the second class meeting, on 19 January. This assignment will not factor into your grade, but you must complete it before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate with the professor.

R

Caught in a thicket of syntax (photo by missysnowkitten on Flickr) R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before), and this class presumes that you have. Every assignment will require you to use it. No other form of computational work will be accepted. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let the professor know at once.

There is a separate page of resources for learning R.

Other Iterations of the Class

Some material is available from versions of this class taught in other years. As stated above, any use of solutions provided in earlier years is not only cheating, it is very easily detected cheating.

Schedule

Subject to revision. Lecture notes, assignments and solutions will all be linked here, as they are available. Identifying significant features from background (photo by Gord McKenna on Flickr)

Current revision of the complete textbook

January 17 (Tuesday): Lecture 1, Introduction to the class; regression
Reading: Chapter 1
Optional reading: Cox and Donnelly, chapter 1; Faraway, chapter 1 (especially up to p. 17).
Homework 0 (on collaboration and plagiarism) assigned
Homework 1 assigned
January 19 (Thursday): Lecture 2, The truth about linear regression
Reading: Chapter 2
Optional reading: Faraway, rest of chapter 1
Homework 0 due (at start of class)
January 24 (Tuesday): Lecture 3, Evaluation of Models: Error and inference
Reading: Notes, chapter 3
Optional reading: Cox and Donnelly, ch. 6
Handout: "predict and Friends: Common Methods for Predictive Models in R"
January 26 (Thursday): Lecture 4, Smoothing methods in regression
Reading: Chapter 4
Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
Homework 1 due (at 11:59 pm the night before)
Homework 2 assigned
January 31 (Tuesday): Lecture 5, Writing R Code
Reading: Appendix on writing R code
February 2 (Thursday): Lecture 6, Simulation
Reading: Chapter 5
Homework 2 due (at 11:59 pm the night before)
Homework 3 assigned
February 7 (Tuesday): Lecture 7, The Bootstrap
Reading: Chapter 6
Optional reading: Cox and Donnelly, chapter 8
February 9 (Thursday): Lecture 8, Splines
Reading: Chapter 7
Optional reading: Faraway, section 11.2
Homework 3 due (at 11:59 pm the night before)
Homework 4 assigned
February 14 (Tuesday): Lecture 9, Additive models
Reading: Chapter 8
Optional reading: Faraway, chapter 12
February 16 (Thursday): Lecture 10, Testing Regression Specifications
Reading: Chapter 9
Optional reading: Cox and Donnelly, chapter 7
Homework 4 due (at 11:59 pm the night before)
Homework 5 assigned
February 21 (Tuesday): Lecture 11, Heteroskedasticity, weighted least squares, and variance estimation
Reading: Chapter 10
Optional reading: Faraway, section 11.3
February 23 (Thursday): Lecture 12, Logistic Regression
Reading: Chapter 11
Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
Homework 5 due (at 11:59 pm the night before)
Homework 6 assigned
February 28 (Tuesday): Lecture 13, Generalized linear models and generalized additive models
Reading: Chapter 12
Optional reading: Faraway, section 3.1 and chapter 6
March 2 (Thursday): Lecture 14, GLMs and GAMs continued
Reading and optional reading: Same as lecture 13
Homework 6 due (at 11:59 pm the night before)
Midterm exam assigned
March 7 (Tuesday): Lecture 15, Multivariate Distributions
Reading: Appendix on multivariate distributions
March 9 (Thursday): Lecture 16, Density Estimation
Reading: Chapter 14
Midterm exam due (at 11:59 pm the night before)
Homework 7 assigned
March 14 and 16: Spring break
March 21 (Tuesday): Lecture 17, Principal Components Analysis and Factor Models
Reading: Chapters 16 and 17
March 23 (Thursday): Lecture 18, Mixture Models
Reading: Chapter 19
Homework 7 due (at 11:59 pm the night before)
Homework 8 assigned
March 28 (Tuesday): Lecture 19, Missing Data
Reading: TBD
Optional reading: Cox and Donnelly, chapter 5
March 30 (Thursday): Lecture 20, Graphical Models
Reading: Chapter 20
Homework 8 due (at 11:59 pm the night before)
Homework 9 assigned
April 4 (Tuesday): Lecture 21, Graphical Causal Models
Reading: Chapter 24
Optional reading: Cox and Donnelly, chapters 6 and 9; Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2
April 6 (Thursday): Lecture 22, Identifying Causal Effects from Observations I
Reading: Chapter 25
Optional reading: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1
Homework 9 due (at 11:59 pm the night before)
Homework 10 assigned
April 11 (Tuesday): Lecture 23, Identifying Causal Effects from Observations II
Reading: Chapter 25
April 13 (Thursday): Lecture 24, Estimating Causal Effects from Observations
Reading: Chapter 27
Homework 10 due (at 11:59 pm the night before)
Homework 11 assigned
April 18 (Tuesday): Lecture 25, Discovering Causal Structure from Observations
Reading: Chapter 28
April 20 (Thursday): Carnival, no class
April 25 (Tuesday): Lecture 26, Limitations of Causal Inference
Reading: Chapter 28
April 27 (Thursday): Lecture 27, Time Series I
Reading: Chapter 21
Homework 11 due (at 11:59 pm the night before)
Homework 12 assigned
May 2 (Tuesday): Lecture 28, Time Series II
Reading: Chapter 21
May 4 (Thursday): Lecture 29, Principles
Homework 12 due (at 11:59 pm the night before)
Final exam assigned
May 8 (Monday)
Final exam due at 10:30 am
photo by barjack on Flickr