36350, Statistical Computing, Fall 2012

Instructors  Prof. Cosma Shalizi 
TAs  Mr. Beau Dabbs 
 Mr. Samuel Ventura 
Lecture  Mondays and Wednesdays 10:3011:20, Wean Hall 5415 
Labs  Fridays, 10:3011:20, Baker Hall 140 CE 
Office hours  Wednesdays, 11:301:30, Baker Hall 229C 


Description
Computational data analysis is an essential part of modern statistics.
Competent statisticians must not just be able to run existing programs, but to
understand the principles on which they work. They must also be able to read,
modify and write code, so that they can assemble the computational tools needed
to solve their dataanalysis problems, rather than distorting problems to fit
tools provided by others. This class is an introduction to programming,
targeted at statistics majors with minimal programming knowledge, which will
give them the skills to grasp how statistical software works, tweak it to suit
their needs, recombine existing pieces of code, and when needed create their
own programs.
Students will learn the core of ideas of programming — functions,
objects, data structures, flow control, input and output, debugging, logical
design and abstraction — through writing code to assist in numerical and
graphical statistical analyses. Students will in particular learn how to write
maintainable code, and to test code for correctness. They will then learn how
to set up stochastic simulations, how to parallelize data analyses, how to
employ numerical optimization algorithms and diagnose their limitations, and
how to work with and filter large data sets. Since code is also an important
form of communication among scientists, students will learn how to comment and
organize code.
The class will be taught in the R
language.
Prerequisites
This is an introduction to programming for statistics students. Prior exposure
to statistical thinking, to data analysis, and to basic probability concepts is
essential, as is some prior acquaintance with statistical software. Previous
programming experience is not assumed, but familiarity with the
computing system is. Formally, the prerequisites are "Computing at Carnegie
Mellon" (or consent of instructor), plus one of either 36202 or 36208, with
36225 as either a prerequisite (preferable) or corequisite (if need be).
The class may be unbearably redundant for those who already know a
lot about programming. The class will be utterly incomprehensible for
those who do not know statistics.
Calendar and topics
Subject to revision.
 Data types and data structures (lectures on 8/27, 8/29, lab on 8/31)
Lecture 1: Introduction to the
class, basic data types, vector and array data structures
Lecture 2: Matrices and matrix
operations; lists; data frames; structures of structures
Homework assignment 1, due at 11:59 pm on Thursday, 6 September
Reading for the week (before lab): lecture slides; chapters 1 and 2 of Matloff
Lab 1; solutions
 Flow control and looping (lecture on 9/5, lab)
Lecture 3: Conditioning the
calculation on the data; iteration to repeat similar calculations; avoiding
iteration with "vectorized" operations and functions.
Homework assignment 2, due at 11:59 pm on Thursday, 13 September. R for this problem set.
Reading for the week (by Thursday): lecture slides; chapters 35 of Matloff (sections marked "Extended Examples" optional); section 7.1 of Matloff
Lab 2: gmp.dat, lab02.R
 Writing and calling functions (9/10, 9/12, lab)
Lecture 4: Declaring
functions to tie together related commands. Arguments (inputs) and return
values (outputs). Named arguments and defaults.
Interfaces. R for examples, with
comments
Lecture 5: Using multiple
functions for related tasks; to reuse work; to break big problems down into
smaller ones. R for examples, with
comments
Homework 3, due at 11:59 pm on Thursday, 20 September
Reading for the week (by Thursday): sections 1.3, 7.37.5, 7.11, 7.13 of Matloff
Lab 3
 More functionwriting: topdown design, testing (9/17, 9/19, lab)
Lecture 6, Topdown design:
recursively solving problems by writing functions to integrate the work of
subfunctions that solve subproblems. Example with linear regression.
Lecture 7, Testing: purpose
of testing; tests of particular cases vs. crosschecking tests; cycling between
testing and programming.
Notes on common problems: multiple return values; using arguments and defaults; \texttt{curve}
Homework 4, due at 11:59 pm on Thursday, 27 September
Reading for the week: sections 7.6 and 7.9 in Matloff
Lab 4, lab04.R
 Debugging and scope (9/24, 9/26, lab)
Lecture 8, Debugging; characterizing and localizing bugs; common
errors; programming for debugging.
Lecture 9, Scope: Names, scoping rules and environments. Examples, including ones based on homework.
Homework 5
(code, data), due
at 11:59 pm on Thursday, 4 October
Reading for the week: chapter 13 in Matloff
Lab 5
 Functions as Objects (lecture on 10/1, lab)
Lecture 10, Functions as
Objects: in R, functions are objects like everything else, so they can be
arguments to other functions
(examples: gradient, gradient.descent), and they can be
returned by other functions (examples: mathematical operators, predictors,
making functions from expressions for plotting surfaces). R for examples
NO LECTURE on Wednesday 3 October
Office hours: 10noon on Thursday, 4 October, Baker Hall 229C
Homework 6, due at 11:59 pm on Tuesday, 9 October (note day!)
Lab 6
 Split/apply/combine (10/8, 10/10); midterm on 10/12 instead of lab
Lecture 11, The split, apply,
combine pattern, using base R. The idea of design patterns, and their benefits
in clarity and reuse. The split/apply/combine pattern: break up a large data
set into smaller meaningful pieces; apply the same analysis to each piece;
combine the answers. Iteration as a painful and clumsy way of doing
split/apply/combine. Tools for split/apply/combine in basic R:
the apply function for arrays, lapply for
lists, mapply, etc.; split. Detailed example with a
complicated data set: the relation between strikes and parliamentary politics.
Lecture 12,
Split/apply/combine II: using plyr. Abstracting the
split/apply/combine pattern: using a single function to appropriately split up
the input, apply the function, and combine the results, depending on the type
of input and output data. Syntax details. Examples: standardizing
measurements for regularlysampled spatial data; standardizing measurements for
irregularlysampled spatial data; more fun with strikes and leftwing politics.
No homework
No lab — go to midterm! (Location Baker Hall 140C and 140E, by family name, as explained in email)
Exam 1 (Each student got a random
combination of problem 1 or 2; problems 25 or 68; problem 9 or 11; problem 10 or 12)
 Refactoring, Graphics (10/15, 10/17, no lab)
Lecture 13, Refactoring:
Reworking your code to make it clearer, more meaningful, and more easily
fixed and extended.
Lecture 14, Graphics (canceled)
Homework 7, due at 11:59 pm on Thursday, 25 October (canceled)
No lab (midsemester break)
 Simulation (10/22, 10/24, lab)
Lecture 15, Simulation I: Random variable generation
Lecture 16: Simulation II: Monte Carlo, Markov chains, Markov Chain Monte Carlo
Reading: Matloff, chapter 8; R Cookbook, chapter 8;
handouts on Markov chains and Monte Carlo and on Markov chain Monte Carlo
Lab 7: moretti.csv data
Homework 8, due at 11:59 on Thursday, 1 November
 Optimization (10/31, 11/2, lab)
Lecture 17: Deterministic,
Unconstrained Optimization. The tradeoff of approximation versus time.
Basics from calculus about minima. The gradientdescent method, its pros and
cons. Taylor series as the motivation for Newton's method; Newton's method as
gradient descent with adaptive stepsize; pros and cons. Coordinate descent
instead of multivariate optimization. NelderMead/simplex method for
derivativefree optimization. Peculiarities of optimizing statistical
functionals: don't bother optimizing much within the margin of error;
asymptotic calculations for said margin. Illustrations with
optim.
Lecture 18: Stochastic and
Constrained Optimization. Difficulties of optimizing statistical functions
when the data is large. Sampling as an alternative to averaging over the whole
data. Stochastic gradient descent and stochastic Newton's method as an
application of sampling. Simulated annealing to escape local minima.
Constrained optimization: an example of why constraints matter. The method of
Lagrange multipliers for equality constraints. Lagrange multipliers as shadow
prices, indicating how much a marginal weakening of the constraint would
improve the optimum. Inequality constraints and their Lagrange multipliers.
Mathematical programming. Barrier methods for inequality constraints. The
correspondence between constrained and penalized optimization.
Optional reading: Francis
Spufford, Red
Plenty
(cf.);
Herbert Simon, The Sciences of the Artificial, especially chapters
5 and 8; Léon Bottou and Olivier
Bosquet, "The
Tradeoffs of Large Scale Learning"
Lab 8; ckm.csv
Homework 9, due at 11:59 pm on Thursday, 8 November
 Working with character data (11/7, 11/9, lab)
Lecture 19: Basics of
character manipulation. Characters and strings; length of a string vs. length
of a character vector; strings as parts of larger data structures; extracting
and replacing substrings; splitting strings into vectors; combining vectors
into strings; tabulating counts of string tokens by string type; why we need
flexible string patterns.
Data: al2.txt
Lecture 20: Regular
Expressions I. Files used for
examples: al2.txt, ANSS.csv.html, alicebob.txt, ss.html.
Lab 9; lab09.R
Homework 10, due at 11:59 pm on Thursday, 15 November
Project Options. Your
preferences are due by midnight on Wednesday, 14 November
 Regular expressions and web scraping (11/12, 11/14, lab)
Lecture 21: Regular Expressions II (same notes as lecture 20)
Lecture 22: Importing Data
from Web Pages. Easy impoprts from structured data files. Considerations in
scraping from unstructured HTML. Exercises, in building networks of political
books.
Lab 10. Data sources:
rich1.html,
rich2.html,
rich3.html,
rich4.html
Final project selection
 Reshaping data (lecture 11/19, no lab)
Lecture 23: Reshaping data and changes of representation
No homework
No lab
 Databases (11/26, 11/28, lab)
Lecture 24: Guest lecture by Prof. Nugent
Lecture 25: Databases
Lab 11
Homework 11, due at 11:59 pm on Thursday, 29 November
 Conclusion (12/3, 12/5, 12/7)
Lecture 26: Programming for statistics
No lecture 12/5: work on your projects
No lab 12/7: work on your projects
No homework: work on your projects
 Presentations: 12/14
Final presentations will be held during our final exam period,
8:3011:30 on Friday 14 December. Attendance for the whole
period is mandatory.
Course Mechanics and Grading
There will be two lectures every week (with exceptions only for holidays), and
a weekly inclass lab. There will also be homework nearly every week, an
inclass midterm in place of one of the labs, and a final group project. Grades will
be calculated as follows:
 Labs: 10%
 Homework: 30%
 Midterm: 20%
 Final project: 40%
Homework
There will be a homework assignment nearly every week. The lowest three
homework grades will be dropped; no credit will be given for late work under
any circumstances. No extensions will be given. If you cannot finish by the
deadline, turn in what you have done for partial credit.
You will be required to do programming in R. You must have reliable access
to a computer running a reasonably up to date version of R. If this is a
problem, contact the instructors as soon as possible.
All homework must be turned in electronically as plain text files (no Word,
no PDF, etc.), with file names clearly indicating your Andrew ID and the
assignment number. No credit will be given for homework with the wrong format.
For programming assignments, your code must be ready to run. Code which does
not run may or may not be given partial credit, at our discretion; the more we
have to work to figure out why your code does not run, the less credit you will
get.
Final Project
Students will be assigned to small groups to work on a final project. You
will select project topics from a list provided by the instructor. (Multiple
groups can take on the same project.) Each group will cooperate on writing
code, documenting it, writing a report, and making a presentation on the
project in during the final exam period.
Textbooks
The two required books are Norman Matloff,
The Art of R Programming: A Tour
of Statistical Software Design, and Paul
Teetor, The R
Cookbook. The first will serve as our textbook; the second is an
extremely valuable reference work. You will need both.
Two other books are optional: John
M. Chambers, Software
for Data Analysis: Programming with R, and Phil
Spector, Data
Manipulation with R.
All four should be available at the university book store, and of course
from online stores.
Some R Resources
R is a free, opensource software
package/programming language for statistical computing. There are many online
resources for learning about it and working with it, in addition to the
textbooks:
 The official intro, "An Introduction to R", available online in
HTML
and PDF
 John Verzani, "simpleR",
in PDF
 Google R Style Guide offers some rules for naming, spacing, etc., which are generally good ideas
 QuickR. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and wellorganized, and others may find
it useful as well.
 Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
 Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
The website Software Carpentry is
not specifically R related, but contains a lot of valuable advice and
information on scientific programming.
RStudio is an "integrated development
environment" (IDE) for R. Using it is not required, but strongly recommended,
because it gives everyone, on all computers, a common working environment.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 2682012.
Collaboration, Copying and Plagiarism
You are encouraged to discuss course material, including assignments, with
your classmates. All work you turn in, however, must be your own. This
includes both writing and code. Copying from other students, from
books, from websites, or from solutions for previous versions of the class, (1)
does nothing to help you learn how to program, (2) is easy for us to detect,
and (3) has serious negative consequences for you, as outlined in the
university's policy
on cheating and plagiarism. If, after reading the policy, you are unclear
on what is acceptable, please ask an instructor.
The Old 36350
If you came to this page by a search engine, you may be looking for
the datamining class which used to be numbered 36350.
It is now 36462, and is taught in the spring semester.
You might also be looking for another iteration of this
class.