36350, Statistical Computing, Fall 2014

Instructors  Prof. Cosma Shalizi 
 Prof. Andrew Thomas 
TAs  Mr. Bryan Hooi 
 Mr. Samuel Ventura 
Lecture  Section 1, Mondays and Wednesdays 10:3011:20, Gates 4102 
 Section 2, Mondays and Wednesdays 11:3012:20, Gates 4102 
Labs  Sections A and B, Fridays, 10:3011:20, Hunt Library computer labs 
 Sections C and D, Fridays, 11:3012:20, Baker Hall 332P 
Office hours  Monday 9:2010:20 Wean Hall 8110 (Mr. Hooi) 
 Monday 4:305:30 Baker Hall 132H (Prof. Thomas) 
 Thursday 1:002:30 Baker Hall 229A (Prof. Shalizi) 
 Friday 3:004:00 Wean Hall 8110 (Mr. Ventura) 


Description
Computational data analysis is an essential part of modern statistics.
Competent statisticians must not just be able to run existing programs, but to
understand the principles on which they work. They must also be able to read,
modify and write code, so that they can assemble the computational tools needed
to solve their dataanalysis problems, rather than distorting problems to fit
tools provided by others. This class is an introduction to programming,
targeted at statistics majors with minimal programming knowledge, which will
give them the skills to grasp how statistical software works, tweak it to suit
their needs, recombine existing pieces of code, and when needed create their
own programs.
Students will learn the core of ideas of programming — functions,
objects, data structures, flow control, input and output, debugging, logical
design and abstraction — through writing code to assist in numerical and
graphical statistical analyses. Students will in particular learn how to write
maintainable code, and to test code for correctness. They will then learn how
to set up stochastic simulations, how to parallelize data analyses, how to
employ numerical optimization algorithms and diagnose their limitations, and
how to work with and filter large data sets. Since code is also an important
form of communication among scientists, students will learn how to comment and
organize code.
The class will be taught in the R
language.
Prerequisites
This is an introduction to programming for statistics students. Prior exposure
to statistical thinking, to data analysis, and to basic probability concepts is
essential. Previous programming experience is not assumed, but
familiarity with the computing system is. Formally, the prerequisites are
"Computing at Carnegie Mellon" (or consent of instructor), plus one of either
36202 or 36208, with 36225 as either a prerequisite (preferable) or
corequisite (if need be).
The class may be unbearably redundant for those who already know a
lot about programming. The class will be utterly incomprehensible for
those who do not know statistics.
Course Mechanics and Grading
There will be two lectures every week (with exceptions only for holidays), and
a weekly inclass lab. There will also be homework nearly every week, a
midterm programming project, and a final group project.
Grades will be calculated as follows:
 Labs: 10%
 Homework: 30%
 Midterm project: 20%
 Final project: 40%
Final grades are based on demonstrated mastery of the material, not
relative standing in the class.
R and RStudio
R is a free, opensource programming
language for statistical computing. Almost all of our work in this class will
be done using R. You will need regular, reliable access to a computer running
an uptodate version of R. If this is a problem, let the professors know
right away.
RStudio is a free, opensource R
programming environment. It contains a builtin code editor, many features to
make working with R easier, and works the same way across different operating
systems. Use of RStudio is
required for the labs, and strongly recommended in general.
Assignment Formatting
All assignments must be turned in electronically, through Blackboard.
All assignments will involve writing a combination of code and actual prose.
You must submit your assignment in a format which allows for the combination of
the two, and the automatic execution of all your code. The easiest
way to do this is to use R Markdown.
Exceptions may be made, with prior permission, for those who want to
use Sweave or
(better) knitr. (If you don't know what
those are, plan to use R Markdown.)
Work submitted as Word files, PDFs, unformatted plain text, etc., will
receive an automatic grade of 0, without exceptions.
Every file you submit should have a name which includes your Andrew ID, and
clearly indicates the type of assignment (homework, lab, etc.) and its number.
Homework
There will be a homework assignment nearly every week. Each homework will
be graded out of three points: one point for making a goodfaith effort at
every part of the assignment; one point for technicallycorrect, working
solutions to each part; and one point for clean,
wellformatted, easily readable code.
Due dates: unless otherwise noted in the calendar, all homework is
due at 11:59 pm on Thursday Monday, the week after it is assigned.
Revision: You are free to revise your homework assignments, after
they have been graded, and resubmit them to be regraded.
Labs
There will be a 50 minute lab period every week on Friday morning. The labs
will be short exercises, generally related to that week's homework. Attendance
is mandatory.
Pair programming: An important part of programming is
collaboration. To help you practice this, the labs will be done through
"pair programming". You will be randomly paired with a different partner for
each lab, and during the first half of the lab, one of you will do all
the actual typing, while the other monitors and comments; during the second
half, you will switch roles with your partner.
Midterm Project
In place of an inclass midterm exam, there will be a solo programming
project. You will have two weeks to do this project, and will have to submit a
writeup containing both your executable code and its results, and an
explanation of how you approached the problem and why you chose that approach.
As with the homework, grading will give equal weight to completeness,
correctness, and comprehensibility.
Final Project
You will be assigned to small groups to work on a final project. You will
select project topics from a list provided by the professors. (Multiple groups
can take on the same project.) Each group will cooperate on writing code,
documenting it, writing a report, and making a presentation on the project in
during the final exam period.
Peer assessment: One component of your final project grade will be
based on your teammates' assessment of your contribution to the project.
Textbooks
There are three required books:
 Norman Matloff,
The Art of R Programming: A Tour
of Statistical Software Design
 Phil Spector, Data Manipulation with R
 Paul Teetor, The R
Cookbook
The first two will serve as our textbooks; the third is an
extremely valuable reference work. You will need all three.
Four other books are optional but recommended:
All the books should be available at the university book store, and of
course from online stores.
Some R Resources
There are many online resources for learning about it and working with it,
in addition to the textbooks:
 The official intro, "An Introduction to R", available online in
HTML
and PDF
 John Verzani, "simpleR",
in PDF
 Google R Style Guide offers some rules for naming, spacing, etc., which are generally good ideas
 QuickR. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and wellorganized, and others may find
it useful as well.
 Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
 Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
The website Software Carpentry is
not specifically R related, but contains a lot of valuable advice and
information on scientific programming.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 2682012.
Collaboration, Copying and Plagiarism
You are encouraged to discuss course material, including assignments, with
your classmates. All work you turn in, however, must be your own. This
includes both writing and code. Copying from other students, from
books, from websites, or from solutions for previous versions of the class, (1)
does nothing to help you learn how to program, (2) is easy for us to detect,
and (3) has serious negative consequences for you, as outlined in the
university's policy
on cheating and plagiarism. If, after reading the policy, you are unclear
on what is acceptable, please ask an instructor.
The Old 36350
If you came to this page by a search engine, you may be looking for
the datamining class which used to be numbered 36350.
It is now 36462, and is taught in the spring semester.
You might also be looking for another year's iteration
of this class.
Note to Instructors
If you would like to use these materials for your own class, you are welcome to
do so, with attributions to the authors (see below) and links to this page. Asking for
permission isn't necessary, though letting us know about it is appreciated.
Preferred citation: Shalizi, C. R. and Thomas, A. C. (2014), "Statistical Computing
36350: Beginning to Advanced Techniques in R", http://www.stat.cmu.edu/~cshalizi/statcomp/14
Calendar and topics
Subject to revision.
 Data types and data structures
 Lecture 1 (25 August): Simple data types and structures
 Course mechanics; the R console; basic data types; vectors, our first data structures
 Rpres file for this lecture (the R Markdown file used to build the presentation)
 Printable PDF
 Lecture 2 (27 August): Bigger data structures
 Arrays; matrices and matrix operations; lists; data frames; structures of structures
 Rpres
 Printable PDF
 Lab 1 (29 August)
 R Markdown file for the lab
 Homework: HW 1 assigned; nothing due
 R Markdown file for the assignment
 Reading for the week: chapters 1 and 2 of Matloff
 Flow control and looping
 Lecture 3 (Sept. 3): Data Frames and Control
 Data frames for tabular data; conditioning the calculation on the
data; iteration to repeat similar calculations; avoiding iteration with
"vectorized" operations and functions.
 Rpres file for this lecture
 Printable PDF
 Lab 2 (Sept. 5)
 R Markdown file for the lab
 Homework: HW 1 due; HW 2 assigned
 R Markdown file for the assignment
 Reading for the week: Chapters 35 of Matloff (sections marked "Extended Examples" optional); section 7.1 of Matloff
 Text
 Lecture 4 (Sept. 8): Text basics
 Characters, strings, text data. Extracting and replacing
substrings; splitting strings; building strings; counting strings.
 Printable PDF
 Rpres file for the lecture
 Lecture 5 (Sept. 10): Regular expressions
 "Regular expressions" are patterns of strings. Rules for building
regular expressions. R functions for finding matches, splitting strings, and
substituting according to patterns.
 Handout for the
lecture, with additional examples
 Rpres file for
the lecture
 Printable PDF
 Lab 3 (Sept. 12)
 rich.html file; R Markdown file for the lab. (Do not include the text of the questions in your writeup.)
 Homework: HW 2 due; HW 3 assigned
 NHLHockeySchedule2.html file for the homework
 Reading for the week: Matloff, chapter 11; R Cookbook, chapter 7; Spector, chapter 7; handout for lecture 5
 Optional readings: Bradnam and Korf, sections 4.264.28, 5.3, 6.1
 Writing and calling functions
 Lecture 6 (Sept. 15): Writing functions
 Functions tie together related commands. Arguments (inputs) and
return values (outputs). Named arguments and defaults. Interfaces.
 gmp.dat file for the example; Rpres file for presentation
 Printable PDF
 Lecture 7 (Sept. 17): Multiple functions
 Using multiple functions for related tasks; to reuse work; to
break big problems down into smaller ones.
 Printable PDF
 R Markdown source for the slides
 Lab 4 (Sept. 19)
 R Markdown source for the lab
 Homework: HW 3 due; HW 4 assigned
 gmp2013.dat data file for the last problem; R Markdown file for the assignment
 Reading for the week: sections 1.3, 7.37.5, 7.11, 7.13 of Matloff
 Data from elsewhere
 Lecture 8 (Sept. 22): Getting data
 Reading and writing nonR formats. Importing data from the Web.
Scraping Web pages.
 Printable PDF version of slides; R Markdown source file for the slides
 Lecture 9 (Sept. 24): Dataframes with Regression Models
 Making dataframes readable. Plotting with dataframes. Basic statistics on dataframes. Fitting linear models with lm; formulas.
Fitting generalized linear models with glm.
 R Markdown source file for the slides
 Printable PDF
 Lab 5 (Sept. 26)
 wtidreport.csv
 Homework: HW 4 due; HW 5 assigned
 R Markdown file for the assignment
 Reading for the week: Matloff, chapter 10; Spector, chapter 2 (skipping sections 2.82.10)
 Fitting and using statistical models
 Lecture 10 (Sept. 29): Random number generation
 Sources of actually (?) random numbers. Pseudorandom number
generators; setting the seed. Basic R functions for parametric distributions.
 Printable PDF
 Lecture 11 (Oct. 1): Distributions as models
 Empiricaldistributionrelated R commands. R functions for
parametric distributions: d*, p*, q*, r*.
Fitting distributions: method of moments; generalized moments; maximum
likelihood. Diagnostics for distributions.
 Printable PDF
 Lab 6 (Oct. 3)
 R Markdown source file
 Homework: HW 5 due; no new homework
 Midsemester project assigned, due at 11:59 pm on Thursday, 16 October
 midterm.zip (61 MB), archive of IMDB pages for the midterm
 Reading for the week: R Cookbook, chapter 11
 Changing My Shape, I Feel Like an Accident
 Lecture 12 (Oct. 6): Transformations
 Selective access to data. Applying the same function to all parts
of a data object. Transforming the data to suit the problem. Common numerical
transformations. Summarizing subsets of the data. Sorting, and ordering
dataframes. Transposition. Merging dataframes. Reshaping
dataframes from wide to long or long to wide.
 Printable PDF of lecture;
Rpres source file for slides
 Data sets used in examples: fha.csv,
ua.txt, snoqualmie.csv
 Lecture 13 (Oct. 8): Debugging
 Debugging as differential diagnosis; characterizing and localizing
bugs; common errors; programming now for debugging later. Tests and bugs.
 Printable PDF; Rpres source file for HTML slides
 Lab 7 (Oct. 10)
 Data files for lab: ckm_nodes.csv and ckm_network.dat
 Homework: no new homework due
 Reading for the week: chapters 8 and 9 in Spector (sections 9.3 and 9.7 optional); chapter 13 in Matloff; chapters 5 and 6 in The R Cookbook
 Optional reading: Hadley Wickham, "Reshaping Data with the reshape Package", Journal of Statistical Software 21 (2007): 12
 Leet Programming Skillz
 Lecture 14 (Oct. 13): Testing
 Why we test our code; tests of particular cases vs. crosschecking
tests; cycling between testing and programming.
 Printable PDF; Rpres source file
 Lecture 15 (Oct. 15): Topdown design
 Recursively solving problems by writing functions to integrate the
work of subfunctions that solve subproblems. Advantages: demands less
thought to write or to read; simpler to debug or extend. Refactoring to make
code which wasn't designed this way look like it was. Extended example with
the jackknife.
 Printable PDF;
Rpres source file
 No lab (midsemester break)
 Homework: HW 6 assigned
 hw06supplement.R (containing deliberately buggy code); R Markdown file for the assignment
 Midsemester project due at 11:59 pm on Thursday, 16 October
 Reading: Sections 7.6, 7.9, and 14.114.3 in Matloff
 Optional reading: Chambers, TBD
 Functions of functions, and optimization
 Lecture 16 (Oct. 20): Functions as objects
 In R, functions are objects like everything else, so they can be
arguments to other functions, and they can be returned by other functions.
Examples with curve, grad, gradient descent, and writing surface, a 2D counterpart to curve
 Printable PDF, Rpres source
 Lecture 17 (Oct. 22): Simple
optimization
 Basics from calculus about minima. Taylor series. Gradient
descent and Newton's method. Scaling and bigO notation. Curvefitting by
optimization. Illustrations with optim and nls. Bonus:
NedlerMead, a.k.a. the simplex method; coordinate descent.
 Printable PDF;
Rpres source
 Lab 8 (Oct. 24)
 Homework: HW 6 due, HW 7 assigned
 R Markdown file for the homework
 Reading: Recipes 13.113.2 in The R Cookbook
 Optional reading: I.1, II.1 and II.2 in Red Plenty
 Optimization will continue while morale improves
 Lecture 18 (Oct. 27): Constrained optimization
 Optimization under constraints; using Lagrange multipliers to turn
constrained problems into unconstrained ones. Lagrange multipliers as "shadow
prices". Barrier methods for inequality constraints. The correspondence
between constrained and penalized optimization ("a fine is a price").
Statistical uses of penalized optimization: ridge, lasso and spline regression
as examples.
 Printable PDF,
Rpres source
 Lecture 19 (Oct. 29): Stochastic optimization
 Optimization vs. "big data". Sampling as an alternative to using all
the data at once: stochastic gradient descent et al. Peculiarities of
optimizing statistical functionals: don't bother optimizing much within the
margin of error; finding that margin.
 Lab 9 (Oct. 31)
 lab09.RData
 Homework: HW 7 due
 Reading
 Optional reading: Red
Plenty
(cf.);
Léon Bottou and Olivier
Bosquet, "The
Tradeoffs of Large Scale Learning"
 Split/apply/combine
 Lecture 20 (Nov. 3): The split, apply, combine pattern, using base R
 Design patterns in general. The split/apply/combine pattern: break up
a large data set into smaller meaningful pieces; apply the same analysis to
each piece; combine the answers. Iteration as painful, clumsy
split/apply/combine. Tools for split/apply/combine in basic R:
the apply function for arrays, lapply for
lists, mapply, etc.; split; aggregate; subset.
 Lecture 21 (Nov. 5): Split/apply/combine, using plyr
 Abstracting the split/apply/combine pattern: using a single command
to appropriately split up the input, apply the function, and combine the
results, depending on the type of input and output data. Syntax details.
 Lab 10 (Nov. 7)
 debt.csv; R Markdown source file
 Homework: HW 8 assigned, none due
 hw08.RData, R Markdown file
 Options for final project announced
 Preferences due by 12 November
 Reading for the week: Cookbook, chapter 6; Spector, chapter 8
 Optional: Hadley Wickham, "The SplitApplyCombine Strategy for Data Analysis", Journal of Statistical Software 40 (2011): 1
 Databases
 Lecture 22 (Nov. 10): Split/Apply/Combine 3
 The highlevel view of what split/apply/combine does. Thinking about how to split the data into pieces: concepts and R syntax. Thinking about the function to apply to each piece: concepts and R syntax. Illustrations.
 (No R Markdown source file for this lecture, because the lecturer couldn't figure out how to do the image manipulation in R Markdown; LaTeX source files available on request.)
 Homework: HW 8 due, HW 9 assigned
 Lecture 23 (Nov. 12): Databases
 Basic concepts of relational databases; how a database is like an R
dataframe. The client/server model. The structured query language (SQL) and
queries; SELECT and JOIN. R/SQL translations. Accessing databases through R.
 Handout: Databases, and Databases in R
 baseball.db database for examples from lecture and handout (30 Mb)
 Printable PDF;
RPres source
 Final project preferences due
 Lab 11 (Nov. 14)
 Final project teams announced
 Reading for the week: Spector, chapter 3 (for databases)
 Simulation
 Lecture 24 (Nov. 17): Simulation I: Random variable generation and Markov chains
 R Markdown source file
 Homework: HW 9 due, HW 10 assigned
 R Markdown source file
 Lecture 25 (Nov. 19): Simulation II: Monte Carlo and Markov Chain Monte Carlo
 R Markdown source file
 Lab 12 (Nov. 21)
 Reading: Matloff, chapter 8; R Cookbook, chapter 8;
handouts
 Markov Chain Monte Carlo
 Lecture 26 (Nov. 24): Simulation III: Simulations as Models
 Using simulations to replace probability calculations. Using simulations as statistical models. Live coding demo of a simulation model.
 Simulation code written by section 1 and by section 2
 Printable PDF;
Rpres source file
 Homework: HW 10 due; no new homework
 No lab (Thanksgiving break)
 Reading for the week: Handouts
 Optional reading: Charles Geyer, "Practical Markov Chain Monte Carlo",
Statistical
Science 7 (1992):
473483; "One
Long
Run"; BurnIn is
Unnecessary; On
the Bogosity of MCMC Diagnostics; Andrew Gelman and Donald Rubin, "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science 7 (1992): 457472
 Conclusion of the class
 Lecture 27 (Dec. 1): Beyond R
 Limitations of R. Connecting to other languages and specialized tools.
 Lecture 28 (Dec. 3): Computing for statistics
 No lab: work on your projects
 No homework: work on your projects
 Reading for the week: Matloff, sections 14.314.6, 15.1
 Optional readings: Spufford, Red Plenty, TBD; Bradnam and Karf, Unix and Perl to the Rescue, TBD; Chambers, TBD
 Final projects
Final presentations will be held during our final exam period. Attendance for the whole period is mandatory. Submission instructions will be provided closer to the deadline.