Statistics 36462/662: Data Mining
Spring 2020
Prof. Cosma Shalizi
Tuesdays and Thursdays 1:302:50 Porter Hall 100
Data mining is the art of extracting useful patterns from large bodies of data.
(Metaphorically: finding seams of actionable knowledge in the raw ore of
information.) This involves both methods for discovering possible patterns,
and methods for checking or validating candidate patterns. The ongoing growth
in our ability to use computers to store and analyze data opens opportunities
for data mining in business, science, government and elsewhere. The aim of
this course is to help you take advantage of these opportunities in a
responsible way.
After taking the class, when you're faced with a new datamining problem,
you should be able to (1) select appropriate methods of pattern discovery and
validation, and justify your choices, (2) use and program statistical software
to implement those methods, (3) critically evaluate the results and communicate
them to professional colleagues who are not also statisticians or
dataanalysts, and (4) be aware of the most common ethical issues involved.
Data mining is related to statistics
and to machine learning, but has its own
aims and scope. Statistics deal with systems for creating reliable inferences
from imperfect data. Machine learning develops technologies of automated
prediction and induction. We will use tools from statistics and from machine
learning, but we will use them as tools, rather than as things to study in
their own right. We will do a lot of calculations, but will not prove many
theorems, and we will do many more experiments than calculations.
Prerequisites
For this semester, the prerequisite is
passing 36401 (modern
regression) with a grade of C or better. The only exceptions are for
graduate students taking the graduate section (36662; see below).
Graduate students and 36662
This course is primarily intended as an advanced elective for undergraduate
students majoring in statistics. A limited number of graduate students are
welcome to take it as 36662. (Graduate students who register for 36462 will
be dropped from the roster.) Formally, 662 has no prerequisites, but students
are expected to have a background at least equivalent to 36401, and the
prerequisites to 401: linear regression modeling, probability, mathematical
statistics, linear algebra, calculus, and competence in using R for data
analysis. This class does not have the resources to help those without this
background catch up.
Undergraduates must take the course as 36462; those who register for 662
will be dropped from the roster.
Instructors
Professor  Dr. Cosma Shalizi  cshalizi [at] cmu.edu 
  Baker Hall 229C 
Teaching assistant  Ms. Robin Dunn  not to be bothered with emails  not be bothered in offices either 
 Mr. Tudor Manole   
 Mr. Aleksandr Podkopaev   
 Mr. Ian WaudbySmith   
Topics and Techniques to be Covered
These may change during the semester.
Methods of pattern discovery (a.k.a. "heuristics")
 "Supervised" prediction methods for classification and regression
 Regression=predicting a numerical variable; classification=predicting a categorical variable
 General concepts in evaluating prediction models: the biasvariance decomposition of total error, the biasvariance tradeoff. Specific error measures for classification: misclassification rate, false positive and false negative errors; ROC curves.
 Nearest neighbor methods: basic ideas (averaging nearbycases), tradeoffs in the number of neighbors, mathematical and statistical properties of nearest neighbor prediction, computational considerations
 Treebased methods: basic ideas (averaging similar cases), tradeoffs in the size o the tree, mathematical and statistical properties
 Methods for combining multiple models (especially trees): bagging, random forests, boosting, stacking
 Kernelbased methods: kernels as similarity measures; averaging similar cases; selection of kernels; support vector machines; mathematical and computational issues.
 "Unsupervised", notalwayspredictive methods
 Dimension reduction: forming new features by combining old ones; linear dimension reduction by principal components; linear dimension reduction by factor models; linear dimension reduction by random projections; nonlinear dimension reduction by local linear embedding and by diffusion maps.
 Clustering: assigning data points to categories without knowing the right categories; geometric clustering and the kmeans algorithm; probabilistic mixture models.
 Informationtheoretic foundations
 Basic concepts of information theory (entropy, information); the informational view of prediction, dimension reduction and clustering; the information bottleneck; feature selection.
Methods of pattern validation
 Evaluation of predictive performance
 Insample vs. outofsample and generalization performance; the concept of overfitting; analytical approximations to generalization performance; effective degrees of freedom and other measures of model complexity.
 Crossvalidation
 Evaluation of uncertainty
 Bootstrap
 Inference after model selection / data mining
 Issues with multiple testing and with postselection inference; sample splitting.
Case studies and concerns
 Recommendation engines
 "You may also like"; recommender systems based on nearest neighbors; systems based on factor models; hybrid systems; consequences of recommendation engines.
 Information retrieval
 Searching by content (text, images, genes...); choosing representations of content; the bagofwords representation of text; evaluating search; dimension reduction and clustering for text; word2vec representation of text; incorporating user feedback.
 Fairness in prediction
 Case studies: lending decisions, criminal justice. The COMPAS/ProPublica controversy. Definitions of fairness; tradeoffs between different notions of fairness, and between fairness and predictive accuracy.
 Sources of technical failure in data mining
 Data quality. Measurement vs. reality. Weak predictive
relationships. Nonexistent predictive relationships. The curse of dimensionality. Dataset shift.
 Should we do it / what are we doing?
 Privacy (or lack thereof). Human rights concerns. Social consequences, e.g., are we building a dystopia to
make each other click on ads?
Course Mechanics and Grading
There are three reasons you will get assignments in this course. In order of
decreasing importance:
 Practice. Practice is essential to developing the skills
you are learning in this class. It also actually helps you learn, because some
things which seem murky clarify when you actually do them, and sometimes trying
to do something shows that you only thought you understood it.
 Feedback. By seeing what you can and cannot do, and what
comes easily and what you struggle with, I can help you learn better, by giving
advice and, if need be, adjusting the course.
 Evaluation. The university is, in the end, going to
stake its reputation (and that of its faculty) on assuring the world that you
have mastered the skills and learned the material that goes with your degree.
Before doing that, it requires an assessment of how well you have, in fact,
mastered the material and skills being taught in this course.
To serve these goals, there will be three kinds of assignment in this
course.
 Homework
 Every week will have a homework assignment, divided into a series of
questions or problems. These will have a common theme, and will usually build
on each other, but different problems may involve doing or applying some
theory, analyzing real data sets on the computer, and communicating the
results.
 All homework will be submitted electronically through Canvas and
Gradescope. Most weeks, homework will be due at 10:00 pm on
Thursdays. Homework assignments will always be released by Friday of
the previous week, and sometimes before.
 There are specific formatting requirements for
homework  see below.
 Online assignments
 Most homework assignments will be accompanied by shortanswer or multiple
choice questions, to be done on Canvas, before the homework itself is
due. These will typically be due on Sunday nights; consult Canvas for the
exact schedule.
 Inclass exercises
 Most lectures will have inclass exercises. These will be short (1020
minutes) assignments, emphasizing problem solving, done in class in small
groups of no more than four people. The assignments will be given out in
class, and must be handed in on paper during the class period. On most days, a
randomlyselected group will be asked to present their solution to the class.
 The inclass exercises will either be problems from that week's homework,
or close enough that seeing how to do the exercise should tell you how to do
some of the problems.
There will be no exams.
Time expectations
You should expect to spend 57 hours on assignments every week, averaging over
the semester. (This follows from the university's rules about how course
credits translate into hours of student time.) If you find yourself spending
significantly more time than that on the class, please come to talk to me.
Grading
Grades will be broken down as follows:
 Inclass exercises: 10%. All exercises will have equal weight. Your
lowest three exercise grades will be dropped; if you complete all
exercises with a score of at least 60%, your lowest five grades will be
dropped.
 Homework: 90%. All homework assignments will be weighted equally. Most
assignments will include online shortanswer questions, due before the
main assignment (see above); these will be incorporated into the grades for the
corresponding assignments. The online questions will amount to about 10% of
your final grade.
Your lowest two homework grades will be dropped. If you complete all
homework assignments, with a minimum score of at least 60%, your
lowest three homework grades will be dropped.
Late homework will not be accepted for any reason.
The point of dropping the low grades is to let you schedule interviews,
social engagements, and other obligations, with flexibility, and without my
having to decide what is or is not important enough to change the rules. If
something  illness, family crisis, or anything else  is interfering with
your ability to do the work of the class, come talk to me.
Grade boundaries will be as follows:
A  [90, 100] 
B  [80, 90) 
C  [70, 80) 
D  [60, 70) 
R  < 60 
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular
assignment was wrongly graded, tell me as soon as possible. Direct any
questions or complaints about your grades to me; the teaching assistants have
no authority to make changes. (This also goes for your final letter grade.)
Complaints that the thresholds for letter grades are unfair, that you deserve a
higher grade, etc., will accomplish much less than pointing to concrete
problems in the grading of specific assignments.
As a final word of advice about grading, "what is the least amount of work I
need to do in order to get the grade I want?" is a much worse way to approach
higher education than "how can I learn the most from this class and from my
teachers?".
Lectures
Lectures will be used to amplify the readings, provide examples and demos, and
answer questions and generally discuss the material. They are also when you
will do the graded inclass assignments which will help consolidate your
understanding of the material, and help with your homework.
Readings
There are (up to) three classes of readings associated with each lecture.
 Key readings: Usually from required textbooks,
occasionally from other sources which will be made available well in advance.
Do this reading before coming to lecture, and lecture will make a lot more
sense. (Also, the online exercises will be easy.)
 Suggested readings: Usually easiertoread academic papers, or
portions of the nonrequired texts, which will deepen your understand of what
we're doing in lecture and in the homework, but aren't critical.
 Background readings: Usually hardertofollow or more historical papers,
which give further details about methods and applications.
TL;DR: definitely do the key readings before class; try to do the suggested readings too, they're easy; and the background readings are if you want to get into the weeds.
Electronics
Please don't use any electronic devices during lecture: no laptops, no
tablets, no phones, no recording devices, no watches that do
more than tell time. The only exception to this rule is for electronic
assistive devices, properly documented with CMU's Office of Disability
Resources.
(The noelectronics rule is not arbitrary meanness on my part. Experiments
show,
pretty
clearly, that students learn more in electronicsfree classrooms, not least
because your device isn't distracting your neighbors, who aren't as good at
multitasking as you are.)
Missing class and emergencies
You can miss up to three class sessions during the semester without
(grade) penalty. In cases of routine illness (for example, you have a bad cold
and a day of sleep and fluids would help), stay home, take a free
day, and let yourself rest and recover, rather than dragging
yourself to class. The grading policy gives you space for this.
In cases of extended illness or emergency, we can deal with makeups for
missed work after the crisis. If extended illness or emergency
happens (including mental health crises), or a personal or family crisis,
please don't wait to get help. Just let me know that you're dealing with an
emergency, and that you are already getting help.
R, R Markdown, and Reproducibility
R is a free, opensource software
package/programming language for statistical computing. You should have begun
to learn it in 36401 (if not before). In this class, you will use it for
every homework. If you are not able to use R, or do not have ready,
reliable access to a computer on which you can do so, let me know at once.
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it; this writing is part of
the assignment and will be graded. Raw computer output and R code is not
acceptable; your document must be humanly readable.
All homework and exam assignments are to be written up
in R Markdown. (If you know
what knitr is and would rather use it,
ask first.) R Markdown is a system that lets you embed R code, and its output,
into a single document. This helps ensure that your work
is reproducible, meaning that other people can redo your
analysis and get the same results. It also helps ensure that what you report
in your text and figures really is the result of your code (and not some
brilliant trick that you then forget, or obscure bug that you didn't notice).
For help on using R Markdown,
see "Using R Markdown
for Class Reports". (Part of the first lecture will be going over
the basics of R Markdown.)
For each assignment, you should generally submit two, and only two, files:
 an R Markdown source file, integrating text, generated figures and R code, uploaded to Canvas;
 the "knitted," humanlyreadable document, in PDF format, uploaded to
Gradescope. (Using PDF is
important for the grading software.)
(Using two systems is inconvenient for everyone; Gradescope is much better for efficiently sharing work across multiple graders, but will only work with PDFs.)
I will be rerunning the R Markdown file
of randomly selected students; you should expect to be picked for this about
once in the semester. You will lose points if your R Markdown file does not,
in fact, generate your knitted file (making obvious allowances for random
numbers, etc.).
Some problems in the homework will require you to do math. R Markdown
provides a simple but powerful system for typesetting math. (It's based on
the LaTeX documentpreparation system widely used in the sciences.) If you
can't get it to work, you can handwrite the math and include scans or photos
of your writing in the appropriate places in your R Markdown document. (You
should upload these additional files to Canvas.) You will, however, lose
points for doing so, starting with no penalty for homework 1, and growing to a
90% penalty (for those problems) by the last homework.
Canvas, Gradescope and Piazza
Homework will be submitted electronically through Canvas and Gradescope;
Canvas will also be used as the gradebook. Some readings and course materials
will also be distributed through Canvas.
We will be using the Piazza website for questionanswering. You will
receive an invitation within the first week of class.
Anonymoustootherstudents posting of questions and replies will be allowed,
at least initially. (Postings will not be anonymous to instructors.)
Anonymity will go away for everyone if it is abused. As noted above, homework
will generally be due at 10 pm on Wednesdays; however, you should not expect
the instructors to answer questions on Piazza (or by email) after 6 pm.
Solutions
Solutions for all homework will be available, after their due date, through
Canvas. Please don't share them with anyone, even after the course has ended.
Office Hours
Mondays  12:301:30  Mr. Manole  Wean Hall 3509 
Tuesdays  10:3011:30  Mr. WaudbySmith  Wean Hall 3715 
Wednesdays  12:301:30  Prof. Shalizi  Baker Hall 229C 
Thursdays  3:004:00  Mr. Podkopaev  Porter Hall A19A 
If you want help with computing, please bring a laptop.
If you cannot make the regular office hours, or have concerns you'd rather
discuss privately, please email me about making an appointment.
Textbook
The following three books are required:
 D. J. Hand, Heikki Mannila
and Padhraic Smyth,
Principles of Data
Mining (Cambridge, Massachusetts: MIT Press, 2001)
 Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (New York: Crown Books, 2016)
 Michael Kearns and Aaron Roth, The Ethical Algorithm: The Science of Socially Aware Algorithm Design (New York: Oxford University Press, 2019)
Hand et al. and O'Neil are (currently) available electronically through
the university library, but that's limited to one user at a time. You should
make sure you have access to copies of all three books.
The following books are recommended but not required.
 Richard A. Berk, Statistical Learning from a Regression Perspective (2nd edition, New York: Springer, 2016) [available online through the university library]
 Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition, Berlin: SpringerVerlag, 2009) [free online]
 Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman,
Mining of Massive Datasets (second edition, Cambridge, England: Cambridge University Press, 2014) [free online]
 Paul Teetor, The R Cookbook (Sebastopol, California: O'Reilly, 2011) [R's help files are organized around "What does command X do?"; this is organized around "I want to do Y, what commands do I need to use?"]
Except for explicit group exercises,
everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbook and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or
legitimate online resources, though again, all use of these sources must be
acknowledged in your work. (Websites which compile course materials
are not legitimate online resources.)
In general, you are free to discuss, in general terms, homework with other
students in the class, though not to share or compare work; such conversations
must be acknowledged in your assignments. You may not discuss the
content of assignments with anyone other than current students, the
instructors, or your teachers in other current classes at CMU, until after the
assignments are due. (Exceptions can be made, with prior permission, for
approved tutors.) You are free to complain, in general terms, about any aspect
of the course, to whomever you like.
Any use of solutions provided for any assignment in this course in previous
semesters is strictly prohibited. This prohibition applies even to students
who are retaking the course. Do not copy the old solutions (in whole or in
part), do not "consult" them, do not read them, do not ask your friend who took
the course last year if they "happen to remember" or "can give you a hint".
Doing any of these things, or anything like these things, is cheating, it is
easily detected cheating, and those who thought they could get away with it in
the past have failed the course. Even more importantly: doing any of those
things means that the assignment doesn't give you a chance to
practice; it makes any feedback you get meaningless; and of course it makes any
evaluation based on that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will be no penalty for asking. If you do violate
these policies but then think better of it, it is your responsibility to tell
me as soon as possible to discuss how your misdeeds might be rectified.
Otherwise, violations of any sort will lead to formal disciplinary action,
under the terms of the university's
policy
on academic integrity.
You must complete "homework 0" on the content of the university's policy on
academic integrity, and on these course policies. This assignment will not
factor into your grade, but you must complete it, with a grade of at
least 80%, before you can get any credit for any other assignment.
Accommodations for Students with Disabilities
The Office of
Disability Resources provides support services for both physically disabled
and learning disabled students. For individualized academic adjustment based
on a documented disability, contact Disability Resources at access [at]
andrew.cmu.edu or 4122686121; they will coordinate with me.
Schedule, Lecture Notes and Readings
Lecture notes and readings will be linked to here, as they become available.
Lecture topics are tentative and subject to change. Don't expect required and
suggested readings to finalize until about a week before each lecture.
PDM = Principles of Data Mining; WMD = Weapons of Math Destriction; TEA = The Ethical Algorithm
TBD = to be determined = I'll figure it out between when whenever you're reading this and when the readings need to be made available.
Lecture 1 (Tuesday, 14 January): Introduction to the course
 What is data mining? how is it used? where did it come from? where does the
data come from? Some technical themes: prediction by averaging similar cases;
deciding what's a similar case; verification by predicting new data. R
Markdown. Course policy on cheating.
 Readings
 Key readings:
 Background readings:
 Homework:
 After class:
Lecture 2 (Thursday, 16 January):
Lightning review of linear regression
 Linear regression as a prediction method. Expected squared error; the
biasvariance decomposition; the conditional expectation function as the
optimal regression function. The optimal linear predictor: why it always
exists (regardless of whether anything is really linear, let alone
linearandGaussian), and its general mathematical form. Ordinary least
squares as an estimator of the optimal linear predictor; OLS in matrix form.
The hat matrix and predictions. Model checking and diagnostics: when they
matter and when they don't. What linear models can and cannot do.
 Readings
 Key reading: PDM, sections 11.111.3
 Suggested readings:
 Berk, chapter 1
 Hastie et al., sections 2.3.1 and 2.6, and chapter 3 (especially through section 3.4)
 Background reading:
 Richard A. Berk, Regression Analysis: A Constructive Critique (Thousand Oaks, California: Sage, 2004)
 R. W. Farebrother, Fitting Linear Relationships: A History of the Calculus of Observations, 17501900 (New York: SpringerVerlag, 1999)
 CRS, The Truth About Linear Regression
 Norbert Wiener, Extrapolation, Interpolation and Smoothing of Stationary
TimeSeries: with Engineering Application (Cambridge, Massachusetts: The Technology Press, 1949 [but first published as a classified technical report, Washington, D.C.: National Defense Research Council, 1942])
 After class:
 Slides (typocorrected!)
 .Rmd source file for the slides
Sunday, 19 January at 10 pm
 Homework 0 due
 Online questions for homework 1 due
Lecture 3 (Tuesday, 21 January):
Nearest neighbors I: Mostly theoretical
 knearest neighbors for classification, regression, and general prediction.
Performance of nearestneighbor prediction (=1NN) for noisefree function
learning; how and why the distance to the nearest neighbor goes to zero.
Effects of noise. Why 1NN converges to a risk of twice the minimum possible
risk. Biasvariance tradeoff for knearestneighbors.
 Readings
 Key reading: PDM, sections 6.3.3, 10.110.2 and 10.6
 Suggested reading: Hastie et al., chapter 13 (skimming section 13.2)
 Background readings:
 Thomas M. Cover, "Estimation by the Nearest Neighbor Rule", IEEE Transactions on Information Theory 14 (1968): 5055 [PDF reprint via Prof. Cover]
 Thomas M. Cover, "Rates of Convergence for Nearest Neighbor Procedures", pp. 413415 in B. K. Kinariwala and F. F. Kuo (eds.), Proceedings of the Hawaii International Conference on Systems Sciences (Honolulu: University of Hawaii Press, 1968) [PDF reprint via Prof. Cover]
 Thomas M. Cover and P. E. Hart, "Nearest Neighbor Pattern Classification", IEEE Transactions on Information Theory 13 (1967): 2127 [PDF reprint via Prof. Cover]
After class: Notes (.Rmd), also covering the material for lecture 4.
Lecture 4 (Thursday, 23 January):
Nearest neighbors II: Mostly statistical
 The biasvariance tradeoff for knearestneighbors (reprise). The problem
of selecting the number of nearest neighbors. Selecting k by
crossvalidation: vfold versus leaveoneout. Some examples.
 Readings
 Key reading: PDM, sections 10.1010.11
 Homework:
 After class: Notes (.Rmd), also covering the material for lecture 3.
Sunday, 26 January
Online questions for homework 2 due at 10 pm
Lecture 5 (Tuesday, 28 January):
Trees and ensembles I
 Classification and regression trees; the CART algorithm for building trees; "entropy" and "deviance"; some examples.
 Reading
 Key reading: PDM, sections 5.2 and 10.5
 Suggested readings:
 Berk, chapter 3
 Hastie et al., section 9.2
 After class: the chapter on trees from Advanced Data Analysis from an Elementary Point of View
Lecture 6 (Thursday, 30 January):
Trees and ensembles II
 Improving on single trees by combining multiple trees. "Bagging", or
"bootstrap averaging": combining multiple trees from slightly different data
sets. Random forests: bagging plus random feature selection. Two perspectives
on why multiple trees help: averaging reduces variance (but less so if what's
averaged is positively correlated); the performance of the average is the
average performance plus the diversity.
 Reading
 Suggested readings:
 Berk, chapters 4 and 5
 Hastie et al., section 8.7 (bagging) and chapter 15 (random forests)
 Background reading:
 Leo Breiman, "Bagging Predictors", Machine Learning 24 (1996): 123140
 Leo Breiman, "Random Forests", Machine Learning 45 (2001): 532
 Hastie et al., sections 8.8 (stacking) and 9.5 (hierarchical mixtures of experts)
 Anders Krogh and Jesper Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning", pp. 231238 in Gerald Tesauro, David Tourtetsky and Todd Leen (eds.) Advances in Neural Information Processing Systems 7 [NIPS 1994] (Cambridge, Massachusetts: MIT Press, 1995)
 After class: Notes, including the algebra I shortcircuited in class (.Rmd)
 Homework
Lecture 7 (Tuesday, 4 February):
Linear classifiers and logistic regression
 The "prototype method" for classification (= which class has the closer average vector?). Linear classifiers more broadly. The idea of a decision boundary. Logistic regression: a linear classifier plus an estimated class probability.
 Reading
 Key reading: PDM, sections 10.4 and 10.7
 Suggested reading: Hastie et al., chapter 4
 After class: Slides (.Rmd)
Lecture 8 (Thursday, 6 February):
Kernel methods, support vector machines, and random kitchen sinks
 Nonlinear classification by using a linear classifiers with
a nonlinear set of features. The "kernel trick" for avoiding explicitly
calculating features. Support vector machines. Random nonlinear features.
 Reading
 Key reading: PDM, sections 10.3
 Suggested readings:
 Berk, chapter 7
 Hastie et al., chapter 12
 Background reading:
 After lecture: Notes (.Rmd source)
 Homework
Sunday, 9 February
Online questions for homework 4 due at 10 pm No online reading questions this week.
Lecture 9 (Tuesday, 11 February):
Information measures I: Scenesetting
 Prediction needs statistical dependence; reminder of the definition
of statistical independence and how it connects to prediction. Mutual
information quantify dependence between random variables. Alternate take: entropy measures uncertainty or spread in a random variable; conditional entropy measures uncertainty after conditioning; mutual information is the reduction in entropy due to conditioning.
 Bonus material not covered in lecture: proofs of the "it can be shown that" claims; connection between entropy and coding;
measuring differences between distributions with divergence a.k.a. relative entropy; connections between classification and conditional entropy; connections between classification and divergence; maximum likelihood estimation is really minimum divergence estimation; estimating entropy and information from sample data.
 After class: Notes (.Rmd)
Lecture 10 (Thursday, 13 February):
Information measures II: The informationtheoretic view of prediction, feature selection, and summarization
 Predictive information. Sufficient statistics. Statistical
relevance bases. The informationbottleneck method. Selecting features to maximize predictive information; to minimize
stored information. Creation of new features to capture information:
the information bottleneck again; word2vec.
 Readings
 Suggested reading:
 Naftali Tishby, Fernando C. Pereira and William Bialek, "The Information Bottleneck Method", pp. 368377 in B. Hajek and R. S. Sreenivas (eds.), Proceedings of the 37th Annual Allerton Conference on
Communication, Control and Computing (Urbana, Illinois: University of Illinois Press, 1999), arxiv:physics/0004057
 After class: Notes (with extensive worked examples); R code for reproducing the notes (except for a few deliberatelyomitted items);
nyt.frame.csv data frame
 Homework
Sunday, 16 February
Online questions for homework 5 due at 10 pm
NO CLASS on Tuesday, 18 February
Lecture 11 (Thursday, 20 February):
Prelude to dimension reduction
 Goals of dimension reduction. Linear algebra
review: vectors, matrices, inner products, projections. Eigendecomposition
of matrices.
 Readings: None
 Homework
Sunday, 23 February
Online questions for homework 6 due at 10 pm
Lecture 12 (Tuesday, 25 February):
Linear dimension reduction
 "Principal components analysis" = finding the linear surface which comes closest to the original data points. Performing and interpreting PCA.
Example with historical complexity. "Latent semantic indexing" = PCA on
bagofwords vectors. Multidimensional scaling and PCA. Random linear
projections as an alternative to PCA.
 Reading
 Key reading: PDM, chapter 3, especially sections 3.6 and 3.7
 Suggested readings:
 Background readings:
 After class: Slides, with amplifications / clarifications based on class discussion (.Rmd source file)
Lecture 13 (Thursday, 27 February):
Linear dimensionality reduction, once more with feeling
 Recapitulation of the math; worked example with real data.
 After class: Slides (.Rmd source file)
 Homework
 The original plan was to do nonlinear dimension reduction, as follows:
 Problems with using linear methods when the data lie near a curved surface. The concept of a "manifold". Exploiting manifold structure: local linear embedding. Exploiting a preexisting measure of similarity: diffusion maps.
 Reading
 Key readings:
 Background reading:
 Mikhail Belkin and Partha Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation",
Neural Computation 15 (2003): 13731396 [PDF via Prof. Belkin]
 Ann B. Lee and Larry Wasserman, "Spectral Connectivity Analysis", Journal of the American Statistical Association 105 (2010): 12411255, arxiv:0811.0121
 Sam T. Roweis and Laurence K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding", Science 290 (2000): 23232326
 Lawrence K. Saul and Sam T. Roweis, "Think Globally, Fit Locally: Supervised Learning of Low Dimensional Manifolds", Journal of Machine Learning Research 4 (2003): 119155
Sunday, 1 March
Online questions for homework 7 due at 10 pm
Lecture 14 (Tuesday, 3 March):
Prelude to clustering
 "Clustering" is dividing data into categories, without knowing what the
categories are, or having an examples to follow. Why this is
not totally hopeless, and some rough ideas about what good clusters
might look like. Lightning review of probability distributions: cumulative
distribution function, probability density function, median, mode, mean;
goodnessoffit testing for distributions (chisquared, KolmogorovSmirnov);
entropy and relative entropy again.
 Reading
 Key reading: PDM, sections 6.4 and 9.2.19.2.3, 9.2.6
Lecture 15 (Thursday, 5 March):
Clustering I: Clusters without Probability
 The most basic algorithm for clustering: kmeans. Using clustering to find
(approximate) nearest neighbors ("localitysensitive hashing").
 Readings
 Key readings:
 PDM, sections 9.39.5 (skimming 9.1 and 9.2 is recommended)
 O'Neil, Chapters 9 and 10
 Suggested readings:
 Hastie et al., sections 14.3.114.3.6
 Background reading:
 Aristides Gionis, Piotr Indyk and Rajeev Motwani, "Similarity Search in High Dimensions via Hashing", pp. 518529 in M. P. Atkinson et al. (eds.),
Proceedings of the 25th International Conference on Very Large Data Bases [VLDB '99] (San Francisco: Morgan Kaufmann, 1999)
 Jacob Kogan, Introduction to Clustering Large and HighDimensional Data (Cambridge, UK: Cambridge University Press, 2006)
 After class: Notes
 Homework
 Homework 7 due at 10 pm
Homework 8 assigned
10 and 12 March: NO CLASS
Enjoy spring break.
Monday, 16 March
Online questions for homework 8 due at 10 pm (note day!)
Lecture 16 (Tuesday, 17 March): NO CLASS
The university has canceled all instruction for Monday and Tuesday.
Lecture 17 (Thursday, 19 March):
Clustering II
 Probabilistic clustering with mixture models; the EM algorithm; kmeans
again. Using loglikelihood to pick the number of clusters. Checking
goodness of fit of cluster models. Advanced topics in probabilistic clustering: "mixed membership"
models, and their roles in modeling text ("topic models", "latent
Dirichlet allocation") and genetics.
 Reading
 Key reading: PDM, sections 9.2.49.2.5 (again), and section 9.6
 Suggested reading:
 Background readings:
 April Galyardt, "Interpreting Mixed Membership", in Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds.), Handbook of Mixed Membership Models and Their Applications (New York; Chapman and Hall, 2016) [PDF]
 David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003): 9931022
 J. K. Pritchard, M. Stephens and P. Donnelly, "Inference of population structure using multilocus genotype data", Genetics 155 (2000): 945959
 Homework
 Homework 8 ("graded survey") due at 10 pm
 Homework 9: Assignment
Lecture 18 (Tuesday, 24 March):
How big are the errors of our predictive models?
 Error measures or "loss functions": squared error for regression,
mislabeling ("0/1 loss") for classification, loglikelihood for probabilities.
Why we should never evaluate predictive models insample. (Review of why
Rsquared is useless and/or misleading.) Simple (too simple?) formulas for
estimating outofsample prediction error. The concept of effective
degrees of freedom. Other measures of model complexity.
 Reading
 Key reading: PDM, chapter 7
 Suggested readings:
 Berk, section 1.6 (skipping 1.6.7, "Basis Functions" and 1.6.8, "The Curse of Dimensionality")
 Hastie et al., chapter 7
 Background readings:
 Notes (.Rmd)
Lecture 19 (Thursday, 26 March):
What kinds of errors do our predictive models make?
 Regression diagnostics, focusing on prediction error: over and under prediction, nonconstant variance ("heteroskedasticity"). Errors for classifiers: false positives versus false negatives; the tradeoff between false positives and false negatives, and the "receiver operating characteristic" (ROC) curve. "Calibration" of probability models; "proper scoring rules".
 Reading
 Key reading: PDM, chapter 10
 Suggested readings:
 Hastie et al., section 9.2.5
 Background readings:
 Tilmann Gneiting, Fadoua Balabdaoui and Adrian E. Raftery,
"Probabilistic Forecasts, Calibration and Sharpness", Journal of the Royal Statistical Society B 69 (2007): 243268
 Deborah G. Mayo and Aris Spanos, "Methodology in Practice: Statistical Misspecification Testing", Philosophy of Science 71 (2004): 10071025
 Homework
Saturday, 28 March
Homework 9 due, along with online reading questions (an extension)
Lecture 20 (Tuesday, 31 March):
Crossvalidation
 Splitting the sample to estimate generalization error. Crossvalidation:
"vfold" (or "kfold") and "leaveoneout". Shortcut formulas for LOOCV.
 Readings
 Key reading: PDM, chapter 10, especially sections 10.6 and 10.10
 Suggested reading:
 Hastie et al., section 7.10
 Background reading:
 Notes: Slides (.Rmd source file)
Lecture 21 (Thursday, 2 April):
Bootstrapping
 The concept of the bootstrap (again): imitating generalizing from a sample
to a whole population by resampling the data. Bootstrap estimates of
prediction error. Bootstrap confidence intervals. Bootstrap model
averaging (again).
 Reading
 Suggested reading: Hastie et al., sec. 7.11
 Homework
 Slides (.Rmd)
Lecture 22 (Tuesday, 7 April):
Recommender systems I: Broad Ideas and Technical Issues
 The idea of a recommender system. Bestseller
lists. Nearestneighborbased approaches (whether based on users or on items).
The Netflix problem, singular value decomposition, and PCA/factormodel style
approaches. Recommendations using social graphs. Missingdata issues.
 Readings
 Key readings:
 Suggested reading:
 Background readings:
 Gérard Biau, Benoît Cadre, Laurent Rouvière, "Statistical analysis of knearest neighbor collaborative recommendation", Annals of Statistics
38 (2010): 15681592, arxiv:1010.0499
 Maurizio Ferrari Dacrema, Paolo Cremonesi and Dietmar Jannach, "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches", in Proceedings of the 13th ACM Conference on Recommender Systems [RecSys 2019],
arxiv:1907.06902
 Benjamin Marlin, Richard S. Zemel, Sam Roweis, Malcolm Slaney, "Collaborative Filtering and the Missing at Random Assumption", arxiv:1206.5267
 Notes (.Rmd)
Lecture 23 (Thursday, 9 April):
Recommender systems II: What's Not to Love?
 Homogenization; paradoxes of diversification; rabbit holes and echo chambers; discrimination; what does the system optimize for; when inferior recommendations are more profitable; natural monopoly (or oligopoly). Also: do recommendations do anything at all?
 Readings
 Suggested reading (in this order):
 "Amazon.com Recommendations Understand Area Woman Better Than Husband", The Onion 9 January 2007
 Tom Slee, "Online Monoculture and the End of the Niche", Whimsley 15 March 2009
 Tommy Wallach, "In Which Netflix Achieves Sentience As A Result of My Terrible Decisionmaking", The Toast, 4 August 2014
 Zeynep Tufekci, "YouTube's Recommendation Algorithm Has a Dark Side",
Scientific American 320:4 (April 2019): 77
 Zeynep Tufekci, "YouTube, the Great Radicalizer", New York Times, 10 March 2018
 Notes (.Rmd)
 Homework
 Homework 11 due at 10 pm
 Homework 12: CANCELLED
Lecture 24 (Tuesday, 14 April):
Information retrieval

Finding data by content. Approaches we will neglect: metadata; "coding" or
"content analysis". Textual features. Vector representations of documents:
the bagofwords representation; word2vec. Measures of similarity and
the importance of normalization; inverse document frequency and other tweaks to
the representation. Example with
the New
York Times Annotated Corpus. The trick to searching: queries are documents and can be represented in the
same feature space. Search evaluation: precision, recall, precisionrecall
curves; error rates. Incorporating user feedback. Proxies for relevance like
"engagement". Abstraction and generalizing to other
types of data.
 Readings
 Key reading: PDM, chapter 14
 Background reading:
 Marco Baroni, Georgiana Dinu and German Kruszewski, "Don’t count, predict! A systematic comparison of contextcounting vs. contextpredicting semantic vectors", pp. 238247 in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics [ACL14] (Baltimore, MD: Association for Computational Linguistics, 2014)
 Yoav Goldberg and Omer Levy, "word2vec Explained: deriving Mikolov et al.'s negativesampling wordembedding method", arxiv:1402.3722
 Kian KenyonDean, "Word Embedding Algorithms as Generalized Low Rank Models and their Canonical Form", arxiv:1911.02639
 Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama, "A Mutual Information Maximization Perspective of Language Representation Learning", arxiv:1910.08350
 Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", arxiv:1301.3781
 Notes (.Rmd)
Thursday, 16 April: OPTIONAL lecture on epidemic modeling
Originally, there wasn't supposed to be any class this day because of
Carnival. The university now says it's a day for instruction, but I promised
you wouldn't have to do anything, so we'll cover something that is,
strictly speaking, outside the scope of the course, but statistical
and topical, and make it completely optional.
 The basic "susceptibleinfectiousremoved" (SIR) epidemic model. The
probability model and its deterministic limit. The idea of the "basic
reproductive number" R0 and how it relates to the rates of transmission and
removal. Why diseases do not necessarily evolve to be less lethal to their
hosts. The epidemic threshold when R0=1.
Complications: gaps between being infected and becoming infectious; the
possibility of being infectious without showing symptoms; reinfection.
Epidemics in social networks, and how network structure affects the epidemic
threshold; why highdegree people tend to be among the first infected, and
diseasecontrol strategies based on "destroying the hubs". Statistical issues
in connecting epidemic models to data.
 Reading:
 Slides (.Rmd), revised after the lecture to fix typos, amplify points, incorporate things that came up in discussion, etc.
 Homework
Lecture 25 (Tuesday, 21 April):
Fairness in prediction, especially classification
 Notions of "fair" prediction: not using "protected" features; parity of
error rates across groups; calibration of prediction. "Inference" issues:
using features that carry information about protected features. Tradeoffs
between different forms of fairness. Tradeoffs between forms of fairness and
accuracy. Some critiques of these notions of "fairness".
 Readings
 Key readings:
 WMD, chapters 1, 5, 6 and 8
 TEA, chapter 2
 Suggested readings:
 Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, "Machine Bias", ProPublica 23 May 2016
 Alexandra Chouldechova, "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments", arxiv:1610.07524
 Sam CorbettDavies and Sharad Goel, "The Measure and
Mismeasure of Fairness: A Critical Review of Fair Machine
Learning", arxiv:1808.00023
 Simon DeDeo, "Wrong side of the tracks: Big Data and Protected Categories", pp. 3142 in Cassidy R. Sugimoto, Hamid R. Ekbia and Michael Mattioli (eds.), Big Data Is Not a Monolith (Cambridge, Mass.: MIT Press, 2016), arxiv:1412.4643
 Sharad Goel, Jake M. Hofman, M. Irmak Sirer, "Who Does What on the Web: A LargeScale Study of Browsing Behavior", Sixth International AAI Conference on Weblogs and Social Media [ICWSM 2012] [Preprint via Prof. Goel]
 Background reading:
 Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan, "A Survey on Bias and Fairness in Machine Learning", arxiv:1908.09635
 Notes (.Rmd)
Lecture 26 (Thursday, 23 April):
Waste, fraud and abuse I: when is our datamining project going to be a technical failure?
 Weak or nonexistent predictive
relationships. Changing predictive relationship. Changing distributions. The curse of dimensionality.
 Readings
 Key readings:
 PDM, chapter 2, section 6.5, section 1.7
 Suggested readings:
 Chris Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", Wired 16:17 (June 2008)
 Berk, chapter 9
 Jesse Frederik and Maurits Martijn, "The new dot com bubble is here: it’s called online advertising", The Correspondent 2 November 2019 [While a popular article, this is wellsourced in serious research, which it links to]
 Andrew Gelman and David Weakliem, "Of Beauty, Sex and Power",
American Scientist 97 (2009): 310
 Fernando Pereira, comments on Anderson, Earning My Turns, 23 June 2008
 Christian Grothoff and J.M. Porup, "The NSA’s SKYNET program may be killing thousands of innocent people", Ars Technica 16 February 2016
 Notes (.Rmd)
Saturday, 25 April
Homework
Lecture 27 (Tuesday, 28 April):
Waste, Fraud, and Abuse II: More sources of technical failure
 Bad data: errors in data vs. bad measurements, measuring the wrong thing.
Not accounting for model search and/or model selection; sample splitting as a
cure for postselection inference. The "garden of forking paths" and "researcher degrees of freedom". Prediction vs. causality; taking credit
for what would have happened anyway.
 Readings
 Key reading: TEA, chapter 4
 Suggested readings:
 Background readings
 Jorge Luis Borges, "The Garden of Forking Paths" (as "El jardín de senderos que se bifurcan", 1941; translated by Anthony Kerrigan in Ficciones [New York: Grove Press, 1962] and by Anthony Hurley in Collected Fictions [New York: Viking, 1998])
 Howard Becker, Evidence (Chicago: University of Chicago Press, 2017)
 Luke OakdenRayner, "Exploring large scale public medical image datasets", arxiv:1907.12720
 Amit Sharma, Jake M. Hofman and Duncan J. Watts, "Estimating the Causal Impact of Recommendation Systems from Observational Data", pp. 453470 in Proceedings of the Sixteenth ACM Conference on Economics and Computation [EC '15], arxiv:1510.05569
Lecture 28 (Thursday, 30 April):
Waste, fraud and abuse III: what are we doing to ourselves? should we be doing this?
 Privacy and surveillance. Rightsviolating applications of data mining.
Legal (probably) but still destructive applications. Is it any better if it doesn't work (well)? Some ethical checklist questions: who benefits? who is harmed? how much? how surely? are you OK with being responsible for that?
 Readings
 Key readings:
 WMD, Conclusion
 TEA, chapters 1 and 5
 Suggested readings:
 Maciej Ceglowski, "What Happens Next Will Amaze You"
 Henry Farrell, "Seeing Like a FiniteState Machine", Crooked Timber 25 November 2019
 Kashmir Hill, "The Secretive Company That Might End Privacy As We Know It", New York Times 19 January 2020 [published in the print edition as "A Facial Recognition App That Puts Privacy in Peril", p. 1]
See also:
 Betsy Swan, "FacialRecognition Company That Works With Law Enforcement Says Entire Client List Was Stolen", Daily Best 26 February 2020
 Stuart A. Thompson and Charlie Warzel, "Twelve Million Phones, One Dataset, Zero Privacy", New York Times 19 December 2019
 Zeynep Tufekci, "We're building a dystopia just to make people click on ads", TED
 Background readings:
 Francis Spufford, Red Plenty (London: Faber and Faber, 2010) [About the first time brilliant and wellintentioned people tried to use data, computers and optimization to make the world a better place]
 Norbert Wiener, The Human Use of Human Beings: Cybernetics and Society (Boston: Houghton Mifflin, 1950)
Friday, 1 May
Homework 14 due at 10 pm