Statistics 36-462/662: Data Mining

Fall 2019

Prof. Cosma Shalizi
Mondays and Wednesdays 1:30--2:50 Doherty Hall 1212

Data mining is the art of extracting useful patterns from large bodies of data. (Metaphorically: finding seams of actionable knowledge in the raw ore of information.) The rapid growth of computerized data, and the computer power available to analyze it, creates great opportunities for data mining in business, medicine, science, government and elsewhere. The aim of this course is to help you take advantage of these opportunities in a responsible way.

After taking the class, when you're faced with a new problem, you should be able to (1) select appropriate methods, and justify their choice, (2) use and program statistical software to implement them, and (3) critically evaluate the results and communicate them to professional colleagues who are not also statisticians or data-analysts.

Data mining is related to statistics and to machine learning, but has its own aims and scope. Statistics deal with systems for creating reliable inferences from imperfect data. Machine learning develops technologies of automated prediction and induction. We will use tools from statistics and from machine learning, but we will use them as tools, not things to study in their own right. We will do a lot of calculations, but will not prove many theorems, and we will do many more experiments than calculations.


For this semester, the prerequisite is passing 36-401 (modern regression) with a grade of C or better. The only exceptions are for graduate students taking the graduate section (36-662; see below).

Graduate students and 36-662

This course is primarily intended as an advanced elective for undergraduate students majoring in statistics. A limited number of graduate students are welcome to take it as 36-662. (Graduate students who register for 36-462 will be dropped from the roster.) Formally, 662 has no prerequisites, but students are expected to have a background at least equivalent to 36-401, and the prerequisites to 401: linear regression modeling, probability, mathematical statistics, linear algebra, calculus, and competence in using R for data analysis. This class doesn't have the resources to help those without this background catch up.

Undergraduates must take the course as 36-462; those who register for 662 will be dropped from the roster.


Professor Dr. Cosma Shalizi cshalizi [at]
Baker Hall 229C
Teaching assistant Ms. Robin Dun not to be bothered with e-mails

Topics and Techniques to be Covered

These may change during the semester.
Searching by similarity and information retrieval
Searching by content (texts, images, genes, ...); attributes, representations and definitions of similarity and distance; choice of representation. Bag-of-words and vector-space representations of text. A first look at linear classifiers, nearest neighbors, prototypes; user feedback; evaluating searches.
Dimension reduction
Using principal components to reduce the number of features; using factor analysis to reduce the number of features; using random projections to reduce the number of features. Strengths and limitations of linear dimension-reduction methods. Nonlinear dimensionality reduction: local linear embedding, diffusion maps, random nonlinear features. Predictive embedding methods; word2vec. Case studies: the "Big 5" personality traits; latent semantic indexing; Netflix; Cambridge Analytica.
Quantitative measures of information, uncertainty, and information gain; information-theoretic feature selection.
Nearest neighbor methods
Nearest neighbors for classification, regression, and general prediction. Finding nearest neighbors: data structures; locality-sensitive hashing; LSH and information retrieval. Case studies: recommender systems / "you may also like"; matching estimates of treatment effects.
Unsupervised category-learning, a.k.a. clustering. Non-probabilistic clustering methods: k-means clustering, hierarchical clustering. Geometry of clustering, ideas about what makes for good clusters; clusters and hashing. Probabilistic clustering: mixture models, hierarchical mixture (topic) models. Case studies: demographic clustering; topic models for text; topic models for genetics.
Supervised categorization, a.k.a. classification. Linear classifiers; logistic regression. Evaluating classifiers: error rates, over- and under- fitting, cross-validation; base rates, operating characteristics, ROC curves, Neyman-Pearson classifiers. Support-vector machines and the "kernel trick." "Fairness" in classification: definitions and disputes. Case studies: credit scoring; "clinical vs. actuarial judgment"; COMPAS.
Trees and ensembles
Classification and regression trees; tree-growing algorithms. "Fast and frugal heuristics". Basic idea of ensembles: combining predictive models; the virtues of diversity. Bagging. Random forests. Boosting. Stacking. Mixtures of experts and hierarchical mixtures of experts. Case studies to be determined.
Waste, Fraud, and Abuse
When data mining will fail: bad data, wrong data, insufficient data, overwhelming false positives, impossible problems, attacking the wrong problem. Evaluating claims about data mining and calling BS. When data mining is evil. Case studies: are we building a dystopia to make each other click on ads?

Course Mechanics and Grading

There are three reasons you will get assignments in this course. In order of decreasing importance:
  1. Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows that you only thought you understood it.
  2. Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
  3. Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be three kinds of assignment in this course.

Every week will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may theory, analyzing real data sets on the computer, and communicating the results.
All homework will be submitted electronically through Canvas. Most weeks, homework will be due at 10:00 pm on Wednesday. Homework assignments will always be released by Friday of the previous week, and sometimes before.
There are specific formatting requirements for homework --- see below.
Online assignments
Most homework assignments will be accompanied by short-answer or multiple choice questions, to be done on Canvas, before the homework itself is due. These will typically be due on Sunday nights; consult Canvas for the exact schedule.
In-class exercises
Most lectures will have in-class exercises. These will be short (10--20 minutes) assignments, emphasizing problem solving, done in class in small groups of no more than four people. The assignments will be given out in class, and must be handed in on paper by the end of class. On most days, a randomly-selected group will be asked to present their solution to the class.
The in-class exercises will either be problems from that week's homework, or close enough that seeing how to do the exercise should tell you how to do some of the problems.
There will be no exams.

Time expectations

You should expect to spend 5--7 hours on assignments every week, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.


Grades will be broken down as follows:

(The point of dropping the low grades is to let you schedule interviews, social engagements, and other obligations, with flexibility, and without my having to decide what is or is not important enough to change the rules.)

Grade boundaries will be as follows:
A [90, 100]
B [80, 90)
C [70, 80)
D [60, 70)
R < 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".


Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. They are also when you will do the graded in-class assignments which will help consolidate your understanding of the material, and help with your homework.

There are (up to) three classes of readings associated with each lecture.

  1. Key readings: Usually from the textbook, occasionally from other sources which will be made available well in advance. Do this reading before coming to lecture, and lecture will make a lot more sense. (Also, the online exercises will be easy.)
  2. Suggested readings: Usually easier-to-read articles or academic papers, or portions of the non-required texts, which will deepen your understand of what we're doing in lecture and in the homework, but aren't quite so critical.
  3. Background readings: Usually harder-to-follow or more historical papers, which give further details about methods and applications.
TL;DR: definitely the key readings before class; try to do the suggested readings too, they're easy; and the background readings are if you want to get into the weeds.

Please don't use any electronic devices during lecture: no laptops, no tablets, no phones, no recording devices, no watches that do more than tell time. The only exception to this rule is for electronic assistive devices, properly documented with CMU's Office of Disability Resources.

(The no-electronics rule is not arbitrary meanness on my part. Experiments show, pretty clearly, that students learn more in electronics-free classrooms, not least because your device isn't distracting your neighbors, who aren't as good at multitasking as you are.)

R, R Markdown, and Reproducibility

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). In this class, you'll be using it for every homework. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code (and not some brilliant trick that you then forget, or obscure bug that you didn't notice). For help on using R Markdown, see "Using R Markdown for Class Reports". (Part of the first lecture will be going over the basics of R Markdown.)

Format Requirements for Homework

For each assignment, you should generally submit two, and only two, files: an R Markdown source file, integrating text, generated figures and R code, and the "knitted," humanly-readable document, in either PDF (preferred) or HTML format. (I cannot read Word files, and you will lose points if you submit them.) I will be re-running the R Markdown file of randomly selected students; you should expect to be picked for this about once in the semester. You will lose points if your R Markdown file does not, in fact, generate your knitted file (making obvious allowances for random numbers, etc.).

Some problems in the homework will require you to do math. R Markdown provides a simple but powerful system for type-setting math. (It's based on the LaTeX document-preparation system widely used in the sciences.) If you can't get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. (You should upload these additional files to Canvas.) You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by the last homework.

Canvas and Piazza

Homework will be submitted electronically through Canvas, which will also be used as the gradebook. Some readings and course materials will also be distributed through Canvas.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. (Postings will not be anonymous to instructors.) Anonymity will go away for everyone if it is abused. As noted above, homework will generally be due at 10 pm on Wednesdays; however, you should not expect the instructors to answer questions on Piazza (or by e-mail) after 6 pm.


Solutions for all homework will be available, after their due date, through Canvas. Please don't share them with anyone, even after the course has ended.

Office Hours

Wednesdays 12:30--1:20 Prof. Shalizi Baker Hall 229C
Tuesdays 4:00--5:00 Ms. Dunn Baker Hall 132Q
(TA office hours are TBD but will be announced here when set)

If you want help with computing, please bring a laptop.

If you cannot make the regular office hours, or have concerns you'd rather discuss privately, please e-mail me about making an appointment.


Our textbook is It's available electronically through the university library, or you can buy a copy.

The following books are recommended but not required; portions of them may be distributed (via Canvas) as readings, as needed:

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss, in general terms, homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students or the instructors until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course in previous years is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

You must complete "homework 0" on the content of the university's policy on academic integrity, and on these course policies. This assignment will not factor into your grade, but you must complete it, with a grade of at least 80%, before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Disability Resources provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Disability Resources at access [at] or 412-268-6121; they will coordinate with me.

Schedule, Lecture Notes and Readings

Lecture notes and readings will be linked to here, as they become available. Lecture topics are tentative and subject to change. Don't expect required and suggested readings to finalize until about a week before each lecture.

PDM = Principles of Data Mining, our textbook
TBD = to be determined = I'll figure it out between when whenever you're reading this and when the readings need to be made available.

Lecture 1 (Monday, 26 August): Introduction to the course
What is data mining? how is it used? where did it come from? Some themes. R Markdown.
Key readings:
  • PDM, chapter 1
  • "Using R Markdown for Class Reports"
  • Suggested readings:
  • O'Neil, Introduction and Chapter 1
  • Very deep background readings:
  • James Beniger, The Control Revolution: Technological and Economic Origins of the Information Society (Cambridge, Massachusetts: Harvvard University Press, 1986)
  • Herbert Simon, The Shape of Automation for Men and Management (New York: Harper and Row, 1965) [Reprinted as The New Science of Management Decisions]
  • Norbert Wiener, The Human Use of Human Beings: Cybernetics and Society (Boston: Houghton Mifflin, 1950)
  • JoAnne Yates, Control through Communication: The Rise of System in American Management (Baltimore: Johns Hopkins University Press, 1989)
  • After class: Slides (.Rmd source), welcome.Rmd example file
    Lecture 2 (Wednesday, 28 August): Information retrieval I
    Finding data by content. Approaches we will neglect: metadata; "coding" or "content analysis". Textual features. The bag-of-words representation. Vector representations of documents. Measures of similarity and the importance of normalization; inverse document frequency and other tweaks to the representation. Our first stab at nearest-neighbor classification. Example with the New York Times Annotated Corpus. Abstraction and generalizing to other types of data.
    Key reading: PDM, chapter 14
    After class: Handout with useful examples (.Rmd)
    Homework 0
    Assignment on Canvas only
    Due on Sunday, 1 September at 10 pm
    Lecture 3 (Wednesday, 4 September): Information retrieval II
    The trick to searching: queries are documents. Search evaluation: precision, recall, precision-recall curves; error rates. Incorporating user feedback. Proxies for relevance like "engagement". Classification: nearest neighbors, linear classifiers, prototypes. Classifier evaluation by mis-classification rate and by confusion matrices.
    After class: Slides (.Rmd)
    Homework 1
    Due on Wednesday, 4 September at 10 pm
    Assignment, data set, nytac-and-bow.R code
    Lecture 4 (Monday, 9 September): Dimension reduction I
    Goals of dimension reduction. Linear algebra review: vectors, matrices, inner products, projections. Eigen-decomposition.
    Key reading: Excerpts from Stone, Independent Component Analysis (on Canvas)
    Lecture 5 (Wednesday, 11 September): Dimension reduction II
    Principal components analysis. Distance-preserving linear projection. Performing and interpreting PCA. Example with historical complexity. "Latent semantic indexing" = PCA on bag-of-words vectors. Multidimensional scaling and PCA. The Netflix problem, singular value decomposition, and PCA.
    Key reading: PDM, chapter 3, especially sections 3.6 and 3.7
    Suggested readings:
  • Andrey Feuerverger, Yu He, and Shashi Khatri, "Statistical Significance of the Netflix Challenge", Statistical Science 27 (2012): 202--231
  • Hastie et al., section 14.5
  • Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on PCA
  • Background readings:
  • Peter Turchin et al., "Quantitative historical analysis uncovers a single dimension of complexity that structures global variation in human social organization", Proceedings of the National Academy of Sciences (USA) 115 (2018) E144--E151
  • Michael E. Wall, Andreas Rechtsteiner and Luis M. Rocha, "Singular Value Decomposition and Principal Component Analysis", physics/0208101
  • After class: Slides (.Rmd, including commented R commands for all figures)
    Homework 2
    Due on Wednesday, 11 September at 10 pm
    Lecture 6 (Monday, 16 September): Dimension reduction III
    Factor analysis: a generative model of the data (unlike PCA), which aims to preserve correlations and a linear structure. Factor analysis as a kind of linear manifold learning. Origins of factor analysis in psychometrics. Utility and drawbacks. Examples: personality testing, Netflix (again), Cambridge Analytica.
    Key reading: PDM, chapter 3, section 3.6
    Suggested readings:
  • Matthew Hindman, "This is how Cambridge Analytica’s Facebook targeting model really worked — according to the person who built it", The Conversation, 30 March 2018
  • O'Neil, chapter 6 ("Ineligible to Serve")
  • Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on factor models
  • Background readings:
  • David J. Bartholomew, Latent Variable Models and Factor Analysis
  • Annie Murphy Paul, The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves (New York: Free Press, 2004)
  • Charles Spearman, "``General Intelligence,'' Objectively Determined and Measured", American Journal of Psychology 15 (1904): 201--293 [Online]
  • Godfrey H. Thomson, The Factorial Analysis of Human Ability
  • L. L. Thurstone, "The Vectors of Mind", Psychological Review 41 (1934): 1--32 [Online]
  • After class: Slides (.Rmd)
    Lecture 7 (Wednesday, 18 September): Dimension reduction IV
    Random linear projections. Nonlinear dimension reduction: local linear embedding.
    Key readings:
  • PDM, section 6.5
  • Shalizi, Advanced Data Analysis from an Elementary Point of View, appendix on nonlinear dimensionality reduction
  • Suggested readings:
  • Sam T. Roweis and Laurence K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding", Science 290 (2000): 2323--2326
  • Lawrence K. Saul and Sam T. Roweis, "Think Globally, Fit Locally: Supervised Learning of Low Dimensional Manifolds", Journal of Machine Learning Research 4 (2003): 119--155
  • After class: Slides (.Rmd, including a working LLE function)
    Homework 3
    Due on Wednesday, 18 September at 10 pm
    Assignment; data file; eigendresses.R
    Lecture 8 (Monday, 23 September): Dimension reduction V
    Nonlinear dimension reduction: diffusion maps
    Background reading:
  • Mikhail Belkin and Partha Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation", Neural Computation 15 (2003): 1373--1396 [PDF via Prof. Belkin]
  • Ann B. Lee and Larry Wasserman, "Spectral Connectivity Analysis", Journal of the American Statistical Association 105 (2010): 1241--1255, arxiv:0811.0121
  • After class: Handout
    Lecture 9 (Wednesday, 25 September): Nearest neighbors I: Mostly theoretical
    k-nearest neighbors for classification, regression, and general prediction. Performance of nearest-neighbor prediction (=1NN) for noise-free function learning; how and why the distance to the nearest neighbor goes to zero. Effects of noise. Why 1NN converges to a risk of twice the minimum possible risk. Bias-variance trade-off for k-nearest-neighbors.
    Key reading: PDM, sections 6.3.3, 10.1--10.2, 10.6, 10.10--10.11
    Suggested readings:
  • Hastie et al., chapter 13 (skimming section 13.2)
  • Background readings:
  • Thomas M. Cover, "Estimation by the Nearest Neighbor Rule", IEEE Transactions on Information Theory 14 (1968): 50--55 [PDF reprint via Prof. Cover]
  • Thomas M. Cover, "Rates of Convergence for Nearest Neighbor Procedures", pp. 413--415 in B. K. Kinariwala and F. F. Kuo (eds.), Proceedings of the Hawaii International Conference on Systems Sciences (Honolulu: University of Hawaii Press, 1968) [PDF reprint via Prof. Cover]
  • Thomas M. Cover and P. E. Hart, "Nearest Neighbor Pattern Classification", IEEE Transactions on Information Theory 13 (1967): 21--27 [PDF reprint via Prof. Cover]
  • After class: Notes (.Rmd source)
    Homework 4
    Due on Wednesday, 25 September at 10 pm
    Lecture 10 (Monday, 30 September): Nearest neighbors II: Mostly statistical
    The problem of selecting the number of nearest neighbors. Selecting k by cross-validation: v-fold versus leave-one-out. Nearest neighbors as a linear smoother, and the short-cut for leave-one-out CV.
    Key reading: PDM, sections 10.10--10.11
    Background reading:
  • Mona Azadkia, "Optimal choice of k for k-nearest neighbor regression", arxiv:1909.05495
  • Seymour Geisser, "The Predictive Sample Reuse Method with Applications", Journal of the American Statistical Association 70 (1975): 320--328
  • Seymour Geisser and William F. Eddy, "A Predictive Approach to Model Selection", Journal of the American Statistical Association 74 (1979): 153--160
  • M. Stone, "Cross-validatory choice and assessment of statistical predictions", Journal of the Royal Statistical Society B 36 (1974): 111--147
  • Lecture 11 (Wednesday, 2 October): Nearest neighbors III: Mostly computational
    Finding nearest neighbors: data structures; locality-sensitive hashing; random projections again; LSH and information retrieval.
    Key reading: PDM, sections 12.1--12.4
    Suggested reading: Leskovec et al., chapter 3
    Background reading:
  • Aristides Gionis, Piotr Indyk and Rajeev Motwani, "Similarity Search in High Dimensions via Hashing", pp. 518--529 in M. P. Atkinson et al. (eds.), Proceedings of the 25th International Conference on Very Large Data Bases [VLDB '99] (San Francisco: Morgan Kaufmann, 1999)
  • After class: Slides (.Rmd)
    Homework 5
    Due on Wednesday, 2 October at 10 pm
    Lecture 12 (Wednesday, 7 October): Recommender systems I: Broad Ideas and Technical Issues
    Background readings:
  • Gérard Biau, Benoît Cadre, Laurent Rouvière, "Statistical analysis of k-nearest neighbor collaborative recommendation", Annals of Statistics 38 (2010): 1568--1592, arxiv:1010.0499
  • Benjamin Marlin, Richard S. Zemel, Sam Roweis, Malcolm Slaney, "Collaborative Filtering and the Missing at Random Assumption", arxiv:1206.5267
  • Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl, "GroupLens: an open architecture for collaborative filtering of netnews", pp. 175--186 in Proceedings of the 1994 ACM conference on Computer supported cooperative work [CSCW '94] [HTML reprint via Prof. Resnick]
  • Upendra Shardanand and Pattie Maes, "Social Information Filtering: Algorithms for Automating ``Word of Mouth’’", pp. 210--217 in CHI '95: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems [PDF]
  • After class: Slides (.Rmd)
    Lecture 13 (Wednesday, 9 October): Recommender systems II: What's Not to Love?
    Lecture canceled due to instructor illness, but see below for slides
    Suggested reading (in this order):
  • " Recommendations Understand Area Woman Better Than Husband", The Onion 9 January 2007
  • Tom Slee, "Online Monoculture and the End of the Niche", Whimsley 15 March 2009
  • Tommy Wallach, "In Which Netflix Achieves Sentience As A Result of My Terrible Decision-making", The Toast, 4 August 2014
  • Zeynep Tufekci, "YouTube's Recommendation Algorithm Has a Dark Side", Scientific American 320:4 (April 2019): 77
  • ---, "YouTube, the Great Radicalizer", New York Times, 10 March 2018
  • Background reading:
  • Maurizio Ferrari Dacrema, Paolo Cremonesi and Dietmar Jannach, "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches", in Proceedings of the 13th ACM Conference on Recommender Systems [RecSys 2019], arxiv:1907.06902
  • After class: Slides (Rmd)
    Homework 6
    Due on Wednesday, 9 October at 10 pm
    Lecture 14 (Monday, 14 October)
    Everything about recommendation engines we were supposed to cover last Wednesday.
    After class: Slides (Rmd)
    Lecture 15 (Wednesday, 16 October): Information measures I: Scene-setting
    Definitions: entropy, mutual information, conditional mutual information. MI and statistical independence. Kullback-Leibler divergence (relative entropy). Information theory and coding. Information theory and hypothesis testing. Maximum likelihood vs. minimum divergence.
    Suggested reading: David Feldman, "A Brief Introduction to Information Theory, Excess Entropy and Computational Mechanics", Part I, "Information Theory" (pages 1--11) [PDF]
    Homework 7
    Due on Wednesday, 16 October at 10 pm
    Lecture 16 (Monday, 21 October): Information measures II: The information-theoretic view of prediction and estimation.
    Predictive information. Sufficient statistics. Statistical relevance bases. The information-bottleneck method. Information theory in parameter estimation.
    Background reading:
  • Naftali Tishby, Fernando C. Pereira and William Bialek, "The Information Bottleneck Method", pp. 368--377 in B. Hajek and R. S. Sreenivas (eds.), Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (Urbana, Illinois: University of Illinois Press, 1999), arxiv:physics/0004057
  • Handout: PDF (R code)
    Lecture 17 (Wednesday, 23 October:) Information measures III: Feature selection and predictive embeddings / dimension reduction.
    Selecting features to maximize predictive information; to minimize stored information. Creation of new features to capture information: the information bottleneck again; predictive embeddings; word2vec.
    Background readings:
  • Marco Baroni, Georgiana Dinu and German Kruszewski, "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors", pp. 238--247 in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics [ACL14] (Baltimore, MD: Association for Computational Linguistics, 2014)
  • Yoav Goldberg and Omer Levy, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method", arxiv:1402.3722
  • Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", arxiv:1301.3781
  • Handout: PDF (R code)
    After class: Slides (.Rmd)
    Homework 8
    Due on Wednesday, 23 October at 10 pm
    Lecture 18 (Monday, 28 October): Clustering I
    Intro to clustering; k-means. Geometry of clusters.
    Key reading: PDM, sections 9.3--9.5 (skimming 9.1 and 9.2 is recommended)
    Suggested readings:
  • Hastie et al., sections 14.3.1--14.3.6
  • O'Neil, Chapters 9 and 10 ("No Safe Zone" and "The Targeted Citizen")
  • After class: Notes to lecture 18
    Lecture 19 (Wednesday, 30 October): Clustering II
    Hierarchical clustering. Clustering as locality-sensitive hashing.
    Suggested reading: Hastie et al., section 14.3.12
    Homework 9
    Due on Wednesday, 30 October at 10 pm
    Lecture 19 (Monday, 4 November): Clustering III
    Probabilistic clustering with mixture models; the EM algorithm; k-means again.
    Key reading: PDM, sections 9.2.4--9.2.5 (again), and section 9.6
    Suggested reading:
  • Hastie et al., section 8.5
  • Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on mixture models
  • Lecture 20 (Wednesday, 6 November): Clustering IV
    Using cross-validated log-likelihood to compare mixture models. Advanced probabilistic clustering: mixed membership; topic models in text mining and in genetics.
    Suggest reading: Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on mixture models
    Background readings:
  • David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003): 993--1022
  • April Galyardt, "Interpreting Mixed Membership", in Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds.), Handbook of Mixed Membership Models and Their Applications (New York; Chapman and Hall, 2016) [PDF]
  • J. K. Pritchard, M. Stephens and P. Donnelly, "Inference of population structure using multilocus genotype data", Genetics 155 (2000): 945--959
  • Homework 10
    Due on Wednesday, 6 November at 10 pm
    Lecture 21 (Monday, 11 November): Classification I
    Review of how to evaluate classifiers, of linear classifiers and of nearest-neighbor classification. Nonlinear classifiers: "kernel trick", support vector machines. Random nonlinear features.
    Key reading: PDM, sections 10.3, 10.4, 10.7, 10.10, 10.11
    Suggested readings:
  • Berk, chapter 7
  • Hastie et al., chapters 4 and 12
  • Background reading:
  • Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning", pp. 1313--1320 in D. Koller et al. (eds.), Advances in Neural Information Processing Systems 21 [NIPS 2008]
  • Lecture 22 (Wednesday, 13 November): Classification II
    Notions of fairness for classifiers; critiques of "fairness"
    Key reading: O'Neil, chapters 1, 5, 6 and 8
    Suggested readings:
  • Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, "Machine Bias", ProPublica 23 May 2016
  • Alexandra Chouldechova, "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments", arxiv:1610.07524
  • Sam Corbett-Davies and Sharad Goel, "The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning", arxiv:1808.00023
  • Simon DeDeo, "Wrong side of the tracks: Big Data and Protected Categories", pp. 31--42 in Cassidy R. Sugimoto, Hamid R. Ekbia and Michael Mattioli (eds.), Big Data Is Not a Monolith (Cambridge, Mass.: MIT Press, 2016), arxiv:1412.4643
  • Background reading:
  • Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan, "A Survey on Bias and Fairness in Machine Learning", arxiv:1908.09635
  • After class: Notes (.Rmd)
    Homework 11
    Due on Wednesday, 13 November at 10 pm
    Lecture 23 (Monday, 18 November): Classification III
    Classifier design, informative features, inferences
    Background readings:
  • Sharad Goel, Jake M. Hofman, M. Irmak Sirer, "Who Does What on the Web: A Large-Scale Study of Browsing Behavior", Sixth International AAAI Conference on Weblogs and Social Media [ICWSM 2012] [Preprint via Prof. Goel]
  • After class: Notes (.Rmd)
    Lecture 24 (Wednesday, 20 November): Trees and ensembles I
    Classification and regression trees: basics and examples
    Key reading: PDM, sections 5.2 and 10.5
    Suggested readings:
  • Berk, chapter 3
  • Hastie et al., section 9.2
  • Homework 12
    Due on Wednesday, 20 November at 10 om
    Assignment, compas_violence.csv data file
    Lecture 25 (Monday, 25 November): Trees and ensembles II
    Bagging and random forests
    Suggested readings:
  • Berk, chapters 4 and 5
  • Hastie et al., sections 8.7 (bagging) and chapter 15 (random forests)
  • After class: Notes (.Rmd)
    Homework 13
    Due on Tuesday, 26 November at 10 pm --- NOTE DATE
    Lecture 26 (Monday, 2 December): Waste, fraud and abuse I
    Bad data sources, curse of dimensionality, failure to replicate, and other causes of technical failures
    Key reading: PDM, chapter 2, section 6.5, section 1.7
    Suggested readings:
  • Chris Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", Wired 16:17 (June 2008)
  • Fernando Pereira, comments on Anderson, Earning My Turns, 23 June 2008
  • Luke Oakden-Rayner, "Exploring large scale public medical image datasets", arxiv:1907.12720
  • Christian Grothoff and J.M. Porup, "The NSA’s SKYNET program may be killing thousands of innocent people", Ars Technica 16 February 2016
  • Lecture 27 (Wednesday, 4 December): Waste, fraud and abuse II
    Key reading: O'Neil, TBD
    Suggested readings:
  • Maciej Ceglowski, "What Happens Next Will Amaze You"
  • Zeynep Tufekci, "We're building a dystopia just to make people click on ads", TED
  • Homework 14
    Due on Wednesday, 4 December at 10 pm