Overview

Statistical learning is about finding reliable, predictive patterns in data. That may sound arcane, but actually it shapes your world. What you see in your social media feed, what ads you get shown, whether you can get a credit card or a home mortgage, are all decided by statistical learning. Even crucial decisions about hiring, finance, medical care, criminal justice and literally life-or-death military actions are increasingly made using statistical learning. This course will be a fast-paced, hands-on introduction to the most important methods used in statistical learning, covering how and why the methods work (not just “what do I type in R?”, but that too), how they’re used in decision-making, how to evaluate them in practice, and how to use them responsibly. There will be a lot of data analysis, a fair amount of programming, some math (but not too much) and some reading (and thinking about what you’ve read).

Learning Objectives (accreditation bureaucrats look here)

After taking the class, when you’re faced with a new applied statistical-learning problem, you should be able to

  1. select appropriate methods of pattern discovery and validation, and justify your choices,
  2. use and program statistical software to implement those methods,
  3. critically evaluate the results,
  4. communicate the results to professional colleagues who are not also statisticians or data-analysts, and
  5. be aware of the most common ethical issues involved.

Prerequisites

For this semester, the prerequisite is passing 36-401 (modern regression) with a grade of C or better. The only exceptions are for graduate students taking the graduate section (36-662; see below).

Graduate students and 36-662

This course is primarily intended as an advanced elective for undergraduate students majoring in statistics, and for graduate students in the master of statistical practice (MSP) degree. Graduate students from other departments are also welcome, if there’s room. All graduate students must take the course as 36-662. (Those who register for 36-462 will be dropped.) Formally, 662 has no prerequisites, but students are expected to have a background at least equivalent to 36-401 and its prerequisites: linear regression modeling, probability, mathematical statistics, linear algebra, calculus, and competence in using R for data analysis. This class does not have the resources to help those without this background catch up.

Undergraduates must take the course as 36-462; those who register for 662 will be dropped.

462/662 vs. 465/665 vs. …

36-462/662 is intended as a practical course on methods and methodology, covering a lot of different techniques, the principles used to choose between techniques, and selected applications. 36-465/665, “Conceptual Foundations of Statistical Learning”, is more focused on the theoretical foundations common to all methods of statistical learning. If you are debating between this course and something from another department (e.g., 10-601), you may find the detailed topic list below helpful.

Instructors

Name E-mail Office
Professor Dr. Cosma Shalizi cshalizi [at] cmu.edu Baker Hall 229C
Teaching assistants Mr. Raghav Bansal not to be bothered with emails not to be bothered in their offices either
Ms. Victoria Lin
Mr. Luca Masserano

Topics and Techniques to Be Covered

  • Statistical learning as predictive modeling, guided by decision theory and optimization
    • Regression \(=\) redicting a numerical variable; classification \(=\) predicting a categorical variable
    • Review of linear and logistic regression and linear classifiers as prediction methods
    • Measuring the performance of predictive models. Regression: mean squared error, bias-variance decomposition, bias-variance trade-off. Classification: accuracy, false positives and false negatives; predictive values; ROC curves.
    • Prediction as a decision problem; elements of decision theory. “Risk” as expected error, risk-minimization as the goal of decision-making. Extension of bias-variance trade-off to approximation-estimation trade-off.
    • Optimization: general concepts; key algorithms (gradient descent, Newton’s method, stochastic gradient descent, simulated annealing); role of penalties and constraints in learning.
    • Estimating risk and prediction error: optimistic bias of fit to training data; why the bias grows with model complexity; large-sample estimates of prediction risk from optimization theory; computational estimates of prediction risk from cross-validation.
  • “Supervised” prediction methods for classification and regression
    • Nearest neighbor methods: basic ideas (averaging nearby-cases), trade-offs in the number of neighbors, mathematical and statistical properties of nearest neighbors, computational considerations
    • Tree-based methods: basic ideas (averaging similar cases), trade-offs in the size of the tree, mathematical and statistical properties
    • Methods for combining multiple models (especially trees): bagging, random forests, boosting, stacking
    • Kernel-based methods: kernels as similarity measures; averaging similar cases; selection of kernels; support vector machines; combining random functions; mathematical and computational issues.
  • “Unsupervised”, not-always-predictive methods
    • Dimension reduction: forming new features by combining old ones; linear dimension reduction by principal components, factor models and random projections; nonlinear dimension reduction by local linear embedding and by diffusion maps.
    • Clustering: assigning data points to categories without knowing the right categories; geometric clustering and the k-means algorithm; probabilistic mixture models.
  • Case studies and concerns
    • Recommendation engines: “You may also like”; recommender systems based on nearest neighbors, factor models, clustering, hybrid systems; consequences of recommendation engines.
    • Fairness in prediction: Classifiers and risk scores in commercial lending decisions, criminal justice. The COMPAS/ProPublica controversy. Definitions of fairness; trade-offs between different notions of fairness, and between fairness and predictive accuracy.
    • Policing: Predictive models of crime and their use in policing. Concerns about effectiveness, fairness, and civil liberties.
    • Sources of technical failure when applying statistical learning: data quality; bad measurement; weak predictive relationships; non-existent predictive relationships; changing predictive relationships; the curse of dimensionality.
    • Should we do it / what are we doing?: Sources of ethical and political failure when applying statistical learning. Privacy (or lack thereof). Human rights concerns. Social consequences, e.g., are we “building a dystopia to make each other click on ads”?

Course Mechanics and Grading

There are three reasons you will get assignments in this course. In order of decreasing importance:

  1. Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows that you only thought you understood it.
  2. Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
  3. Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be three kinds of assignment in this course.

  • Homework
    • Every week will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve doing or applying some theory, analyzing real data sets on the computer, and communicating the results.
    • All homework will be submitted electronically through Gradescope. Most weeks, homework will be due at 6 pm on Thursdays. Homework assignments will always be released by Friday of the previous week, and sometimes before.
    • There are specific formatting requirements for homework — see below.
  • Online assignments
    • Most homework assignments will be accompanied by short-answer or multiple choice questions, to be done on Canvas, before the homework itself is due. These will typically be due on Monday nights; consult Canvas for the exact schedule.
  • Classroom exercises
    • Most lectures will have in-class exercises. These will be short (10–20 minutes) assignments, emphasizing problem solving, done in class in small groups of no more than four people. The assignments will be given out in class, and must be turned in electronically, by Gradescope, by the end of the day. On most days, a randomly-selected group will be asked to present their solution to the class.
    • The exercises will either be problems from that week’s homework, or close enough that seeing how to do the exercise should tell you how to do some of the problems.

There will be no exams.

Time expectations

You should expect to spend 5–7 hours on assignments every week, averaging over the semester. (This follows from the university’s rules about how course credits translate into hours of student time.) If you find yourself spending much more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

  • Class exercises: 10%. All exercises will have equal weight. Your lowest five exercise grades will be dropped; if you complete all exercises with a score of at least 60%, your lowest six grades will be dropped.
  • Homework: 90%. All homework assignments will be weighted equally. Most assignments will include online short-answer questions, due before the main assignment (see above); these will be incorporated into the grades for the corresponding assignments. The online questions will amount to about 10% of your final grade.
    • Your lowest three homework grades will be dropped. If you complete all homework assignments, with a minimum score of at least 60%, your lowest four homework grades will be dropped.
    • Late homework will not be accepted for any reason.

Dropping your lowest grades lets you schedule interviews, social engagements, and other non-academic uses of your time flexibly, and without my having to decide what is or is not important enough to change the rules. If something — illness, family crisis, or anything else — is interfering with your ability to do the work of the class, come talk to me.

Letter-grade thresholds

Grade boundaries will be as follows:

Grade Score range
A [90, 100]
B [80, 90)
C [70, 80)
D [60, 70)
R [0, 60)

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading

If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct questions or complaints about your grades to me; the TAs have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve or need a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, “what is the least amount of work I need to do in order to get the grade I want?” is a much worse way to approach higher education than “how can I learn the most from this class and from my teachers?”.

Lectures

Lectures will amplify the readings, provide examples and demos, and answer questions and generally discuss the material. They are also when you will do the graded in-class assignments which will help consolidate your understanding of the material, and help with your homework.

Readings

There are (up to) three kinds of readings associated with each lecture.

  • Key readings: Usually from required textbooks, occasionally from other sources which will be made available well in advance. Do this reading before coming to lecture, and lecture will make a lot more sense. (Also, the online exercises will be easy.)
  • Suggested readings: Usually easier-to-read academic papers, or portions of the non-required texts, which will deepen your understand of what we’re doing in lecture and in the homework, but aren’t crucial. Occasionally, non-academic pieces about the material in the news.
  • Background readings: Usually harder-to-follow or more historical papers, which give further details about methods, applications, and origins.

TL;DR: definitely do the key readings before class; try to do the suggested readings too, they’re easy; the background readings are if you’re thirsty to know more.

Electronics

When we meet in person, please don’t use any electronic devices during lecture: no laptops, no tablets, no phones, no recording devices, no watches that do more than tell time. The only exception to this rule is for electronic assistive devices, properly documented with CMU’s Office of Disability Resources.

(The no-electronics rule is not arbitrary meanness on my part. Experiments show, pretty clearly, that students learn more in electronics-free classrooms, not least because your device isn’t distracting your neighbors, who aren’t as good at multitasking as you are.)

Zoom

We’re going to be meeting on Zoom for the first two weeks of the semester. Class will still meet at our regular time. You will find a link to our Zoom meetings in the class calendar on Canvas; please don’t share it outside of class. Class meetings will not be recorded, so show up, take notes, ask questions, and speak freely, as you would in person. The biggest difference is that we won’t do group exercises via Zoom, but you will instead have individual after-class exercises.

(In the unfortunate event we have to go back to Zoom after the first two weeks, these rules will still hold.)

Canvas, Gradescope and Piazza

For the most part, you will submit your work through Gradescope. This includes both your written homework assignments and the class exercises. The online assignments that proceed most homeworks, however, will need to be done on Canvas. We will also use Canvas as a gradebook, and to distribute some readings that can’t be publicly posted here.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. (Postings will not be anonymous to instructors.) Anonymity will go away for everyone if it is abused. As noted above, homework will generally be due at 6 pm on Thursday; however, you should not expect the instructors to answer questions on Piazza (or by e-mail) outside normal working hours. (We may, but you shouldn’t expect it.)

R, R Markdown, and Reproducibility

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). In this class, you will use it for every homework. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week’s data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code (and not some brilliant trick that you then forget, or obscure bug that you didn’t notice). For help on using R Markdown, see “Using R Markdown for Class Reports”. (Part of the first lecture will be going over the basics of R Markdown.)

Format Requirements for Homework

For each assignment, you should write your homework in R Markdown, knit it to a humanly readable document in PDF format, and upload the PDF to Gradescope. This is, ordinarily, what you will be graded on. However, it is important that you keep the R Markdown file around, because every week I will randomly select a few students and ask them to send me your R Markdown so that I can re-run it, and check that it does, in fact, produce the file you turned in. (You should expect to be picked about once a semester.) If they don’t match, I will have questions, and it will hurt your grade.

(Gradescope makes it much easier for multiple graders to collaborate, but it doesn’t understand R Markdown files, just PDFs.)

You’ll need to do math for some homework problems. R Markdown provides a simple but powerful system for type-setting math. (It’s based on the LaTeX document-preparation system widely used in the sciences.) If you can’t get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by the last homework.

Solutions

Solutions for all homework will be available, after their due date, through Canvas. Please don’t share them with anyone, even after the course has ended. This very much includes uploading them to websites.

Office Hours

Day Time Instructor Location
Tuesday 4:00–5:00 pm Ms. Lin Gates Hall 5716
Wednesday 1:25–2:45 pm Prof. Shalizi Baker Hall 229C
Wednesday 8:30–9:30 pm Mr. Bansal Zoom, link on Canvas
Thursday 9:00–10:00 am Mr. Masserano Zoom, link on Canvas

If you want help with computing at an in-person office hour, please bring a laptop.

If you cannot make the regular office hours, or have concerns you’d rather discuss privately, please e-mail me about making an appointment.

Textbooks

The following books are required:

  1. D. J. Hand, Heikki Mannila and Padhric Smyth, Principles of Data Mining. Cambridge, Massachusetts: MIT Press, 2001
  2. Richard A. Berk, Statistical Learning from a Regression Perspective (3rd edition, New York: Springer, 2020), doi:10.1007/978-3-030-40189-4
  3. Cathy O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (New York: Crown Books, 2016)
  4. Michael Kearns and Aaron Roth, The Ethical Algorithm: The Science of Socially Aware Algorithm Design (New York: Oxford University Press, 2019)

You can download a copy of Berk from the university library via the link above; the other books are also available electronically from the library, but for limited numbers of simultaneous users. You should make sure you have access to copies of all these books.

The following books are recommended but not required:

  • Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition, Berlin: Springer-Verlag, 2009) [http://www-stat.stanford.edu/~tibs/ElemStatLearn/]
  • Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman, Mining of Massive Datasets (third edition, Cambridge, England: Cambridge University Press, 2019) [http://www.mmds.org/]
  • Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View (Cambridge: Cambridge University Press, forthcoming) [http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/]
  • Paul Teetor, The R Cookbook (Sebastopol, California: O’Reilly, 2011) [R’s help files are organized around “What does command X do?”; this is organized around “I want to do Y, what commands do I need to use?”]

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an source; this includes all mathematical derivations, computer code and figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbooks and recommended readings, lecture slides and demos, any resources provided through the class website, solutions provided to this semester’s previous assignments in this course, books and papers in the library, or legitimate online resources (including this class’s Piazza), though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss, in general terms, homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students, the instructors, or your teachers in other current classes at CMU, until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course in previous semesters is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not “consult” them, do not read them, do not ask your friend who took the course last year if they “happen to remember” or “can give you a hint”. Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn’t give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to formal disciplinary action, under the terms of the university’s policy on academic integrity.

You must complete “homework 0” on the content of the university’s policy on academic integrity, and on these course policies. This assignment will not factor into your grade, but you must complete it, with a grade of at least 80%, before you can get any credit for any other assignment.

Schedule, Lecture Notes and Readings

Lecture notes and readings will be linked to here, as they become available. Lecture topics are subject to change depending on how the course is going. Don’t expect required and suggested readings to finalize until about a week before each lecture.

Abbreviation Book
PDM Principles of Data Mining
Berk Statistical Learning from a Regression Perspective
WMD Weapons of Math Destruction
TEA The Ethical Algorithm
Hastie et al. The Elements of Statistical Learning
ADAfaEPoV Advanced Data Analysis from an Elementary Point of View
TBD To be determined

Orientation and reminders

Lecture 1 (Tuesday, 18 January): Introduction to the course

“Statistical learning”, strictly speaking, is about using data to find or learn rules from data which will predict well on new cases, usually by optimizing some function of the data. It emerged as a field between about 1980 and 2000, and especially between about 1985 and 1995, when computer scientists interested in making machines learn ideas from statistics about how to fit and evaluate predictive models (as well as more technical tools from theoretical statistics), and applied them to new kinds of models and new problems. This course will introduce you to some of the most important methods used in practical applications of statistical learning, to the tools used to evaluate how well those methods work in particular cases (i.e., how we know we’re not fooling ourselves), and a selection of important applications, and their implications. Statistical learning is interested intellectually because it sheds new light on puzzles about learning, knowledge, induction and scientific method which thoughtful people have wrestled with for thousands of years all of the world. It’s interesting practically because it is used to support data-driven decision-making, and increasingly to automatically make consequential decisions. In this way it’s the latest phase in a long history of trying to base decisions on data, and collecting more and more data to aid decision-making.

Slides, R Markdown file that generated the slides, printable pdf

Readings

  • Key readings:
  • Suggested readings:
    • Ajay Agrawal, Joshua S. Gans and Avi Goldfarb, “Artificial Intelligence: The Ambiguous Labor Market Impact of Automating Prediction”, Journal of Economic Perspectives 33 (2019): 31–50, doi:10.1257/jep.33.2.31
  • Background readings, scientific origins and implications:
  • Background readings, historical roots:

Homework

Thursday, 20 January: Lecture 2, Lightning review of linear regression

Linear regression is a prediction method. For any quantitative variable \(Y\) which we’d like to predict, and any predictor variable(s) \(X\), there is an optimal regression function, \(\mu(X)\), which is optimal in the sense that it minimizes the expected squared error, \(\mathbb{E}\left[ (Y-\mu(X))^2 \right] \leq \mathbb{E}\left[ (Y-m(X))^2 \right]\) for any other function \(m\) of \(X\). Linear regression is not that optimal regression function. If we limit ourselves to linear functions of \(X\), ones which can be written in the form \(b_0 + b X\), there’s an optimal intercept \(\beta_0\) and optimal coefficient(s) \(\beta\), which minimize the expected squared error, \(\mathbb{E}\left[ (Y- (\beta_0 + \beta X))^2 \right] \leq \mathbb{E}\left[ (Y - (b_0 + b X))^2 \right]\). That optimal coefficient is always \(\beta = \mathrm{Var}\left[ X \right]^{-1} \mathrm{Cov}\left[ X,Y \right]\), and then the optimal intercept is always \(\beta_0 = \mathbb{E}\left[ Y \right] - \beta \mathbb{E}\left[ X \right]\). This is all true no matter how nonlinear the real regression function might be, and whether or not anything is Gaussian. Least squares regression tries to estimate this optimal linear model from the data. OLS can conveniently be done in matrix form, \(\hat{\beta} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y}\). (This is really the sample version of \(\mathrm{Var}\left[ X \right]^{-1} \mathrm{Cov}\left[ X, Y \right]\).) Predictions on the training data then become \(\mathbf{x} (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y}\), which reveals the crucial role of the “hat matrix”, \(\mathbf{x} (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T\). From this predictive viewpoint, lots of what we teach you in linear models classes about regression diagnostics and parameter inference doesn’t matter. It also becomes clear that linear regression is a very special case whose main virtues are clean math and simple calculations, not powerful predictions or accurate representation. We teach it to you as the first step on the ladder to better methods.

Slides, .Rmd, printable PDF

Readings

  • Key reading:
    • PDM, sections 11.1–11.3
    • Berk, sections 1.3 and 1.4 through 1.4.3 (the rest of section 1.4 is optional)
    • ADAfaEPoV, chapter 2
  • Suggested readings:
    • Hastie et al., sections 2.3.1 and 2.6, and chapter 3 (especially through section 3.4)
  • Background reading:
    • Richard A. Berk, Regression Analysis: A Constructive Critique (Thousand Oaks, California: Sage, 2004)
    • R. W. Farebrother, Fitting Linear Relationships: A History of the Calculus of Observations, 1750–1900 (New York: Springer-Verlag, 1999), doi:10.1007/978-1-4612-0545-6
    • Cosma Rohilla Shalizi, The Truth About Linear Regression (unpublished book draft, 2015)
    • Norbert Wiener, Extrapolation, Interpolation and Smoothing of Stationary Time-Series: with Engineering Application (Cambridge, Massachusetts: The Technology Press, 1949 [but first published as a classified technical report, Washington, D.C.: National Defense Research Council, 1942])

Tuesday, 25 January: Lecture 3, Linear Classifiers and Logistic Regression

Regression is about predictive quantitative, numerical variables. If we want to predict a qualitative, categorical variable, especially a binary one, we call that classification. A very basic way of doing classification is the prototype method: pick two points to represent each class, the “prototypes”, and classify a new point as belonging to the closer class. This is a special case of the broader family of linear classifiers, where the boundary between classes is linear. Linear classifiers, like all classifiers, can be evaluated in terms of their accuracy (= probability that a new case is correctly classified), or equivalently their mis-classification rate (=1-accuracy). Slightly more refined analyses look at false positive and false negative rates, and receiver operating characteristics (how those error rates vary with model parameters). Logistic regression is (despite its name) a linear classifier, plus the assumption that the probability of being in one class or another varies in a certain way with distance from the boundary. Because it allows for some uncertainty, it’s a more informative model; evaluating it involves not just classification, but also looking at whether those probabilities are properly calibrated (so that events it says are 50-50 happen about half the time, etc.).

Slides, .Rmd, PDF

Reading

  • Key reading: PDM, sections 10.1–10.2, 10.4, 10.7, 10.10, 10.11 (skimming that last) and 11.3
  • Suggested reading:
    • Hastie et al., chapter 4
    • ADAfaEPoV, chapter 11
  • Background readings:
    • Tilmann Gneiting, Fadoua Balabdaoui and Adrian E. Raftery, “Probabilistic Forecasts, Calibration and Sharpness”, Journal of the Royal Statistical Society B 69 (2007): 243–268, doi:10.1111/j.1467-9868.2007.00587.x
    • Deborah G. Mayo and Aris Spanos, “Methodology in Practice: Statistical Misspecification Testing”, Philosophy of Science 71 (2004): 1007–1025

Prediction, Decision, and Optimization

Thursday, 27 January: Lecture 4, Prediction as a decision problem

We’ve now seen three kinds of prediction: regression (guessing a quantitative variable), classification (guessing a categorical variable), and density estimation (guessing a probability distribution). “What prediction should I make for \(Y\) when I’ve seen that \(X=x\)?” is a decision problem. There are three elements to any decision problem: actions, states, and information. Loss functions say how bad a a given action is in a given state. Squared error is a natural loss function for regression, mis-labeling or “0/1 loss” for classification, and (negative) log-likelihood for probabilities. A strategy is a rule which tells us what action to take on the basis of the available information. Risk is the expected loss incurred from following a strategy, and risk minimization is a sensible guide to decision making, though not the only imaginable one. There is always a mathematically-possible strategy which minimizes the risk, but we’re usually restricted to a certain class of feasible strategies we can actually implement; the difference between the best-possible strategy and the best feasible strategy is approximation error. Risk is defined using the true distribution of the data, but we don’t have that distribution, only data sampled from it. The gap between the distribution and the training data always introduces some estimation error into our choice of strategy. Narrowing the class of allowed strategies generally reduces this estimation error, but at the cost of increasing the approximation error. This trade-off between approximation and estimation error is fundamental to learning, and includes the bias-variance trade-off as a special case.

Slides, .Rmd, PDF

Reading

  • Key reading:
    • PDM, chapter 7
    • ADAfaEPoV, chapter 3
  • Suggested readings:
    • Berk, section 1.6 (skipping 1.6.10, “Basis Functions” and 1.6.11, “The Curse of Dimensionality”)
    • Hastie et al., chapter 7
  • Background readings:
    • Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games (Cambridge, England: Cambridge University Press, 2006)
    • Tilmann Gneiting, “Making and Evaluating Point Forecasts”, Journal of the American Statistical Association 106 (2011): 746–762, doi:10.1198/jasa.2011.r10138

Homework

  • Homework 0 due at 6 pm
  • Homework 1 due at 6 pm
  • Homework 2: assignment, [spam data set as a CSV file](hw/02/spam.csv]

Tuesday, 1 February: Lecture 5, Essentials of Optimization

In statistical learning, fitting a model usually means solving an optimization problem, of the form “minimize the loss on the training data”, or “minimize the sum of loss on the training data plus a penalty for funny-looking models”. This makes optimization fundamental to statistical learning, so we need to go over the essentials. This starts with distinguishing local and global optima. After that, we need some facts you may remember from calculus: if we’re optimizing a function \(f\) of one variable \(\theta\), a necessary condition for \(\theta^*\) to be an optimum is \(\frac{df}{d\theta}(\theta^*) = 0\), unless we’re at a boundary of the space of \(\theta\). If we meet that first-order condition, a sufficient condition for a local minimum is that \(\frac{d^2 f}{d\theta^2}(\theta^*) > 0\); minima which don’t meet this second-order condition exist but are delicate and weird. If we’re optimizing more than one variable, the first-order condition is that the gradient (vector of partial derivatives) \(\nabla f\) is zero, \(\nabla f(\theta^*) = 0\); the second-order condition for a minimum is that the Hessian (matrix of second partial derivatives) \(\nabla \nabla f\), has to be “positive definite”, \(\nabla \nabla f(\theta^*) \succ 0\). (You learned what that means in matrix algebra, but we’ll go over it again, and why it’s reasonable.) Close enough to a local optimum, most smooth functions look quadratic, so they can be approximated using only their gradient and their Hessian. This also tells us how much we miss by not being quite at the optimum, which will be useful later.

Slides, .Rmd, printable PDF

Reading

  • Key reading:
    • PDM, sections 8.1, 8.3.1–8.3.2

Thursday, 3 February: Lecture 6, Algorithms for Optimization

In statistical learning, it’s not usually possible to set up the equations for the first- and second- order conditions, \(\nabla f(\theta) = 0\), \(\nabla\nabla f(\theta) \succ 0\), and explicitly solve for an optimum \(\theta^*\). Instead, we have to use numerical algorithms to approximately compute \(\theta^*\). The calculus we went over last time suggests the most basic and widely-used of these algorithms, gradient descent: start somewhere, find the gradient, take a small step in that direction, and repeat until the gradient approaches zero. An important refinement is Newton’s method: basically, do gradient descent, but take smaller steps where the function is more curved (as indicated by its Hessian). For large data sets, calculating the loss on all data points can take too much time, so we use a random sample of the data at each step, giving us stochastic gradient descent and stochastic Newton’s method, among other stochastic approximation algorithms. Whether these algorithms work well depends on the properties of the function we’re trying to optimize, and if it’s not very smooth we may be better off using simulated annealing: move in a random direction, but back-track if it makes things worse, and get more inclined to back-track the longer we’ve been working. It is annoying to have to think about multiple different optimization algorithms, rather than having One Algorithm to Rule Them All, there are real trade-offs of generality vs. computing time vs. optimization error, and also fundamental limits which say that no algorithm can always out-perform others across all problems (the “no free lunch theorems”). No optimization algorithm gives us the exact optimum, which in statistical learning introduces an extra source of error, optimization error. Spending more computer time reduces the optimization error, but it’s not worth going on once the optimization error is small compared to the estimation error. Sometimes quite crude optimization is all we really need.

Slides, .Rmd, printable pdf

Reading

  • Key reading:
    • PDM, sections 8.2, 8.3.3–8.3.6, 8.5, 8.6
  • Suggested reading:
    • Léon Bottou and Olivier Bosquet, “The Tradeoffs of Large Scale Learning”, pp. 351–368 in Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning (Cambridge, Massachusetts: MIT Press, 2012), [http://leon.bottou.org/publications/pdf/mloptbook-2011.pdf]
  • Background reading:
    • Arthur E. Albert and Leland A. Gardener, Jr., Stochastic Approximation and Nonlinear Regression (“Cambridge, Massachusetts: MIT Press, 1967)
    • John H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence (2nd edition, Cambridge, Massachusetts: MIT Press, 1992; first edition Ann Arbor, Michigan: University of Michigan Press, 1975)
    • Herbert Robbins and Sutton Monro, “A Stochastic Approximation Method”, Annals of Mathematical Statistics 22 (1951): 400–407, doi:10.1214/aoms/1177729586
    • David H. Wolpert and William G. Macready, “No free lunch theorems for optimization”, IEEE Transactions on Evolutionary Computation 1 (1997): 67–82, doi:https://doi.org/10.1109/4235.585893, Santa Fe Institute Working Paper 95-02-010

Homework

Tuesday, 8 February: Lecture 7, Penalized, Constrained, and Regularized Optimization

Many statistical learning methods don’t just try to minimize the risk within some feasible class of strategies. Instead, they add penalties, which make it harder to select some strategies, unless they really improve the risk. This is an example of a general tactic in optimization. Penalties can arise because we find some sorts of solution expensive or distasteful, and want to be sure they’re worth it. Another reason we use penalties is that some optimization problems are “unstable” or “ill-posed”, so that small changes to the function or the data give wildly different optima, and adding penalties can stabilize or regularize them. In statistical learning, this especially arises when we’re dealing with high-dimensional data, where the number of variables is comparable to, or greater than, the number of data points. Even low-dimensional data can be collinear, which leads us want regularized linear regression. The two most common penalties for linear models are the sum of the absolute values of the coefficients (\(L_1\)) an the sum of the squared coefficients (\(L_2\)); these are respectively called lasso and ridge regression, and lead to different kinds of models. A final way penalties arise is as substitutes for constraints on the optimization, restrictions which cut down the feasible set, by requiring that the parameters require some set of equations and/or inequalities. Remarkably enough, it is generally possible to turn a constrained optimization problem with \(p\) variables under \(q\) constraints into an unconstrained problem with \(p+q\) variables: there is one penalty term for each constraint, and the new \(q\) new variables, the Lagrange multipliers, make the penalties strong enough to enforce the constraint (“a fine is a price”). This goes the other way, too: penalties are generally equivalent to constraints. Optimizing under constraints is an important tool in economics and operations research, sometimes called mathematical programming, and very efficient algorithms exist for the special class of convex programming problems; this makes it very useful when we can show our learning problem can be cast as a convex programming problem. Time permitting: “Comrades, let’s optimize!”

Slides, .Rmd

Reading

Thursday, 10 February: Lecture 8, Constrained Optimization continued

See under lecture 7.

Homework

Tuesday, 15 February: Lecture 9, Estimating Risk

The goal of decision theory is to find the strategy (from the “feasible” set of strategies) which minimizes the risk, but “risk” is defined using the true data-generating distribution, which we do not know. The “empirical risk” of a strategy is simply its mean loss on the training data, and “empirical risk minimization” is selecting the (feasible) strategy with the lowest mean loss on the training data. (Examples: least squares, maximum likelihood.) This is always biased as an estimate of the actual risk, and the bias is always in the optimistic direction. We should therefore never just evaluate predictive models in-sample. There are some simple (too simple?) formulas for estimating out-of-sample prediction error, based on the empirical risk and some measures of model complexity; these follow from the basic calculus facts about optimization, and a little probability theory.

Slides, .Rmd file

Thursday, 17 February: Lecture 10, Cross-Validation

We ended the last lecture with a very general formula for estimating the risk of a model which we fit by minimizing the empirical risk on the training data. An alternative, widely used in practice, is cross-validation: divide the data into parts, fit our model on one part, see how well it can predict the other part, and then swap roles. This estimates how well our models can generalize to new data, by having them generalize to new data. There are two main forms of cross-validation, “leave-one-out” and “v-fold” (or “k-fold”), with specific advantages and disadvantages. To summarize what we’ll go over in more detail: k-fold CV tends to be harsher on more complicated models than leave-one-out, which can have advantages for model selection when one of the models we’re selecting among contains the truth. But leave-one-out often has better predictive performance, especially when none of our models is right. Leave-one-out is usually much slower, computationally, than k-fold CV, but there are short-cuts in some special cases.

Slides, .Rmd

Reading

  • Key readings:
    • PDM, chapter 10, especially sections 10.1, 10.2 and 10.10
    • ADAfaEPoV, chapter 3
  • Suggested reading:
    • Hastie et al., sections 7.1–7.6 and 7.10, 7.12
  • Background reading:
    • Seymour Geisser, “The Predictive Sample Reuse Method with Applications”, Journal of the American Statistical Association 70 (1975): 320–328, doi:10.1080/01621459.1975.10479865
    • Seymour Geisser and William F. Eddy, “A Predictive Approach to Model Selection”, Journal of the American Statistical Association 74 (1979): 153–160, doi:10.1080/01621459.1979.10481632
    • M. Stone, “Cross-validatory choice and assessment of statistical predictions”, Journal of the Royal Statistical Society B 36 (1974): 111–147, [http://www.jstor.org/stable/2984809]

Homework

“Supervised” Methods for Regression and Classification

Tuesday, 22 February: Lecture 11, Nearest Neighbors

The simplest, and one of the oldest, nonparametric methods for prediction is the nearest neighbor method: when we want to predict what will happen in a new case, we find the case with a known outcome that most resembles the new one, and say that will happen. The \(k\)-nearest-neighbor method is only slightly more complicated: look up the \(k\) most similar cases, and average them or have them vote. (Plain nearest neighbors is then “1-nearest-neighbors”, or “1NN”.) It’s fairly straightforward to analyze how 1NN would work if there was no noise, and we’ll do so in class; the crucial fact turns out to be that the distance to the nearest neighbor goes to zero, and we’ll see how fast it does so. Adding noise back in to the problem is then fairly straightforward. The risk of 1NN converges to a risk of exactly twice the minimum possible risk under almost no assumptions. As we increase the number of nearest neighbors, we discover a bias-variance trade-off: small \(k\) means low bias but high variance, large \(k\) means high bias but low variance.

The most important statistical issue for nearest neighbors is selecting the right number of neighbors \(k\) to use. This is not a parameter, a fact about the world which we are trying to estimate, but a control setting, something we decide on. Analyzing the bias-variance trade-off gives us a sensible way to approach how to set this control; it tells us that the best \(k\) to use should grow with the number of data points, but not too fast. Practical answers turn on cross-validation, which we’ll go over again (once more, with feeling). It turns out that for \(k\)-NN, leave-one-out CV is (predictively) optimal. Once we have a well-tuned nearest-neighbor predictor, we can use it for nonparametric regression or classification. This in turn opens up many other possibilities; one of these is matching estimates of causal effects, where (say) people who have received some treatment are matched with the most similar people who haven’t, and the difference in their outcomes estimates the effect of the treatment.

PDF notes (.Rmd)

Readings

Thursday, 24 February: Lecture 12, Trees

Classification and regression trees; the CART algorithm for building trees; “entropy” and “deviance”; some examples.

In place of notes, please read chapter 13 of ADAfaEPoV.

Reading

  • Key reading:
    • PDM, sections 5.2 and 10.5
    • Berk, chapter 3
    • ADAfaEPoV, chapter 13
  • Suggested readings:
    • Hastie et al., section 9.2

Homework

Tuesday, 1 March: Lecture 13, Forests and Ensembles

Improving on single trees by combining multiple trees. “Bagging”, or “bootstrap averaging”: combining multiple trees from slightly different data sets. Random forests: bagging plus random feature selection. Two perspectives on why multiple trees help: averaging reduces variance (but less so if what’s averaged is positively correlated); the performance of the average is the average performance plus the diversity.

Notes (.Rmd)

Reading

  • Suggested readings:
    • Berk, chapters 4 and 5
    • Hastie et al., section 8.7 (bagging) and chapter 15 (random forests)
  • Background reading:
    • Leo Breiman, “Bagging Predictors”, Machine Learning 24 (1996): 123–140
    • Leo Breiman, “Random Forests”, Machine Learning 45 (2001): 5–3 doi.org:10.1023/A:1010933404324
    • Hastie et al., sections 8.8 (stacking) and 9.5 (hierarchical mixtures of experts)
    • Anders Krogh and Jesper Vedelsby, “Neural Network Ensembles, Cross Validation, and Active Learning”, pp. 231–238 in Gerald Tesauro, David Tourtetsky and Todd Leen (eds.) Advances in Neural Information Processing Systems 7 [NIPS 1994] (Cambridge, Massachusetts: MIT Press, 1995)

Thursday, 3 March: Lecture 14, Forests and Ensembles, continued

Same notes as last time

Reading

Same as last time.

Homework

Tuesday, 8 March, and Thursday, 10 March: NO CLASS

Enjoy spring break.

Tuesday, 15 March: Lecture 15, Kernel Machines and Support Vector Machines

Motivation: using linear methods, but on nonlinear functions of the data. The “kernel trick”; definition of a kernel function and characterization of kernels. Kernel machines for classification and regression. Some practicalities of working with kernel machines: the implicit expansion of kernel machines in terms of basis functions. Kernel ridge regression. Worked examples.

Turning to classification: It’s hard to optimize the 0/1 loss, so we turn to “surrogates” which are easier to work with. The “margin” of a data point in a classification problem is how much the machine’s output would have to change to flip the classification; the margin of the machine is the minimum margin over data points. A classifier which achieves a large margin has put the boundary between classes far from all the data points, at least in whatever implicit feature space it’s using. For kernel machines, maximizing the margin leads to an optimization problem which is computationally tractable, and sets the weight on many data points to zero; the others are the “support vectors”. Matters are a little more complicated when we allow the machine the “slack” to make a certain number of mistakes, but fundamentally similar.

Notes, .Rmd

Readings

  • Key reading: PDM, sections 10.3
  • Suggested readings:
    • Berk, chapter 7
    • Hastie et al., sections 5.1–5.3, chapter 12
  • Background reading:
    • Hastie et al., section 5.8
    • John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis (Cambridge, England: Cambridge University Press, 2004), doi:10.1017/CBO9780511809682

Thursday, 17 March: Lecture 16, Random Nonlinear Features a.k.a. Random Kitchen Sinks

Kernels have many attractive properties, but the need to keep around all the data points, and especially to work with the \([n \times n ]\) kernel matrix, creates a crucial bottleneck for large data. Random feature machines first emerged as a way of making kernel machines more computationally efficient: by decomposing kernels into sums (or integrals) of nonlinear functions, and sampling from those functions, one can get a randomized approximation to the kernel that uses only a small fraction of the computational resources. This led to the idea of directly using random features, without even trying to approximate a kernel. A surprisingly small number of randomly-chosen nonlinear features can come remarkably close to the optimal function, while still having good generalization performance.

Notes: Combined with notes for lecture 15.

Readings

Homework

“Unsupervised” Methods for Dimension Reduction and Clustering

Tuesday, 22 March: Lecture 17, Prelude to dimension reduction

Goals of dimension reduction. Linear algebra review: vectors, matrices, inner products, projections. Eigen-decomposition of matrices.

Notes (.Rmd)

Readings

None required, but the review of vectors and matrices from J. V. Stone’s Independent Component Analysis (Cambridge, Massachusetts: MIT Press, 2004) may be helpful. (Also available on Canvas.)

Thursday, 24 March: Lecture 18, Linear dimension reduction

“Principal components analysis” = finding the linear surface which comes closest to the original data points. Performing and interpreting PCA. Example with US geography. “Latent semantic indexing” = PCA on bag-of-words vectors. Multidimensional scaling and PCA. Real-data examples. Time permitting: Random linear projections as an alternative to PCA.

Notes (.Rmd)

Reading

  • Key reading: PDM, chapter 3, especially sections 3.6 (but not pp. 83–84) and 3.7
  • Suggested readings:
    • Hastie et al., section 14.5
    • ADAfaEPoV, chapter 15
  • Background readings:
    • Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman, “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science 41 (1990): 391–407, doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9, [http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf]
    • George W. Furnas, Scott Deerwester, Susan T. Dumais, Thomas K. Landauer, Richard A. Harshman, Lynn A. Streeter and Karen E. Lochbaum, “Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure”, pp. 465–480 in Yves Chiaramella (ed.), SIGIR’88, Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New York: Association for Computing Machinery, 1988), doi:10.1145/62437.62487, [http://susandumais.com/SIGIR1988-LSI-FurnasDeewesterDumaisEtAl.pdf]
    • Thomas K. Landauer and Susan T. Dumais, “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge”, Psychological Review 104 (1997): 211–240, [http://lsa.colorado.edu/papers/plato/plato.annote.html]
    • Peter Turchin et al., “Quantitative historical analysis uncovers a single dimension of complexity that structures global variation in human social organization”, Proceedings of the National Academy of Sciences (USA) 115 (2018) E144–E151, doi:10.1073/pnas.1708800115
    • Michael E. Wall, Andreas Rechtsteiner and Luis M. Rocha, “Singular Value Decomposition and Principal Component Analysis”, arxiv:physics/0208101

Tuesday, 29 March: Lecture 19, Factor Models

PCA is just an approximation method, which assumes nothing about the data-generating process. This means that we can always use it, it’s never wrong, but it also means we can’t actually make any predictions with it, and it’s never right, either. In particular, there is no right answer to the question of “how many principal components should we use?” One way around this is to make a PCA-like model, by assuming that the difference between the actual data vectors, and their reconstruction using only a limited number of low-dimensional variables, should be random noise. This gives us factor models, whose use is called factor analysis. In these models, our high-dimensional observables are linear combinations of a low-dimensional set of unseen factors, plus noise. Unlike PCA, factor models do make predictions about unseen data, at least at the level of the distribution, and so we can sensibly answer questions like “can we predict this data better using five factors or is just one enough?”, or “does the five-factor model pass goodness-of-fit tests?” Estimating these models involves some back-and-forth between making assumptions about the noise, and using that to estimate how the factors relate to the observables, and making assumptions about the factors and using that to estimate the noise.

Turning to applications, factor models are often used to give low-dimensional (really, low-rank-plus-noise) approximations to data which naturally takes the form of high-dimensional matrices. The classic cases of this are in psychology, where the technique was first developed: rows of matrices there are people (or other test-takers), columns are “items” in a mental test, and the “factors” are (supposedly) attributes such as “knowledge of arithmetic”, “narcissism”, “racial resentment” or “need for chaos”. These uses are entrenched in our society, but also deservedly controversial. Turning to more data-mining-y applications, factor models are also common in recommendation systems (playing an important role in the Netflix Prize competition) and in ad targeting (being at the core of what Cambridge Analytica claimed to do).

Reading

  • Key reading: PDM, chapter 3, section 3.6
  • Suggested readings:
    • Matthew Hindman, “This is how Cambridge Analytica’s Facebook targeting model really worked — according to the person who built it”, The Conversation, 30 March 2018
    • O’Neil, chapter 6 (“Ineligible to Serve”)
    • ADAfaEPoV, chapter 16
  • Background readings:
    • David J. Bartholomew, Latent Variable Models and Factor Analysis (Oxford: Oxford University Press, 1987)
    • Andrey Feuerverger, Yu He, and Shashi Khatri, “Statistical Significance of the Netflix Challenge”, Statistical Science 27 (2012): 202–231, doi:10.1214/11-STS368
    • Annie Murphy Paul, The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves (New York: Free Press, 2004)
    • Charles Spearman, “‘General Intelligence,’ Objectively Determined and Measured”, American Journal of Psychology 15 (1904): 201–293
    • Godfrey H. Thomson, The Factorial Analysis of Human Ability (Boston: Houghton Mifflin Company, 1939), [http://www.archive.org/details/factorialanalysi032965mbp]
    • L. L. Thurstone, “The Vectors of Mind”, Psychological Review< 41 (1934): 1–32

Supplementary Notes on Nonlinear dimension reduction

Problems with using linear methods when the data lie near a curved surface. The concept of a “manifold”. Exploiting manifold structure: local linear embedding. Exploiting a pre-existing measure of similarity: diffusion maps.

Slides (.Rmd)

Reading

  • Key readings:
    • PDM, section 6.5
    • ADAfaEPoV, appendix on nonlinear dimensionality reduction for material on diffusion maps
  • Background reading:
    • Mikhail Belkin and Partha Niyogi, “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation”, Neural Computation 15 (2003): 1373–1396, doi:10.1162/089976603321780317
    • Ann B. Lee and Larry Wasserman, “Spectral Connectivity Analysis”, Journal of the American Statistical Association 105 (2010): 1241–1255, doi:10.1198/jasa.2010.tm09754, arxiv:0811.0121
    • Sam T. Roweis and Laurence K. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding”, Science 290 (2000): 2323–2326, doi:10.1126/science.290.5500.2323
    • Lawrence K. Saul and Sam T. Roweis, “Think Globally, Fit Locally: Supervised Learning of Low Dimensional Manifolds”, Journal of Machine Learning Research 4 (2003): 119–155, [http://jmlr.csail.mit.edu/papers/v4/saul03a.html]

Thursday, 31 March: Lecture 20, Clustering

“Clustering” is dividing data into categories, without knowing what the should be categories, or having an examples to follow. Why this is not totally hopeless, and some rough ideas about what good clusters might look like. The most basic algorithm for clustering: k-means. Using clustering to find (approximate) nearest neighbors (“locality-sensitive hashing”). Probabilistic clustering with mixture models; the EM algorithm; k-means again. Using log-likelihood to pick the number of clusters. Checking goodness of fit of cluster models. Advanced topics in probabilistic clustering (time permitting): “mixed membership” models, and their roles in modeling text (“topic models”, “latent Dirichlet allocation”) and genetics.

Reading

  • Key readings:
    • PDM, sections 6.4, 9.2.1–9.2.6, 9.3–9.5, 9.6 (skimming 9.1 is recommended)
  • Suggested reading:
    • Hastie et al., sections 14.3.1–14.3.6, section 8.5
    • ADAfaEPoV, chapter 17
  • Background reading:
    • David M. Blei, Andrew Y. Ng and Michael I. Jordan, “Latent Dirichlet Allocation”, Journal of Machine Learning Research 3 (2003): 993–1022, [http://jmlr.csail.mit.edu/papers/v3/blei03a.html]
    • April Galyardt, “Interpreting Mixed Membership”, in Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds.), Handbook of Mixed Membership Models and Their Applications (New York; Chapman and Hall, 2016), doi:10.1201/b17520, [https://pdfs.semanticscholar.org/3266/b3317f848d3c545bd9ab6c1621e356202f84.pdf]
    • Aristides Gionis, Piotr Indyk and Rajeev Motwani, “Similarity Search in High Dimensions via Hashing”, pp. 518–529 in M. P. Atkinson et al. (eds.), Proceedings of the 25th International Conference on Very Large Data Bases [VLDB ’99] (San Francisco: Morgan Kaufmann, 1999)
    • Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data (Cambridge, UK: Cambridge University Press, 2006)
    • Leskovec et al., chapters 3 and 7
    • J. K. Pritchard, M. Stephens and P. Donnelly, “Inference of population structure using multilocus genotype data”, Genetics 155 (2000): 945–959, doi:10.1093/genetics/155.2.945

Homework

For Carnival Week: Neural Networks

Tuesday, 5 April, Lecture 21: Neural Networks

A.k.a. hierarchically organized nonlinear transformations of weighted sums of input features. Basic, two layer neural networks, their training and limitations. Three layer neural networks. Random feature models as three-layer networks with a fixed initial layer. Deep neural networks. Powers: universal approximation; really good results in some domains. Limitations and unknowns: adversarial examples, generalization despite memorization, etc. Some reflections, from someone who remembers the last time neural networks were going to take over the world.

Notes (.Rmd)

Reading

  • Key reading:
    • PDM, sections 10.3 and 11.4
    • Berk, chapter 8
  • Suggested reading (in this order):
    • Hastie et al., chapter 11
    • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus, “Intriguing properties of neural networks”, arxiv:1312.6199
    • Anh Nguyen, Jason Yosinski, Jeff Clune, “Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images”, arxiv:1412.1897
    • Brandon Carter, Siddhartha Jain, Jonas Mueller, David Gifford, “Overinterpretation reveals image classification model pathologies”, arxiv:2003.08907
    • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, “Understanding deep learning (still) requires rethinking generalization”, Communications of the ACM 64 (2021): 107–115, doi:10.1145/3446776
  • Background reading:
    • Mikhail Belkin, “Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation”, arxiv:2105.14368
    • Maureen Caudill and Charles Butler, Naturally Intelligent Systems (Cambridge, Massachusetts: MIT Press, 1990)
    • Patricia Churchland and Terrence Sejnowski, The Computational Brain (Cambridge, Massachusetts: MIT Press, 1992)
    • Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning (Cambridge, Massachusetts: MIT Press, 2016)
    • Warren S. McCulloch, Embodiments of Mind (Cambridge, Massachusetts: MIT Press, 1965)
    • Melanie Mitchell, Artificial Intelligence: A Guide for Thinking Humans (New York: Farrar, Straus and Giroux, 2019)
    • Janelle Shane, “You Look Like a Thing and I Love You”: How Artificial Intelligence Works and Why It’s Making the World a Weirder Place (New York: Little, Brown, 2019)

Thursday 7 April: NO CLASS

Enjoy Carnival

Case Studies and Concerns

Tuesday, 12 April, Lecture 22: Recommender systems I: Broad Ideas and Technical Issues

The idea of a recommender system. Best-seller lists. Nearest-neighbor-based approaches (whether based on users or on items). The Netflix problem, and PCA/factor-model style approaches. Recommendations using social graphs. Model combination. Missing-data issues and the “cold start” problem.

Notes (.Rmd)

Readings

  • Key readings:
  • Suggested reading:
    • Andrey Feuerverger, Yu He, and Shashi Khatri, “Statistical Significance of the Netflix Challenge”, Statistical Science 27 (2012): 202–231, doi:10.1214/11-STS368
  • Background readings:
    • Gérard Biau, Benoît Cadre, Laurent Rouvière, “Statistical analysis of k-nearest neighbor collaborative recommendation”, Annals of Statistics 38 (2010): 1568–1592, doi:10.1214/09-AOS759, arxiv:1010.0499
    • Maurizio Ferrari Dacrema, Paolo Cremonesi and Dietmar Jannach, “Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches”, in Proceedings of the 13th ACM Conference on Recommender Systems [RecSys 2019], doi:10.1145/3298689.3347058, arxiv:1907.06902
    • Benjamin Marlin, Richard S. Zemel, Sam Roweis, Malcolm Slaney, “Collaborative Filtering and the Missing at Random Assumption”, arxiv:1206.5267

Thursday, 14 April, Lecture 23: Recommender systems II: What’s Not to Love?

Homogenization; paradoxes of diversification; rabbit holes and echo chambers; discrimination; what does the system optimize for; when inferior recommendations are more profitable; natural monopoly (or oligopoly). Also: do recommendations do anything at all?

Slides (.Rmd)

Readings

Homework

Tuesday, 19 April: Lecture 24, Fairness in prediction, especially classification

Notions of “fair” prediction: not using “protected” features; parity of error rates across groups; calibration of prediction. “Inference” issues: using features that carry information about protected features.

Notes (.Rmd)

Readings

  • Key readings:
  • Suggested readings:
    • Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, “Machine Bias”, ProPublica 23 May 2016
    • Alexandra Chouldechova, “Fair prediction with disparate impact: A study of bias in recidivism prediction instruments”, arxiv:1610.07524
    • Sam Corbett-Davies and Sharad Goel, “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning”, arxiv:1808.00023
    • Simon DeDeo, “Wrong side of the tracks: Big Data and Protected Categories”, pp. 31–42 in Cassidy R. Sugimoto, Hamid R. Ekbia and Michael Mattioli (eds.), Big Data Is Not a Monolith (Cambridge, Mass.: MIT Press, 2016), arxiv:1412.4643
    • Sharad Goel, Jake M. Hofman, M. Irmak Sirer, “Who Does What on the Web: A Large-Scale Study of Browsing Behavior”, Sixth International AAI Conference on Weblogs and Social Media [ICWSM 2012], [https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/viewPaper/4660], [https://5harad.com/papers/whowhatweb.pdf]
  • Background reading:
    • Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan, “A Survey on Bias and Fairness in Machine Learning”, arxiv:1908.09635

Thursday, 21 April: Lecture 25, More fairness

Trade-offs between fairness and accuracy. The “Pareto frontier” and multi-objective optimization. Trade-offs between different forms of fairness. Some critiques of these notions of “fairness”.

Notes: see notes for last time.

Readings

  • Key reading: As last time.
  • Background reading:
    • G. O. Mohler, M. B. Short, Sean Malinowski, Mark Johnson, G. E. Tita, Andrea L. Bertozzi and P. J. Brantingham, “Randomized Controlled Field Trials of Predictive Policing”, Journal of the American Statistical Association 110 (2016): 1399–1411, doi:10.1080/01621459.2015.1077710
    • Sarah Brayne, Predict and Surveil: Data, Discretion, and the Future of Policing (Oxford: Oxford University Press, 2020), doi:10.1093/oso/9780190684099.001.0001
    • Virginia Eubanks, Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor (New York: St. Martin’s Press, 2017)
    • Andrew Guthrie Ferguson, The Rise of Big Data Policing: Surveillance, Race, and the Future of Law Enforcement (New York: New York University Press, 2017)
    • Bernard E. Harcourt, Against Prediction: Profiling, Policing, and Punishing in an Actuarial Age (Chicago: University of Chicago Press, 2007)
    • Colleeen McCue, Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis (Amsterdam: Elsevier, 2006)

Homework

Tuesday, 26 April: Lecture 26, Waste, fraud and abuse: when is our statistical-learning project going to be a technical failure?

Bad data: errors in data vs. bad measurements, measuring the wrong thing. Weak or non-existent predictive relationships. Changing predictive relationship. Changing distributions. The curse of dimensionality. Not accounting for model search and/or model selection; sample splitting as a cure for post-selection inference. The “garden of forking paths” and “researcher degrees of freedom”. Prediction vs. causality; taking credit for what would have happened anyway.

Notes (.Rmd)

Readings

  • Key readings:
    • PDM, chapter 2, section 6.5, section 1.7
    • Berk, section 1.6.10 and chapter 9
    • TEA, chapter 4
  • Suggested readings:
  • Background readings:
    • Jorge Luis Borges, “The Garden of Forking Paths” (as “El jardín de senderos que se bifurcan”, 1941; translations by Anthony Kerrigan in Ficciones [New York: Grove Press, 1962] and by Anthony Hurley in Collected Fictions [New York: Viking, 1998])
    • Howard Becker, Evidence (Chicago: University of Chicago Press, 2017)
    • Luke Oakden-Rayner, “Exploring large scale public medical image datasets”, arxiv:1907.12720
    • Amit Sharma, Jake M. Hofman and Duncan J. Watts, “Estimating the Causal Impact of Recommendation Systems from Observational Data”, pp. 453–470 in Proceedings of the Sixteenth ACM Conference on Economics and Computation [EC ’15], doi:10.1145/2764468.2764488, arxiv:1510.05569

Thursday, 28 April: Lecture 27, Waste, fraud and abuse: what are we doing to ourselves? should we be doing this?

Privacy and surveillance. Rights-violating applications of statistical learning. Legal (probably) but still destructive applications. Is it any better if it doesn’t work (well)? Some ethical check-list questions: who benefits? who is harmed? how much? how surely? are you OK with being responsible for that?

Notes

Readings

  • Key readings:
    • WMD, Conclusion
    • TEA, chapters 1 and 5
  • Suggested readings:
  • Background readings:
    • Tim Hwang, Suprime Attention Crisis: Advertising and the Time Bomb at the Heart of the Internet (New York; Farrar, Straus and Giroux, 2020)
    • Bruce Schneier, Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World (New York: W. W. Norton, 2015)
    • Francis Spufford, Red Plenty (London: Faber and Faber, 2010) [About the first time brilliant and well-intentioned people tried to use data, computers and optimization to make the world a better place]
    • Norbert Wiener, The Human Use of Human Beings: Cybernetics and Society (Boston: Houghton Mifflin, 1950)

Stuff the University Wants Mentioned

Accommodations for Students with Disabilities

If you need accommodations for physical and/or learning disabilities, please contact the Office of Disability Resources, via their website, [http://www.cmu.edu/disability-resources]. They will help you work out an official written accommodation plan, and help coordinate with me.

Missing class and emergencies

Because of the drop-the-lowest-grades rule for class exercises, you can miss up to five class sessions without (grade) penalty. In cases of routine illness (for example, you have a bad cold and a day of sleep and fluids would help), stay home, take a free day, and let yourself rest and recover, rather than dragging yourself to class. The grading policy gives you space for this.

In cases of extended illness or emergency, we can deal with make-ups for missed work after the crisis. If extended illness or emergency happens (including mental health crises), or a personal or family crisis, please don’t wait to get help. Just let me know that you’re dealing with an emergency, and that you are already getting help (e.g., from Counseling and Psychological Services, or whoever is appropriate).

Covid

If you get sick with Covid-19, you’ll need to not come to class while you’re isolating; this will usually be handled as we would any other illness. (You will however need to report your case to the university’s Community Health and Well-Being office.) In the unlikely event that more substantial accommodations are necessary, please contact the Office of Disability Resources as above.

Inclusion and Respectful Participation

The university is a community of scholars, that is, of people seeking knowledge. All of our accumulated knowledge has to be re-learned by every new generation of scholars, and re-tested, which requires debate and discussion. Everyone enrolled in the course has a right to participate in the class discussions. This doesn’t mean that everything everyone says is equally correct or equally important, but does mean that everyone needs to be treated with respect as persons, and criticism and debate should be directed at ideas and not at people. Don’t dismiss (or credit) anyone in the course because of where they come from, and don’t use your participation in the class as a way of shutting up others. Don’t be rude, and don’t go looking for things to be offended by. Statistical methods don’t usually lead to heated debate, but some of the uses of those methods we’ll discuss do. If someone else is saying something you think is really wrong-headed, and you think it’s important to correct it, address why it doesn’t make sense, and listen if they give a counter-argument.

The classroom is not a democracy; as the teacher, I have the right and the responsibility to guide the discussion in what I judge are productive directions. This may include shutting down discussions which are not helping us learn about statistics, even if those discussions might be important to have elsewhere. (You can have them elsewhere.) I will do my best to guide the course in a way which respects everyone’s dignity as a human being, as a scholar, and as a member of the university.


(The pictures decorating the syllabus are from the great Renaissance artist Albrecht Durer; the URLs below each figure give source links to ARTSTOR.)