Conceptual Foundations of Statistical Learning

36-465/665, Spring 2021

Erecting the edifice of statistical learning theory Cosma Shalizi
Tuesdays and Thursdays, 2:20--3:40 pm (Pittsburgh time), online only

This course is an introduction to the core ideas and theories of statistical learning, and their uses in designing and analyzing machine-learning systems. Statistical learning theory studies how to fit predictive models to training data, usually by solving an optimization problem, in such a way that the model will predict well, on average, on new data. The course will focus on the key concepts and theoretical tools, at a mathematical level accessible to students who have taken 36-401 (or equivalent) and its pre-requisites. The course will also illustrate those concepts and tools by applying them to carefully selected kinds of machine learning systems (such as kernel machines).

Pre-requisites

Undergraduates taking the course as 36-465 must have taken 36-401 with a grade of C or better; this ensures familiarity with both the theory and practice of linear regression modeling. 36-401 itself has a chain of pre-requisites: linear algebra, calculus with multiple variables, undergraduate probability, and mathematical statistics. We will be making extensive use of all of these. Graduate students taking the course as 36-665 have, formally speaking, no pre-requisites, but will be expected to have a preparation at least as thorough as the undergraduates.

Learning Objectives

By the end of this course, when students encounter a new prediction problem, they should be able to:

When students encounter a new class of learning machines or statistical models, they should be able to:

We will descend in to the mathematical depths

36-465/665 vs. 36-462/662, 10-301/601 and 10-701

Students wanting exposure to a broad range of learning algorithms and their applications would be better served by other courses, especially 36-462/662 ("data mining", "methods of statistical learning"), 10-301/601 ("introduction to machine learning") or 10-701 ("introduction to machine learning" for Ph.D. students). This class is for those who want a deeper understanding of the underlying principles. It will mean a lot more math than coding, and it won't help you move up a leader-board, but it will help you understand the statistical reasons why learning machines work (when they do).

36-464 vs. 36-665

Undergraduates should register for the course as 36-465. Graduate students in statistics or other departments are welcome to take the course, but should register for it as 36-665. (Undergraduates registering for 36-465 will be dropped from the roster.)

Short Outline of Topics Covered

This is subject to revision.
  1. Prediction as a decision problem; elements of decision theory; loss functions; examples of loss functions for classification and regression; "risk" defined as expected loss on new data; the goal is a low-risk prediction rule ("probably approximately correct", PAC).
  2. Finding decision rules by optimizing the "empirical risk", or average loss on the training data; learning curves (risk vs. sample size); why the empirical risk of the estimated rule is lower than its true risk; the origin of over-fitting.
  3. Mathematical tools for controlling the difference between empirical risk, or in-sample error, and true risk, or generalization error. Deviation inequalities, and their uniform counterparts, concentration inequalities. Measures of model complexity: Rademacher complexity, Vapnik-Chervonenkis (VC) dimension. Why counting parameters isn't enough (and is often irrelevant). "Stability" arguments.
  4. Optimization: basics of optimization theory; asymptotics of optimization with small noise; "optimism" of optimization; adding penalties or constraints; why penalties and constraints are equivalent; how penalties "regularize" unstable optimization problems; effects of regularization on model complexity
  5. Model selection: comparisons based on estimated generalization error; based on bounds on generalization error ("structural risk minimization"); avoiding selection by model averaging and ensemble methods.
  6. Application to kernel machines: kernel methods for classification; kernel methods for regression. Risk bounds for kernel methods. Large-margin classifiers. Support vector machines.
  7. Application to random feature machines ("kitchen sinks") for classification and regression.
  8. Application to mixture models for density estimation.
  9. Extra topics, time permitting: predicting stochastic processes; sequential decision-making and reinforcement learning; low-regret ("on-line" or "adversarial") learning.

Course Mechanics

Discussion is vital to learning

Lectures and Remote-Only Instruction

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. You will usually find the readings more rewarding if you do the readings before lecture, rather than after (or during). Since this is an online-only class this semester, lectures will be held via Zoom; the link for each session will be on Canvas. I know that the class time is late at night or early in the morning for some of you; I nonetheless urge you to come to class and participate.

No Recordings: I will not be recording lectures. This is because the value of class meetings lies precisely in your chance to ask questions, discuss, and generally interact. (Otherwise, you could just read a book.) Recordings interfere with this in two ways:

  1. They tempt you to skip class and/or to zone out and/or try to multi-task during it. (Nobody, not even you, is really any good at multi-tasking.) Even if you do watch the recording later, you will not learn as much from it as if you had attended in the first place.
  2. People are understandably reluctant to participate when they know they're being recorded. (It's only too easy to manipulate recordings to make anyone seem dumb and/or obnoxious.) Maybe this doesn't bother you; it doesn't bother me, much, because I'm protected by academic freedom and by tenure, but a good proportion of your classmates won't participate if they're being recorded, and that diminishes the value of the class for everyone.

Recording someone without their permission is illegal in many places, and more importantly is unethical everywhere, so don't make your own recordings of the class.

(Taking notes during class is fine and I strongly encourage it; taking notes forces you to think about what you are hearing and how to organize it, which helps you understand and remember the content.)

Textbooks

To be determined (as of November 2020). The following books are strongly recommended, and I may well end up picking one of them to be the textbook; if so, I will decide well before the beginning of classes.

Assignments

There are three reasons you will get assignments in this course. In order of decreasing importance:
  1. Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows what you only thought you understood.
  2. Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
  3. Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be two kinds of assignment in this course.

After-class comprehension questions and exercises
Following every lecture, there will be a brief set of questions about the material covered in lecture. Sometimes these will be about specific points in the lecture, sometimes about specific aspects of the reading assigned to go with the lecture. These will be done on Canvas and will be due the day after each lecture. These should take no more than 10 minutes, but will be untimed (so no accommodations for extra time are necessary). If the questions ask you to do any math (and not all of them will!), a scan or photograph of hand-written math is OK, so long as the picture is clearly legible. (Black ink or dark pencil on white paper helps.)
Homework
Most weeks will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve statistical theory, analyzing real data sets on the computer, and communicating the results.
All homework will be submitted electronically through Gradescope/Canvas. Most weeks, homework will be due at 6:00 pm on Thursdays (Pittsburgh time). There will be a few weeks, clearly noted on the syllabus and on the assignments, when this won't be the case. When this results in less than seven days between an assignment's due date and the previous due date, the homework will be shortened.

Time Expectations

You should expect to spend 5--7 hours on assignments every week, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

Grade boundaries will be as follows:
A [90, 100]
B [80, 90)
C [70, 80)
D [60, 70)
R < 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, that you need a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Canvas and Piazza

Homework will be submitted electronically through Gradescope/Canvas. Canvas will also be used for the after-class questions, as a calendar showing all assignments and their due-dates, to distribute some readings, and as the official gradebook.

We will be using Piazza for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. Anonymity will go away for everyone if it is abused. During Piazza office hours, someone will be online to respond to questions (and follow-ups) in real time. You are welcome to post at any time, but outside of normal working hours you should expect that the instructors have lives. Cheating will lead to disaster

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbooks and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students, the instructors, or your teachers in other current classes at CMU, until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course, or in other courses, in previous semesters is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will never be a penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how to rectify matters. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

On the first day of class, every student will receive a written copy of the university's policy on academic integrity, a written copy of these course policies, and a "homework 0" on the content of these policies. This assignment will not factor into your grade, but you must complete it before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate with me.

Inclusion and Respectful Participation

The university is a community of scholars, that is, of people seeking knowledge. All of our accumulated knowledge has to be re-learned by every new generation of scholars, and re-tested, which requires debate and discussion. Everyone enrolled in the course has a right to participate in the class discussions. This doesn't mean that everything everyone says is equally correct or equally important, but does mean that everyone needs to be treated with respect as persons, and criticism and debate should be directed at ideas and not at people. Don't dismiss (or enhance) anyone in the course because of where they come from, and don't use your participation in the class as a way of shutting up others. (Don't be rude, and don't go looking for things to be offended by.) While statistical learning theory doesn't usually lead to heated debate, some of the subjects we'll be applying it to might. If someone else is saying something you think is really wrong-headed, and you think it's important to correct it, address why it doesn't make sense, and listen if they give a counter-argument.

The classroom is not a democracy; as the teacher, I have the right and the responsibility to guide the discussion in what I judge are productive directions. This may include shutting down discussions which are not helping us learn about statistics, even if those discussions might be important to have elsewhere. (You can have them elsewhere.) I will do my best to guide the course in a way which respects everyone's dignity as a human being, as a scholar, and as a member of the university.

Detailed course calendar

Using concentration inequalities to build a barrier against fluctuations and over-fitting Links to lecture notes, assignments, etc., will go here as they become relevant.

The ordering is currently (November 2020) fairly tentative; any or all topics after mixture models might be dropped, if earlier ones end up taking longer than expected.

Lecture 1 (Tuesday, 2 February): Introduction to the course

Lecture 2 (Thursday, 4 February): Prediction as a Decison Problem

Lecture 3 (Tuesday, 9 February): Empirical Risk Minimization

Lecture 4 (Thursday, 11 February): Optimism and Over-Fitting

Lecture 5 (Tuesday, 16 February): Deviation Bounds of Probability Theory I: Markov's Inequality and Its Descendants

Lecture 6 (Thursday, 18 February): Deviation Bounds II: Bounded-Difference Inequality and Its Descndants

NO CLASS on Tuesday, 23 February

This is a university break day.

Lecture 7 (Thursday, 25 February) Concentration Inequalities

Lecture 8 (Tuesday, 2 March): Rademacher Complexity and Our First Error Bounds

Lecture 9 (Thursday, 4 March): Vapnik-Chervonenkis (VC) Dimension and Distribution-Free Bounds

Lecture 10 (Tuesday, 9 March): Optimization and Its Asymptotics

Lecture 11 (Thursday, 11 March): Stability of learning

Lecture 12 (Tuesday, 16 March): Regularization by Penalization and Constraints

Lecture 13 (Thursday, 18 March): Regularization and Model Complexity

Lecture 14 (Tuesday, 23 March): Model Selection Based on Point Estimates of Risk

Lecture 15 (Thursday, 25 March): Model Selection Based on Bounds on Risk

Lecture 16 (Tuesday, 30 March): Model Averaging

Lecture 17 (Thursday, 1 April): A Glimpse of Boosting

Lecture 18 (Tuesday, 6 April): Kernel Machines I

Lecture 19 (Thursday, 8 April): Kernel Machines II

Lecture 20 (Tuesday, 13 April): Random Feature Machines

NO CLASS on Thursday, 15 April

Spring Carnival.

Lecture 21 (Tuesday, 20 April): Mixture Models I

Lecture 22 (Thursday, 22 April): Mixture Models II

Lecture 23 (Tuesday, 27 April): Low-regret learning I

Lecture 24 (Thursday, 29 April): Low-regret learning II

Lecture 25 (Tuesday, 4 May): Predicting stochastic processes

Lecture 26 (Thursday, 6 May): Recapitulation