Conceptual Foundations of Statistical Learning
36-465/665, Spring 2021
Cosma Shalizi
Tuesdays and Thursdays, 2:20--3:40 pm (Pittsburgh time), online only
This course is an introduction to the core ideas and theories of statistical
learning, and their uses in designing and analyzing machine-learning systems.
Statistical learning theory studies how to fit predictive models to training
data, usually by solving an optimization problem, in such a way that the model
will predict well, on average, on new data. The course will focus on the key
concepts and theoretical tools, at a mathematical level accessible to students
who have taken 36-401
(or equivalent) and its pre-requisites. The course will also illustrate those
concepts and tools by applying them to carefully selected kinds of machine
learning systems (such as kernel machines).
Pre-requisites
Undergraduates taking the course as 36-465 must have taken 36-401 with a grade
of C or better; this ensures familiarity with both the theory and practice of
linear regression modeling. 36-401 itself has a chain of pre-requisites:
linear algebra, calculus with multiple variables, undergraduate probability,
and mathematical statistics. We will be making extensive use of all of these.
Graduate students taking the course as 36-665 have, formally speaking, no
pre-requisites, but will be expected to have a preparation at least as thorough
as the undergraduates.
Learning Objectives
By the end of this course, when students encounter a new prediction problem,
they should be able to:
- Analyze in terms of statistical decision theory, identifying the space of actions, the space of outcomes, the information set, and the loss function;
- Determine whether a given combination of machines and estimation methods are suitable for the prediction problem;
- Identify whether standard generalization error bounds apply to this prediction problem and learning method, and, if so, apply those results to give guarantees about the out-of-sample performance;
- Interpret and compare error bounds arising from different techniques and assumptions.
When students encounter a new class of learning machines or statistical models, they should be able to:
- Calculate standard measures of model complexity or flexibility, including the Rademacher complexity and the VC dimension;
- Implement regularized and un-regularized optimization methods for fitting models from the class, and be able to calculate the effects of regularization on model complexity;
- Apply standard heuristics to quantify the consequences of imperfect optimization on predictive performance;
- Implement techniques for selecting among models.
36-465/665 vs. 36-462/662, 10-301/601 and 10-701
Students wanting exposure to a broad range of learning algorithms and their
applications would be better served by other courses,
especially 36-462/662 ("data
mining", "methods of statistical learning"), 10-301/601 ("introduction to
machine learning") or 10-701 ("introduction to machine learning" for
Ph.D. students). This class is for those who want a deeper understanding of
the underlying principles. It will mean a lot more math than coding, and it
won't help you move up a leader-board, but it will help you understand the
statistical reasons why learning machines work (when they do).
36-464 vs. 36-665
Undergraduates should register for the course as 36-465. Graduate students
in statistics or other departments are welcome to take the course, but should
register for it as 36-665. (Undergraduates registering for 36-465 will be
dropped from the roster.)
Short Outline of Topics Covered
This is subject to revision.
- Prediction as a decision problem; elements of decision theory; loss functions; examples of loss functions for classification and regression; "risk" defined as expected loss on new data; the goal is a low-risk prediction rule ("probably approximately correct", PAC).
- Finding decision rules by optimizing the "empirical risk", or average loss on the training data; learning curves (risk vs. sample size); why the empirical risk of the estimated rule is lower than its true risk; the origin of over-fitting.
- Mathematical tools for controlling the difference between empirical risk, or in-sample error, and true risk, or generalization error. Deviation inequalities, and their uniform counterparts, concentration inequalities. Measures of model complexity: Rademacher complexity, Vapnik-Chervonenkis (VC) dimension. Why counting parameters isn't enough (and is often irrelevant). "Stability" arguments.
- Optimization: basics of optimization theory; asymptotics of optimization with small noise; "optimism" of optimization; adding penalties or constraints; why penalties and constraints are equivalent; how penalties "regularize" unstable optimization problems; effects of regularization on model complexity
- Model selection: comparisons based on estimated generalization error;
based on bounds on generalization error ("structural risk minimization"); avoiding selection by model averaging and ensemble methods.
- Application to kernel machines: kernel methods for classification; kernel methods for regression. Risk bounds for kernel methods. Large-margin classifiers. Support vector machines.
- Application to random feature machines ("kitchen sinks") for classification and regression.
- Application to mixture models for density estimation.
- Extra topics, time permitting: predicting stochastic processes; sequential decision-making and reinforcement learning; low-regret ("on-line" or "adversarial") learning.
Course Mechanics
Lectures will be used to amplify the readings, provide examples and demos, and
answer questions and generally discuss the material. You will usually find the
readings more rewarding if you do the readings before lecture, rather
than after (or during). Since this is an online-only class this semester,
lectures will be held via Zoom; the link for each session will be on Canvas. I
know that the class time is late at night or early in the morning for some of
you; I nonetheless urge you to come to class and participate.
No Recordings: I will not be recording lectures.
This is because the value of class meetings lies precisely in your chance to
ask questions, discuss, and generally interact. (Otherwise, you could just
read a book.) Recordings interfere with this in two ways:
- They tempt you to skip class and/or to zone out and/or try to multi-task
during it. (Nobody, not even you, is really any good at
multi-tasking.) Even if you do watch the recording later, you will
not learn as much from it as if you had attended in the first place.
- People are understandably reluctant to participate when they know they're
being recorded. (It's only too easy to manipulate recordings to make anyone
seem dumb and/or obnoxious.) Maybe this doesn't bother you; it doesn't bother
me, much, because I'm protected by academic freedom and by tenure, but a good
proportion of your classmates won't participate if they're being recorded,
and that diminishes the value of the class for everyone.
Recording someone without their permission is illegal in many places, and
more importantly is unethical everywhere, so don't make your own recordings
of the class.
(Taking notes during class is fine and I strongly encourage it; taking notes
forces you to think about what you are hearing and how to organize it, which
helps you understand and remember the content.)
Textbooks
To be determined (as of November 2020). The following books are strongly recommended, and I may well end up picking one of them to be the textbook; if so, I will
decide well before the beginning of classes.
- Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning
- Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory
- Mathukumalli Vidyasagar, A Theory of Learning and
Generalization: With Applications to Neural Networks and Control Systems
Assignments
There are three reasons you will get assignments in this course. In order of
decreasing importance:
- Practice. Practice is essential to developing the skills you are
learning in this class. It also actually helps you learn, because some things
which seem murky clarify when you actually do them, and sometimes trying to do
something shows what you only thought you understood.
- Feedback. By seeing what you can and cannot do, and what comes
easily and what you struggle with, I can help you learn better, by giving
advice and, if need be, adjusting the course.
- Evaluation. The university is, in the end, going to stake its
reputation (and that of its faculty) on assuring the world that you have
mastered the skills and learned the material that goes with your degree.
Before doing that, it requires an assessment of how well you have, in fact,
mastered the material and skills being taught in this course.
To serve these goals, there will be two kinds of assignment in this
course.
- After-class comprehension questions and exercises
- Following every lecture, there will be a brief set of questions
about the material covered in lecture. Sometimes these will be about specific
points in the lecture, sometimes about specific aspects of the reading assigned
to go with the lecture. These will be done on Canvas and will be due the day
after each lecture. These should take no more than 10 minutes, but will be
untimed (so no accommodations for extra time are necessary). If the questions
ask you to do any math (and not all of them will!), a scan or photograph of
hand-written math is OK, so long as the picture is clearly legible. (Black ink
or dark pencil on white paper helps.)
- Homework
- Most weeks will have a homework assignment, divided into a series of
questions or problems. These will have a common theme, and will usually build
on each other, but different problems may involve statistical theory, analyzing
real data sets on the computer, and communicating the results.
- All homework will be submitted electronically through Gradescope/Canvas.
Most weeks, homework will be due at 6:00 pm on Thursdays
(Pittsburgh time). There will be a few weeks, clearly noted on the syllabus
and on the assignments, when this won't be the case. When this results in less
than seven days between an assignment's due date and the previous due date, the
homework will be shortened.
Time Expectations
You should expect to spend 5--7 hours on assignments every week, averaging over
the semester. (This follows from the university's rules about how course
credits translate into hours of student time.) If you find yourself spending
significantly more time than that on the class, please come to talk to me.
Grading
Grades will be broken down as follows:
- After-class questions: 10%. All sets of questions will
have equal weight. The lowest 4 will be dropped, no questions asked.
- Homework: 90%. All homeworks will have equal weight. Your lowest 3
homework grades will be dropped, no questions asked. If you turn in all
homework assignments on time, for a grade of at least 60% (each), your lowest four homework grades
will be dropped. Late homework will not be accepted for any
reason.
Grade boundaries will be as follows:
A | [90, 100] |
B | [80, 90) |
C | [70, 80) |
D | [60, 70) |
R | < 60 |
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular
assignment was wrongly graded, tell me as soon as possible. Direct any
questions or complaints about your grades to me; the teaching assistants have
no authority to make changes. (This also goes for your final letter grade.)
Complaints that the thresholds for letter grades are unfair, that you deserve a
higher grade, that you need a higher grade, etc., will accomplish much less
than pointing to concrete problems in the grading of specific assignments.
As a final word of advice about grading, "what is the least amount of work I
need to do in order to get the grade I want?" is a much worse way to approach
higher education than "how can I learn the most from this class and from my
teachers?".
Canvas and Piazza
Homework will be submitted electronically through Gradescope/Canvas. Canvas
will also be used for the after-class questions, as a calendar showing all
assignments and their due-dates, to distribute some readings, and as the
official gradebook.
We will be using Piazza for question-answering. You will receive an
invitation within the first week of class. Anonymous-to-other-students
posting of questions and replies will be allowed, at least initially.
Anonymity will go away for everyone if it is abused. During Piazza office
hours, someone will be online to respond to questions (and follow-ups) in real
time. You are welcome to post at any time, but outside of normal working
hours you should expect that the instructors have lives.
Collaboration, Cheating and Plagiarism
Except for explicit group exercises,
everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbooks and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or
legitimate online resources, though again, all use of these sources must be
acknowledged in your work. (Websites which compile course materials
are not legitimate online resources.)
In general, you are free to discuss homework with other students in the
class, though not to share or compare work; such conversations must be
acknowledged in your assignments. You may not discuss the content of
assignments with anyone other than current students, the instructors, or your
teachers in other current classes at CMU, until after the assignments are due.
(Exceptions can be made, with prior permission, for approved tutors.) You are,
naturally, free to complain, in general terms, about any aspect of the course,
to whomever you like.
Any use of solutions provided for any assignment in this course, or in other
courses, in previous semesters is strictly prohibited. This prohibition
applies even to students who are re-taking the course. Do not copy the old
solutions (in whole or in part), do not "consult" them, do not read them, do
not ask your friend who took the course last year if they "happen to remember"
or "can give you a hint". Doing any of these things, or anything like these
things, is cheating, it is easily detected cheating, and those who thought they
could get away with it in the past have failed the course. Even more
importantly: doing any of those things means that the
assignment doesn't give you a chance to practice; it makes any
feedback you get meaningless; and of course it makes any evaluation based on
that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will never be a penalty for asking. If you do
violate these policies but then think better of it, it is your responsibility
to tell me as soon as possible to discuss how to rectify matters. Otherwise,
violations of any sort will lead to severe, formal disciplinary action, under
the terms of the university's
policy
on academic integrity.
On the first day of class, every student will receive a written copy of the
university's policy on academic integrity, a written copy of these course
policies, and a "homework 0" on the content of these policies. This assignment
will not factor into your grade, but you must complete it before you
can get any credit for any other assignment.
Accommodations for Students with Disabilities
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate
with me.
Inclusion and Respectful Participation
The university is a community of scholars, that is, of people seeking
knowledge. All of our accumulated knowledge has to be re-learned by every new
generation of scholars, and re-tested, which requires debate and discussion.
Everyone enrolled in the course has a right to participate in the class
discussions. This doesn't mean that everything everyone says is equally
correct or equally important, but does mean that everyone needs to be treated
with respect as persons, and criticism and debate should be directed
at ideas and not at people. Don't dismiss (or enhance) anyone in the course
because of where they come from, and don't use your participation in the class
as a way of shutting up others. (Don't be rude, and don't go looking for
things to be offended by.) While statistical learning theory
doesn't usually lead to heated debate, some of the subjects we'll be
applying it to might. If someone else is saying something you think is really
wrong-headed, and you think it's important to correct it, address why it
doesn't make sense, and listen if they give a counter-argument.
The classroom is not a democracy; as the teacher, I have the right and the
responsibility to guide the discussion in what I judge are productive
directions. This may include shutting down discussions which are not helping
us learn about statistics, even if those discussions might be important to have
elsewhere. (You can have them elsewhere.) I will do my best to guide the
course in a way which respects everyone's dignity as a human being, as a
scholar, and as a member of the university.
Detailed course calendar
Links to lecture notes, assignments, etc., will go here as they become
relevant.
The ordering is currently (November 2020) fairly tentative; any or all
topics after mixture models might be dropped, if earlier ones end up taking
longer than expected.
Lecture 1 (Tuesday, 2 February):
Introduction to the course
- Homework:
- Homework 0 assigned
- Homework 1 assigned
Lecture 2 (Thursday, 4 February):
Prediction as a Decison Problem
Lecture 3 (Tuesday, 9 February):
Empirical Risk Minimization
Lecture 4 (Thursday, 11 February):
Optimism and Over-Fitting
- Homework:
- Homework 0 due
- Homework 1 due
- Homework 2 assigned
Lecture 5 (Tuesday, 16 February):
Deviation Bounds of Probability Theory I: Markov's Inequality and Its Descendants
Lecture 6 (Thursday, 18 February):
Deviation Bounds II: Bounded-Difference Inequality and Its Descndants
- Homework:
- Homework 2 due
- Homework 3 assigned
NO CLASS on Tuesday, 23 February
This is a university break day.
Lecture 7 (Thursday, 25 February)
Concentration Inequalities
- Homework:
- Homework 3 due
- Homework 4 assigned
Lecture 8 (Tuesday, 2 March):
Rademacher Complexity and Our First Error Bounds
- Controlling the maximum deviation by controlling the discrepancy between
two independent realizations of the data. Derivation of the Rademacher
complexity. Interpretation of the Rademacher complexity: how well could this
model class (seem to) correlate with random noise? Bounds in terms of the
Rademacher complexity. Empirical Rademacher complexity and its estimation.
Rademacher complexity of linear models.
- Reading
Lecture 9 (Thursday, 4 March):
Vapnik-Chervonenkis (VC) Dimension and Distribution-Free Bounds
- How many distinguishable functions are there? The "growth
function" of a model class and its growth rate. VC dimension defined. VC
dimension of linear classifiers. Using the VC dimension to derive
distribution-free bounds. Extensions to regression.
- Homework:
- Homework 4 due
- Homework 5 assigned
Lecture 10 (Tuesday, 9 March):
Optimization and Its Asymptotics
- Essentials of optimization theory: gradients, Hessians, and all that.
Small-noise asymptotics for optimization: convergence to the limiting minimum;
fluctuations around the minimum and the "sandwich covariance"; using
propagation of error to get uncertainty in predictions; "optimism" of learned
functions again.
Lecture 11 (Thursday, 11 March):
Stability of learning
- Algorithmic stability bounds on generalization error.
- Homework:
- Homework 5 due
- Homework 6 assigned
Lecture 12 (Tuesday, 16 March):
Regularization by Penalization and Constraints
- Penalized optimization. Intuitive justifications for penalties.
Qualitative effects of L1 and L2 penalties. Constrained optimization.
Equivalence of penalties and constraints via Lagrange multipliers. Lagrange
multipliers as prices.
Lecture 13 (Thursday, 18 March):
Regularization and Model Complexity
- Effects of regularization on the capacity of function spaces. Explicit
calculations for examples. Deciding on how much regularization is called
for. The method of sieves.
- Homework:
- Homework 6 due
- Homework 7 assigned
Lecture 14 (Tuesday, 23 March):
Model Selection Based on Point Estimates of Risk
- Comparisons based on the "optimism". Comparisons based on information
criteria. Comparisons based on cross-validation.
Lecture 15 (Thursday, 25 March):
Model Selection Based on Bounds on Risk
- "Structural risk minimization". Preserving over-all bounds for the
selected model.
- Homework:
- Homework 7 due
- Homework 8 assigned
Lecture 16 (Tuesday, 30 March):
Model Averaging
- How averging can improve prediction. Role of diversity in improving
performance. Why ensembles aren't as prone to fit as their size would suggest.
Lecture 17 (Thursday, 1 April):
A Glimpse of Boosting
- Old-style or primitive boosting. AdaBoost. Modern views of boosting.
- Homework:
- Homework 8 due
- Homework 9 assigned
Lecture 18 (Tuesday, 6 April):
Kernel Machines I
- Motivation: using linear methods, but on nonlinear functions of the data.
The "kernel trick"; definition of a kernel function and characterization
of kernels. Kernel machines for classification and regression. Complexity
of kernel machines.
Lecture 19 (Thursday, 8 April):
Kernel Machines II
- The notion of margin, and margin bounds on generalization error. Support vector machines for classification and regression.
- Homework:
- Homework 9 due
- Homework 10 assigned
Lecture 20 (Tuesday, 13 April):
Random Feature Machines
- Back to the using-nonlinear-features idea. Replacing kernels with random features. How to pick random features. Generalization bounds. The power of random features.
- Reading:
- Optional reading:
- Ali Rahimi and Benjamin Recht, "Random Features for Large-Scale Kernel Machines", pp. 1177--1184 in John C. Platt, Daphne Koller, Yoram Singer and Samuel T. Roweis (eds.), Advances in Neural Information Processing Systems 20 [NIPS 2007]
- Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning", pp. 1313--1320 in Daphne Koller, D. Schuurmans, Y. Bengio and L. Bottou (eds.), Advances in Neural Information Processing Systems 21 [NIPS 2008]
- Ali Rahimi and Benjamin Recht, "Reflections on Random Kitchen Sinks", argmin blog, 5 December 2017
- Homework:
- Homework 10 due (note date)
- Homework 11 assigned
NO CLASS on Thursday, 15 April
Spring Carnival.
Lecture 21 (Tuesday, 20 April):
Mixture Models I
Lecture 22 (Thursday, 22 April):
Mixture Models II
- Homework:
- Homework 11 due
- Homework 12 assigned
Lecture 23 (Tuesday, 27 April):
Low-regret learning I
- Either a catch-up day, or an introduction to the idea of regret, of low-regret learning, and mixtures of experts
Lecture 24 (Thursday, 29 April):
Low-regret learning II
- Either a catch-up day, or competing against sequences of experts, and growing ensembles
- Homework:
- Homework 12 due
- Homework 13 assigned
Lecture 25 (Tuesday, 4 May):
Predicting stochastic processes
- Adapting learning theory from IID to dependent random variables; or a catch-up day.
Lecture 26 (Thursday, 6 May):
Recapitulation
- Either a summary of highlights of the course, or a catch-up day.
- Homework: