Conceptual Foundations of Statistical Learning
36-465/665, Spring 2021
Tuesdays and Thursdays, 2:20--3:40 pm (Pittsburgh time), online only
This course is an introduction to the core ideas and theories of statistical
learning, and their uses in designing and analyzing machine-learning systems.
Statistical learning theory studies how to fit predictive models to training
data, usually by solving an optimization problem, in such a way that the model
will predict well, on average, on new data. The course will focus on the key
concepts and theoretical tools, at a mathematical level accessible to students
who have taken 36-401
(or equivalent) and its pre-requisites. The course will also illustrate those
concepts and tools by applying them to carefully selected kinds of machine
learning systems (such as kernel machines).
Undergraduates taking the course as 36-465 must have taken 36-401 with a grade
of C or better; this ensures familiarity with both the theory and practice of
linear regression modeling. 36-401 itself has a chain of pre-requisites:
linear algebra, calculus with multiple variables, undergraduate probability,
and mathematical statistics. We will be making extensive use of all of these.
Graduate students taking the course as 36-665 have, formally speaking, no
pre-requisites, but will be expected to have a preparation at least as thorough
as the undergraduates.
By the end of this course, when students encounter a new prediction problem,
they should be able to:
- Analyze in terms of statistical decision theory, identifying the space of actions, the space of outcomes, the information set, and the loss function;
- Determine whether a given combination of machines and estimation methods are suitable for the prediction problem;
- Identify whether standard generalization error bounds apply to this prediction problem and learning method, and, if so, apply those results to give guarantees about the out-of-sample performance;
- Interpret and compare error bounds arising from different techniques and assumptions.
When students encounter a new class of learning machines or statistical models, they should be able to:
- Calculate standard measures of model complexity or flexibility, including the Rademacher complexity and the VC dimension;
- Implement regularized and un-regularized optimization methods for fitting models from the class, and be able to calculate the effects of regularization on model complexity;
- Apply standard heuristics to quantify the consequences of imperfect optimization on predictive performance;
- Implement techniques for selecting among models.
36-465/665 vs. 36-462/662, 10-301/601 and 10-701
Students wanting exposure to a broad range of learning algorithms and their
applications would be better served by other courses,
especially 36-462/662 ("data
mining", "methods of statistical learning"), 10-301/601 ("introduction to
machine learning") or 10-701 ("introduction to machine learning" for
Ph.D. students). This class is for those who want a deeper understanding of
the underlying principles. It will mean a lot more math than coding, and it
won't help you move up a leader-board, but it will help you understand the
statistical reasons why learning machines work (when they do).
36-464 vs. 36-665
Undergraduates should register for the course as 36-465. Graduate students
in statistics or other departments are welcome to take the course, but should
register for it as 36-665. (Undergraduates registering for 36-465 will be
dropped from the roster.)
Short Outline of Topics Covered
This is subject to revision.
- Prediction as a decision problem; elements of decision theory; loss functions; examples of loss functions for classification and regression; "risk" defined as expected loss on new data; the goal is a low-risk prediction rule ("probably approximately correct", PAC).
- Finding decision rules by optimizing the "empirical risk", or average loss on the training data; learning curves (risk vs. sample size); why the empirical risk of the estimated rule is lower than its true risk; the origin of over-fitting.
- Mathematical tools for controlling the difference between empirical risk, or in-sample error, and true risk, or generalization error. Deviation inequalities, and their uniform counterparts, concentration inequalities. Measures of model complexity: Rademacher complexity, Vapnik-Chervonenkis (VC) dimension. Why counting parameters isn't enough (and is often irrelevant). "Stability" arguments.
- Optimization: basics of optimization theory; asymptotics of optimization with small noise; "optimism" of optimization; adding penalties or constraints; why penalties and constraints are equivalent; how penalties "regularize" unstable optimization problems; effects of regularization on model complexity
- Model selection: comparisons based on estimated generalization error;
based on bounds on generalization error ("structural risk minimization"); avoiding selection by model averaging and ensemble methods.
- Application to kernel machines: kernel methods for classification; kernel methods for regression. Risk bounds for kernel methods. Large-margin classifiers. Support vector machines.
- Application to random feature machines ("kitchen sinks") for classification and regression.
- Application to mixture models for density estimation.
- Extra topics, time permitting: predicting stochastic processes; sequential decision-making and reinforcement learning; low-regret ("on-line" or "adversarial") learning.
Lectures will be used to amplify the readings, provide examples and demos, and
answer questions and generally discuss the material. You will usually find the
readings more rewarding if you do the readings before lecture, rather
than after (or during). Since this is an online-only class this semester,
lectures will be held via Zoom; the link for each session will be on Canvas. I
know that the class time is late at night or early in the morning for some of
you; I nonetheless urge you to come to class and participate.
No Recordings: I will not be recording lectures.
This is because the value of class meetings lies precisely in your chance to
ask questions, discuss, and generally interact. (Otherwise, you could just
read a book.) Recordings interfere with this in two ways:
- They tempt you to skip class and/or to zone out and/or try to multi-task
during it. (Nobody, not even you, is really any good at
multi-tasking.) Even if you do watch the recording later, you will
not learn as much from it as if you had attended in the first place.
- People are understandably reluctant to participate when they know they're
being recorded. (It's only too easy to manipulate recordings to make anyone
seem dumb and/or obnoxious.) Maybe this doesn't bother you; it doesn't bother
me, much, because I'm protected by academic freedom and by tenure, but a good
proportion of your classmates won't participate if they're being recorded,
and that diminishes the value of the class for everyone.
Recording someone without their permission is illegal in many places, and
more importantly is unethical everywhere, so don't make your own recordings
of the class.
(Taking notes during class is fine and I strongly encourage it; taking notes
forces you to think about what you are hearing and how to organize it, which
helps you understand and remember the content.)
To be determined (as of November 2020). The following books are strongly recommended, and I may well end up picking one of them to be the textbook; if so, I will
decide well before the beginning of classes.
- Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning
- Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory
- Mathukumalli Vidyasagar, A Theory of Learning and
Generalization: With Applications to Neural Networks and Control Systems
There are three reasons you will get assignments in this course. In order of
- Practice. Practice is essential to developing the skills you are
learning in this class. It also actually helps you learn, because some things
which seem murky clarify when you actually do them, and sometimes trying to do
something shows what you only thought you understood.
- Feedback. By seeing what you can and cannot do, and what comes
easily and what you struggle with, I can help you learn better, by giving
advice and, if need be, adjusting the course.
- Evaluation. The university is, in the end, going to stake its
reputation (and that of its faculty) on assuring the world that you have
mastered the skills and learned the material that goes with your degree.
Before doing that, it requires an assessment of how well you have, in fact,
mastered the material and skills being taught in this course.
To serve these goals, there will be two kinds of assignment in this
- After-class comprehension questions and exercises
- Following every lecture, there will be a brief set of questions
about the material covered in lecture. Sometimes these will be about specific
points in the lecture, sometimes about specific aspects of the reading assigned
to go with the lecture. These will be done on Canvas and will be due the day
after each lecture. These should take no more than 10 minutes, but will be
untimed (so no accommodations for extra time are necessary). If the questions
ask you to do any math (and not all of them will!), a scan or photograph of
hand-written math is OK, so long as the picture is clearly legible. (Black ink
or dark pencil on white paper helps.)
- Most weeks will have a homework assignment, divided into a series of
questions or problems. These will have a common theme, and will usually build
on each other, but different problems may involve statistical theory, analyzing
real data sets on the computer, and communicating the results.
- All homework will be submitted electronically through Gradescope/Canvas.
Most weeks, homework will be due at 6:00 pm on Thursdays
(Pittsburgh time). There will be a few weeks, clearly noted on the syllabus
and on the assignments, when this won't be the case. When this results in less
than seven days between an assignment's due date and the previous due date, the
homework will be shortened.
You should expect to spend 5--7 hours on assignments every week, averaging over
the semester. (This follows from the university's rules about how course
credits translate into hours of student time.) If you find yourself spending
significantly more time than that on the class, please come to talk to me.
Grades will be broken down as follows:
- After-class questions: 10%. All sets of questions will
have equal weight. The lowest 4 will be dropped, no questions asked.
- Homework: 90%. All homeworks will have equal weight. Your lowest 3
homework grades will be dropped, no questions asked. If you turn in all
homework assignments on time, for a grade of at least 60% (each), your lowest four homework grades
will be dropped. Late homework will not be accepted for any
Grade boundaries will be as follows:
|A ||[90, 100]|
|B ||[80, 90)|
|C ||[70, 80)|
|D ||[60, 70)|
|R || < 60|
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular
assignment was wrongly graded, tell me as soon as possible. Direct any
questions or complaints about your grades to me; the teaching assistants have
no authority to make changes. (This also goes for your final letter grade.)
Complaints that the thresholds for letter grades are unfair, that you deserve a
higher grade, that you need a higher grade, etc., will accomplish much less
than pointing to concrete problems in the grading of specific assignments.
As a final word of advice about grading, "what is the least amount of work I
need to do in order to get the grade I want?" is a much worse way to approach
higher education than "how can I learn the most from this class and from my
Canvas and Piazza
Homework will be submitted electronically through Gradescope/Canvas. Canvas
will also be used for the after-class questions, as a calendar showing all
assignments and their due-dates, to distribute some readings, and as the
We will be using Piazza for question-answering. You will receive an
invitation within the first week of class. Anonymous-to-other-students posting
of questions and replies will be allowed, at least initially. Anonymity will
go away for everyone if it is abused. During Piazza office hours, someone will
be online to respond to questions (and follow-ups) in real time. You are
welcome to post at any time, but outside of normal working hours you should
expect that the instructors have lives.
|Tuesday 3:40--4:40 ||Prof. Shalizi ||Zoom (same link as lecture)|
|Wednesday 2:00--3:00 ||Prof. Shalizi ||Piazza|
|Thursday 1:20--2:20 ||Mr. Luo ||Piazza|
Collaboration, Cheating and Plagiarism
Except for explicit group exercises,
everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbooks and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or
legitimate online resources, though again, all use of these sources must be
acknowledged in your work. (Websites which compile course materials
are not legitimate online resources.)
In general, you are free to discuss homework with other students in the
class, though not to share or compare work; such conversations must be
acknowledged in your assignments. You may not discuss the content of
assignments with anyone other than current students, the instructors, or your
teachers in other current classes at CMU, until after the assignments are due.
(Exceptions can be made, with prior permission, for approved tutors.) You are,
naturally, free to complain, in general terms, about any aspect of the course,
to whomever you like.
Any use of solutions provided for any assignment in this course, or in other
courses, in previous semesters is strictly prohibited. This prohibition
applies even to students who are re-taking the course. Do not copy the old
solutions (in whole or in part), do not "consult" them, do not read them, do
not ask your friend who took the course last year if they "happen to remember"
or "can give you a hint". Doing any of these things, or anything like these
things, is cheating, it is easily detected cheating, and those who thought they
could get away with it in the past have failed the course. Even more
importantly: doing any of those things means that the
assignment doesn't give you a chance to practice; it makes any
feedback you get meaningless; and of course it makes any evaluation based on
that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will never be a penalty for asking. If you do
violate these policies but then think better of it, it is your responsibility
to tell me as soon as possible to discuss how to rectify matters. Otherwise,
violations of any sort will lead to severe, formal disciplinary action, under
the terms of the university's
on academic integrity.
On the first day of class, every student will receive a written copy of the
university's policy on academic integrity, a written copy of these course
policies, and a "homework 0" on the content of these policies. This assignment
will not factor into your grade, but you must complete it before you
can get any credit for any other assignment.
Accommodations for Students with Disabilities
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate
Inclusion and Respectful Participation
The university is a community of scholars, that is, of people seeking
knowledge. All of our accumulated knowledge has to be re-learned by every new
generation of scholars, and re-tested, which requires debate and discussion.
Everyone enrolled in the course has a right to participate in the class
discussions. This doesn't mean that everything everyone says is equally
correct or equally important, but does mean that everyone needs to be treated
with respect as persons, and criticism and debate should be directed
at ideas and not at people. Don't dismiss (or credit) anyone in the course
because of where they come from, and don't use your participation in the class
as a way of shutting up others. Don't be rude, and don't go looking for
things to be offended by. While statistical learning theory
doesn't usually lead to heated debate, some of the subjects we'll be
applying it to might. If someone else is saying something you think is really
wrong-headed, and you think it's important to correct it, address why it
doesn't make sense, and listen if they give a counter-argument.
The classroom is not a democracy; as the teacher, I have the right and the
responsibility to guide the discussion in what I judge are productive
directions. This may include shutting down discussions which are not helping
us learn about statistics, even if those discussions might be important to have
elsewhere. (You can have them elsewhere.) I will do my best to guide the
course in a way which respects everyone's dignity as a human being, as a
scholar, and as a member of the university.
Detailed course calendar
Links to lecture notes, assignments, etc., will go here as they become
Note that the order of topics after mixture models is somewhat tentative;
any or all of them might be dropped, if earlier ones end up taking longer than
expected, and/or replaced by others that seem more relevant.
Lecture 1 (Tuesday, 2 February):
Introduction to the course
Lecture 2 (Thursday, 4 February):
Prediction as a Decison Problem
- Kinds of prediction: regression (guessing a quantitative variable),
classification (guessing a categorical variable), ranking (guessing an ordinal
variable), density estimation (guessing a probability distribution), etc.
"What prediction should I make for Y when I've seen that X=x?" is a decision
problem. The elements of decision theory: actions, states, information. "Loss
functions" say how bad a a given action is in a given state. A "strategy" is a
rule which tells us what action to take on the basis of the available
information. "Risk" is the expected loss incurred from following a strategy.
Risk minimization as a guide to decision making. Illustration with
least-squares regression and especially linear regression. Other losses for
regression; some losses for classification. Asides (time permitting):
alternatives to minimizing risk, such as minimizing the maximum loss; where did
decision theory come from?
- Slides (.Rmd)
Lecture 3 (Tuesday, 9 February):
Empirical Risk Minimization
- The goal of decision theory is to find the strategy (from the "feasible"
set of strategies) which minimizes the risk, but "risk" is defined using the
true data-generating distribution, which we do not know. The "empirical risk"
of a strategy is simply its mean loss on the training data, and "empirical risk
minimization" is selecting the (feasible) strategy with the lowest mean loss on
the training data. (Examples: least squares, maximum likelihood.) This is an
optimization problem. By using the law of large numbers, we can say something
about which empirical risk minimization will converge on the strategy that
minimizes the true risk. By using the central limit theorem (and waving our
hands a lot), we can say something about how the empirical risk minimizer has
Gaussian fluctuations around the true risk minimizer.
- Slides (.Rmd)
Lecture 4 (Thursday, 11 February):
Optimism and Over-Fitting
- The empirical risk of any fixed strategy is an unbiased estimate of its
true risk. But the empirical risk of the empirical risk minimizer is
a negatively biased estimate of its true risk. ("The winner's
curse".) "Optimism" refers to this negative bias. The same rough calculations
which told us last time about how ERM works this time tell us about the
magnitude of the optimism. "Over-fitting" happens because larger, more
flexible models (which have less approximation error) are more prone to
- Slides (.Rmd)
- Homework 0 due
- Homework 1 due
- Homework 2 assigned
Lecture 5 (Tuesday, 16 February):
Deviation Bounds of Probability Theory I: Markov's Inequality and Its Descendants
- A random variable can't be small, on average, if large values are highly
probable. Markov's inequality turns this around, and uses the expected value
to upper-bound tail probabilities. Applying increasing functions to both sides
of the inequality lets us breed new inequalities from old ones. Chebyshev's
inequality bounds deviations from the expected value in terms of the variance.
This let Chebyshev, and us, prove the law of large numbers. Chebyshev's form
of the LLN implies that the probability of a large deviation goes to zero
inversely with the number of samples (if not faster). Using the exponential
Markov inequality, one of several related results called "the Chernoff bound",
gives deviation probabilities that go to zero exponentially fast. This
requires knowing, or at least controlling, the moment generating function, and
leads to the idea of "sub-Gaussian" random variables, essentially ones where
the law of large numbers converges at least as fast as for Gaussians.
Back-up topics, skipped in the interest of time but at the end of the slides: lower bounds on deviation probabilities from the Paley-Zygmund inequality; large deviations theory and upper and lower bounds going to zero at the same exponential rate; deriving the central limit theorem with moment generating functions; cumulant generating functions; what can happen when a single
variable can cause a large deviation in an average.
- Slides (.Rmd)
Lecture 6 (Thursday, 18 February):
Deviation Bounds II: To the Bounded-Difference Inequality
- Continuing with the exponential Markov inequality: The moment generating
function for the sum of two independent random variables is the product of
their moment generating functions. This implies that the sum of independent
sub-Gaussian random variables is also sub-Gaussian, and lets us bound their
sums and their averages. A particularly important class of sub-Gaussian random
variables turns out to be those with a finite range; "Hoeffding's inequality"
shows that these are sub-Gaussian and controls their moment generating
functions in terms of the length of the range. Immediate corollaries are
deviation inequalities for sums and averages of bounded random variables.
(These are also called "Hoeffding's inequality",
confusion because the results are so closely related.) One
application of Hoeffding's inequality is to control the empirical risk, under
the log loss, of binary classification, showing that it's close to the true
risk. Turning to functions which are not just sums or averages, we introduce
the "bounded difference property", which says that changing one argument of a
function at a time can only lead to a limited change in the value of the
function. If the arguments are independent, this limits the variance (via the
Efron-Stein inequality, which doesn't strictly require the bounded-difference
property). The McDiarmid inequality, often also called the bounded
difference inequality, is an exponential deviation bound for functions
with the bounded difference property. As an illustration, this gives us
deviation inequalities for the difference between the true cumulative
distribution function and the empirical or sample CDF, even though the empirical
CDF is not an average of the data points.
- Slides (.Rmd)
- Handout with the gory/boring mathematical details skipped in lecture: PDF (.Rmd)
NO CLASS on Tuesday, 23 February
This is a university break day.
Lecture 7 (Thursday, 25 February)
From Deviation Inequalities to Generalization Error Bounds via Uniform Convergence
- We use data to estimate what strategy to follow. Typically we do this by
empirical risk minimization. We know the empirical risk of the estimated
strategy, but what we'd really like to know is how close the true risk is to
the empirical risk ("generalization error"), and how close the true risk of the
estimated strategy is to the risk of the optimal strategy ("estimation error").
The deviation bounds we've learned how to use are not quite enough to
get at this, because the estimated strategy is, itself, a random,
data-dependent object. One way to control both generalization error and
estimation error is to move from "universal" convergence (having some deviation
bound for every strategy's risk) to "uniform" convergence (have bounds on
the maximum deviation). If the set of strategies was finite, a simple
union bound would accomplish this. For infinite spaces, the basic approach is
to approximate or "cover" the strategy space by a finite set of strategies,
apply the union bound, and handle the extra slop due to the approximation.
While theoretically fundamental, in practice it can be better to show that the
maximum deviation concentrates on its expected value, and then control
the maximum expected deviation using properties of the model space;
this approach will occupy us next week.
- Slides (.Rmd)
Lecture 8 (Tuesday, 2 March):
Rademacher Complexity and Our First Error Bounds
- Controlling the maximum deviation by controlling the discrepancy between
two independent realizations of the data. Derivation of the Rademacher
complexity. Interpretation of the Rademacher complexity: how well could this
model class (seem to) correlate with random noise? Bounds in terms of the
Rademacher complexity. Using empirical Rademacher complexity to estimate Rademacher complexity (while still getting a valid generalization error bound).
Using simulation to estimate empirical Rademacher complexity.
- Slides (correcting some backwards inequalities; .Rmd)
Lecture 9 (Thursday, 4 March):
Vapnik-Chervonenkis (VC) Dimension and Distribution-Free Bounds
- Rademacher complexity yields distribution-dependent
generalization error bounds, ones where the size of the bound involves the true
data-generating distribution. We can sometimes calculate these; for example,
for linear models. But this requires mathematical insight, so sometimes we
want to use data-dependent generalization error bounds, such as those
coming from empirical Rademacher complexity. Moving in the other direction, we
can try to come up with bounds on the Rademacher complexity that holds
regardless of the data-generating distribution. For classification, this turns
out to involve answering the question "how many different ways could
our models classify n data points, in arbitrary position?" If this
number grows only polynomially with n, we get distribution-free
generalization error bounds. When this growth rate is polynomial, the order of
the polynomial is called the "Vapnik-Chervonenkis", or "VC" dimension. Finite
VC dimension implies that empirical risk minimization will converge on the
minimium risk under any distribution. (Mathematically: under all
distributions, the maximum deviation goes to zero.) If the VC dimension is
infinite, then there are at least some distributions where ERM
will not converge on the minimum risk. (Under some
distributions, the maximum deviation does not go to zero.) All of this
generalizes beyond binary classification. Distribution-free bounds tend to be
looser than distribution-dependent ones, but they may be easier to find; rich,
complex, flexible model classes have looser bounds than impoverish, simple,
limited model classes, but they also have better optimal risks.
- Slides (.Rmd)
Lecture 10 (Tuesday, 9 March):
Probably Approximately Correct Learning
- Review of the ground we've covered so far: we want to use limited training
data, drawn from some probability distribution, to select a strategy (decision
rule) which can be expected to perform well on new data from the same
distribution. "Perform well" here means that the risk (=expected loss) of our
selected strategy is not much higher than the risk of the optimal strategy in
our class or family of strategies. This estimation error or excess risk can be
bounded by the maximum deviation, which can be bounded (probabilistically) by
the quantities like Rademacher complexity and VC dimension. When the
complexity isn't too high, we can come arbitrarily close to the risk of optimal
strategy, with arbitrarily high confidence, by collecting enough data. (With a
given amount of data, we can put high-confidence limits on the estimation
error, and those bounds shrink as the sample size grows.) This is "risk
consistency" in statistics and decision theory. Essentially the same idea
was re-invented in computer science under the more illuminating name of
"probably approximately correct" (PAC) learning. The modern field of machine learning emerged from the realization (occasionally angry) that statisticians
and computer scientists were talking about the same thing in different
language, and could share each others' tools.
- Slides (.Rmd)
Lecture 11 (Thursday, 11 March):
Optimization and Its Algorithms
- Minimizing the empirical risk means solving an optimization problem. When
we get to learning methods other than ERM, they will involve optimization as
well. We have used some basic mathematical facts about optimization already
(especially in lectures 3 and 4), but now we are going to look at how to
actually do it. Nice optimization problems have a single global
optimum and no local optima, so that the rule "keep moving as long as things
are getting better" will eventually get us to the optimum.
Specific algorithms for finding optima rely on properties of the function we're
optimizing. We'll look at "gradient descent" (take a small step in the best
direction and repeat until we get there), "Newton's method" (do gradient
descent but take smaller steps when the function is curvier), "stochastic
gradient descent" (randomly sample data points to compute the function), and
"simulated annealing" (move randomly, but back-track if it makes things worse,
and get more inclined to back-track as you go along). No optimization
algorithm gives us the exact optimum, which in statistical learning introduces
an extra source of error, the "optimization error". Spending more computer
time reduces the optimization error, but past a certain point it gets small
compared to the estimation error, and isn't worth it. Sometimes quite crude
optimization is all we really need. It is annoying to have to think
about multiple different optimization algorithms, rather than having One Algorithm to Rule Them All, but this is partly because of genuine trade-offs of generality vs. computing time vs. optimization error, but also because of quite
fundamental limits which say that no algorithm can always out-perform
others across all problems (the "no free lunch theorems").
- Slides (.Rmd)
- Optional reading:
- (*) Léon Bottou and Olivier Bosquet, "The Tradeoffs of Large Scale Learning", pp. 351--368 in Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning (Cambridge, Massachusetts: MIT Press, 2012) [PDF preprint via Dr. Bottou]
- (***) Herbert Robbins and Sutton Monro, "A Stochastic Approximation Method", Annals of Mathematical Statistics 22 (1951): 400--407
- (**) David H. Wolpert and William G. Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1997): 67--82 [Preprint version, Santa Fe Institute Working Paper 95-02-010]
Lecture 12 (Tuesday, 16 March):
Regularization by Penalization and Constraints
- Penalized optimization. Intuitive justifications for penalties.
Qualitative effects of L1 and L2 penalties; ridge regression and lasso as
illustrations. Constrained optimization, and "mathematical programming".
Equivalence of penalties and constraints via Lagrange multipliers ("a fine is a
price"). Lagrange multipliers as prices. Some remarks on algorithms for
solving "convex program" problems. Time permitting: "Comrades, let's optimize!"
- Slides (.Rmd)
- Dan Klein, "Lagrange Multipliers without Permanent Scarring", online tutorial (2001)
- Optional readings:
- Francis Spufford, Red Plenty (London: Faber and Faber, 2010) [What happened the first time brilliant, idealistic people tried to use the power of data, computers and optimization to make the world a better place]
Lecture 13 (Thursday, 18 March):
Stability of learning
- So far, our generalization error bounds have only used properties of the
model class (for distribution-free, VC-style bounds), or properties of the
model class and the data-generating distribution (distribution-dependent,
Rademacher-style bounds). The actual fitting or estimation process hasn't
figured into our calculations. This is reflected in the way our definitions
involve a maximum (or minimum) over the entire model space. But the whole
model space may not be relevant; the fitting algorithm may only ever explore a
small part of it. A fitting algorithm is said to be "stable" if its output
changes little, or not at all, under various perturbations to its input.
Stability implies generalization error bounds, in ways we'll explore.
- Slides (.Rmd)
- Optional reading:
- (**) Olivier Bousquet and André Elisseeff, "Stability and Generalization", Journal of Machine Learning Research 2 (2002): 499--526
- (*) Pedro Domingos, "Process-Oriented Estimation of Generalization Error", pp. 714--719 in Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence [IJCAI 1999] [Ungated copy via Prof. Domingos]
- (**) Michael J. Kearns and Dana Ron, "Algorithmic stability and sanity-check bounds for leave-one-out cross-validation", Neural Computation 11 (1999): 1427--1453 [Ungated copy via Prof. Kearns]
- (***) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik
Sridharan, "Learnability, Stability and Uniform
of Machine Learning Research 11 (2010): 2635--2670
Lecture 14 (Tuesday, 23 March):
Regularization and Model Complexity
- For purposes of learning, the complexity of a class of models or strategies
is not about whether any particular function in the class is easy or hard to
describe or calculate or communicate. Rather, the complexity of a class of
models is about its diversity: how many distinct
or distinguishable behaviors can it realize on limited data? This is
also known as the "capacity" of a set of functions, and quantities like
Rademacher complexity, VC dimension, etc., are all ways of measuring the
complexity of a model class. The fundamental trade-off is that "rich",
high-capacity, complex models have lower approximation error, but
higher estimation error. Regularization has the effect of directly
lowering model complexity. (A small calculation for linear models and the
L2/ridge penalty illustrates this.) Regularization helps, i.e., reduces
over-all risk, when it lowers estimation error more than it increases
approximation error. But the increase in approximation error from limited
complexity is sample-size independent, O(1), while the reduction in estimation
error is shrinks with sample size, since estimation error does, it's o(1) and
typically O(n-0.5). A fixed level of regularization
therefore becomes counter-productive for sufficiently large samples. In the
"method of sieves", we try to finesse this, by relaxing the constraint as the
sample size grows; the trick is to relax at just the right rate to make sure
the sum of estimation and approximation errors goes to zero. (That is, don't
estimate more complex models until we have enough data to do so reliably.) In
the method of sieves, we fix how much regularization we want to do in advance
of looking at the data; we'll explore next time how to pick it in a
- Slides (with typo corrections and a bonus figure illustrating the approximation-estimation trade-off; .Rmd)
- Optional reading:
Lecture 15 (Thursday, 25 March):
Model Selection Based on Point Estimates of Risk
- When we regularize (whether using penalties or constraints), we're need to
decide how much to regularize. It'd be nice if we could do this in a
way informed by the data. This is an example of the more general problem
of model selection, of using the data to pick a model class, as well
as a precise model within that class. The goal of model selection should be,
for us, to pick a model class whose risk is close to that of the best model
class, i.e., best given the limited data available. A very basic way of doing
this is data splitting or "hold-out": divide the data at random into
independent training and selection sets, do ERM (or whatever) for each model
class on the training set, and pick the class whose fitted model performs best
on the selection set. The tools we've already developed let us prove that the
hold-out will select a model class whose ERM has a true risk close to the
lowest among all the ERMs, with bounds that tighten as the selection set grows.
The drawback of the holdout, aside from its sheer idiot simplicity, is that we
need to leave more and more data in the selection set and never use it for
estimation. Cross-validation tries to better this, by swapping data points in
and out of the role of "training" and "testing", and averaging. It's more
complicated to analyze than holdout, but results have the same flavor, with
better over-all performance. Leave-one-out cross-validation, the oldest form
of CV, has the best predictive performance, but is slower than things like
k-fold CV (because we need to fit n times). For some situations, like
regression with linear smoothers, there are clever tricks where we can
calculate the leave-one-out CV estimate while only fitting each model once. In
other situations, the large-sample estimate of the "optimism" from lecture 4
actually re-appears as an approximation to leave-one-out. Akaike's information
criterion, and many of the other variants on AIC, are in turn approximations to
the optimism under various assumptions.
- Slides (.Rmd)
Lecture 16 (Tuesday, 30 March):
Model Selection Based on Risk Bounds, a.k.a. Structural Risk Minimization
- All of the model selection criteria we examined last time can be seen as
doing penalized optimization, only now what are being penalized are the model
classes. This makes our earlier optimization theory from lecture 3 and 4 less
directly applicable, because we're optimizing over discrete,
qualitatively-distinct model classes, not a single continuous space of
functions. (We can't take a gradient with respect to the order of a
polynomial.) But we can, nonetheless, use penalties to discourage certain
features of the model class, unless really justified by improved
performance in the form of empirical risk. AIC penalizes the number of
parameters. The "optimism" penalizes imprecisely-estimated models.
Cross-validation and holdout penalize models which perform very differently on
different data sets. The idea penalty would in fact be exactly the deviation
between the risk and the empirical risk of the ERM. All of those
earlier-mentioned penalties can be seen as so many ways of giving point
estimates of the deviation. But generalization error bounds tell us that the
deviation is less than such-and-such a size with high probability. This leads
us (following V. N. Vapnik) to the principle of "structural risk minimization":
arrange the model classes in a nested hierarchy of increasingly large and
elaborate models, calculate the sum of the empirical risk and the
generalization error bound for each level in the hierarchy, and pick the model
class with the best (smallest) sum. That is, use the size of the
generalization error bound as the penalty. If we have an infinite hierarchy,
but use the same confidence level (= error probability)
throughout, some step in the hierarchy will break the bound, which
would be bad. The usual practice is thus to use only a limited number of
levels in the hierarchy, but also to let that number of levels grow as we get
more data. This allows the risk of our selected model to converge on the
lowest risk attainable anywhere in the hierarchy.
- Slides (.Rmd)
- Optional reading:
- (**) Sylvain Arlot, "V-fold cross-validation improved: V-fold
- (***) John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and Martin Anthony, "Structural Risk Minimization over Data-Dependent Hierarchies", IEEE Transactions on Information Theory 44 (1998): 1926--1940
- (****) V. N. Vapnik, The Nature of Statistical Learning Theory (2nd ed., Berlin: Springer-Verlag, 2000), chapter 4
Lecture 17 (Thursday, 1 April):
- Model selection is inherently noisy, and often the difference between the
best model (by our criterion) and the not-quite-best models is small and more
or less due to chance. This motivates wanting to get away from having to
select a model, to somehow using an ensemble of
models at once. The most basic sort of ensemble method is model averaging.
Some basic algebra shows that the mean squared error of an ensemble average
equals the average MSE of the individual models in the ensemble, minus
the variance of predictions across the ensemble. Similar results hold for
other convex loss functions, such as the log probability loss. Thus the risk
of the ensemble equals the average risk of its members, minus the diversity of
predictions across the ensemble. Average individual skill, and group-level
diversity, make equal contributions to the performance of the ensemble. The
risk of an ensemble can thus be better than the risk of the any model
in the ensemble, if the models can compensate for each others' systematic
mistakes. There are, in practice, multiple ways of creating ensembles where
the individual models have not-too-bad individual risks and there's diversity
across the ensemble; these include weighting by in-sample performance (possibly
penalized), fitting multiple models to different parts or resamplings of the
data set, and creating a series of models which fit data points that the other
models handled poorly. One might worry that the average of many models is
itself a model of such high complexity that it will always overfit; this
doesn't happen because ensemble methods are algorithmically stable.
- Slides (.Rmd)
Lecture 18 (Tuesday, 6 April):
Kernel Machines I
- Motivation: using linear methods, but on nonlinear functions of the data.
The "kernel trick"; definition of a kernel function and characterization
of kernels. Kernel machines for classification and regression. Complexity
of kernel machines.
- Slides (.Rmd)
Lecture 19 (Thursday, 8 April): Canceled
Class canceled due to side effects of the professor getting vaccinated the day
- Homework 9 due
- Homework 10: assignment --- please note this is due on Tuesday, 13 April (and is correspondingly shortened)
Lecture 20 (Tuesday, 13 April):
Kernel Machines II
- Some practicalities of working with kernel machines: the implicit expansion
of kernel machines in terms of basis functions; regularizing
linear-in-the-basis-function models; visualizing the kernel matrix, and
visualizing the eigenvectors of the kernel matrix as approximating the
eigenfunctions. Kernel ridge regression. A worked example.
- Slides, .Rmd source file (which you will find useful for the homework)
NO CLASS on Thursday, 15 April
Lecture 21 (Tuesday, 20 April):
Kernel Machines III, Support Vector Machines
- Classification with kernel machines. It's hard to optimize the 0/1 loss,
so we turn to "surrogates" which are easier to work with. The "margin" of a
data point in a classification problem is how much the machine's output would
have to change to flip the classification; the margin of the machine is the
minimum margin over data points. A classifier which achieves a large margin
has put the boundary between classes far from all the data points, at least in
whatever implicit feature space it's using. For kernel machines, maximizing
the margin leads to an optimization problem which is computationally tractable,
and sets the weight on many data points to zero; the others are the "support
vectors". Moreover, the empirical Rademacher complexity of support vector
machines turns out to be inversely proportional to the margin we achieve, so
that large margins imply good generalization performance. Matters are a little
more complicated when we allow the machine the "slack" to make a certain number
of mistakes, but fundamentally similar.
- Slides (.Rmd)
Lecture 22 (Thursday, 22 April):
Random Feature Machines
- Kernels have many attractive properties, but the need to keep around all
the data points, and especially to work with the [n x n] kernel
matrix, creates a crucial bottleneck for large data. Random feature machines
first emerged as a way of making kernel machines more computationally
efficient: by decomposing kernels into sums (or integrals) of nonlinear
functions, and sampling from those functions, one can get a randomized
approximation to the kernel that uses only a small fraction of the
computational resources. This led to the idea of directly using random
features, without even trying to approximate a kernel. A surprisingly
small number of randomly-chosen nonlinear features can come remarkably
close to the optimal function, while still having good generalization
- Slides (.Rmd)
- Optional reading:
- Ali Rahimi and Benjamin Recht, "Random Features for Large-Scale Kernel Machines", pp. 1177--1184 in John C. Platt, Daphne Koller, Yoram Singer and Samuel T. Roweis (eds.), Advances in Neural Information Processing Systems 20 [NIPS 2007]
- Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning", pp. 1313--1320 in Daphne Koller, D. Schuurmans, Y. Bengio and L. Bottou (eds.), Advances in Neural Information Processing Systems 21 [NIPS 2008]
- Ali Rahimi and Benjamin Recht, "Reflections on Random Kitchen Sinks", argmin blog, 5 December 2017
Lecture 23 (Tuesday, 27 April):
Low-regret learning I
- An introduction to the idea of regret, of low-regret learning, and mixtures of experts
- Slides (.Rmd)
Lecture 24 (Thursday, 29 April):
Low-regret learning II
- The risk properties of multiplicative weight training when the data are IID. Competing against sequences of experts, and growing ensembles
- Slides (.Rmd)
- Homework 12 due
- Homework 13: assignment, due on Sunday, 9 May 2021
Lecture 25 (Tuesday, 4 May):
Summary and Conclusions
- Recapitulation of the highlights of the course; what we haven't done.
Thursday, 6 May:
Since the professor is getting vaccinated.
Sunday, 9 May
Homework 13 is due at 6 pm