Conceptual Foundations of Statistical Learning

36-465/665, Spring 2021

Erecting the edifice of statistical learning theory

Cosma Shalizi
Tuesdays and Thursdays, 2:20--3:40 pm (Pittsburgh time), online only

This course is an introduction to the core ideas and theories of statistical learning, and their uses in designing and analyzing machine-learning systems. Statistical learning theory studies how to fit predictive models to training data, usually by solving an optimization problem, in such a way that the model will predict well, on average, on new data. The course will focus on the key concepts and theoretical tools, at a mathematical level accessible to students who have taken 36-401 (or equivalent) and its pre-requisites. The course will also illustrate those concepts and tools by applying them to carefully selected kinds of machine learning systems (such as kernel machines).

Pre-requisites

Undergraduates taking the course as 36-465 must have taken 36-401 with a grade of C or better; this ensures familiarity with both the theory and practice of linear regression modeling. 36-401 itself has a chain of pre-requisites: linear algebra, calculus with multiple variables, undergraduate probability, and mathematical statistics. We will be making extensive use of all of these. Graduate students taking the course as 36-665 have, formally speaking, no pre-requisites, but will be expected to have a preparation at least as thorough as the undergraduates.

Learning Objectives

By the end of this course, when students encounter a new prediction problem, they should be able to:

Analyze in terms of statistical decision theory, identifying the space of actions, the space of outcomes, the information set, and the loss function;
Determine whether a given combination of machines and estimation methods are suitable for the prediction problem;
Identify whether standard generalization error bounds apply to this prediction problem and learning method, and, if so, apply those results to give guarantees about the out-of-sample performance;
Interpret and compare error bounds arising from different techniques and assumptions.

When students encounter a new class of learning machines or statistical models, they should be able to:

Calculate standard measures of model complexity or flexibility, including the Rademacher complexity and the VC dimension;
Implement regularized and un-regularized optimization methods for fitting models from the class, and be able to calculate the effects of regularization on model complexity;
Apply standard heuristics to quantify the consequences of imperfect optimization on predictive performance;
Implement techniques for selecting among models.

36-465/665 vs. 36-462/662, 10-301/601 and 10-701

Students wanting exposure to a broad range of learning algorithms and their applications would be better served by other courses, especially 36-462/662 ("data mining", "methods of statistical learning"), 10-301/601 ("introduction to machine learning") or 10-701 ("introduction to machine learning" for Ph.D. students). This class is for those who want a deeper understanding of the underlying principles. It will mean a lot more math than coding, and it won't help you move up a leader-board, but it will help you understand the statistical reasons why learning machines work (when they do).

36-464 vs. 36-665

Undergraduates should register for the course as 36-465. Graduate students in statistics or other departments are welcome to take the course, but should register for it as 36-665. (Undergraduates registering for 36-465 will be dropped from the roster.)

Short Outline of Topics Covered

This is subject to revision.

Prediction as a decision problem; elements of decision theory; loss functions; examples of loss functions for classification and regression; "risk" defined as expected loss on new data; the goal is a low-risk prediction rule ("probably approximately correct", PAC).
Finding decision rules by optimizing the "empirical risk", or average loss on the training data; learning curves (risk vs. sample size); why the empirical risk of the estimated rule is lower than its true risk; the origin of over-fitting.
Mathematical tools for controlling the difference between empirical risk, or in-sample error, and true risk, or generalization error. Deviation inequalities, and their uniform counterparts, concentration inequalities. Measures of model complexity: Rademacher complexity, Vapnik-Chervonenkis (VC) dimension. Why counting parameters isn't enough (and is often irrelevant). "Stability" arguments.
Optimization: basics of optimization theory; asymptotics of optimization with small noise; "optimism" of optimization; adding penalties or constraints; why penalties and constraints are equivalent; how penalties "regularize" unstable optimization problems; effects of regularization on model complexity
Model selection: comparisons based on estimated generalization error; based on bounds on generalization error ("structural risk minimization"); avoiding selection by model averaging and ensemble methods.
Application to kernel machines: kernel methods for classification; kernel methods for regression. Risk bounds for kernel methods. Large-margin classifiers. Support vector machines.
Application to random feature machines ("kitchen sinks") for classification and regression.
Application to mixture models for density estimation.
Extra topics, time permitting: predicting stochastic processes; sequential decision-making and reinforcement learning; low-regret ("on-line" or "adversarial") learning.

Course Mechanics

Lectures and Remote-Only Instruction

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. You will usually find the readings more rewarding if you do the readings before lecture, rather than after (or during). Since this is an online-only class this semester, lectures will be held via Zoom; the link for each session will be on Canvas. I know that the class time is late at night or early in the morning for some of you; I nonetheless urge you to come to class and participate.

No Recordings: I will not be recording lectures. This is because the value of class meetings lies precisely in your chance to ask questions, discuss, and generally interact. (Otherwise, you could just read a book.) Recordings interfere with this in two ways:

They tempt you to skip class and/or to zone out and/or try to multi-task during it. (Nobody, not even you, is really any good at multi-tasking.) Even if you do watch the recording later, you will not learn as much from it as if you had attended in the first place.
People are understandably reluctant to participate when they know they're being recorded. (It's only too easy to manipulate recordings to make anyone seem dumb and/or obnoxious.) Maybe this doesn't bother you; it doesn't bother me, much, because I'm protected by academic freedom and by tenure, but a good proportion of your classmates won't participate if they're being recorded, and that diminishes the value of the class for everyone.

Recording someone without their permission is illegal in many places, and more importantly is unethical everywhere, so don't make your own recordings of the class.

(Taking notes during class is fine and I strongly encourage it; taking notes forces you to think about what you are hearing and how to organize it, which helps you understand and remember the content.)

Textbooks

To be determined (as of November 2020). The following books are strongly recommended, and I may well end up picking one of them to be the textbook; if so, I will decide well before the beginning of classes.

Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning
Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory
Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems

Assignments

There are three reasons you will get assignments in this course. In order of decreasing importance:

Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows what you only thought you understood.
Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be two kinds of assignment in this course.

After-class comprehension questions and exercises: Following every lecture, there will be a brief set of questions about the material covered in lecture. Sometimes these will be about specific points in the lecture, sometimes about specific aspects of the reading assigned to go with the lecture. These will be done on Canvas and will be due the day after each lecture. These should take no more than 10 minutes, but will be untimed (so no accommodations for extra time are necessary). If the questions ask you to do any math (and not all of them will!), a scan or photograph of hand-written math is OK, so long as the picture is clearly legible. (Black ink or dark pencil on white paper helps.)
Homework: Most weeks will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve statistical theory, analyzing real data sets on the computer, and communicating the results.; All homework will be submitted electronically through Gradescope/Canvas. Most weeks, homework will be due at 6:00 pm on Thursdays (Pittsburgh time). There will be a few weeks, clearly noted on the syllabus and on the assignments, when this won't be the case. When this results in less than seven days between an assignment's due date and the previous due date, the homework will be shortened.

Time Expectations

You should expect to spend 5--7 hours on assignments every week, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

After-class questions: 10%. All sets of questions will have equal weight. The lowest 4 will be dropped, no questions asked.
Homework: 90%. All homeworks will have equal weight. Your lowest 3 homework grades will be dropped, no questions asked. If you turn in all homework assignments on time, for a grade of at least 60% (each), your lowest four homework grades will be dropped. Late homework will not be accepted for any reason.

Grade boundaries will be as follows:

A [90, 100]

B [80, 90)

C [70, 80)

D [60, 70)

R < 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, that you need a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Canvas and Piazza

Homework will be submitted electronically through Gradescope/Canvas. Canvas will also be used for the after-class questions, as a calendar showing all assignments and their due-dates, to distribute some readings, and as the official gradebook.

We will be using Piazza for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. Anonymity will go away for everyone if it is abused. During Piazza office hours, someone will be online to respond to questions (and follow-ups) in real time. You are welcome to post at any time, but outside of normal working hours you should expect that the instructors have lives.

Office Hours

Tuesday 3:40--4:40	Prof. Shalizi	Zoom (same link as lecture)
Wednesday 2:00--3:00	Prof. Shalizi	Piazza
Thursday 1:20--2:20	Mr. Luo	Piazza

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbooks and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students, the instructors, or your teachers in other current classes at CMU, until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course, or in other courses, in previous semesters is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will never be a penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how to rectify matters. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

On the first day of class, every student will receive a written copy of the university's policy on academic integrity, a written copy of these course policies, and a "homework 0" on the content of these policies. This assignment will not factor into your grade, but you must complete it before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate with me.

Inclusion and Respectful Participation

The university is a community of scholars, that is, of people seeking knowledge. All of our accumulated knowledge has to be re-learned by every new generation of scholars, and re-tested, which requires debate and discussion. Everyone enrolled in the course has a right to participate in the class discussions. This doesn't mean that everything everyone says is equally correct or equally important, but does mean that everyone needs to be treated with respect as persons, and criticism and debate should be directed at ideas and not at people. Don't dismiss (or credit) anyone in the course because of where they come from, and don't use your participation in the class as a way of shutting up others. Don't be rude, and don't go looking for things to be offended by. While statistical learning theory doesn't usually lead to heated debate, some of the subjects we'll be applying it to might. If someone else is saying something you think is really wrong-headed, and you think it's important to correct it, address why it doesn't make sense, and listen if they give a counter-argument.

The classroom is not a democracy; as the teacher, I have the right and the responsibility to guide the discussion in what I judge are productive directions. This may include shutting down discussions which are not helping us learn about statistics, even if those discussions might be important to have elsewhere. (You can have them elsewhere.) I will do my best to guide the course in a way which respects everyone's dignity as a human being, as a scholar, and as a member of the university.

Detailed course calendar

Using concentration inequalities to build a barrier against fluctuations and over-fitting

Links to lecture notes, assignments, etc., will go here as they become relevant.

Note that the order of topics after mixture models is somewhat tentative; any or all of them might be dropped, if earlier ones end up taking longer than expected, and/or replaced by others that seem more relevant.

Lecture 1 (Tuesday, 2 February): Introduction to the course

Slides (.Rmd source file for generating the slides)
Homework:
- Homework 0 assigned (on Canvas)
- Homework 1: assignment

Lecture 2 (Thursday, 4 February): Prediction as a Decison Problem

Kinds of prediction: regression (guessing a quantitative variable), classification (guessing a categorical variable), ranking (guessing an ordinal variable), density estimation (guessing a probability distribution), etc. "What prediction should I make for Y when I've seen that X=x?" is a decision problem. The elements of decision theory: actions, states, information. "Loss functions" say how bad a a given action is in a given state. A "strategy" is a rule which tells us what action to take on the basis of the available information. "Risk" is the expected loss incurred from following a strategy. Risk minimization as a guide to decision making. Illustration with least-squares regression and especially linear regression. Other losses for regression; some losses for classification. Asides (time permitting): alternatives to minimizing risk, such as minimizing the maximum loss; where did decision theory come from?
Slides (.Rmd)

Lecture 3 (Tuesday, 9 February): Empirical Risk Minimization

The goal of decision theory is to find the strategy (from the "feasible" set of strategies) which minimizes the risk, but "risk" is defined using the true data-generating distribution, which we do not know. The "empirical risk" of a strategy is simply its mean loss on the training data, and "empirical risk minimization" is selecting the (feasible) strategy with the lowest mean loss on the training data. (Examples: least squares, maximum likelihood.) This is an optimization problem. By using the law of large numbers, we can say something about which empirical risk minimization will converge on the strategy that minimizes the true risk. By using the central limit theorem (and waving our hands a lot), we can say something about how the empirical risk minimizer has Gaussian fluctuations around the true risk minimizer.
Slides (.Rmd)

Lecture 4 (Thursday, 11 February): Optimism and Over-Fitting

The empirical risk of any fixed strategy is an unbiased estimate of its true risk. But the empirical risk of the empirical risk minimizer is a negatively biased estimate of its true risk. ("The winner's curse".) "Optimism" refers to this negative bias. The same rough calculations which told us last time about how ERM works this time tell us about the magnitude of the optimism. "Over-fitting" happens because larger, more flexible models (which have less approximation error) are more prone to optimism.
Slides (.Rmd)
Homework:
- Homework 0 due
- Homework 1 due
- Homework 2 assigned

Lecture 5 (Tuesday, 16 February): Deviation Bounds of Probability Theory I: Markov's Inequality and Its Descendants

A random variable can't be small, on average, if large values are highly probable. Markov's inequality turns this around, and uses the expected value to upper-bound tail probabilities. Applying increasing functions to both sides of the inequality lets us breed new inequalities from old ones. Chebyshev's inequality bounds deviations from the expected value in terms of the variance. This let Chebyshev, and us, prove the law of large numbers. Chebyshev's form of the LLN implies that the probability of a large deviation goes to zero inversely with the number of samples (if not faster). Using the exponential Markov inequality, one of several related results called "the Chernoff bound", gives deviation probabilities that go to zero exponentially fast. This requires knowing, or at least controlling, the moment generating function, and leads to the idea of "sub-Gaussian" random variables, essentially ones where the law of large numbers converges at least as fast as for Gaussians.
Back-up topics, skipped in the interest of time but at the end of the slides: lower bounds on deviation probabilities from the Paley-Zygmund inequality; large deviations theory and upper and lower bounds going to zero at the same exponential rate; deriving the central limit theorem with moment generating functions; cumulant generating functions; what can happen when a single variable can cause a large deviation in an average.
Slides (.Rmd)

Lecture 6 (Thursday, 18 February): Deviation Bounds II: To the Bounded-Difference Inequality

Continuing with the exponential Markov inequality: The moment generating function for the sum of two independent random variables is the product of their moment generating functions. This implies that the sum of independent sub-Gaussian random variables is also sub-Gaussian, and lets us bound their sums and their averages. A particularly important class of sub-Gaussian random variables turns out to be those with a finite range; "Hoeffding's inequality" shows that these are sub-Gaussian and controls their moment generating functions in terms of the length of the range. Immediate corollaries are deviation inequalities for sums and averages of bounded random variables. (These are also called "Hoeffding's inequality", ~~to cause confusion~~ because the results are so closely related.) One application of Hoeffding's inequality is to control the empirical risk, under the log loss, of binary classification, showing that it's close to the true risk. Turning to functions which are not just sums or averages, we introduce the "bounded difference property", which says that changing one argument of a function at a time can only lead to a limited change in the value of the function. If the arguments are independent, this limits the variance (via the Efron-Stein inequality, which doesn't strictly require the bounded-difference property). The McDiarmid inequality, often also called the bounded difference inequality, is an exponential deviation bound for functions with the bounded difference property. As an illustration, this gives us deviation inequalities for the difference between the true cumulative distribution function and the empirical or sample CDF, even though the empirical CDF is not an average of the data points.
Slides (.Rmd)
Handout with the gory/boring mathematical details skipped in lecture: PDF (.Rmd)
Homework:
- Homework 2 due
- Homework 3: assignment

NO CLASS on Tuesday, 23 February

This is a university break day.

Lecture 7 (Thursday, 25 February) From Deviation Inequalities to Generalization Error Bounds via Uniform Convergence

We use data to estimate what strategy to follow. Typically we do this by empirical risk minimization. We know the empirical risk of the estimated strategy, but what we'd really like to know is how close the true risk is to the empirical risk ("generalization error"), and how close the true risk of the estimated strategy is to the risk of the optimal strategy ("estimation error"). The deviation bounds we've learned how to use are not quite enough to get at this, because the estimated strategy is, itself, a random, data-dependent object. One way to control both generalization error and estimation error is to move from "universal" convergence (having some deviation bound for every strategy's risk) to "uniform" convergence (have bounds on the maximum deviation). If the set of strategies was finite, a simple union bound would accomplish this. For infinite spaces, the basic approach is to approximate or "cover" the strategy space by a finite set of strategies, apply the union bound, and handle the extra slop due to the approximation. While theoretically fundamental, in practice it can be better to show that the maximum deviation concentrates on its expected value, and then control the maximum expected deviation using properties of the model space; this approach will occupy us next week.
Slides (.Rmd)
Homework:
- Homework 3 due
- Homework 4: assignment

Lecture 8 (Tuesday, 2 March): Rademacher Complexity and Our First Error Bounds

Controlling the maximum deviation by controlling the discrepancy between two independent realizations of the data. Derivation of the Rademacher complexity. Interpretation of the Rademacher complexity: how well could this model class (seem to) correlate with random noise? Bounds in terms of the Rademacher complexity. Using empirical Rademacher complexity to estimate Rademacher complexity (while still getting a valid generalization error bound). Using simulation to estimate empirical Rademacher complexity.
Slides (correcting some backwards inequalities; .Rmd)
Reading
- Optional readings:
  - (**) Peter L. Bartlett and Shahar Mendelson, "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results", Journal of Machine Learning Research 3 (2002): 463--482
  - (*) Xiaojin Zhu, Timothy Rogers and Bryan Gibson, "Human Rademacher Complexity", pp. 2322--2330 in Y. Bengio, D. Schuurmans, John Lafferty, C. K. I. Williams and A. Culotta (eds.), Advances in Neural Information Processing Systems 22 (NIPS 2009) [Commentary]

Lecture 9 (Thursday, 4 March): Vapnik-Chervonenkis (VC) Dimension and Distribution-Free Bounds

Rademacher complexity yields distribution-dependent generalization error bounds, ones where the size of the bound involves the true data-generating distribution. We can sometimes calculate these; for example, for linear models. But this requires mathematical insight, so sometimes we want to use data-dependent generalization error bounds, such as those coming from empirical Rademacher complexity. Moving in the other direction, we can try to come up with bounds on the Rademacher complexity that holds regardless of the data-generating distribution. For classification, this turns out to involve answering the question "how many different ways could our models classify n data points, in arbitrary position?" If this number grows only polynomially with n, we get distribution-free generalization error bounds. When this growth rate is polynomial, the order of the polynomial is called the "Vapnik-Chervonenkis", or "VC" dimension. Finite VC dimension implies that empirical risk minimization will converge on the minimium risk under any distribution. (Mathematically: under all distributions, the maximum deviation goes to zero.) If the VC dimension is infinite, then there are at least some distributions where ERM will not converge on the minimum risk. (Under some distributions, the maximum deviation does not go to zero.) All of this generalizes beyond binary classification. Distribution-free bounds tend to be looser than distribution-dependent ones, but they may be easier to find; rich, complex, flexible model classes have looser bounds than impoverish, simple, limited model classes, but they also have better optimal risks.
Slides (.Rmd)
Homework:
- Homework 4 due
- Homework 5: assignment, data-05.csv data file

Lecture 10 (Tuesday, 9 March): Probably Approximately Correct Learning

Review of the ground we've covered so far: we want to use limited training data, drawn from some probability distribution, to select a strategy (decision rule) which can be expected to perform well on new data from the same distribution. "Perform well" here means that the risk (=expected loss) of our selected strategy is not much higher than the risk of the optimal strategy in our class or family of strategies. This estimation error or excess risk can be bounded by the maximum deviation, which can be bounded (probabilistically) by the quantities like Rademacher complexity and VC dimension. When the complexity isn't too high, we can come arbitrarily close to the risk of optimal strategy, with arbitrarily high confidence, by collecting enough data. (With a given amount of data, we can put high-confidence limits on the estimation error, and those bounds shrink as the sample size grows.) This is "risk consistency" in statistics and decision theory. Essentially the same idea was re-invented in computer science under the more illuminating name of "probably approximately correct" (PAC) learning. The modern field of machine learning emerged from the realization (occasionally angry) that statisticians and computer scientists were talking about the same thing in different language, and could share each others' tools.
Slides (.Rmd)

Lecture 11 (Thursday, 11 March): Optimization and Its Algorithms

Minimizing the empirical risk means solving an optimization problem. When we get to learning methods other than ERM, they will involve optimization as well. We have used some basic mathematical facts about optimization already (especially in lectures 3 and 4), but now we are going to look at how to actually do it. Nice optimization problems have a single global optimum and no local optima, so that the rule "keep moving as long as things are getting better" will eventually get us to the optimum. Specific algorithms for finding optima rely on properties of the function we're optimizing. We'll look at "gradient descent" (take a small step in the best direction and repeat until we get there), "Newton's method" (do gradient descent but take smaller steps when the function is curvier), "stochastic gradient descent" (randomly sample data points to compute the function), and "simulated annealing" (move randomly, but back-track if it makes things worse, and get more inclined to back-track as you go along). No optimization algorithm gives us the exact optimum, which in statistical learning introduces an extra source of error, the "optimization error". Spending more computer time reduces the optimization error, but past a certain point it gets small compared to the estimation error, and isn't worth it. Sometimes quite crude optimization is all we really need. It is annoying to have to think about multiple different optimization algorithms, rather than having One Algorithm to Rule Them All, but this is partly because of genuine trade-offs of generality vs. computing time vs. optimization error, but also because of quite fundamental limits which say that no algorithm can always out-perform others across all problems (the "no free lunch theorems").
Slides (.Rmd)
Reading:
- Optional reading:
  - (*) Léon Bottou and Olivier Bosquet, "The Tradeoffs of Large Scale Learning", pp. 351--368 in Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning (Cambridge, Massachusetts: MIT Press, 2012) [PDF preprint via Dr. Bottou]
  - (***) Herbert Robbins and Sutton Monro, "A Stochastic Approximation Method", Annals of Mathematical Statistics 22 (1951): 400--407
  - (**) David H. Wolpert and William G. Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1997): 67--82 [Preprint version, Santa Fe Institute Working Paper 95-02-010]
Homework:
- Homework 5 due
- Homework 6: assignment

Lecture 12 (Tuesday, 16 March): Regularization by Penalization and Constraints

Penalized optimization. Intuitive justifications for penalties. Qualitative effects of L1 and L2 penalties; ridge regression and lasso as illustrations. Constrained optimization, and "mathematical programming". Equivalence of penalties and constraints via Lagrange multipliers ("a fine is a price"). Lagrange multipliers as prices. Some remarks on algorithms for solving "convex program" problems. Time permitting: "Comrades, let's optimize!"
Slides (.Rmd)
Reading:
- Dan Klein, "Lagrange Multipliers without Permanent Scarring", online tutorial (2001)
- Optional readings:
  - Francis Spufford, Red Plenty (London: Faber and Faber, 2010) [What happened the first time brilliant, idealistic people tried to use the power of data, computers and optimization to make the world a better place]

Lecture 13 (Thursday, 18 March): Stability of learning

So far, our generalization error bounds have only used properties of the model class (for distribution-free, VC-style bounds), or properties of the model class and the data-generating distribution (distribution-dependent, Rademacher-style bounds). The actual fitting or estimation process hasn't figured into our calculations. This is reflected in the way our definitions involve a maximum (or minimum) over the entire model space. But the whole model space may not be relevant; the fitting algorithm may only ever explore a small part of it. A fitting algorithm is said to be "stable" if its output changes little, or not at all, under various perturbations to its input. Stability implies generalization error bounds, in ways we'll explore.
Slides (.Rmd)
Reading:
- Optional reading:
  - (**) Olivier Bousquet and André Elisseeff, "Stability and Generalization", Journal of Machine Learning Research 2 (2002): 499--526
  - (*) Pedro Domingos, "Process-Oriented Estimation of Generalization Error", pp. 714--719 in Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence [IJCAI 1999] [Ungated copy via Prof. Domingos]
  - (**) Michael J. Kearns and Dana Ron, "Algorithmic stability and sanity-check bounds for leave-one-out cross-validation", Neural Computation 11 (1999): 1427--1453 [Ungated copy via Prof. Kearns]
  - (***) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik Sridharan, "Learnability, Stability and Uniform Convergence", Journal of Machine Learning Research 11 (2010): 2635--2670
Homework:
- Homework 6 due
- Homework 7: assignment

Lecture 14 (Tuesday, 23 March): Regularization and Model Complexity

For purposes of learning, the complexity of a class of models or strategies is not about whether any particular function in the class is easy or hard to describe or calculate or communicate. Rather, the complexity of a class of models is about its diversity: how many distinct or distinguishable behaviors can it realize on limited data? This is also known as the "capacity" of a set of functions, and quantities like Rademacher complexity, VC dimension, etc., are all ways of measuring the complexity of a model class. The fundamental trade-off is that "rich", high-capacity, complex models have lower approximation error, but higher estimation error. Regularization has the effect of directly lowering model complexity. (A small calculation for linear models and the L2/ridge penalty illustrates this.) Regularization helps, i.e., reduces over-all risk, when it lowers estimation error more than it increases approximation error. But the increase in approximation error from limited complexity is sample-size independent, O(1), while the reduction in estimation error is shrinks with sample size, since estimation error does, it's o(1) and typically O(n^-0.5). A fixed level of regularization therefore becomes counter-productive for sufficiently large samples. In the "method of sieves", we try to finesse this, by relaxing the constraint as the sample size grows; the trick is to relax at just the right rate to make sure the sum of estimation and approximation errors goes to zero. (That is, don't estimate more complex models until we have enough data to do so reliably.) In the method of sieves, we fix how much regularization we want to do in advance of looking at the data; we'll explore next time how to pick it in a data-dependent way.
Slides (with typo corrections and a bonus figure illustrating the approximation-estimation trade-off; .Rmd)
Optional reading:
- (**) Stuart Geman and Chii-Ruey Hwang, "Nonparametric Maximum Likelihood Estimation by the Method of Sieves", Annals of Statistics 10 (1982): 401--414

Lecture 15 (Thursday, 25 March): Model Selection Based on Point Estimates of Risk

When we regularize (whether using penalties or constraints), we're need to decide how much to regularize. It'd be nice if we could do this in a way informed by the data. This is an example of the more general problem of model selection, of using the data to pick a model class, as well as a precise model within that class. The goal of model selection should be, for us, to pick a model class whose risk is close to that of the best model class, i.e., best given the limited data available. A very basic way of doing this is data splitting or "hold-out": divide the data at random into independent training and selection sets, do ERM (or whatever) for each model class on the training set, and pick the class whose fitted model performs best on the selection set. The tools we've already developed let us prove that the hold-out will select a model class whose ERM has a true risk close to the lowest among all the ERMs, with bounds that tighten as the selection set grows. The drawback of the holdout, aside from its sheer idiot simplicity, is that we need to leave more and more data in the selection set and never use it for estimation. Cross-validation tries to better this, by swapping data points in and out of the role of "training" and "testing", and averaging. It's more complicated to analyze than holdout, but results have the same flavor, with better over-all performance. Leave-one-out cross-validation, the oldest form of CV, has the best predictive performance, but is slower than things like k-fold CV (because we need to fit n times). For some situations, like regression with linear smoothers, there are clever tricks where we can calculate the leave-one-out CV estimate while only fitting each model once. In other situations, the large-sample estimate of the "optimism" from lecture 4 actually re-appears as an approximation to leave-one-out. Akaike's information criterion, and many of the other variants on AIC, are in turn approximations to the optimism under various assumptions.
Slides (.Rmd)
Homework:
- Homework 7 due
- Homework 8: assignment, anomaly.csv data file for Q3

Lecture 16 (Tuesday, 30 March): Model Selection Based on Risk Bounds, a.k.a. Structural Risk Minimization

All of the model selection criteria we examined last time can be seen as doing penalized optimization, only now what are being penalized are the model classes. This makes our earlier optimization theory from lecture 3 and 4 less directly applicable, because we're optimizing over discrete, qualitatively-distinct model classes, not a single continuous space of functions. (We can't take a gradient with respect to the order of a polynomial.) But we can, nonetheless, use penalties to discourage certain features of the model class, unless really justified by improved performance in the form of empirical risk. AIC penalizes the number of parameters. The "optimism" penalizes imprecisely-estimated models. Cross-validation and holdout penalize models which perform very differently on different data sets. The idea penalty would in fact be exactly the deviation between the risk and the empirical risk of the ERM. All of those earlier-mentioned penalties can be seen as so many ways of giving point estimates of the deviation. But generalization error bounds tell us that the deviation is less than such-and-such a size with high probability. This leads us (following V. N. Vapnik) to the principle of "structural risk minimization": arrange the model classes in a nested hierarchy of increasingly large and elaborate models, calculate the sum of the empirical risk and the generalization error bound for each level in the hierarchy, and pick the model class with the best (smallest) sum. That is, use the size of the generalization error bound as the penalty. If we have an infinite hierarchy, but use the same confidence level (= error probability) throughout, some step in the hierarchy will break the bound, which would be bad. The usual practice is thus to use only a limited number of levels in the hierarchy, but also to let that number of levels grow as we get more data. This allows the risk of our selected model to converge on the lowest risk attainable anywhere in the hierarchy.
Slides (.Rmd)
Optional reading:
- (**) Sylvain Arlot, "V-fold cross-validation improved: V-fold penalization", arxiv:0802.0566
- (***) John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and Martin Anthony, "Structural Risk Minimization over Data-Dependent Hierarchies", IEEE Transactions on Information Theory 44 (1998): 1926--1940
- (****) V. N. Vapnik, The Nature of Statistical Learning Theory (2nd ed., Berlin: Springer-Verlag, 2000), chapter 4

Lecture 17 (Thursday, 1 April): Model Averaging

Model selection is inherently noisy, and often the difference between the best model (by our criterion) and the not-quite-best models is small and more or less due to chance. This motivates wanting to get away from having to select a model, to somehow using an ensemble of models at once. The most basic sort of ensemble method is model averaging. Some basic algebra shows that the mean squared error of an ensemble average equals the average MSE of the individual models in the ensemble, minus the variance of predictions across the ensemble. Similar results hold for other convex loss functions, such as the log probability loss. Thus the risk of the ensemble equals the average risk of its members, minus the diversity of predictions across the ensemble. Average individual skill, and group-level diversity, make equal contributions to the performance of the ensemble. The risk of an ensemble can thus be better than the risk of the any model in the ensemble, if the models can compensate for each others' systematic mistakes. There are, in practice, multiple ways of creating ensembles where the individual models have not-too-bad individual risks and there's diversity across the ensemble; these include weighting by in-sample performance (possibly penalized), fitting multiple models to different parts or resamplings of the data set, and creating a series of models which fit data points that the other models handled poorly. One might worry that the average of many models is itself a model of such high complexity that it will always overfit; this doesn't happen because ensemble methods are algorithmically stable.
Slides (.Rmd)
Homework:
- Homework 8 due
- Homework 9: assignment

Lecture 18 (Tuesday, 6 April): Kernel Machines I

Motivation: using linear methods, but on nonlinear functions of the data. The "kernel trick"; definition of a kernel function and characterization of kernels. Kernel machines for classification and regression. Complexity of kernel machines.
Slides (.Rmd)

Lecture 19 (Thursday, 8 April): Canceled

Class canceled due to side effects of the professor getting vaccinated the day before.

Homework:

Homework 9 due
Homework 10: assignment --- please note this is due on Tuesday, 13 April (and is correspondingly shortened)

Lecture 20 (Tuesday, 13 April): Kernel Machines II

Some practicalities of working with kernel machines: the implicit expansion of kernel machines in terms of basis functions; regularizing linear-in-the-basis-function models; visualizing the kernel matrix, and visualizing the eigenvectors of the kernel matrix as approximating the eigenfunctions. Kernel ridge regression. A worked example.
Slides, .Rmd source file (which you will find useful for the homework)
Homework:
- Homework 10 due (note date)
- Homework 11: assignment, COMPAS data set, file giving row numbers to use as the training set

NO CLASS on Thursday, 15 April

Spring Carnival.

Lecture 21 (Tuesday, 20 April): Kernel Machines III, Support Vector Machines

Classification with kernel machines. It's hard to optimize the 0/1 loss, so we turn to "surrogates" which are easier to work with. The "margin" of a data point in a classification problem is how much the machine's output would have to change to flip the classification; the margin of the machine is the minimum margin over data points. A classifier which achieves a large margin has put the boundary between classes far from all the data points, at least in whatever implicit feature space it's using. For kernel machines, maximizing the margin leads to an optimization problem which is computationally tractable, and sets the weight on many data points to zero; the others are the "support vectors". Moreover, the empirical Rademacher complexity of support vector machines turns out to be inversely proportional to the margin we achieve, so that large margins imply good generalization performance. Matters are a little more complicated when we allow the machine the "slack" to make a certain number of mistakes, but fundamentally similar.
Slides (.Rmd)

Lecture 22 (Thursday, 22 April): Random Feature Machines

Kernels have many attractive properties, but the need to keep around all the data points, and especially to work with the [n x n] kernel matrix, creates a crucial bottleneck for large data. Random feature machines first emerged as a way of making kernel machines more computationally efficient: by decomposing kernels into sums (or integrals) of nonlinear functions, and sampling from those functions, one can get a randomized approximation to the kernel that uses only a small fraction of the computational resources. This led to the idea of directly using random features, without even trying to approximate a kernel. A surprisingly small number of randomly-chosen nonlinear features can come remarkably close to the optimal function, while still having good generalization performance.
Slides (.Rmd)
Reading:
- Optional reading:
  - Ali Rahimi and Benjamin Recht, "Random Features for Large-Scale Kernel Machines", pp. 1177--1184 in John C. Platt, Daphne Koller, Yoram Singer and Samuel T. Roweis (eds.), Advances in Neural Information Processing Systems 20 [NIPS 2007]
  - Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning", pp. 1313--1320 in Daphne Koller, D. Schuurmans, Y. Bengio and L. Bottou (eds.), Advances in Neural Information Processing Systems 21 [NIPS 2008]
  - Ali Rahimi and Benjamin Recht, "Reflections on Random Kitchen Sinks", argmin blog, 5 December 2017
Homework:
- Homework 11 due
- Homework 12: Assignment

Lecture 23 (Tuesday, 27 April): Low-regret learning I

An introduction to the idea of regret, of low-regret learning, and mixtures of experts
Slides (.Rmd)

Lecture 24 (Thursday, 29 April): Low-regret learning II

The risk properties of multiplicative weight training when the data are IID. Competing against sequences of experts, and growing ensembles
Slides (.Rmd)
Homework:
- Homework 12 due
- Homework 13: assignment, due on Sunday, 9 May 2021

Lecture 25 (Tuesday, 4 May): Summary and Conclusions

Recapitulation of the highlights of the course; what we haven't done.

Thursday, 6 May: NO CLASS

Since the professor is getting vaccinated.

Sunday, 9 May

Homework 13 is due at 6 pm