Statistical learning is about finding reliable, predictive patterns in data. That may sound arcane, but actually it shapes your world. What you see in your social media feed, what ads you get shown, whether you can get a credit card or a home mortgage, are all decided by statistical learning. Even crucial decisions about hiring, finance, medical care, criminal justice and literally life-or-death military actions are increasingly made using statistical learning. This course will be a fast-paced, hands-on introduction to the most important methods used in statistical learning, covering how and why the methods work (not just “what do I type in R?”, but that too), how they’re used in decision-making, how to evaluate them in practice, and how to use them responsibly. There will be a lot of data analysis, a fair amount of programming, some math (but not too much) and some reading (and thinking about what you’ve read).
After taking the class, when you’re faced with a new applied statistical-learning problem, you should be able to
For this semester, the prerequisite is passing 36-401 (modern regression) with a grade of C or better. The only exceptions are for graduate students taking the graduate section (36-662; see below).
This course is primarily intended as an advanced elective for undergraduate students majoring in statistics, and for graduate students in the master of statistical practice (MSP) degree. Graduate students from other departments are also welcome, if there’s room. All graduate students must take the course as 36-662. (Those who register for 36-462 will be dropped.) Formally, 662 has no prerequisites, but students are expected to have a background at least equivalent to 36-401 and its prerequisites: linear regression modeling, probability, mathematical statistics, linear algebra, calculus, and competence in using R for data analysis. This class does not have the resources to help those without this background catch up.
Undergraduates must take the course as 36-462; those who register for 662 will be dropped.
36-462/662 is intended as a practical course on methods and methodology, covering a lot of different techniques, the principles used to choose between techniques, and selected applications. 36-465/665, “Conceptual Foundations of Statistical Learning”, is more focused on the theoretical foundations common to all methods of statistical learning. If you are debating between this course and something from another department (e.g., 10-601), you may find the detailed topic list below helpful.
| Name | Office | ||
|---|---|---|---|
| Professor | Dr. Cosma Shalizi | cshalizi [at] cmu.edu | Baker Hall 229C | 
| Teaching assistants | Mr. Raghav Bansal | not to be bothered with emails | not to be bothered in their offices either | 
| Ms. Victoria Lin | |||
| Mr. Luca Masserano | 
There are three reasons you will get assignments in this course. In order of decreasing importance:
To serve these goals, there will be three kinds of assignment in this course.
There will be no exams.
You should expect to spend 5–7 hours on assignments every week, averaging over the semester. (This follows from the university’s rules about how course credits translate into hours of student time.) If you find yourself spending much more time than that on the class, please come to talk to me.
Grades will be broken down as follows:
Dropping your lowest grades lets you schedule interviews, social engagements, and other non-academic uses of your time flexibly, and without my having to decide what is or is not important enough to change the rules. If something — illness, family crisis, or anything else — is interfering with your ability to do the work of the class, come talk to me.
Grade boundaries will be as follows:
| Grade | Score range | 
|---|---|
| A | [90, 100] | 
| B | [80, 90) | 
| C | [70, 80) | 
| D | [60, 70) | 
| R | [0, 60) | 
To be fair to everyone, these boundaries will be held to strictly.
If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct questions or complaints about your grades to me; the TAs have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve or need a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.
As a final word of advice about grading, “what is the least amount of work I need to do in order to get the grade I want?” is a much worse way to approach higher education than “how can I learn the most from this class and from my teachers?”.
Lectures will amplify the readings, provide examples and demos, and answer questions and generally discuss the material. They are also when you will do the graded in-class assignments which will help consolidate your understanding of the material, and help with your homework.
There are (up to) three kinds of readings associated with each lecture.
TL;DR: definitely do the key readings before class; try to do the suggested readings too, they’re easy; the background readings are if you’re thirsty to know more.
When we meet in person, please don’t use any electronic devices during lecture: no laptops, no tablets, no phones, no recording devices, no watches that do more than tell time. The only exception to this rule is for electronic assistive devices, properly documented with CMU’s Office of Disability Resources.
(The no-electronics rule is not arbitrary meanness on my part. Experiments show, pretty clearly, that students learn more in electronics-free classrooms, not least because your device isn’t distracting your neighbors, who aren’t as good at multitasking as you are.)
We’re going to be meeting on Zoom for the first two weeks of the semester. Class will still meet at our regular time. You will find a link to our Zoom meetings in the class calendar on Canvas; please don’t share it outside of class. Class meetings will not be recorded, so show up, take notes, ask questions, and speak freely, as you would in person. The biggest difference is that we won’t do group exercises via Zoom, but you will instead have individual after-class exercises.
(In the unfortunate event we have to go back to Zoom after the first two weeks, these rules will still hold.)
For the most part, you will submit your work through Gradescope. This includes both your written homework assignments and the class exercises. The online assignments that proceed most homeworks, however, will need to be done on Canvas. We will also use Canvas as a gradebook, and to distribute some readings that can’t be publicly posted here.
We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. (Postings will not be anonymous to instructors.) Anonymity will go away for everyone if it is abused. As noted above, homework will generally be due at 6 pm on Thursday; however, you should not expect the instructors to answer questions on Piazza (or by e-mail) outside normal working hours. (We may, but you shouldn’t expect it.)
R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). In this class, you will use it for every homework. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.
Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week’s data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.
All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code (and not some brilliant trick that you then forget, or obscure bug that you didn’t notice). For help on using R Markdown, see “Using R Markdown for Class Reports”. (Part of the first lecture will be going over the basics of R Markdown.)
For each assignment, you should write your homework in R Markdown, knit it to a humanly readable document in PDF format, and upload the PDF to Gradescope. This is, ordinarily, what you will be graded on. However, it is important that you keep the R Markdown file around, because every week I will randomly select a few students and ask them to send me your R Markdown so that I can re-run it, and check that it does, in fact, produce the file you turned in. (You should expect to be picked about once a semester.) If they don’t match, I will have questions, and it will hurt your grade.
(Gradescope makes it much easier for multiple graders to collaborate, but it doesn’t understand R Markdown files, just PDFs.)
You’ll need to do math for some homework problems. R Markdown provides a simple but powerful system for type-setting math. (It’s based on the LaTeX document-preparation system widely used in the sciences.) If you can’t get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by the last homework.
Solutions for all homework will be available, after their due date, through Canvas. Please don’t share them with anyone, even after the course has ended. This very much includes uploading them to websites.
| Day | Time | Instructor | Location | 
|---|---|---|---|
| Tuesday | 4:00–5:00 pm | Ms. Lin | Gates Hall 5716 | 
| Wednesday | 1:25–2:45 pm | Prof. Shalizi | Baker Hall 229C | 
| Wednesday | 8:30–9:30 pm | Mr. Bansal | Zoom, link on Canvas | 
| Thursday | 9:00–10:00 am | Mr. Masserano | Zoom, link on Canvas | 
If you want help with computing at an in-person office hour, please bring a laptop.
If you cannot make the regular office hours, or have concerns you’d rather discuss privately, please e-mail me about making an appointment.
The following books are required:
You can download a copy of Berk from the university library via the link above; the other books are also available electronically from the library, but for limited numbers of simultaneous users. You should make sure you have access to copies of all these books.
The following books are recommended but not required:
Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an source; this includes all mathematical derivations, computer code and figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbooks and recommended readings, lecture slides and demos, any resources provided through the class website, solutions provided to this semester’s previous assignments in this course, books and papers in the library, or legitimate online resources (including this class’s Piazza), though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)
In general, you are free to discuss, in general terms, homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students, the instructors, or your teachers in other current classes at CMU, until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are free to complain, in general terms, about any aspect of the course, to whomever you like.
Any use of solutions provided for any assignment in this course in previous semesters is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not “consult” them, do not read them, do not ask your friend who took the course last year if they “happen to remember” or “can give you a hint”. Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn’t give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to formal disciplinary action, under the terms of the university’s policy on academic integrity.
You must complete “homework 0” on the content of the university’s policy on academic integrity, and on these course policies. This assignment will not factor into your grade, but you must complete it, with a grade of at least 80%, before you can get any credit for any other assignment.
Lecture notes and readings will be linked to here, as they become available. Lecture topics are subject to change depending on how the course is going. Don’t expect required and suggested readings to finalize until about a week before each lecture.
| Abbreviation | Book | 
|---|---|
| PDM | Principles of Data Mining | 
| Berk | Statistical Learning from a Regression Perspective | 
| WMD | Weapons of Math Destruction | 
| TEA | The Ethical Algorithm | 
| Hastie et al. | The Elements of Statistical Learning | 
| ADAfaEPoV | Advanced Data Analysis from an Elementary Point of View | 
| TBD | To be determined | 
“Statistical learning”, strictly speaking, is about using data to find or learn rules from data which will predict well on new cases, usually by optimizing some function of the data. It emerged as a field between about 1980 and 2000, and especially between about 1985 and 1995, when computer scientists interested in making machines learn ideas from statistics about how to fit and evaluate predictive models (as well as more technical tools from theoretical statistics), and applied them to new kinds of models and new problems. This course will introduce you to some of the most important methods used in practical applications of statistical learning, to the tools used to evaluate how well those methods work in particular cases (i.e., how we know we’re not fooling ourselves), and a selection of important applications, and their implications. Statistical learning is interested intellectually because it sheds new light on puzzles about learning, knowledge, induction and scientific method which thoughtful people have wrestled with for thousands of years all of the world. It’s interesting practically because it is used to support data-driven decision-making, and increasingly to automatically make consequential decisions. In this way it’s the latest phase in a long history of trying to base decisions on data, and collecting more and more data to aid decision-making.
Slides, R Markdown file that generated the slides, printable pdf
calif_penn_2011.csvLinear regression is a prediction method. For any quantitative variable \(Y\) which we’d like to predict, and any predictor variable(s) \(X\), there is an optimal regression function, \(\mu(X)\), which is optimal in the sense that it minimizes the expected squared error, \(\mathbb{E}\left[ (Y-\mu(X))^2 \right] \leq \mathbb{E}\left[ (Y-m(X))^2 \right]\) for any other function \(m\) of \(X\). Linear regression is not that optimal regression function. If we limit ourselves to linear functions of \(X\), ones which can be written in the form \(b_0 + b X\), there’s an optimal intercept \(\beta_0\) and optimal coefficient(s) \(\beta\), which minimize the expected squared error, \(\mathbb{E}\left[ (Y- (\beta_0 + \beta X))^2 \right] \leq \mathbb{E}\left[ (Y - (b_0 + b X))^2 \right]\). That optimal coefficient is always \(\beta = \mathrm{Var}\left[ X \right]^{-1} \mathrm{Cov}\left[ X,Y \right]\), and then the optimal intercept is always \(\beta_0 = \mathbb{E}\left[ Y \right] - \beta \mathbb{E}\left[ X \right]\). This is all true no matter how nonlinear the real regression function might be, and whether or not anything is Gaussian. Least squares regression tries to estimate this optimal linear model from the data. OLS can conveniently be done in matrix form, \(\hat{\beta} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y}\). (This is really the sample version of \(\mathrm{Var}\left[ X \right]^{-1} \mathrm{Cov}\left[ X, Y \right]\).) Predictions on the training data then become \(\mathbf{x} (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y}\), which reveals the crucial role of the “hat matrix”, \(\mathbf{x} (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T\). From this predictive viewpoint, lots of what we teach you in linear models classes about regression diagnostics and parameter inference doesn’t matter. It also becomes clear that linear regression is a very special case whose main virtues are clean math and simple calculations, not powerful predictions or accurate representation. We teach it to you as the first step on the ladder to better methods.
Regression is about predictive quantitative, numerical variables. If we want to predict a qualitative, categorical variable, especially a binary one, we call that classification. A very basic way of doing classification is the prototype method: pick two points to represent each class, the “prototypes”, and classify a new point as belonging to the closer class. This is a special case of the broader family of linear classifiers, where the boundary between classes is linear. Linear classifiers, like all classifiers, can be evaluated in terms of their accuracy (= probability that a new case is correctly classified), or equivalently their mis-classification rate (=1-accuracy). Slightly more refined analyses look at false positive and false negative rates, and receiver operating characteristics (how those error rates vary with model parameters). Logistic regression is (despite its name) a linear classifier, plus the assumption that the probability of being in one class or another varies in a certain way with distance from the boundary. Because it allows for some uncertainty, it’s a more informative model; evaluating it involves not just classification, but also looking at whether those probabilities are properly calibrated (so that events it says are 50-50 happen about half the time, etc.).
We’ve now seen three kinds of prediction: regression (guessing a quantitative variable), classification (guessing a categorical variable), and density estimation (guessing a probability distribution). “What prediction should I make for \(Y\) when I’ve seen that \(X=x\)?” is a decision problem. There are three elements to any decision problem: actions, states, and information. Loss functions say how bad a a given action is in a given state. Squared error is a natural loss function for regression, mis-labeling or “0/1 loss” for classification, and (negative) log-likelihood for probabilities. A strategy is a rule which tells us what action to take on the basis of the available information. Risk is the expected loss incurred from following a strategy, and risk minimization is a sensible guide to decision making, though not the only imaginable one. There is always a mathematically-possible strategy which minimizes the risk, but we’re usually restricted to a certain class of feasible strategies we can actually implement; the difference between the best-possible strategy and the best feasible strategy is approximation error. Risk is defined using the true distribution of the data, but we don’t have that distribution, only data sampled from it. The gap between the distribution and the training data always introduces some estimation error into our choice of strategy. Narrowing the class of allowed strategies generally reduces this estimation error, but at the cost of increasing the approximation error. This trade-off between approximation and estimation error is fundamental to learning, and includes the bias-variance trade-off as a special case.
spam data set as a CSV file](hw/02/spam.csv]In statistical learning, fitting a model usually means solving an optimization problem, of the form “minimize the loss on the training data”, or “minimize the sum of loss on the training data plus a penalty for funny-looking models”. This makes optimization fundamental to statistical learning, so we need to go over the essentials. This starts with distinguishing local and global optima. After that, we need some facts you may remember from calculus: if we’re optimizing a function \(f\) of one variable \(\theta\), a necessary condition for \(\theta^*\) to be an optimum is \(\frac{df}{d\theta}(\theta^*) = 0\), unless we’re at a boundary of the space of \(\theta\). If we meet that first-order condition, a sufficient condition for a local minimum is that \(\frac{d^2 f}{d\theta^2}(\theta^*) > 0\); minima which don’t meet this second-order condition exist but are delicate and weird. If we’re optimizing more than one variable, the first-order condition is that the gradient (vector of partial derivatives) \(\nabla f\) is zero, \(\nabla f(\theta^*) = 0\); the second-order condition for a minimum is that the Hessian (matrix of second partial derivatives) \(\nabla \nabla f\), has to be “positive definite”, \(\nabla \nabla f(\theta^*) \succ 0\). (You learned what that means in matrix algebra, but we’ll go over it again, and why it’s reasonable.) Close enough to a local optimum, most smooth functions look quadratic, so they can be approximated using only their gradient and their Hessian. This also tells us how much we miss by not being quite at the optimum, which will be useful later.
In statistical learning, it’s not usually possible to set up the equations for the first- and second- order conditions, \(\nabla f(\theta) = 0\), \(\nabla\nabla f(\theta) \succ 0\), and explicitly solve for an optimum \(\theta^*\). Instead, we have to use numerical algorithms to approximately compute \(\theta^*\). The calculus we went over last time suggests the most basic and widely-used of these algorithms, gradient descent: start somewhere, find the gradient, take a small step in that direction, and repeat until the gradient approaches zero. An important refinement is Newton’s method: basically, do gradient descent, but take smaller steps where the function is more curved (as indicated by its Hessian). For large data sets, calculating the loss on all data points can take too much time, so we use a random sample of the data at each step, giving us stochastic gradient descent and stochastic Newton’s method, among other stochastic approximation algorithms. Whether these algorithms work well depends on the properties of the function we’re trying to optimize, and if it’s not very smooth we may be better off using simulated annealing: move in a random direction, but back-track if it makes things worse, and get more inclined to back-track the longer we’ve been working. It is annoying to have to think about multiple different optimization algorithms, rather than having One Algorithm to Rule Them All, there are real trade-offs of generality vs. computing time vs. optimization error, and also fundamental limits which say that no algorithm can always out-perform others across all problems (the “no free lunch theorems”). No optimization algorithm gives us the exact optimum, which in statistical learning introduces an extra source of error, optimization error. Spending more computer time reduces the optimization error, but it’s not worth going on once the optimization error is small compared to the estimation error. Sometimes quite crude optimization is all we really need.
Many statistical learning methods don’t just try to minimize the risk within some feasible class of strategies. Instead, they add penalties, which make it harder to select some strategies, unless they really improve the risk. This is an example of a general tactic in optimization. Penalties can arise because we find some sorts of solution expensive or distasteful, and want to be sure they’re worth it. Another reason we use penalties is that some optimization problems are “unstable” or “ill-posed”, so that small changes to the function or the data give wildly different optima, and adding penalties can stabilize or regularize them. In statistical learning, this especially arises when we’re dealing with high-dimensional data, where the number of variables is comparable to, or greater than, the number of data points. Even low-dimensional data can be collinear, which leads us want regularized linear regression. The two most common penalties for linear models are the sum of the absolute values of the coefficients (\(L_1\)) an the sum of the squared coefficients (\(L_2\)); these are respectively called lasso and ridge regression, and lead to different kinds of models. A final way penalties arise is as substitutes for constraints on the optimization, restrictions which cut down the feasible set, by requiring that the parameters require some set of equations and/or inequalities. Remarkably enough, it is generally possible to turn a constrained optimization problem with \(p\) variables under \(q\) constraints into an unconstrained problem with \(p+q\) variables: there is one penalty term for each constraint, and the new \(q\) new variables, the Lagrange multipliers, make the penalties strong enough to enforce the constraint (“a fine is a price”). This goes the other way, too: penalties are generally equivalent to constraints. Optimizing under constraints is an important tool in economics and operations research, sometimes called mathematical programming, and very efficient algorithms exist for the special class of convex programming problems; this makes it very useful when we can show our learning problem can be cast as a convex programming problem. Time permitting: “Comrades, let’s optimize!”
See under lecture 7.
The goal of decision theory is to find the strategy (from the “feasible” set of strategies) which minimizes the risk, but “risk” is defined using the true data-generating distribution, which we do not know. The “empirical risk” of a strategy is simply its mean loss on the training data, and “empirical risk minimization” is selecting the (feasible) strategy with the lowest mean loss on the training data. (Examples: least squares, maximum likelihood.) This is always biased as an estimate of the actual risk, and the bias is always in the optimistic direction. We should therefore never just evaluate predictive models in-sample. There are some simple (too simple?) formulas for estimating out-of-sample prediction error, based on the empirical risk and some measures of model complexity; these follow from the basic calculus facts about optimization, and a little probability theory.
We ended the last lecture with a very general formula for estimating the risk of a model which we fit by minimizing the empirical risk on the training data. An alternative, widely used in practice, is cross-validation: divide the data into parts, fit our model on one part, see how well it can predict the other part, and then swap roles. This estimates how well our models can generalize to new data, by having them generalize to new data. There are two main forms of cross-validation, “leave-one-out” and “v-fold” (or “k-fold”), with specific advantages and disadvantages. To summarize what we’ll go over in more detail: k-fold CV tends to be harsher on more complicated models than leave-one-out, which can have advantages for model selection when one of the models we’re selecting among contains the truth. But leave-one-out often has better predictive performance, especially when none of our models is right. Leave-one-out is usually much slower, computationally, than k-fold CV, but there are short-cuts in some special cases.
The simplest, and one of the oldest, nonparametric methods for prediction is the nearest neighbor method: when we want to predict what will happen in a new case, we find the case with a known outcome that most resembles the new one, and say that will happen. The \(k\)-nearest-neighbor method is only slightly more complicated: look up the \(k\) most similar cases, and average them or have them vote. (Plain nearest neighbors is then “1-nearest-neighbors”, or “1NN”.) It’s fairly straightforward to analyze how 1NN would work if there was no noise, and we’ll do so in class; the crucial fact turns out to be that the distance to the nearest neighbor goes to zero, and we’ll see how fast it does so. Adding noise back in to the problem is then fairly straightforward. The risk of 1NN converges to a risk of exactly twice the minimum possible risk under almost no assumptions. As we increase the number of nearest neighbors, we discover a bias-variance trade-off: small \(k\) means low bias but high variance, large \(k\) means high bias but low variance.
The most important statistical issue for nearest neighbors is selecting the right number of neighbors \(k\) to use. This is not a parameter, a fact about the world which we are trying to estimate, but a control setting, something we decide on. Analyzing the bias-variance trade-off gives us a sensible way to approach how to set this control; it tells us that the best \(k\) to use should grow with the number of data points, but not too fast. Practical answers turn on cross-validation, which we’ll go over again (once more, with feeling). It turns out that for \(k\)-NN, leave-one-out CV is (predictively) optimal. Once we have a well-tuned nearest-neighbor predictor, we can use it for nonparametric regression or classification. This in turn opens up many other possibilities; one of these is matching estimates of causal effects, where (say) people who have received some treatment are matched with the most similar people who haven’t, and the difference in their outcomes estimates the effect of the treatment.
Classification and regression trees; the CART algorithm for building trees; “entropy” and “deviance”; some examples.
In place of notes, please read chapter 13 of ADAfaEPoV.
compas_violence.csv data fileImproving on single trees by combining multiple trees. “Bagging”, or “bootstrap averaging”: combining multiple trees from slightly different data sets. Random forests: bagging plus random feature selection. Two perspectives on why multiple trees help: averaging reduces variance (but less so if what’s averaged is positively correlated); the performance of the average is the average performance plus the diversity.
Same notes as last time
Same as last time.
Enjoy spring break.
Motivation: using linear methods, but on nonlinear functions of the data. The “kernel trick”; definition of a kernel function and characterization of kernels. Kernel machines for classification and regression. Some practicalities of working with kernel machines: the implicit expansion of kernel machines in terms of basis functions. Kernel ridge regression. Worked examples.
Turning to classification: It’s hard to optimize the 0/1 loss, so we turn to “surrogates” which are easier to work with. The “margin” of a data point in a classification problem is how much the machine’s output would have to change to flip the classification; the margin of the machine is the minimum margin over data points. A classifier which achieves a large margin has put the boundary between classes far from all the data points, at least in whatever implicit feature space it’s using. For kernel machines, maximizing the margin leads to an optimization problem which is computationally tractable, and sets the weight on many data points to zero; the others are the “support vectors”. Matters are a little more complicated when we allow the machine the “slack” to make a certain number of mistakes, but fundamentally similar.
Kernels have many attractive properties, but the need to keep around all the data points, and especially to work with the \([n \times n ]\) kernel matrix, creates a crucial bottleneck for large data. Random feature machines first emerged as a way of making kernel machines more computationally efficient: by decomposing kernels into sums (or integrals) of nonlinear functions, and sampling from those functions, one can get a randomized approximation to the kernel that uses only a small fraction of the computational resources. This led to the idea of directly using random features, without even trying to approximate a kernel. A surprisingly small number of randomly-chosen nonlinear features can come remarkably close to the optimal function, while still having good generalization performance.
Notes: Combined with notes for lecture 15.
anomaly.csv data fileGoals of dimension reduction. Linear algebra review: vectors, matrices, inner products, projections. Eigen-decomposition of matrices.
None required, but the review of vectors and matrices from J. V. Stone’s Independent Component Analysis (Cambridge, Massachusetts: MIT Press, 2004) may be helpful. (Also available on Canvas.)
“Principal components analysis” = finding the linear surface which comes closest to the original data points. Performing and interpreting PCA. Example with US geography. “Latent semantic indexing” = PCA on bag-of-words vectors. Multidimensional scaling and PCA. Real-data examples. Time permitting: Random linear projections as an alternative to PCA.
PCA is just an approximation method, which assumes nothing about the data-generating process. This means that we can always use it, it’s never wrong, but it also means we can’t actually make any predictions with it, and it’s never right, either. In particular, there is no right answer to the question of “how many principal components should we use?” One way around this is to make a PCA-like model, by assuming that the difference between the actual data vectors, and their reconstruction using only a limited number of low-dimensional variables, should be random noise. This gives us factor models, whose use is called factor analysis. In these models, our high-dimensional observables are linear combinations of a low-dimensional set of unseen factors, plus noise. Unlike PCA, factor models do make predictions about unseen data, at least at the level of the distribution, and so we can sensibly answer questions like “can we predict this data better using five factors or is just one enough?”, or “does the five-factor model pass goodness-of-fit tests?” Estimating these models involves some back-and-forth between making assumptions about the noise, and using that to estimate how the factors relate to the observables, and making assumptions about the factors and using that to estimate the noise.
Turning to applications, factor models are often used to give low-dimensional (really, low-rank-plus-noise) approximations to data which naturally takes the form of high-dimensional matrices. The classic cases of this are in psychology, where the technique was first developed: rows of matrices there are people (or other test-takers), columns are “items” in a mental test, and the “factors” are (supposedly) attributes such as “knowledge of arithmetic”, “narcissism”, “racial resentment” or “need for chaos”. These uses are entrenched in our society, but also deservedly controversial. Turning to more data-mining-y applications, factor models are also common in recommendation systems (playing an important role in the Netflix Prize competition) and in ad targeting (being at the core of what Cambridge Analytica claimed to do).
Problems with using linear methods when the data lie near a curved surface. The concept of a “manifold”. Exploiting manifold structure: local linear embedding. Exploiting a pre-existing measure of similarity: diffusion maps.
“Clustering” is dividing data into categories, without knowing what the should be categories, or having an examples to follow. Why this is not totally hopeless, and some rough ideas about what good clusters might look like. The most basic algorithm for clustering: k-means. Using clustering to find (approximate) nearest neighbors (“locality-sensitive hashing”). Probabilistic clustering with mixture models; the EM algorithm; k-means again. Using log-likelihood to pick the number of clusters. Checking goodness of fit of cluster models. Advanced topics in probabilistic clustering (time permitting): “mixed membership” models, and their roles in modeling text (“topic models”, “latent Dirichlet allocation”) and genetics.
amazon-dress-jpgs.zip, eigendresses.R, ratings.csvA.k.a. hierarchically organized nonlinear transformations of weighted sums of input features. Basic, two layer neural networks, their training and limitations. Three layer neural networks. Random feature models as three-layer networks with a fixed initial layer. Deep neural networks. Powers: universal approximation; really good results in some domains. Limitations and unknowns: adversarial examples, generalization despite memorization, etc. Some reflections, from someone who remembers the last time neural networks were going to take over the world.
Enjoy Carnival
The idea of a recommender system. Best-seller lists. Nearest-neighbor-based approaches (whether based on users or on items). The Netflix problem, and PCA/factor-model style approaches. Recommendations using social graphs. Model combination. Missing-data issues and the “cold start” problem.
Homogenization; paradoxes of diversification; rabbit holes and echo chambers; discrimination; what does the system optimize for; when inferior recommendations are more profitable; natural monopoly (or oligopoly). Also: do recommendations do anything at all?
hw-11.R, MovieLense.csv