Generative artificial intelligence systems are based on statistical models of text and images. The systems are very new, but they rely on well-established ideas about modeling and inference, some of them from the early 1900s. This course will introduce students to the statistical basis of large language models and image generation models, emphasizing high-level principles over implementation details. It will also examine whether, in light of those statistical foundations, we should think of generative AI as a step towards “artificial general intelligence”, or instead as a “cultural technology”.
Undergraduates must register for 36-473; graduate students must register for 36-673. (Undergraduates who try to register for 673 will be dropped from the roster.) Lectures and many assignments will be the same for both courses, but students in 673 will also be expected to complete a project during the second half of the semester. (Details are described below, under “grading”.)
The course is intended for advanced undergraduates in statistics and closely related majors, and for MS students in statistics. Undergraduates taking 473 must have passed 36-402. MS students taking 36-473 should have preparation equivalent to 36-402 and all its pre-req courses in statistics and mathematics. A course in stochastic processes, such as 36-410, is helpful but not required.
By the end of the semester, students in this class should be able to:
| Name | Office | ||
|---|---|---|---|
| Professor | Dr. Cosma Shalizi | cshalizi [at] cmu.edu | Baker Hall 229C |
| TA | Mr. Oliver Hannaoui | Not to be bothered by e-mail | TBD |
Grades for 36-473 will be based 100% on regular homework assignments. Grades for 673 will be based on 80% on homeworks and 20% on a (group) project.
Homework: There will be 9 homework assignments, due on Thursdays at 6 pm. Your lowest three homework scores will be replaced by 80/100 (unless that would lower your average) — no questions asked, no excuses needed. Late homework will not be accepted for any reason; no flexibility will be offered for homework due dates.
Projects: Students taking 36-673, and only those students, will do a project. This can be done in a group of up to 4. Projects may either be exploring some area of the literature which goes beyond what’s covered in the course, or in more depth; experimenting on existing systems; implementing new systems or ideas; or some combination of these. Students are encouraged to propose their own project ideas. Proposals must be approved by the professor, and preliminary proposals are due before final proposals, with time for feedback. All projects — even ones that are primarily about implementing something — need to lead to a written report and to an oral presentation. Students will be graded both on the quality of their work, and on their intelligent and polite engagement with other students’ presentations. (Do not make travel plans which conflict with the last week of classes.)
All assignments in this class have three purposes, in decreasing order of importance:
The points for individual homework problems, rubric items, etc., reflect how important the underlying concepts and skills are for you to master, not how much time or work they should take.
A grade of A will be an average of 90 or above, a B a grade of 80 or above (but below 90), and so on. Any grade below 60 will be an R.
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve or need a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.
As a final word of advice about grading, “what is the least amount of work I need to do in order to get the grade I want?” is a much worse way to approach higher education than “how can I learn the most from this class and from my teachers?”.
Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. You will generally find it helpful to do the readings before coming to class.
Do not record lectures. Exceptions to this will be made only for students with written accommodation plans, authorized by CMU’s Office of Disability Resources, requiring recordings.
Solutions for all homework assignments will be available, after their due date, through Canvas. Please don’t share them with anyone, even after the course has ended. This means do not upload them to websites. Sharing or uploading the solutions will count as academic dishonesty and will be pursued through the university’s disciplinary process.
Thursdays at 10 am in Prof. Shalizi’s office, Baker Hall 229C.
Christopher M. Bishop (with Hugh Bishop), Deep Learning: Foundations and Concepts (Cham, Switzerland: Springer-Verlag, 2024; doi:10.1007/978-3-031-45468-4) is recommended. Hardcopies are a bit expensive (currently $89.99), but we have free access to the full text through the university library — follow the link (and if need be sign in to the library).
Except for explicit group projects, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester’s previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. Websites which compile course materials are not legitimate online resources.
In general, you are free to discuss homework with other students in the class, though not to share work; such conversations must be acknowledged in your assignments. You should not seek help with assignments from people outside the class. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.
If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to formal disciplinary action, under the terms of the university’s policy on academic integrity.
AI tools have their place — why have this course otherwise? — but they are most useful when they are doing boring and repetitive, “mechanical” work which you can check. The point of assignments in a class like this is to help you acquire the knowledge and skills you need to evaluate someone else’s work, or the output of an AI. Therefore, using generative AI for any part of assignments in this class which require you to think is actively harmful to your own learning. (This is backed up by professional studies of the effect of AI on student learning, which I am happy to discuss.)
I recognize that it is going to be very hard to stop you from using these machines to do mechanical tasks (like writing boilerplate code or graph formatting). I would also be failing in my duty as a teacher if I let you replace your own thought and learning with a random sample of the lowest common denominator of the Internet. Since drawing a bright line between those two extremes is hard, I am willing, on a trial basis, to permit the use of generative AI tools in this class, on the following conditions:
You should expect to spend 5–7 hours on assignments every week, outside of class, averaging over the semester. (This follows from the university’s rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.
If you need accommodations for physical and/or learning disabilities, please contact the Office of Disability Resources, via their website, [http://www.cmu.edu/disability-resources]. They will help you work out an official written accommodation plan, and help coordinate with me.
References to “Bishop” in the readings are to the recommended textbook (see above). Readings marked with asterisks (*) are optional. Readings marked with one or more daggers (\(\dagger\)) are, as it were, especially optional, because they are technically advanced, older, or go off on a tangent to the main direction of the course; the number of daggers gives my sense of how hard they will be for most students. Readings not marked with an asterisk or dagger are mandatory.
Modern generative AI rests on large language models, and LLMs are higher-order parametric Markov models fit by maximum likelihood. The first half of the semester will be about unpacking this statement and its consequences.
Language models, in this sense, are probability models for sequences of discrete symbols. A probability model which specifies the distribution for all parts of the data is generative, because it can be used to make up, or “generate”, new data sets by sampling from the probability distribution. If the data are text or images, these models then generate new texts or new images. Not all statistical models are generative: regressions and classifiers, for instance, are not generative, because they do not specify distributions for the predictor (“independent”) variables. Probability models can also be used to compress data sets, essentially by saying how that model could have generated that particular data set. The more accurate the model’s distribution, the more efficient it is at compression. The size of this compression or encoding can be measured in the number of binary digits, or bits, needed to store it, and this turns out to be the negative log probability of the data set. Various averages and transformations of the negative log probability, with names cross-entropy and perplexity, are often used to evaluate generative models.
Reading:
Assignments:
pareto.R, RUR.txtWhen probability models have one or more adjustable parameters, those parameters need to be estimated by matching the data. The most efficient way of doing this is usually to maximize the likelihood, the probability that the model would have generated the particular data we saw. This corresponds to looking for the shortest compression of the data — or at least the shortest achievable using that model. In previous classes, you have worked through simple examples of maximum likelihood estimation. We will go over the general theory of maximum likelihood here, including the interpretation of first and second derivatives of the log-likelihood function.
Maximum likelihood estimation of generative models is what machine learning calls “self-supervised training”. We will look at the properties of self-supervised training when the probability model cannot match the true distribution of the data exactly. The upshot is that maximizing the likelihood converges on the best approximation to the true distribution which the model can achieve. This particular sense of “best approximation” that involves likelihood ratio tests, and a quantity called the “Kullback-Leibler divergence” or “relative entropy”, but we’ll also seen why that’s a reasonable goal.
Reading:
Large language models are higher-order Markov models. To understand them, we begin with first-order Markov models. A Markov chain is a sequence of discrete, random symbols where future variables are probabilistically independent of the past, given the current state of the process. This conditional independence is called the Markov property. The Markov property implies that every symbol has a certain probability (perhaps zero) of following any other symbol. These conditional probabilities can be assembled into a matrix, the transition matrix of the chain. All the other properties of the process can be worked out by doing linear algebra on the transition matrix. The eigenvectors of the transition matrix tell us about what distributions the chain can generate, and the eigenvalues tell us about which distributions are stable or invariant over time, and how quickly the unstable distributions converge to the stable ones. Markov chains are ergodic, meaning that statistics calculated along any one realization of the process converge on what we would calculate from an invariant distribution. This implies that correlations between very distant parts of the sequence go towards zero. Some simple calculations based on the transition matrix give us rough handles on rates of convergence.
Reading:
When the transition matrix of a Markov model is unknown, we need to estimate it from data. Once we see how to define the likelihood function for a Markov chain, we can estimate it. There is a closed-form solution for the transition matrix which maximizes the log-likelihood, and we can use our general maximum likelihood theory to calculate various properties of this estimate, such as (asymptotic) standard errors and confidence sets. This closed-form solution relies on all the elements of the transition matrix being free to vary independently (so long as each row adds to 1). When those elements are instead functions of a common set of underlying parameters, we cannot change one transition probability without affecting others. These cross-influences modify some of the properties of the MLE, again in ways which we can work out from our general theory of likelihood maximization.
Notes: PDF, .Rmd (with comments on the code)
Reading:
Assignments:
dicty.dat for the last two questionsFor most parametric probability models, there is no closed-form expression for the maximum likelihood estimator. We might be able to get a complicated, nonlinear set of equations for the gradient which needs to equal zero, \(\nabla L_n(\hat{\theta}) = 0\), but solvng those equations for the MLE \(\hat{\theta}\) is itself hard; it might not even be feasible to get those equations explicitly, and we just have a function \(L_n(\theta)\) to optimize. A very standard method for numerical optimization is gradient descent: start somewhere, \(\theta^{(0)}\), find the gradient of the objective function there, \(\nabla L_n(\theta^{(0)})\), move “downhill” in that direction a short distance, \(\theta^{(1)} = \theta^{(0)} - \eta \nabla L_n(\theta^{(0)})\), and repeat the cycle at \(\theta^{(1)}\), until we find a (local) minimum or get tired. This requires us to calculate the gradient \(\nabla L_n\), which means taking one derivative per parameter. That can be bad enough, if there are a lot of parameters, but the log-likelihood is a sum over \(n\) data points (and the cross-entropy is an average over \(n\) data points), which means calculating every derivative takes a time proportional to \(n\). If, however, we pick one observation uniformly at random from the data set, the gradient of the log probability at that observation is an unbiased (but noisy) estimate of the gradient of the cross-entropy. (If the model is IID, an “observation” is a data point; if the model is a Markov chain, an “observation” is a transition from one state to the next.) Stochastic gradient descent is (basically) choosing a new random observation to calculate the gradient at each optimization step. This makes our calculation of the MLE sloppier than if we could exactly solve the equations, but this computational error can be made small in comparison to the statistical error which is intrinsic to only having \(n\) data points.
Reading:
We have been discussing first-order Markov models, where \(X_{t:\infty}\) is independent of \(X_{0:t-2\) given \(X_{t-1}\). In a second-order Markov model, conditioning on \(X_{t-2:t-1}\) makes \(X_t\) independent of \(X_{0:t-3}\). In a Markov model of order \(m\), conditioning on \(X_{t-m:t-1}\) makes \(X_t\) independent of \(X_{0:t-m-1}\). Every Markov model of order \(m\) includes models of smaller order, down to order 1, as special cases. For the analysis of higher-order Markov models, however, we go the other way, using what’s called the extended chain trick: For a second-order chain, \(Y_t \equiv ( X_{t}, X_{t-1} )\). The \(Y\) process is a first-order Markov process, not on \(\mathcal{X}\), the original space, but on \(\mathcal{X} \times \mathcal{X}\), the space of pairs of values. If \(\mathcal{X}\) was finite, then so it \(\mathcal{X} \times \mathcal{X}\), and we can analyze \(Y\) like any other Markov chain. It has a transition matrix, whose largest eigenvalue will be 1, corresponding to an invariant distribution. As \(t \rightarrow \infty\), the distribution of \(Y_t\) will approach that invariant distribution, regardless of the initial distribution of \(Y_0\), and the time average of any function \(h(Y_t)\) will approach its expected value under the invariant distribution. And so on and so forth: the extended chain is just another Markov chain. The extended-chain device works for higher-than-second order Markov models as well; an \(m^{\mathrm{th}}\) order chain is a chain on \(\mathcal{X}^m\).
Statistical inference for higher-order Markov chains works just like it does for first-order chains. In a chain of order \(m\), we need to know \(\Pr{(X_t=j|X_{t-m:t-1} = w)}\) for all words \(w\) of length \(m\). (This is equivalent to knowing the transition matrix of \(Y\).) The unrestricted MLE of this probability is, as before, \(\frac{N_{wj}}{\sum_{j}{N_{wj}}}\). For parametric models, we obtain estimating equations of the same form as before. Since chains of order \(m\) contain chains of order \(m-1, m-2, \ldots 1\) as special cases, we can use likelihood ratio tests to do order selection; there are also other possibilities.
An alternative to always conditioning on the last \(m\) observations is to sometimes use shorter histories and sometimes use longer ones, depending on context. This leads to a class of models called context trees or variable-length Markov chains, with a history going back to the 1950s.
Reading:
Assignments:
gd.R code fileFeed-forward neural networks or multi-layer perceptrons are ways of building complicated nonlinear functions by taking weighted sums and applying one fixed nonlinearity, over and over again. The units (“neurons”) of a feed-forward neural network are arranged in layers. The first layer is the vector of inputs to the function, say \(x^{(1)}_{1}, \ldots x^{(1)}_{p}\). The units in the second layer take weighted linear combinations of the inputs, and push them through a fixed nonlinear activation function or squashing function, say \(\sigma\), so the output of unit \(i\) in layer 2 is \(x^{(2)}_i = \sigma(b^{(2)}_i + \sum_{j=1}^{p}{w^{(2)}_{ij} x^{(1)}_j})\). We then do the same thing layer after layer: \(x^{(k)}_{i} = \sigma(b^{k}_i + \sum_{j=1}^{p}{w^{(k)}_{ij} x^{(k-1)}_j})\). (The number of units in each layer can vary.) The final output can be a linear combination of the outputs of the last layer, or some other simple, fixed transformation, such as converting log odds to probabilities (“softmax”).
Every arrangement of units and choice of activation function, or network architecture, defines a parameterized class of functions, with the parameters being all of the \(b^{(k)}_i\) and \(w^{(k)}_{ij}\), collectively the weights of the model. Feed-forward neural networks turn out to be universal function approximators: basically any function can be approximated arbitrarily closely by some network, and the larger we make the network, the closer the approximation can get. In principle, networks with one intermediate layer between input and output are already universal approximators, but experience shows it’s often easier to work with multiple layers (deep networks). Deep or shallow, there are always multiple networks with the same architecture, but different weights, which compute exactly the same function, so the weights are unidentified. This can actually make it easier to find good weights, but does make it hard to interpret the innards of the net.
Neural networks give us functions of continuous vectors. To use them on discrete, categorical variables (such as text), one needs an embedding, mapping each categorical value to a vector. With an embedding, one can use a neural network to parametrize the transition matrix of a Markov chain, or of a higher-order Markov model. Embeddings are also unidentified.
(These models are called neural networks because they have their roots in a very old simplified picture of how neurons in the brain work. Modern models have departed a great deal from that original picture, and not in the direction of biological realism.)
Reading:
The weights/parameters of a neural network are usually estimated, or trained, by gradient descent. Calculating the derivatives of the loss with respect to weights in the top layer is easy. (There, we are actually dealing with a linear smoother or GLM.) But it gets complicated after that. A clever manipulation of the chain rule from calculus, however, gives a clue for how to do this recursively; this is called back-propagation. There are other ways of doing automatic differentiation, but it’s the oldest and still most wide-spread.
(An alternative is to randomly seed the weights for the intermediate layers and then hold them fixed, only modifying the easily-adjusted weights for the final output. With just one intermediate layer with randomized units, this can have expressive power comparable to fully-optimized networks. Current practice strongly favors fully trained deep networks over wide, shallow randomized networks, but it is not clear that the trade-offs have really been explored as thoroughly as they should have been.)
Reading:
Assignments:
A transformer is an architecture for predicting the next token in a sequence of discrete symbols, based on a context window of \(c\) previous tokens. On reading in a sequence of symbols, all but the last \(c\) of them are ignored, leaving say \(s_1, \ldots s_c\). The transformer first embeds each symbol as a vector in some Euclidean space, getting \(\vec{x}_1, \ldots \vec{x}_c\). This sequence of vectors is then smoothed, with \(\vec{x}_t\) being smoothed together with \(\vec{x}_1, \ldots \vec{x}_{t-1}\), but not with later vectors, giving ( _1, _c ). Each smoothed vector is then passed through the same one-hidden-layer feed-forward neural network \(f\), giving a new sequence of vectors \(f(\tilde{x}_1), \ldots f(\tilde{x}_c)\). This combination of smoothing and applying a neural network in parallel is a layer of the transformer. Layers are stacked on top of layers until patience and/or money runs out. A final nonlinear layer returns the probability distribution for \(s_{c+1}|s_{1:c}\). Used generatively, one samples a new token from this distribution, adds it to the end of the sequence, and begins the cycle over again. Because the probability for the next symbol is a function only of the last \(c\) symbols, this is a Markov chain of order \(c\).
The smoothing operation is a weighted average: \(\widetilde{x}_t = \sum_{r=1}^{t}{x_r \frac{\exp{\left\{ (\mathbf{w}_1 x_r) \cdot (\mathbf{w}_2 x_t)\right\}}}{\sum_{u}{\exp{\left\{ (\mathbf{w}_1 x_u) \cdot (\mathbf{w}_2 x_t) \right\}}}}}\), where \(\mathbf{w}_1\) and \(\mathbf{w}_2\) are matrices of learned parameters. This is a kind of kernel smoothing, a.k.a. Nadaraya-Watson smoothing, where the kernel \(K(u,v) = e^{(\mathbf{w}_1 u) \cdot (\mathbf{w}_2 v)}\) is itself learned, by adjusting the two matrices. For historical reasons, dating back to the mid-2010s, this use of kernel smoothing is called attention.
Reading:
The previous sketch of the transformer architecture left out some details which seem important in practice: “temperature” in sampling; positional embeddings; multi-headed attention; layer normalization; skip-layer or residual connections. We will fill in some of those details.
Reading: Same as September 23.
Assignments:
A transformer-based large language model is a higher-order Markov chain. Supplying a prompt to the LLM means conditioning the Markov chain on having realized a certain sequence of symbols — the sequence of symbols in the prompt — and then letting the chain evolve. Prompting does not change the dynamics of the chain, but it can drive it to otherwise improbable parts of the state space. The effects of prompting decay over time; we can calculate how fast using general theory about Markov chains. In-context learning, where we give an LLM examples and ask it to generalize, is also conditioning. It, too, does not change the process, just drives to a particular part of the state-space. Jailbreaking, again, is just conditioning, and does not change the process.
Reading:
Influence functions are a way of measuring how much a function of a distribution would be changed by adding or subtracting a little bit of probability somewhere in the sample space. (You can think of them as a fancy kind of derivative.) They can be used to quantify how much the estimate of a probability model will be changed by deleting a particular data point. This in turn gives a way of calculating how important different source texts are to the output of a language model. We will work through this calculation for a simple Markov chain, and then see how we’d do it for an LLM, and what we’d need to know about the training process and the training corpus.
Reading:
Assignments:
Transformers are higher-order, but finite-order, Markov models. It’s been known for a very long time that many simple-sounding computational tasks just cannot be done with finite-order Markov models. This means that there are many things which it is intrinsically impossible to do with transformers, no matter how big they are, or how much data they are trained on. On the other hand, one can approximate some of these tasks with finite-order Markov models (for various senses of “approximation”). How well do transformers approximate them?
Reading:
Alternatives to transformers or catch-up day.
Reading: TBD
Assignments:
Markov chains are discrete-time, discrete-state processes. There are also Markov processes with continuous states and/or continuous time. The Markov property generalizes: states in the future are probabilistically independent of states in the past given the present state. The most important continuous-state, continuous-time Markov processes are diffusion processes, where the expected rate of change of the state is a function of the current state, plus or minus some noise whose variance can also depend on the current state, say \(\frac{dX}{dt} = a(X(t)) + b(X(t))\zeta(t)\) where \(\zeta(t)\) is a noise process which always has mean zero and constant variance. The expected rate-of-change from state \(x\), \(a(x)\), is called the drift term or drift of the process, and the noise magnitude at that state \(b(x)\) is called the diffusion term. When time advances in small, discrete steps, this gives us \(X(t+h) \approx X(t) + h a(X(t)) + h b(X(t)) \zeta(t)\). Making this exact and rigorously self-consistent for continuous time (\(h\rightarrow 0\)) is tricky, and leads to stochastic calculus — a beautiful but complicated subject we will only glance at. The upshot is that diffusion processes turn out to be solutions to stochastic differential equations. Just as with Markov chains, diffusion processes can have invariant distributions, which involve balancing the drift against the diffusion.
Reading:
If we start with a vector representing an image (or a sound or anything else), we can apply a diffusion process to it by gradually adding more and more noise, until the original vector (image) is totally obscured in the accumulated noise, which looks like a sample from the invariant distribution of the diffusion process. We can imagine reversing time, to go back from the noise to the original, structured vector. This reverse-time process is also a diffusion process, with its own (calculable) drift and diffusion terms. Starting with random vectors representing visual noise, the new diffusion ends with highly structured vectors representing comprehensible images. To implement the de-noising diffusion process as a statistical model, we start with a collection of images, add a little noise to each one, and then estimate a function (train a neural network) to try to recover the original image from the noisy one. We then add a little more noise, and estimate a function to recover the only-slightly-noisy image from the somewhat-more-noisy image, and so on. If we now start with a totally random image/vector and apply those de-noising functions over and over again, each little step corresponds to a small time increment for the reverse diffusion process. This gives us a generative diffusion model, producing new images starting from noise.
Reading:
Assignments:
A generative diffusion model defines a probability distribution over images (or over whatever else the vectors represent). This distribution smears or smooths out the distribution of the training images. We take a mixture of two or more probability distributions by taking a weighted average of the distribution functions (with weights that add up to 1). Equivalently, we randomly pick one of those base distributions and then sample from it. A mixture model is a probability model where we approximate a complicated distribution by a mixture of multiple simple distributions, such as Gaussians, the mixture components. Kernel density estimates are extreme forms of mixture models, where there is one mixture component per data point, all with equal weight and all with the same shape (the kernel). This corresponds to first picking a random data point, and then perturbing it according to the distribution of the kernel (for instance, a Gaussian). Generative diffusion models are extremely close to being kernel density estimates, with the effective kernel depending on the de-noising function.
Reading:
Assignments:
I’ll be out of town, presenting at a workshop.
Assignments:
Information-theoretic methods for diffusion density estimation, or mixture models and KDEs, continued (depending on how far we got on Oct. 28).
Reading:
Assignments:
Combining LLMs and generative diffusion models to get a joint distribution of texts and images. Sampling from the distribution of images conditional on a given piece of text. Sampling from the distribution of texts conditional on a given image.
Reading:
Prompting as conditioning, again. Why generative models are not doing anything differently when they “hallucinate” or “confabulate” than when they give correct answers.
Reading:
Assignments: