Overview/Description

Generative artificial intelligence systems are based on statistical models of text and images. The systems are very new, but they rely on well-established ideas about modeling and inference, some of them from the early 1900s. This course will introduce students to the statistical basis of large language models and image generation models, emphasizing high-level principles over implementation details. It will also examine whether, in light of those statistical foundations, we should think of generative AI as a step towards “artificial general intelligence”, or instead as a “cultural technology”.

473 vs. 673

Undergraduates must register for 36-473; graduate students must register for 36-673. (Undergraduates who try to register for 673 will be dropped from the roster.) Lectures and many assignments will be the same for both courses, but students in 673 will also be expected to complete a project during the second half of the semester. (Details are described below, under “grading”.)

Pre-requisites

The course is intended for advanced undergraduates in statistics and closely related majors, and for MS students in statistics. Undergraduates taking 473 must have passed 36-402. MS students taking 36-473 should have preparation equivalent to 36-402 and all its pre-req courses in statistics and mathematics. A course in stochastic processes, such as 36-410, is helpful but not required.

Topics

  1. Maximum likelihood estimation and data compression
    • General properties of estimating probability models by maximizing likelihood. Connection between maximizing likelihood and data compression. Correspondence between statisticians’ “maximum likelihood estimation of parameters” and computer scientists’ “self-supervised training”.
  2. Markov models
    • Definition of Markov models and of the Markov property. Markov models for text and other sequences of discrete symbols. Basic dynamical properties of Markov models. Higher-order Markov models. “Diffusion models” = Markov models in vector spaces. Connection of diffusion models with mixture models and with kernel density estimation. What happens when a Markov model is conditioned on a particular starting state.
  3. Neural networks
    • Feed-forward neural networks. Neural networks as a technique for parameterizing / approximating complicated functions. Numerical methods for function estimation / neural network training. “Encoding” or “embedding” discrete symbols in vector spaces so neural networks can work with them.
  4. Combinations
    • Maximum likelihood estimation of Markov models. Maximum likelihood estimation of Markov models parameterized by neural networks. The “transformer” architecture for large language models = parameterizing a higher-order Markov chain with a particular type of neural network, trained / estimated by maximum likelihood. “Generative diffusion” = a Markov model in the space of images, where each step is neural network trained/estimated to undo a little bit of visual noise. Multi-modal generative AI = combinations of LLMs and generative diffusion.

Learning Objectives

By the end of the semester, students in this class should be able to:

  • Appreciate how statistical principles and methodologies form the foundation of techniques used in generative AI.
  • Identify, compare, and contrast generative and non-generative statistical models, and translate between models expressed in statistical frameworks and those in AI/machine learning contexts.
  • Implement small-scale examples of methods related to generative AI, including Markov models, neural networks, transformer architectures, and generative diffusion models.
  • Communicate the objectives, design, and results of experiments on generative AI systems effectively, both orally and in writing.

Instructors

Name e-mail Office
Professor Dr. Cosma Shalizi cshalizi [at] cmu.edu Baker Hall 229C
TA Mr. Oliver Hannaoui Not to be bothered by e-mail TBD

Course Policies

Assignments

Grades for 36-473 will be based 100% on regular homework assignments. Grades for 673 will be based on 80% on homeworks and 20% on a (group) project.

Homework: There will be 9 homework assignments, due on Thursdays at 6 pm. Your lowest three homework scores will be replaced by 80/100 (unless that would lower your average) — no questions asked, no excuses needed. Late homework will not be accepted for any reason; no flexibility will be offered for homework due dates.

Projects: Students taking 36-673, and only those students, will do a project. This can be done in a group of up to 4. Projects may either be exploring some area of the literature which goes beyond what’s covered in the course, or in more depth; experimenting on existing systems; implementing new systems or ideas; or some combination of these. Students are encouraged to propose their own project ideas. Proposals must be approved by the professor, and preliminary proposals are due before final proposals, with time for feedback. All projects — even ones that are primarily about implementing something — need to lead to a written report and to an oral presentation. Students will be graded both on the quality of their work, and on their intelligent and polite engagement with other students’ presentations. (Do not make travel plans which conflict with the last week of classes.)

All assignments in this class have three purposes, in decreasing order of importance:

  1. Getting you to practice with the concepts and skills of the class.
  2. Giving you structured feedback on how you are using those concepts and skills.
  3. Assessing, to yourself and to the rest of the world, how well you have mastered those concepts and skills.

The points for individual homework problems, rubric items, etc., reflect how important the underlying concepts and skills are for you to master, not how much time or work they should take.

Grade Thresholds

A grade of A will be an average of 90 or above, a B a grade of 80 or above (but below 90), and so on. Any grade below 60 will be an R.

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve or need a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, “what is the least amount of work I need to do in order to get the grade I want?” is a much worse way to approach higher education than “how can I learn the most from this class and from my teachers?”.

Lectures

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. You will generally find it helpful to do the readings before coming to class.

Do not record lectures. Exceptions to this will be made only for students with written accommodation plans, authorized by CMU’s Office of Disability Resources, requiring recordings.

Solutions

Solutions for all homework assignments will be available, after their due date, through Canvas. Please don’t share them with anyone, even after the course has ended. This means do not upload them to websites. Sharing or uploading the solutions will count as academic dishonesty and will be pursued through the university’s disciplinary process.

Office Hours

Thursdays at 10 am in Prof. Shalizi’s office, Baker Hall 229C.

Textbook

Christopher M. Bishop (with Hugh Bishop), Deep Learning: Foundations and Concepts (Cham, Switzerland: Springer-Verlag, 2024; doi:10.1007/978-3-031-45468-4) is recommended. Hardcopies are a bit expensive (currently $89.99), but we have free access to the full text through the university library — follow the link (and if need be sign in to the library).

Academic Integrity

Except for explicit group projects, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester’s previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. Websites which compile course materials are not legitimate online resources.

In general, you are free to discuss homework with other students in the class, though not to share work; such conversations must be acknowledged in your assignments. You should not seek help with assignments from people outside the class. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to formal disciplinary action, under the terms of the university’s policy on academic integrity.

Use of AI Is Discouraged

AI tools have their place — why have this course otherwise? — but they are most useful when they are doing boring and repetitive, “mechanical” work which you can check. The point of assignments in a class like this is to help you acquire the knowledge and skills you need to evaluate someone else’s work, or the output of an AI. Therefore, using generative AI for any part of assignments in this class which require you to think is actively harmful to your own learning. (This is backed up by professional studies of the effect of AI on student learning, which I am happy to discuss.)

I recognize that it is going to be very hard to stop you from using these machines to do mechanical tasks (like writing boilerplate code or graph formatting). I would also be failing in my duty as a teacher if I let you replace your own thought and learning with a random sample of the lowest common denominator of the Internet. Since drawing a bright line between those two extremes is hard, I am willing, on a trial basis, to permit the use of generative AI tools in this class, on the following conditions:

  1. Whenever possible, don’t use AI. (You’re in college, try to learn for yourself.)
  2. If you do use AI on an assignment, you must acknowledge it, and include a complete transcript of your session(s) with the AI. (Put this at the end, after the main body of your assignment.)
  3. If I judge that you are over-relying on the AI, you will receive feedback to that effect, and your grade will be marked down for that assignment. (My decisions on this are final.)
  4. Using AI without acknowledgment will be treated as plagiarism.
  5. I reserve the right to forbid its use for everyone altogether.

Time expectations

You should expect to spend 5–7 hours on assignments every week, outside of class, averaging over the semester. (This follows from the university’s rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Accommodations for Students with Disabilities

If you need accommodations for physical and/or learning disabilities, please contact the Office of Disability Resources, via their website, [http://www.cmu.edu/disability-resources]. They will help you work out an official written accommodation plan, and help coordinate with me.

Schedule of Lectures, Readings and Assignments

References to “Bishop” in the readings are to the recommended textbook (see above). Readings marked with asterisks (*) are optional. Readings marked with one or more daggers (\(\dagger\)) are, as it were, especially optional, because they are technically advanced, older, or go off on a tangent to the main direction of the course; the number of daggers gives my sense of how hard they will be for most students. Readings not marked with an asterisk or dagger are mandatory.

Maximum Likelihood and Self-Supervised Generative Models (Week 1)

Tuesday, August 26: Generative Models Are Statistical Models

Modern generative AI rests on large language models, and LLMs are higher-order parametric Markov models fit by maximum likelihood. The first half of the semester will be about unpacking this statement and its consequences.

Language models, in this sense, are probability models for sequences of discrete symbols. A probability model which specifies the distribution for all parts of the data is generative, because it can be used to make up, or “generate”, new data sets by sampling from the probability distribution. If the data are text or images, these models then generate new texts or new images. Not all statistical models are generative: regressions and classifiers, for instance, are not generative, because they do not specify distributions for the predictor (“independent”) variables. Probability models can also be used to compress data sets, essentially by saying how that model could have generated that particular data set. The more accurate the model’s distribution, the more efficient it is at compression. The size of this compression or encoding can be measured in the number of binary digits, or bits, needed to store it, and this turns out to be the negative log probability of the data set. Various averages and transformations of the negative log probability, with names cross-entropy and perplexity, are often used to evaluate generative models.

Notes: PDF, .Rmd

Reading:

  • Ted Chiang, “ChatGPT is a Blurry JPEG of the Web”, The New Yorker 9 February 2023, [https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web]
  • (*) Eugene Charniak, Statistical Language Learning (Cambridge, Massachusetts: MIT Press, 1993)
  • (*) Thomas M. Cover and Joy A. Thomas, Elements of Information Theory (2nd edition, New York: John Wiley, 2006, doi:10.1002/047174882X), chs. 1 and 2
  • (\(\dagger\)) Claude E. Shannon, “A Mathematical Theory of Communication”, Bell System Technical Journal 27 (1948): 379–423, reprinted in Claude E. Shannon and Warren Weaver, The Mathematical Theory of Communication (Urbana, Illinois: University of Illinois Press, 1963)
  • (\(\dagger\)) Peter Grünwald, “A Tutorial Introduction to the Minimum Description Length Principle”, pp. 23–79 in Peter Grünwald, In Jae Myung and Mark Pitt (eds.), Advances in Minimum Description Length: Theory and Applications (Cambridge, Massachusetts: MIT Press, 2005; doi:10.7551/mitpress/1114.001.0001), arxiv:math.ST/040607
  • (\(\dagger\dagger\)) Peter Grünwald, The Minimum Description Length Principle (Cambridge, Massachusetts: MIT Press, 2007; doi:10.7551/mitpress/4643.001.0001)
  • (\(\dagger\dagger\dagger\)) Jorma Rissanen, Stochastic Complexity in Statistical Inquiry (Singapore: World Scientific, 1989), doi:10.1142/0822
  • (\(\dagger\dagger\dagger\)) Noam Chomsky, “Three Models for the Description of Language”, IRE Transactions on Information Theory 2 (1956): 113–124, doi:10.1109/TIT.1956.1056813

Assignments:

Thursday, August 28: Maximum Likelihood Estimation, a.k.a. Self-Supervised Training

When probability models have one or more adjustable parameters, those parameters need to be estimated by matching the data. The most efficient way of doing this is usually to maximize the likelihood, the probability that the model would have generated the particular data we saw. This corresponds to looking for the shortest compression of the data — or at least the shortest achievable using that model. In previous classes, you have worked through simple examples of maximum likelihood estimation. We will go over the general theory of maximum likelihood here, including the interpretation of first and second derivatives of the log-likelihood function.

Maximum likelihood estimation of generative models is what machine learning calls “self-supervised training”. We will look at the properties of self-supervised training when the probability model cannot match the true distribution of the data exactly. The upshot is that maximizing the likelihood converges on the best approximation to the true distribution which the model can achieve. This particular sense of “best approximation” that involves likelihood ratio tests, and a quantity called the “Kullback-Leibler divergence” or “relative entropy”, but we’ll also seen why that’s a reasonable goal.

Reading:

  • Bishop, section 2.5 (and Appendix C on Lagrange multipliers, if needed)
  • (*) Stephen M. Stigler, “The Epic Story of Maximum Likelihood”, Statistical Science 22 (2007): 598–620, doi:10.1214/07-STS249
  • (*) Bruce Lindsay and Liawei Liu, “Model Assessment Tools for a Model False World”, Statistical Science 24 (2009): 303–318, doi:10.1214/09-STS302
  • (\(\dagger\)) R. A. Fisher, “On the Mathematical Foundations of Theoretical Statistics”, Philosophical Transactions of the Royal Society A 222 (1922): 309–368, doi:10.1098/rsta.1922.0009, or full text via the R. A. Fisher Archive
  • (\(\dagger\dagger\)) Rudolf Kulhavy, Recursive Nonlinear Estimation: A Geometric Approach (Berlin: Springer-Verlag, 1996), doi:10.1007/BFb0031830, chs. 1–3

Notes: PDF, .Rmd

Markov Models (Weeks 2 and 3)

Tuesday, September 2: Markov Models

Large language models are higher-order Markov models. To understand them, we begin with first-order Markov models. A Markov chain is a sequence of discrete, random symbols where future variables are probabilistically independent of the past, given the current state of the process. This conditional independence is called the Markov property. The Markov property implies that every symbol has a certain probability (perhaps zero) of following any other symbol. These conditional probabilities can be assembled into a matrix, the transition matrix of the chain. All the other properties of the process can be worked out by doing linear algebra on the transition matrix. The eigenvectors of the transition matrix tell us about what distributions the chain can generate, and the eigenvalues tell us about which distributions are stable or invariant over time, and how quickly the unstable distributions converge to the stable ones. Markov chains are ergodic, meaning that statistics calculated along any one realization of the process converge on what we would calculate from an invariant distribution. This implies that correlations between very distant parts of the sequence go towards zero. Some simple calculations based on the transition matrix give us rough handles on rates of convergence.

Notes: PDF, .Rmd

Reading:

  • Peter Guttorp, Stochastic Modeling of Scientific Data (London: Chapman and Hall, 1995), ch. 2, sections 2.1–2.5 [Will be available on Canvas]
  • (*) Gely P. Basharin, Amy N. Langville and Valeriy A. Naumov, “The Life and Work of A. A. Markov”, Linear Algebra and its Applications 386 (2004): 3–26, doi:10.1016/j.laa.2003.12.041 (PDF reprint via Prof. Langville)
  • (*) Judy L. Klein, Statistical Visions in Time: A History of Time Series Analysis, 1662–1938 (Cambridge, England: Cambridge University Press, 1997)

Thursday, September 4: Estimating Markov Models with Maximum Likelihood

When the transition matrix of a Markov model is unknown, we need to estimate it from data. Once we see how to define the likelihood function for a Markov chain, we can estimate it. There is a closed-form solution for the transition matrix which maximizes the log-likelihood, and we can use our general maximum likelihood theory to calculate various properties of this estimate, such as (asymptotic) standard errors and confidence sets. This closed-form solution relies on all the elements of the transition matrix being free to vary independently (so long as each row adds to 1). When those elements are instead functions of a common set of underlying parameters, we cannot change one transition probability without affecting others. These cross-influences modify some of the properties of the MLE, again in ways which we can work out from our general theory of likelihood maximization.

Notes: PDF, .Rmd (with comments on the code)

Reading:

  • Peter Guttorp, Stochastic Modeling of Scientific Data (London: Chapman and Hall, 1995), ch. 2, sections 2.7 and 2.9 [Will be available on Canvas]
  • Bishop, Appendix C (if you need a refresher on Lagrange multipliers)
  • (*) Judy L. Klein, Statistical Visions in Time: A History of Time Series Analysis, 1662–1938 (Cambridge, England: Cambridge University Press, 1997)
  • (\(\dagger\)) Patrick Billingsley, Statistical Inference for Markov Processes (Chicago: University of Chicago Press, 1961)

Assignments:

Tuesday, September 9: Training Markov Models by Stochastic Gradient Descent

For most parametric probability models, there is no closed-form expression for the maximum likelihood estimator. We might be able to get a complicated, nonlinear set of equations for the gradient which needs to equal zero, \(\nabla L_n(\hat{\theta}) = 0\), but solvng those equations for the MLE \(\hat{\theta}\) is itself hard; it might not even be feasible to get those equations explicitly, and we just have a function \(L_n(\theta)\) to optimize. A very standard method for numerical optimization is gradient descent: start somewhere, \(\theta^{(0)}\), find the gradient of the objective function there, \(\nabla L_n(\theta^{(0)})\), move “downhill” in that direction a short distance, \(\theta^{(1)} = \theta^{(0)} - \eta \nabla L_n(\theta^{(0)})\), and repeat the cycle at \(\theta^{(1)}\), until we find a (local) minimum or get tired. This requires us to calculate the gradient \(\nabla L_n\), which means taking one derivative per parameter. That can be bad enough, if there are a lot of parameters, but the log-likelihood is a sum over \(n\) data points (and the cross-entropy is an average over \(n\) data points), which means calculating every derivative takes a time proportional to \(n\). If, however, we pick one observation uniformly at random from the data set, the gradient of the log probability at that observation is an unbiased (but noisy) estimate of the gradient of the cross-entropy. (If the model is IID, an “observation” is a data point; if the model is a Markov chain, an “observation” is a transition from one state to the next.) Stochastic gradient descent is (basically) choosing a new random observation to calculate the gradient at each optimization step. This makes our calculation of the MLE sloppier than if we could exactly solve the equations, but this computational error can be made small in comparison to the statistical error which is intrinsic to only having \(n\) data points.

Reading:

  • Bishop, ch. 7, sections 7.1–7.3
  • Léon Bottou and Olivier Bousquet, “The Tradeoffs of Large Scale Learning”, pp. 351–368 of Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning (Cambridge, Massachusetts: MIT Press, 2012), doi:10.7551/mitpress/8996.003.0015, or full text via. Dr. Bottou
  • (\(\dagger\)) Arthur E. Albert and Gardener, Leland A., Jr., Stochastic Approximation and Nonlinear Regression (Cambridge, Massachusetts: MIT Press, 1967), doi:/10.7551/mitpress/6464.001.0001

Thursday, September 11: Higher-Order Markov Models

We have been discussing first-order Markov models, where \(X_{t:\infty}\) is independent of \(X_{0:t-2\) given \(X_{t-1}\). In a second-order Markov model, conditioning on \(X_{t-2:t-1}\) makes \(X_t\) independent of \(X_{0:t-3}\). In a Markov model of order \(m\), conditioning on \(X_{t-m:t-1}\) makes \(X_t\) independent of \(X_{0:t-m-1}\). Every Markov model of order \(m\) includes models of smaller order, down to order 1, as special cases. For the analysis of higher-order Markov models, however, we go the other way, using what’s called the extended chain trick: For a second-order chain, \(Y_t \equiv ( X_{t}, X_{t-1} )\). The \(Y\) process is a first-order Markov process, not on \(\mathcal{X}\), the original space, but on \(\mathcal{X} \times \mathcal{X}\), the space of pairs of values. If \(\mathcal{X}\) was finite, then so it \(\mathcal{X} \times \mathcal{X}\), and we can analyze \(Y\) like any other Markov chain. It has a transition matrix, whose largest eigenvalue will be 1, corresponding to an invariant distribution. As \(t \rightarrow \infty\), the distribution of \(Y_t\) will approach that invariant distribution, regardless of the initial distribution of \(Y_0\), and the time average of any function \(h(Y_t)\) will approach its expected value under the invariant distribution. And so on and so forth: the extended chain is just another Markov chain. The extended-chain device works for higher-than-second order Markov models as well; an \(m^{\mathrm{th}}\) order chain is a chain on \(\mathcal{X}^m\).

Statistical inference for higher-order Markov chains works just like it does for first-order chains. In a chain of order \(m\), we need to know \(\Pr{(X_t=j|X_{t-m:t-1} = w)}\) for all words \(w\) of length \(m\). (This is equivalent to knowing the transition matrix of \(Y\).) The unrestricted MLE of this probability is, as before, \(\frac{N_{wj}}{\sum_{j}{N_{wj}}}\). For parametric models, we obtain estimating equations of the same form as before. Since chains of order \(m\) contain chains of order \(m-1, m-2, \ldots 1\) as special cases, we can use likelihood ratio tests to do order selection; there are also other possibilities.

An alternative to always conditioning on the last \(m\) observations is to sometimes use shorter histories and sometimes use longer ones, depending on context. This leads to a class of models called context trees or variable-length Markov chains, with a history going back to the 1950s.

Reading:

  • Peter Guttorp, Stochastic Modeling of Scientific Data (London: Chapman and Hall, 1995), ch. 2, section 2.8 [On Canvas]
  • (*) Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas Boullé and Ievgen Redko, “Large Language Models as Markov Chains”, arxiv:2410.02724
  • (\(\dagger\)) J. D. Foulkes, “A Class of Machines which Determine the Statistical Structure of a Sequence of Characters”, pp. 66–73 in vol. 4 of Western Electronics Convention [WESCON] Record (Institute of Radio Engineers, 1959) PDF
  • (\(\dagger\)) Peter Bühlmann and Abraham J. Wyner, “Variable Length Markov Chains”, Annals of Statistics 27 (1999): 480–513, doi:10.1214/aos/1018031204
  • (\(\dagger\dagger\)) Jorma Rissanen, “A Universal Data Compression System”, IEEE Transactions on Information Theory 29 (1983): 656–664, doi:10.1109/TIT.1983.1056741
  • (\(\dagger\dagger\dagger\)) Imre Csiszár and Paul C. Shields, “The Consistency of the BIC Markov order estimator”, Annals of Statistics 28 (2000): 1601–1619, doi:10.1214/aos/1015957472

Assignments:

Neural Networks (Week 4)

Tuesday, September 16: Feed-forward Neural Networks

Feed-forward neural networks or multi-layer perceptrons are ways of building complicated nonlinear functions by taking weighted sums and applying one fixed nonlinearity, over and over again. The units (“neurons”) of a feed-forward neural network are arranged in layers. The first layer is the vector of inputs to the function, say \(x^{(1)}_{1}, \ldots x^{(1)}_{p}\). The units in the second layer take weighted linear combinations of the inputs, and push them through a fixed nonlinear activation function or squashing function, say \(\sigma\), so the output of unit \(i\) in layer 2 is \(x^{(2)}_i = \sigma(b^{(2)}_i + \sum_{j=1}^{p}{w^{(2)}_{ij} x^{(1)}_j})\). We then do the same thing layer after layer: \(x^{(k)}_{i} = \sigma(b^{k}_i + \sum_{j=1}^{p}{w^{(k)}_{ij} x^{(k-1)}_j})\). (The number of units in each layer can vary.) The final output can be a linear combination of the outputs of the last layer, or some other simple, fixed transformation, such as converting log odds to probabilities (“softmax”).

Every arrangement of units and choice of activation function, or network architecture, defines a parameterized class of functions, with the parameters being all of the \(b^{(k)}_i\) and \(w^{(k)}_{ij}\), collectively the weights of the model. Feed-forward neural networks turn out to be universal function approximators: basically any function can be approximated arbitrarily closely by some network, and the larger we make the network, the closer the approximation can get. In principle, networks with one intermediate layer between input and output are already universal approximators, but experience shows it’s often easier to work with multiple layers (deep networks). Deep or shallow, there are always multiple networks with the same architecture, but different weights, which compute exactly the same function, so the weights are unidentified. This can actually make it easier to find good weights, but does make it hard to interpret the innards of the net.

Neural networks give us functions of continuous vectors. To use them on discrete, categorical variables (such as text), one needs an embedding, mapping each categorical value to a vector. With an embedding, one can use a neural network to parametrize the transition matrix of a Markov chain, or of a higher-order Markov model. Embeddings are also unidentified.

(These models are called neural networks because they have their roots in a very old simplified picture of how neurons in the brain work. Modern models have departed a great deal from that original picture, and not in the direction of biological realism.)

Reading:

  • Bishop, ch. 6, sections 6.1–6.4 and 12.2.1–12.2.2
  • (*) B. D. Ripley, Pattern Recognition and Neural Networks (Cambridge, England: Cambridge University Press, 1996; doi:/10.1017/CBO9780511812651, ch. 5, sections 5.0–5.2 and 5.6 (pp. 143–148 and 168–173) and skimming section 5.7 (pp. 173–180)
  • (\(\dagger\)) Maureen Caudill and Charles Butler, Naturally Intelligent Systems (Cambridge, Massachusetts: MIT Press, 1990), doi:10.7551/mitpress/4873.001.0001
  • (\(\dagger\dagger\)) Dirk Husmeier, Neural Networks for Conditional Probability Estimation: Forecasting Beyond Point Predictions (Berlin: Springer-Verlag, 1999) doi:10.1007/978-1-4471-0847-4
  • Roots:
    • (\(\dagger\)) Warren S. McCulloch and Walter Pitts, “A logical calculus of the ideas immanent in nervous activity”, Bulletin of Mathematical Biophysics 5 (1943): 115–133, doi:
    • (\(\dagger \dagger\)) Warren S. McCulloch, Embodiments of Mind (Cambridge, Massachusetts: MIT Press, 1965)
    • (\(\dagger \dagger\)) Donald O. Hebb, The Organization of Behavior: A Neuropsychological Theory (New York: Wiley, 1949), doi:10.4324/9781410612403
    • (\(\dagger \dagger \dagger\)) Friedrich A. Hayek, The Sensory Order: An Inquiry into the Foundations of Theoretical Psychology (Chicago: University of Chicago Press, 1952), doi:10.7208/9780226321301
    • (\(\dagger\)) Frank Rosenblatt, “The Perceptron: A Probablistic Model for Information Storage and Organization in the Brain”, Psychological Review 65 (1958): 386–408, doi:10.1037/h0042519
    • (\(\dagger\)) David E. Rumelhart, James L. McClelland, and the PDP Research Group, Parallel Distributed Processing, volume 1: Explorations in the Microstructure of Cognition: Foundations (Cambridge, Massachusetts: MIT Press, 1986), doi:10.7551/mitpress/5236.001.0001
    • Roots of the roots:
      • (\(\dagger \dagger \dagger\)) Charles Sherrington, The Integrative Action of the Nervous System (New Haven, Connecticut: Yale University Press, 1906)
      • (\(\dagger \dagger \dagger \dagger\)) Nicolas Rashevsky, Mathematical Biophysics: Physico-Mathematical Foundations of Biology (Chicago: University of Chicago Press, 1938)
      • (\(\dagger \dagger \dagger\)) Bertrand Russell, Introduction to Mathematical Philosophy, 2nd edition (London: George Allen and Unwin, 1920) Online
  • Symmetries and identification:
    • (\(\dagger\)) Rachel Carrington, Karthik Bharath, Simon Preston, “Invariance and identifiability issues for word embeddings”, NeurIPS 2019, arxiv:1911.02656
    • (\(\dagger\dagger\)) Samuel K. Ainsworth, Jonathan Hayase and Siddhartha Srinivasa, “Git Re-Basin: Merging Models modulo Permutation Symmetries”, arxiv:2209.0483
    • (\(\dagger\dagger\)) An Mei Chen, Haw-minn Lu and Robert Hecht-Nielsen, “On the Geometry of Feedforward Neural Network Error Surfaces”, Neural Computation 5 (1993): 910–927, doi:10.1162/neco.1993.5.6.910
    • (\(\dagger\dagger\dagger\)) Vera Kurkova and Paul C. Kainen, “Functionally Equivalent Feedforward Neural Networks”, Neural Computation 6 (1994): 543–558 doi:10.1162/neco.1994.6.3.543

Thursday, September 18: Training Neural Networks

The weights/parameters of a neural network are usually estimated, or trained, by gradient descent. Calculating the derivatives of the loss with respect to weights in the top layer is easy. (There, we are actually dealing with a linear smoother or GLM.) But it gets complicated after that. A clever manipulation of the chain rule from calculus, however, gives a clue for how to do this recursively; this is called back-propagation. There are other ways of doing automatic differentiation, but it’s the oldest and still most wide-spread.

(An alternative is to randomly seed the weights for the intermediate layers and then hold them fixed, only modifying the easily-adjusted weights for the final output. With just one intermediate layer with randomized units, this can have expressive power comparable to fully-optimized networks. Current practice strongly favors fully trained deep networks over wide, shallow randomized networks, but it is not clear that the trade-offs have really been explored as thoroughly as they should have been.)

Reading:

  • Bishop, ch. 8
  • (*) B. D. Ripley, Pattern Recognition and Neural Networks (Cambridge, England: Cambridge University Press, 1996; doi:/10.1017/CBO9780511812651), ch. 5, section 5.3 and 5.4 (pp. 148–163)
  • (*) W. N. Venables and B. D. Ripley, Modern Applied Statistics with S, 4th edition (New York: Springer-Verlag, 2002; doi:10.1007/978-0-387-21706-2), sections 8.10 and 12.4 (pp. 243–249 and 342–344)
  • (\(\dagger\)) David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, “Learning representations by back-propagating errors”, Nature 323 (1986): 533–536, doi:10.1038/323533a0
  • (\(\dagger\dagger\)) Halbert White, “Learning in Artificial Neural Networks: A Statistical Perspective”, Neural Computation 1 (1989): 425–464, doi:10.1162/neco.1989.1.4.425
  • (\(\dagger\dagger\dagger\)) Yoh-Han Pao, Gwang-Hoon Park and Dejan J. Sobajic, “Learning and Generalization Characteristics of the Random Vector Functional-Link Net”, Neurocomputing 6 (1994): 163–180, doi:10.1016/0925-2312(94)90053-1
  • (\(\dagger\dagger\)) Ali Rahimi and Benjamin Recht, “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning”, pp. 1313–1320 in Daphne Koller, D. Schuurmans, Y. Bengio and L. Bottou (eds.), Advances in Neural Information Processing Systems 21 [NIPS 2008]

Assignments:

Transformers and “Attention” (Weeks 5, 6 and 7)

Tuesday, September 23: Transformers, Part I

A transformer is an architecture for predicting the next token in a sequence of discrete symbols, based on a context window of \(c\) previous tokens. On reading in a sequence of symbols, all but the last \(c\) of them are ignored, leaving say \(s_1, \ldots s_c\). The transformer first embeds each symbol as a vector in some Euclidean space, getting \(\vec{x}_1, \ldots \vec{x}_c\). This sequence of vectors is then smoothed, with \(\vec{x}_t\) being smoothed together with \(\vec{x}_1, \ldots \vec{x}_{t-1}\), but not with later vectors, giving ( _1, _c ). Each smoothed vector is then passed through the same one-hidden-layer feed-forward neural network \(f\), giving a new sequence of vectors \(f(\tilde{x}_1), \ldots f(\tilde{x}_c)\). This combination of smoothing and applying a neural network in parallel is a layer of the transformer. Layers are stacked on top of layers until patience and/or money runs out. A final nonlinear layer returns the probability distribution for \(s_{c+1}|s_{1:c}\). Used generatively, one samples a new token from this distribution, adds it to the end of the sequence, and begins the cycle over again. Because the probability for the next symbol is a function only of the last \(c\) symbols, this is a Markov chain of order \(c\).

The smoothing operation is a weighted average: \(\widetilde{x}_t = \sum_{r=1}^{t}{x_r \frac{\exp{\left\{ (\mathbf{w}_1 x_r) \cdot (\mathbf{w}_2 x_t)\right\}}}{\sum_{u}{\exp{\left\{ (\mathbf{w}_1 x_u) \cdot (\mathbf{w}_2 x_t) \right\}}}}}\), where \(\mathbf{w}_1\) and \(\mathbf{w}_2\) are matrices of learned parameters. This is a kind of kernel smoothing, a.k.a. Nadaraya-Watson smoothing, where the kernel \(K(u,v) = e^{(\mathbf{w}_1 u) \cdot (\mathbf{w}_2 v)}\) is itself learned, by adjusting the two matrices. For historical reasons, dating back to the mid-2010s, this use of kernel smoothing is called attention.

Reading:

  • Bishop, ch. 12, sec. 12.1–12.3
  • Mary Phuong and Marcus Hutter, “Formal Algorithms for Transformers”, arxiv:2207.09238
  • (*) Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, Ruslan Salakhutdinov, “Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel”, pp. 4344–4353 in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing [EMNLP 2019], doi:10.18653/v1/D19-1443, arxiv:1908.1177
  • (\(\dagger\)) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need”, arxiv:1706.03762
  • (\(\dagger\dagger\)) E. A. Nadaraya, “On Estimating Regression”, Theory of Probability and Its Applications 9 (1964): 141–142, doi:10.1137/1109020
  • (\(\dagger\dagger\)) Geoffrey S. Watson, “Smooth Regression Analysis”, Sanhkya 26 (1964): 359–372, [http://www.jstor.org/stable/25049340]

Thursday, September 25: Transformers, Part II

The previous sketch of the transformer architecture left out some details which seem important in practice: “temperature” in sampling; positional embeddings; multi-headed attention; layer normalization; skip-layer or residual connections. We will fill in some of those details.

Reading: Same as September 23.

Assignments:

Tuesday, September 30: Prompting Is Conditioning

A transformer-based large language model is a higher-order Markov chain. Supplying a prompt to the LLM means conditioning the Markov chain on having realized a certain sequence of symbols — the sequence of symbols in the prompt — and then letting the chain evolve. Prompting does not change the dynamics of the chain, but it can drive it to otherwise improbable parts of the state space. The effects of prompting decay over time; we can calculate how fast using general theory about Markov chains. In-context learning, where we give an LLM examples and ask it to generalize, is also conditioning. It, too, does not change the process, just drives to a particular part of the state-space. Jailbreaking, again, is just conditioning, and does not change the process.

Reading:

  • Murray Shanahan, Kyle McDonell and Laria Reynolds, “Role play with large language models”, Nature 623 (2023): 493–498, doi:10.1038/s41586-023-06647-8
  • (*) Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas Boullé and Ievgen Redko, “Large Language Models as Markov Chains”, arxiv:2410.02724
  • (\(\dagger\)) Yotam Wolf, Noam Wies, Yoav Levine, Amnon Shashua, “Fundamental Limitations of Alignment in Large Language Models”, arxiv:2304.11082

Thursday, October 2: Influence Functions and Source Attribution

Influence functions are a way of measuring how much a function of a distribution would be changed by adding or subtracting a little bit of probability somewhere in the sample space. (You can think of them as a fancy kind of derivative.) They can be used to quantify how much the estimate of a probability model will be changed by deleting a particular data point. This in turn gives a way of calculating how important different source texts are to the output of a language model. We will work through this calculation for a simple Markov chain, and then see how we’d do it for an LLM, and what we’d need to know about the training process and the training corpus.

Reading:

  • Aaron Fisher and Edward H. Kennedy, “Visually Communicating and Teaching Intuition for Influence Functions”, The American Statistician 75 (2021): 162–172, doi:10.1080/00031305.2020.1717620, arxiv:1810.03260
  • (*) Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamile Lukosiute, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman, “Studying Large Language Model Generalization with Influence Functions”, arxiv:2308.03296

Assignments:

  • Homework 5 due (at 6pm)

Tuesday, October 7: Limits of Transformers

Transformers are higher-order, but finite-order, Markov models. It’s been known for a very long time that many simple-sounding computational tasks just cannot be done with finite-order Markov models. This means that there are many things which it is intrinsically impossible to do with transformers, no matter how big they are, or how much data they are trained on. On the other hand, one can approximate some of these tasks with finite-order Markov models (for various senses of “approximation”). How well do transformers approximate them?

Reading:

  • (*) Kavi Gupta, Kate Sanders and Armando Solar-Lezama, “Randomly Sampled Language Reasoning Problems Reveal Limits of LLMs”, arxiv:2501.02825
  • (\(\dagger\)) Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui and Yejin Choi, “Faith and Fate: Limits of Transformers on Compositionality”, arxiv:2305.18654
  • (\(\dagger\)) Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang, “Transformers Learn Shortcuts to Automata”, arxiv:2210.10749
  • (\(\dagger\)) Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman, Maxim Raginsky and Talia Ringer, “Can Transformers Learn to Solve Problems Recursively?”, arxiv:2305.14699
  • (\(\dagger\)) Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman, Maxim Raginsky and Talia Ringer, “Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion”, arxiv:2401.12947
  • (\(\dagger\dagger\)) Noam Chomsky, “Three Models for the Description of Language”, IRE Transactions on Information Theory 2 (1956): 113–124, doi:10.1109/TIT.1956.1056813

Thursday, October 9: Beyond Transformers

Alternatives to transformers or catch-up day.

Reading: TBD

Assignments:

Tuesday, October 14 and Thursday, October 16 — FALL BREAK, NO CLASS

Generative Diffusion Models (Weeks 8–11)

Tuesday, October 21: Diffusion Models

Markov chains are discrete-time, discrete-state processes. There are also Markov processes with continuous states and/or continuous time. The Markov property generalizes: states in the future are probabilistically independent of states in the past given the present state. The most important continuous-state, continuous-time Markov processes are diffusion processes, where the expected rate of change of the state is a function of the current state, plus or minus some noise whose variance can also depend on the current state, say \(\frac{dX}{dt} = a(X(t)) + b(X(t))\zeta(t)\) where \(\zeta(t)\) is a noise process which always has mean zero and constant variance. The expected rate-of-change from state \(x\), \(a(x)\), is called the drift term or drift of the process, and the noise magnitude at that state \(b(x)\) is called the diffusion term. When time advances in small, discrete steps, this gives us \(X(t+h) \approx X(t) + h a(X(t)) + h b(X(t)) \zeta(t)\). Making this exact and rigorously self-consistent for continuous time (\(h\rightarrow 0\)) is tricky, and leads to stochastic calculus — a beautiful but complicated subject we will only glance at. The upshot is that diffusion processes turn out to be solutions to stochastic differential equations. Just as with Markov chains, diffusion processes can have invariant distributions, which involve balancing the drift against the diffusion.

Reading:

  • Bishop, sec. 20.1
  • (*) Maxim Raginsky, Stochastic Differential Equations: A Systems-Theoretic Approach, chs. 1 and 2. See PDF draft via Prof. Raginsky.
  • (\(\dagger\)) Cosma Rohilla Shalizi, Almost None of the Theory of Stochastic Processes, chs. 16–18. See PDF draft.
  • (\(\dagger\)) Jan von Plato, Creating Modern Probability: Its Mathematics, Physics and Philosophy in Historical Perspective (Cambridge, England: Cambridge University Press, 1994), doi:10.1017/CBO9780511609107

Thursday, October 23: Generative Diffusion Models Work by Undoing Noise

If we start with a vector representing an image (or a sound or anything else), we can apply a diffusion process to it by gradually adding more and more noise, until the original vector (image) is totally obscured in the accumulated noise, which looks like a sample from the invariant distribution of the diffusion process. We can imagine reversing time, to go back from the noise to the original, structured vector. This reverse-time process is also a diffusion process, with its own (calculable) drift and diffusion terms. Starting with random vectors representing visual noise, the new diffusion ends with highly structured vectors representing comprehensible images. To implement the de-noising diffusion process as a statistical model, we start with a collection of images, add a little noise to each one, and then estimate a function (train a neural network) to try to recover the original image from the noisy one. We then add a little more noise, and estimate a function to recover the only-slightly-noisy image from the somewhat-more-noisy image, and so on. If we now start with a totally random image/vector and apply those de-noising functions over and over again, each little step corresponds to a small time increment for the reverse diffusion process. This gives us a generative diffusion model, producing new images starting from noise.

Reading:

  • Bishop, sec. 20.2
  • (*) Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”, arxiv:1503.03585
  • (*) Maxim Raginsky, Stochastic Differential Equations: A Systems-Theoretic Approach, sections 7.1 and 7.2. See PDF draft via Prof. Raginsky.

Assignments:

Tuesday, October 28: Generative Diffusion as Density Estimation

A generative diffusion model defines a probability distribution over images (or over whatever else the vectors represent). This distribution smears or smooths out the distribution of the training images. We take a mixture of two or more probability distributions by taking a weighted average of the distribution functions (with weights that add up to 1). Equivalently, we randomly pick one of those base distributions and then sample from it. A mixture model is a probability model where we approximate a complicated distribution by a mixture of multiple simple distributions, such as Gaussians, the mixture components. Kernel density estimates are extreme forms of mixture models, where there is one mixture component per data point, all with equal weight and all with the same shape (the kernel). This corresponds to first picking a random data point, and then perturbing it according to the distribution of the kernel (for instance, a Gaussian). Generative diffusion models are extremely close to being kernel density estimates, with the effective kernel depending on the de-noising function.

Reading:

  • Bishop, sec. 20.3
  • (*) Maxim Raginsky, Stochastic Differential Equations: A Systems-Theoretic Approach, section 7.3. See PDF draft via Prof. Raginsky.
  • (\(\dagger\dagger\)) Giulio Biroli, Tony Bonnaire, Valentin de Bortoli, and Marc Mézard, “Dynamical Regimes of Diffusion Models”, Nature Communications 15 (2024): 9957, doi:10.1038/s41467-024-54281-3, arxiv:2404.18491

Assignments:

  • Preliminary project descriptions/choices due (at 6 pm)

Thursday, October 30: NO CLASS

I’ll be out of town, presenting at a workshop.

Tuesday, November 4 — ELECTION DAY, NO CLASS

Assignments:

  • Final project descriptions due

Thursday, November 6: Information Theory and Density Estimation

Information-theoretic methods for diffusion density estimation, or mixture models and KDEs, continued (depending on how far we got on Oct. 28).

Reading:

  • (*) Xianghao Kong, Rob Brekelmans, and Greg Ver Steeg, “Information-Theoretic Diffusion”, ICLR 2023

Assignments:

  • Homework 6 due (at 6 pm)
  • Homework 7: assigned

Tuesday, November 11: Language and Image

Combining LLMs and generative diffusion models to get a joint distribution of texts and images. Sampling from the distribution of images conditional on a given piece of text. Sampling from the distribution of texts conditional on a given image.

Reading:

  • Bishop, sec. 20.4

Interpretations (Weeks 11–13)

Thursday, November 13: Hallucination or Confabulation

Prompting as conditioning, again. Why generative models are not doing anything differently when they “hallucinate” or “confabulate” than when they give correct answers.

Reading:

  • Main reading TBD
  • (\(\dagger\)) Adam Sobieszek and Tadeusz Price, “Playing Games with AIs: The Limits of GPT-3 and Similar Large Language Models”, Minds and Machines 32 (2022): 341–364, doi:https://doi.org/10.1007/s11023-022-09602-0

Assignments:

  • Homework 7 due (at 6pm)
  • Homework 8: assignment, due on FRIDAY, 21 November