Data over Space and Time (36-467/667)

Fall 2020

Cosma Shalizi
Tuesdays and Thursdays, 9:50--11:10, online only

This course is an introduction to the opportunities and challenges of analyzing data from processes unfolding over space and time. It will cover basic descriptive statistics for spatial and temporal patterns; linear methods for interpolating, extrapolating, and smoothing spatio-temporal data; basic nonlinear modeling; and statistical inference with dependent observations. Class work will combine practical exercises in R, some mathematics of the underlying theory, and case studies analyzing real data from various fields (history, meteorology, climatology, ecology, demography, etc.). Depending on available time and class interest, additional topics may include: statistics of Markov and hidden-Markov (state-space) models; statistics of point processes; simulation and simulation-based inference; agent-based modeling; dynamical systems theory.

This webpage will serve as the class syllabus. Course materials (notes, homework assignments, etc.) will be linked to from here, as available.

Undergraduates must register for the course as 36-467; graduate students must register for it as 36-667. If the system does let you register for the wrong section, you'll be dropped from the roster.

Pre-requisite: For undergraduates taking the course as 36-467, 36-401 with a grade of C or higher. For graduate students taking the course as 36-667, there are formally no pre-requisites, but you really will need to know how to do linear regression, both in theory (as taught in 401) and on real data using R (as also taught in 401), and the mathematics that forms 401's pre-requisites (linear algebra, calculus in multiple variables, probability, mathematical statistics). If you're not sure whether you're ready for 467, ask me!

Instructors

Professor	Dr. Cosma Shalizi	cshalizi [at] cmu [dot] edu
Teaching Assistants	Mr. Raghav Bansal	not to be bothered by e-mail
	Mr. Mateo Dulce Rubio

Goals and Learning Outcomes

(Accreditation officials look here)

The goal of this class is to train you in using statistical models to analyze interdependent data spread out over space, time, or both, using the models as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory of statistical inference for independent data taught in 36-226, and complement the theory and applications of the linear model, introduced in 36-401. After taking the class, when you're faced with a new temporal, spatial, or spatio-temporal data-analysis problem, you should be able to (1) describe the statistical challenges the problem presents, (2) select appropriate methods, (3) use statistical software to implement those methods, (4) critically evaluate the resulting statistical models, and (5) communicate the results of your analyses to collaborators and to non-statisticians.

Topics Covered

Exploratory data analysis for temporal and spatial data: Graphics; levels vs. rates, stocks vs. flows; smoothing; trends vs. fluctuations, detrending; auto- and cross- covariances; nonlinear association measures
Optimal linear prediction and its uses: Theory of optimal linear prediction; prediction for interpolation, extrapolation, and noise removal; "Wiener filter"; "krgiging"; estimating optimal linear predictors
Inference with dependent data: Statistical estimation with dependent data; ergodic properties; the bootstrap; simulation-based inference
Generative models: Linear autoregressive models for time series and spatial processes; Markov chains and Markov processes; compartment (especially epidemic) models; state-space or hidden-Markov models; nonparametric, nonlinear autoregressions; Markov random fields
Possible advanced or supplementary topics: Longitudinal/panel data analysis, and regressions with dependent observations; Fourier methods; point processes; nonlinear dynamical systems theory and chaos; cellular automata and interacting particle systems; agent-based modeling; causal inference across time series; stochastic differential equations; optimal nonlinear prediction.

This class will not give much coverage to ARIMA models of time series, a subject treated extensively in 36-618.

Course Mechanics

Lectures and Remote-Only Instruction

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. You will usually find the readings more rewarding if you do the readings before lecture, rather than after (or during). Since this is an online-only class this semester, lectures will be held via Zoom; the link for each session will be on Canvas. I know that the class time is late at night or early in the morning for many of you; I nonetheless urge you to come to class and participate.

No Recordings: I will not be recording lectures. This is because the value of class meetings lies precisely in your chance to ask questions, discuss, and generally interact. (Otherwise, you could just read a book.) Recordings interfere with this in two ways:

They tempt you to skip class and/or to zone out and/or try to multi-task during it. (Nobody is really any good at multi-tasking.) Even if you do watch the recording later, you will not learn as much from it as if you had attended in the first place.
People are understandably reluctant to participate when they know they're being recorded. (It's only too easy to manipulate recordings to make anyone seem dumb and/or obnoxious.) Maybe this doesn't bother you; it doesn't bother me, much, because I'm protected by academic freedom and by tenure, but a good proportion of your classmates won't participate if they're being recorded, and that diminishes the value of the class for everyone.

Recording someone without their permission is illegal in many places, and more importantly is unethical everywhere, so don't make your own recordings of the class.

(Taking notes during class is fine and I strongly encourage it; taking notes forces you to think about what you are hearing and how to organize it, which helps you understand and remember the content.)

Textbooks

The only required textbook is

Gidon Eshel, Spatiotemporal Data Analysis (Princeton, New Jersey: Princeton University Press, 2011, ISBN 978-0-691-12891-7, available on JSTOR).

The CMU library has electronic access to the full text, in PDF, through the JSTOR service. (You will need to either be on campus, or logged in to the university library.) Links to individual chapters will be posted as appropriate.

In addition, we will assign some sections from

Peter Guttorp, Stochastic Modeling of Scientific Data (Boca Raton, Florida: Chapman & Hall / CRC Press, 1995. ISBN 978-0-412-99281-0).

Because this book is expensive, the library doesn't have electronic access, and a lot of it is about (interesting and important) topics outside the scope of the class, it is not required. Instead, scans of the appropriate sections will be distributed via Canvas. (I am working on getting electronic access through the library, and will update students if I succeed.)

You will also be doing a lot of computational work in R, so

Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7)

is recommended. R's help files answer "What does command X do?" questions. This book is organized to answer "What commands do I use to do Y?" questions.

Assignments

There are three reasons you will get assignments in this course. In order of decreasing importance:

Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows what you only thought you understood.
Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be two kinds of assignment in this course.

After-class comprehension questions and exercises: Following every lecture, there will be a brief set of questions about the material covered in lecture. Sometimes these will be about specific points in the lecture, sometimes about specific aspects of the reading assigned to go with the lecture. These will be done on Canvas and will be due the day after each lecture. These should take no more than 10 minutes, but will be untimed (so no accommodations for extra time are necessary). If the questions ask you to do any math (and not all of them will!), a scan or photograph of hand-written math is OK, so long as the picture is clearly legible. (Black ink or dark pencil on white paper helps.)
Homework: Most weeks will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve statistical theory, analyzing real data sets on the computer, and communicating the results.; All homework will be submitted electronically through Gradescope/Canvas. Most weeks, homework will be due at 6:00 pm on Thursdays (Pittsburgh time). There will be a few weeks, clearly noted on the syllabus and on the assignments, when this won't be the case. When this results in less than seven days between an assignment's due date and the previous due date, the homework will be shortened.; There are specific formatting requirements for homework --- see below.

Time Expectatons

You should expect to spend 5--7 hours on assignments every week, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

After-class questions: 10%. All sets of questions will have equal weight. The lowest 4 will be dropped, no questions asked.
Homework: 90%. All homeworks will have equal weight. Your lowest 3 homework grades will be dropped, no questions asked. If you turn in all homework assignments on time, for a grade of at least 60% (each), your lowest four homework grades will be dropped. Late homework will not be accepted for any reason.

Grade boundaries will be as follows:

A [90, 100]

B [80, 90)

C [70, 80)

D [60, 70)

R < 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments. As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Office Hours

For this semester, Zoom office hours will be times when I am available to answer questions in a Zoom chat. Piazza office hours will be times when an instructor will be logged in and answering questions on Piazza --- so if you want or need to have a back-and-forth, someone will be available.

Instructor	Day	Time (Pittsburgh)	Venue
Mr. Dulce Rubio	Mondays	9:30--10:30 am	Piazza
Mr. Bansal	Tuesdays	2:00--3:00 pm	Piazza
Prof. Shalizi	Wednesdays	2:00--3:00 pm	Zoom
Prof. Shalizi	Thursdays	2:00--3:00 pm	Piazza

If you cannot make the regular office hours, or have concerns you'd rather discuss privately (e.g., grades), please e-mail me about making an appointment to meet by Zoom.

R, R Markdown, and Reproducibility

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). No other form of computational work will be accepted. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

(There are plenty of other perfectly good computing systems for data analysis --- I learned to do it using Fortran and C, so help me --- but a uniform language is a lot easier to grade, and statisticians have self-organized on R as a standard.)

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code. For help on using R Markdown, see "Using R Markdown for Class Reports".

Format Requirements for Homework

For each assignment, you should submit a single PDF file, which is the "knitted", humanly-readable document generated by your R Markdown source file. That source file should integrate all of your text, and the R code to generate all of your numerical results, figures and tables.

Some problems in the homework will require you to do math. R Markdown provides a simple but powerful system for type-setting math. (It's based on the LaTeX document-preparation system widely used in the sciences.) If you can't get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by the final homework. For help on this aspect of using R Markdown, see "Using R Markdown for Class Reports".

Every week, I will randomly select some students and ask you to send me your R Markdown file. You will lose points if your R Markdown file does not, in fact, generate your knitted file (making the obvious allowances for random numbers, etc.). You should expect to be picked for this about once in the semester, but since it will be random sampling with replacement, you may be asked for your R Markdown more than once.

Canvas and Piazza

Homework will be submitted electronically through Gradescope/Canvas. Canvas will also be used for the after-class questions, as a calendar showing all assignments and their due-dates, to distribute some readings, and as the official gradebook.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. Anonymity will go away for everyone if it is abused. During Piazza office hours, someone will be online to respond to questions (and follow-ups) in real time. You are welcome to post at any time, but outside of normal working hours you should expect that the instructors have lives.

Materials from Previous Versions of the Course

Public materials (lecture slides and notes, homework assignments and data files but not solutions, etc.) from other semesters I've taught this course can be found here. You're welcome to look at them, but lectures and assignments will change.

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbooks and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students, the instructors, or your teachers in other current classes at CMU, until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course, or in other courses, in previous semesters is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will never be a penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how to rectify matters. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

On the first day of class, every student will receive a written copy of the university's policy on academic integrity, a written copy of these course policies, and a "homework 0" on the content of these policies. This assignment will not factor into your grade, but you must complete it before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate with me.

Inclusion and Respectful Participation

The university is a community of scholars, that is, of people seeking knowledge. All of our accumulated knowledge has to be re-learned by every new generation of scholars, and re-tested, which requires debate and discussion. Everyone enrolled in the course has a right to participate in the class discussions. This doesn't mean that everything everyone says is equally correct or equally important, but does mean that everyone needs to be treated with respect as persons, and criticism and debate should be directed at ideas and not at people. Don't dismiss (or enhance) anyone in the course because of where they come from, and don't use your participation in the class as a way of shutting up others. (Don't be rude, and don't go looking for things to be offended by.) While methods for spatio-temporal data analysis don't usually lead to heated debate, some of the subjects we'll be applying them to might. If someone else is saying something you think is really wrong-headed, and you think it's important to correct it, address why it doesn't make sense, and listen if they give a counter-argument.

The classroom is not a democracy; as the teacher, I have the right and the responsibility to guide the discussion in what I judge are productive directions. This may include shutting down discussions which are not helping us learn about statistics, even if those discussions are important to have elsewhere. I will do my best to guide the course in a way which respects everyone's dignity as a human being and as a member of the university.

Schedule

Lecture slides and/or notes will be linked to here after each class. This page will also give links to assignments and data files for homeworks (though they will also be linked to on Canvas).

Possible changes: Topics, and the order in which we cover topics, may change. The material up to the beginning of November is unlikely to alter, but the more advanced, and more miscellaneous, topics after that are less set, and will depend on students' expressed interests, what I can find good examples for, whether we need to make up for disruptions earlier in the semester, etc. I will give as much warning to any chances as I can.

Readings: Please do these before coming to class, if you possibly can --- things will make more sense if you do! (Obviously this doesn't apply to the first class.) Readings marked with a star (*) are optional, either because they're more advanced, or longer, or tangential. Readings marked with multiple stars are especially optional. There are currently some lectures for which I haven't fixed on readings, marked with "TBD"; these will be made specific at least three days before class.

Lecture 1 (Tuesday, 1 September): Introduction to the course

Welcome; course mechanics; data distributed over space and time; goals and challenges.
Readings:
- CMU's policy on academic integrity
- excerpt from Turabian's Manual for Writers (on Canvas)
- Guttorp, Introduction and Chapter 1 (on Canvas)
- Eshel, Chapter 7
Slides (R Markdown file used to make the slides)
Homework:
- Homework 0 assigned (on Canvas only)
- Homework 1: Assignment, kyoto-2020.csv data file

Lecture 2 (Thursday, 3 September): Graphics and Exploratory Analyses

Kinds of data and basic EDA by way of pictures. Data for regions (areas, intervals, periods) vs. data at points in space and time. Plotting over time: basic ideas and pitfalls. Tricks for plotting over time: index numbers (relative magnitude), differencing (rate of change over time) and summing (accumulation over time), calendar time vs. time relative to some event. Scatterplots of successive values to get at dynamics. Relationships between two variables: why scatterplots are better than plots with two vertical axes. Plotting over space: maps; types of maps; a little bit about map projections. Relationships between variables over space (scatterplots are better again). Re-doing all our usual EDA (boxplots, histograms, tables, etc.) by region in space and/or time to get at variability.
Reading:
- Handout on Stocks, flows, growth rates, etc. (.Rmd)
- Kieran Healy, "America's Ur-Choropleths", 12 June 2015
- Optional readings
  - (*) Kieran Healy, Data Visualization: A Practical Introduction (Princeton: Princeton University Press, 2019), especially Chapter 7, "Draw Maps" [Prof. Healy's website for the book includes the full text in draft form and code for reproducing examples]
  - (**) Whitney Battle-Baptiste and Britt Rusert (eds.), W. E. B. Du Bois's Data Portraits: Visualizing Black America: The Color Line at the Turn of the Twentieth Century (New York: Princeton Architectural Press, 2018) [About half of Du Bois's plots depict change over time, variation over space, or both. If you read both this and Healy's book, ask yourself how Healy would re-do Du Bois's plots.]
  - (**) Judy L. Klein, Statistical Visions in Time: A History of Time Series Analysis, 1662--1938, especially Part I [This is, among other things, an in-depth look at how different techniques for plotting time series emerged from "commercial arithmetic"]
Slides (with notes on a few points that came up during lecture), R Markdown file used to make the slides
Homework:
- Homework 0 due

Lecture 3 (Tuesday, 8 September): Smoothing, Trends, Detrending I

Smoothing by local averaging. The idea of a trend, and de-trending. Smoothing as EDA. Some of the math of smoothing: the influence (or "hat") matrix, degrees of freedom. Expanding in eigenvectors.
Readings:
- Eshel, Chapter 7
- Guttorp, Introduction and Chapter 1
Slides (.Rmd source file)
Notes on Trends, Smoothing and Detrending (.Rmd source file)

Lecture 4 (Thursday, 10 September): Smothing, Trends, Detrending II

The influence matrix as the source of all knowledge. Residuals after de-trending as estimates of the fluctuations. The Yule-Slutsky effect. Picking how much to smooth by cross-validation. Special considerations for ratios (Kafadar).
Reading:
Eshel, chapter 8
Optional reading:
- (*) Karen Kafadar, "Smoothing Geographical Data, Particularly Rates of Disease", Statistics in Medicine 15 (1996): 2539--2560

Slides (.Rmd source)

Homework:

Homework 1 due
Homework 2: assignment, helper code for problem 1

Lecture 5 (Tuesday, 15 September): Principal Components I

The goal of principal components: finding simpler, linear structure in complicated, high-dimensional data. Math of principal components: linear approximation -> preserving variance -> eigenproblem. Reminders from linear algebra about eigenproblems. Mathematical solution to PCA. How to do PCA in R.
Reading: Eshel, chapter 4, and skim chapter 5
Slides (.Rmd)

Lecture 6 (Thursday, 17 September): Lecture 5, Principal Components II

Brief recap on PCA. Applying PCA to spatial data. Applying PCA to multiple time series. Applying PCA to spatio-temporal data. Interpreting PCA results. Why PCA can be good exploratory analysis, but is not statistical inference.
Reading: Eshel, chapter 11, sections 11.1--11.7 and 11.9--11.10 (i.e., skipping 11.8 and 11.11--11.12)
Slides (.Rmd)
Homework:
- Homework 2 due
- Homework 3: Assignment (with links to the data files)

Lecture 7 (Tuesday, 22 September): Optimal Linear Prediction

Mathematics of prediction. Mathematics of optimal linear prediction, in any context whatsoever. Ordinary least squares as an estimator of the optimal linear predictor. Why we need the covariance functions.
Reading: Eshel, chapter 9, sections 9.1--9.3
Slides (.Rmd)

Lecture 8 (Thursday, 24 September): Linear Interpolation and Extrapolation of Time Series

Applying the linear-predictor idea to time series: interpolating between observations; extrapolating into the future (or past). The concept of stationarity. Auto- and cross- covariance. Covariance functions as EDA. Basic covariance estimation in R. Removing trends; stationary fluctuations after detrending. Historical notes: Wiener and Kolmogorov.
Reading: Eshel, chapter 9, section 9.5 (skipping 9.5.3 and 9.5.4)
Slides (.Rmd, with comments on the sample code provided)
Homework:
- Homework 3 due
- Homework 4: assignment

Lecture 9 (Tuesday, 29 September): Optimal Linear Prediction for Spatial and Spatio-Temporal Data

Applying the linear-predictor idea to data spread over space or over space and time ("kriging"). The importance of estimating covariance between spatial locations. Assumptions restricting the form of the covariance and so enabling estimation: stationarity, isotropy, separability. Estimating parametric covariance functions. Examples.
Reading: No required reading
Slides (.Rmd)

Lecture 10 (Thursday, 1 October): Separating Signal and Noise with Linear Methods

Observational noise: using the linear-predictor idea to remove observational noise, a.k.a. "the Wiener filter". The myseriously-named "nugget effect" (accounting for measurement noise that's not auto-correlated). Periodicity: noticing periodicity from time series; from autocorrelation functions. Extracting periodic components with a known period by averaging. "Climate" and "anomaly". Seasonal adjustment of time series.
Reading:
- No required reading
- Optional reading:
  - (****) Norbert Wiener, Extrapolation, Interpolation and Smoothing of Stationary Time-Series: with Engineering Applications (Cambridge, Massachusetts: The Technology Press, 1949 [but originally published as a classified technical report, National Defense Research Council, 1942])
Slides (.Rmd)
Homework:
- Homework 4 due
- Homework 5: assignment

Lecture 11 (Tuesday, 6 October): Linear Generative Models for Time Series

Linear generative models for random sequences: autoregressions. Deterministic dynamical systems; more fun with eigenvalues and eigenvectors. Stochastic aspects. Vector auto-regressions.
Reading:
- Eshel, sections 9.5 and 9.7
- Optional reading:
  - (**) Judy L. Klein, Statistical Visions in Time: A History of Time Series Analysis, 1662--1938, especially Part II [This studies how linear regression, a method developed to adjust for differences across a population at a single time, came to be used to predict changes over time in a single quantity, which sounds weird when you put it that way]
Slides (.Rmd)
Handout: AR(p) vs. higher-dimensional VAR(1)

Lecture 12 (Thursday, 8 October): Linear Generative Models for Spatial and Spatio-Temporal Data

Simultaneous vs. conditional autoregressions for random fields. The "Gibbs sampler" trick. Autoregressions for spatio-temporal processes.
Reading: None
Slides (.Rmd)
Homework:
- Homework 5 due
- Homework 6: assignment

Lecture 13 (Tuesday, 13 October): Statistical Inference with Dependent Data I: Really Understanding Inference with Independent Data

Reminder: why maximum likelihood and Gaussian approximations work for IID data. Consistency from convergence (law of large numbers); Gaussian approximation from fluctuations (central limit theorem). The "sandwich covariance" for general estimators. Looking ahead at how these ideas carry over to dependent data.
Reading: Guttorp, Appendix A
Slides (.Rmd)

Lecture 14 (Thursday, 15 October): Inference with Dependent Data II

Ergodic theory, a.k.a. laws of large numbers for dependent data. Basic ergodic theory for stochastic processes. Correlation times and effective sample size. Inference with autoregressions. Gestures at more advanced ergodic theory. Likelihood-based inference for dependent data.
Slides (.Rmd)
Homework:
- Homework 6 due
- Homework 7: Assignment. Note: Because I messed up posting this on time, this is now due at the same time as HW 8, and will be extra credit, replacing your lowest grade on the other homeworks.

Lecture 15 (Tuesday, 20 October): Simulation

General idea of simulating a statistical model. The "Monte Carlo method": using simulation to compute probabilities, expected values, etc.
Slides (.Rmd)

Lecture 16 (Thursday, 22 October): Simulation for Inference I: The Bootstrap

The bootstrap principle: approximating the sample distribution by simulating a good estimate of the data-generating distribution. Uncertainty via model-based bootstraps. Uncertainty via resampling bootstraps for time series and for spatial processes. Related ideas: "surrogate data" tests of null hypotheses; ensemble forecasts.
Reading:
- CRS, "The Bootstrap", American Scientist 98:3 (May-June 2010), 186--190
- Optional readings:
  - (*) Bradley Efron, "Bootstrap Methods: Another Look at the Jackknife", Annals of Statistics 7 (1979): 1--26
  - (**) Peter Bühlmann, "Bootstraps for Time Series", Statistical Science 17 (2002): 52--72
  - (***) S. N. Lahiri, Resampling Methods for Dependent Data (Berlin: Springer-Verlag, 2003)
  - (***) Elizaveta Levina and Peter J. Bickel, "Texture synthesis and nonparametric resampling of random fields", Annals of Statistics 34 (2006): 1751--1773
Slides (.Rmd)
Homework:
- ~~Homework 7 due~~
- Homework 8: assignment, lv.R

Lecture 17 (Tuesday, 27 October): Simulation for Inference II: Matching Simulations to Data

Reminder about estimation in general. The method of moments. The method of simulated moments. "Indirect" inference: matching the parameters estimated from an "auxiliary" or "working" model. Some asymptotics.
Optional reading:
- (*) Andrew Gelman and Cosma Rohilla Shalizi, "Philosophy and the Practice of Bayesian Statistics", British Journal of Mathematical and Statistical Psychology 66 (2013): 8--38, arxiv:1006.3868
- (**) Christian Gouriéroux, Alain Monfort and E. Renault, "Indirect Inference", Journal of Applied Econometrics 8 (1993): S85--S118 [JSTOR]
- (*) Brian D. Ripley, Spatial Statistics (New York: Wiley, 1981)
Slides (.Rmd)

Lecture 18 (Thursday, 29 October): Markov Chains I

Markov chains and the Markov property. Examples. Basic properties of Markov chains; special kinds of chain. Yet more fun with eigenvalues and eigenvectors. How one trajectory evolves vs. how a population evolves. Ergodicity and central limit theorems. Higher-order Markov chains and related models. Markov chain Monte Carlo.
Reading:
- Guttorp, chapter 2, sections 2.1--2.6 (inclusive)
- Handouts on "Monte Carlo and Markov Chains" (especially Section 2), and "Markov Chain Monte Carlo" from stat. computing 2013
Slides (.Rmd)
Homework:
- Homework 7 due
- Homework 8 due
- Homework 9: Assignment, R file with new functions

Election Day (Tuesday, 3 November): NO CLASS

Lecture 19 (Thursday, 5 November): Markov Chains II

Likelihood inference for individual trajectories. Least-squares inference for population data. Conditional density estimates for continuous spaces. Model-checking.
Reading:
- Guttorp, chapter 2, sections 2.7--2.9 (inclusive)
- Maximum Likelihood Estimation for Markov Chains handout from the (no longer taught) 36-462, 2009 (uses slightly different notation than we're doing)
Slides (.Rmd)
Homework:
- Homework 9 due
- Homework 10: assignment, helper code, dicty-seq-1.dat, dicty-seq-2.dat (WARNING: the data files are large!)

Lecture 20 (Tuesday, 10 November): Epidemic Models

The basic "susceptible-infectious-removed" (SIR) epidemic model. The probability model and its deterministic limit. The idea of the "basic reproductive number" R0 and how it relates to the rates of transmission and removal. Why diseases do not necessarily evolve to be less lethal to their hosts. The epidemic threshold when R0=1. Complications: gaps between being infected and becoming infectious; the possibility of being infectious without showing symptoms; re-infection. Epidemics in social networks, and how network structure affects the epidemic threshold; why high-degree people tend to be among the first infected, and disease-control strategies based on "destroying the hubs". Statistical issues in connecting epidemic models to data.
Reading:
- Zeynep Tufekci, "Don’t Believe the COVID-19 Models: That’s not what they’re for", The Atlantic 2 April 2020
- Optional readings:
  - (**) Mark E. J. Newman, "The spread of epidemic disease on networks", Physical Review E 66 (2002): 016128, arxiv:cond-mat/0205009
  - (*) Tom Britton, "Epidemic models on social networks -- with inference", arxiv:1908.05517
  - (**) Romualdo Pastor-Satorras and Alessandro Vespignani, "Immunization of complex networks", Physical Review E 65 (2002): 036104, arxiv:cond-mat/0107066
  - (*) Lisa Sattenspiel (with contributions by Alun Lloyd), The Geographic Spread of Infectious Diseases: Models and Applications (Princeton, New Jersey: Princeton University Press, 2009) [Full text access via JSTOR]
Slides (.Rmd)

Lecture 21 (Thursday, 12 November): Compartment Models

General idea of compartment models as a special kind of Markov model. Applications in demography, epidemiology, sociology, chemistry, etc.
Reading: handout (.Rnw)
Slides (.Rmd)
Homework:
- Homework 10 due
- Homework 11: assignment, ckm_nodes.csv data file, ckm_network.dat data file (only needed for extra credit)

Lecture 22 (Tuesday, 17 November): Markov Random Fields

Markov models in space. The Gibbs-Markov equivalence. The Gibbs sampler again. Examples with the Ising model. Inference. Spatio-temporal Markov models: general idea; cellular automata.
Reading: Guttorp, chapter 4, omitting section 4.6, and skimming section 4.3
Slides (.Rmd)

Lecture 23 (Thursday, 19 November): State-Space or Hidden-Markov Models

Markov dynamics + distorting or noisy observations = Non-Markov observations. Model formulation. Inference: E-M algorithm, Kalman filter, particle filter, simulation-based methods. Spatio-temporal version: dynamic factor models.
Reading: Guttorp, section 2.12
Slides (.Rmd); the more detailed handout (.Rmd)
Homework:
- Homework 11 due
- Homework 12: assignment, mt.s1.csv data file

NO CLASS on Tuesday, 24 November

There was going to be an optional lecture on point processes, but it's become clear that there really won't be enough attendance to justify this. I'll still post the slides/notes, but we won't be meeting.
Reading:
- Guttorp, ch. 5
- Alex Reinhart, "A Review of Self-Exciting Spatio-Temporal Point Processes and Their Applications", Statistical Science 33 (2018): 299--318, arxiv:1708.02647
- Brad Leun and Philip B. Stark, "Testing Earthquake Predictions", pp. 302--315 in Deborah Nolan and Terry Speed (eds.), Probability and Statistics: Essays in Honor of David A. Freedman (Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2008)
- Extra-optional more advanced reading:
  - (***) Seth Flaxman, Yee Whye Teh, and Dino Sejdinovic, "Poisson intensity estimation with reproducing kernels", AISTATS 2017, arxiv:1610.08623
  - (*) Charles Loeffler and Seth Flaxman, "Is Gun Violence Contagious? A Spatiotemporal Test", Journal of Quantitative Criminology 34 (2018): 999--1017, arxiv:1611.06713
  - (*) Alex Reinhart and Joel Greenhouse, "Self-exciting point processes with spatial covariates: modeling the dynamics of crime", Journal of the Royal Statistical Society C 67 (2018): 1305--1329, arxiv:1708.03579

Thanksgiving Day (Thursday, 26 November): NO CLASS

Lecture 24 (Tuesday, 1 December): Nonlinear Prediction I: Model-Agnostic Predictions

Using smoothing to estimate regression functions. Nonlinear autoregressions. Additive autoregressions. The "time-delay embedding" method and the question of "how many lags?" When can we expect model-agnostic methods to work?
Readings (all advanced and optional):
- (*) Jianqing Fan and Qiwei Yao, Nonlinear Time Series: Nonparametric and Parametric Methods (Berlin: Springer-Verlag, 2003) [Full-text access via Springerlink]
- (*) Holger Kantz and Thomas Schreiber, Nonlinear Time Series Analysis (2nd edition, Cambridge, UK: Cambridge University Press, 2004)
- (**) Norman H. Packard, James P. Crutchfield, J. Doyne Farmer and Robert S. Shaw, "Geometry from a Time Series", Physical Review Letters 45 (1980): 712--716
- (***) Norbert Wiener, "Nonlinear Prediction and Dynamics", vol. III, pp. 247--252 in Jerzy Neyman (ed.), Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (Berkeley: University of California Press, 1956)
Slides (.Rmd)

Lecture 25 (Thursday, 3 December): Nonlinear prediction II: Model-Reliant Predictions

Estimating the parameters of a model. Estimating the state of a model. Extrapolating the estimated state forward in time using the estimated parameters. Ensemble-based forecasts to handle uncertainty. Issues with extremes and "functional box-plots". The importance of model-checking.
Slides (.Rmd); handout on propagation of error
Reading: optional readings on the last page of the slides
Homework:
- Homework 12 due
- Homework 13: assignment, ccw.csv data file

Lecture 26 (Tuesday, 8 December): Regressions with Dependent Observations

Reminders about why regression theory usually assumes observations are IID. Situations where this breaks down: "panel" or "longitudinal" data and correlations within a "unit" over time; correlations between countries or regions in spatial cross-sections; correlations because of shared "ancestry". Effects on linear regression: OLS is still unbiased but inefficient, and all your inferential statistics are wrong. Solution: generalized least squares would be efficient if we knew the covariance structure; ways of figuring out the covariance without knowing it to start with. Some examples, and some case studies in what goes wrong when we ignore these issues.
Reading:
- Recent, fairly easy-to-read papers that highlight important issues:
  - Morgan Kelly, "The Standard Errors of Persistence", SSRN/3398303 (2019)
  - Youjin Lee and Elizabeth L. Ogburn, "Testing for Network and Spatial Autocorrelation", pp. 91--104 in Naoki Masuda, Kwang-Il Goh, Tao Jia, Junichi Yamanoi and Hiroki Sayama (eds.), Proceedings of NetSci-X 2020: Sixth International Winter School and Conference on Network Science, arxiv:1710.03296
  - Youjin Lee and Elizabeth L. Ogburn, "Network Dependence Can Lead to Spurious Associations and Invalid Inference", Journal of the American Statistical Association forthcoming (2020), arxiv:1908.00520
  - Thomas B. Pepinsky, "On Whorfian Socioeconomics", SSRN/3321347 (2019)
- Classic, harder-to-read papers about methods:
  - (**) Peter Diggle, Kung-Yee Liang and Scott L. Zeger, Analysis of Longitudinal Data (Oxford: Oxford University Press, 1994) [This is actually pretty easy to read, if you have the time for a full-length book; the CMU library has electronic access]
  - (***) Kung-Yee Liang and Scott L. Zeger, "Longitudinal data analysis using generalized linear models", Biometrika 73 (1986): 13--22
  - (***) Scott L. Zeger and Kung‐Yee Liang, "An overview of methods for the analysis of longitudinal data", Statistics in Medicine 11 (1992): 1825--1839
Slides (.Rmd)

Lecture 27 (Thursday, 10 December): Causal Inference over Time

What do statisticians mean by "causality"? What "Granger causality" is, and why it's usually not interesting. Graphical causal models. Defining causal effects in terms of "surgery" on the graph. Graphical causal models for variables evolving over time. Discovering the right graph, assuming additive dependence.
Reading: (**) Tianjiao Chu and Clark Glymour, "Search for Additive Nonlinear Time Series Causal Models", Journal of Machine Learning Research 9 (2008): 967--991
Slides (.Rmd)
Homework:
- Homework 13 due

Image credit: Pictures on this page are from my teacher David Griffeath's Particle Soup Kitchen website, except for Umberto Boccioni's Riot in the Galleria.