\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our last episode…

We typically estimate a quantity \(\psi\) by minimizing some objective function \(M_n(\psi)\), \(\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}\)
Generally, \(\hat{\psi}_n \rightarrow \psi_0\) if
- \(M_n(\psi) \rightarrow m(\psi)\) as \(n\rightarrow\infty\)
- and the true \(\psi_0 = \argmin_{m}{m(\psi)}\)
Generally, \(\Var{\hat{\psi}_n} \rightarrow \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1}\) if
- \(\nabla M_n(\psi_0) \rightarrow \nabla m(\psi_0) = 0\)
- and \(\nabla \nabla M_n(\psi_0) \rightarrow \nabla\nabla m(\psi_0) \equiv \mathbf{h}\)
Law of large numbers gets used to ensure \(M_n \rightarrow m\)
- Central limit theorem gets used to ensure \(M_n \rightsquigarrow \mathcal{N}\)

Agenda for today

Convergence of not-too-correlated sample averages on expectations
- A “mean ergodic theorem”
- Stationary and non-stationary versions
- Notion of effective sample size
Inference for AR(1)
Some glimpses at more advanced ergodic theory (without proofs)
- Convergence of the log-likelihood
- Weak dependence and CLTs

Ergodic theory

Laws of large numbers for dependent variables are called ergodic theorems
- Blame Ludwig Boltzmann
This has absorbed a lot of mathematical talent over the last \(\approx 150\) years (Plato 1994)
We (= you) can prove a useful one over the next few minutes

Second-order stationary and not-too-correlated

Assume \(X(1), X(2), \ldots X(t), \ldots\) is (second-order) stationary:
- \(\Expect{X(t)} = \mu\) for all \(t\)
- \(\Cov{X(t), X(t+h)} = \gamma(h)\) for all \(t, h\)
Assume the sum of the covariances is finite: \[ \sum_{h=-\infty}^{\infty}{\gamma(h)} = \gamma(0)\tau < \infty \]
- Correlation time (also) refers to this \(\tau\)
- “Decay of correlations”: finite sum implies \(\gamma(h) \rightarrow 0\) as \(h\rightarrow\infty\)

Our first ergodic theorem

\[\begin{eqnarray} \overline{X}_n & \equiv & \frac{1}{n}\sum_{t=1}^{n}{X(t)}\\ \Expect{\left(\overline{X}_n - \mu\right)^2} & = & \left(\Expect{\overline{X}_n - \mu}\right)^2 + \Var{\overline{X}_n}\\ \Expect{\overline{X}_n} & = & \frac{1}{n}\sum_{t=1}^{n}{\Expect{X(t)}} = \mu\\ \Var{\overline{X}_n} & = & \frac{1}{n^2}\left(\sum_{t=1}^{n}{\Var{X(t)}} + 2\sum_{t=1}^{n-1}{\sum_{s=t+1}^{n}{\Cov{X(t), X(s)}}}\right)\\ & = & \frac{1}{n^2}\left(n \gamma(0) + \sum_{t=1}^{n}{\sum_{s\neq t}{\gamma(t-s)}}\right)\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{\gamma(t-s)}}\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=1-t}^{n-t}{\gamma(h)}} \\ & \rightarrow & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=-\infty}^{\infty}{\gamma(h)}} = \frac{1}{n^2}\sum_{t=1}^{n}{\gamma(0)\tau} = \frac{\gamma(0)\tau}{n} \end{eqnarray}\]

Our first ergodic theorem

If \(\tau < \infty\), then

\[\begin{eqnarray} \Expect{\left(\overline{X}_n - \mu\right)^2} &\rightarrow & 0 + \frac{\gamma(0)\tau}{n} \rightarrow 0 \end{eqnarray}\]

\(\Leftrightarrow\) If \(\tau < \infty\), then

\[\begin{eqnarray} \overline{X}_n \rightarrow \mu \end{eqnarray}\]

Effective sample size

For uncorrelated \(X_i\), we saw last time that \[ \Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}}{n} \] We just showed that if \(\tau < \infty\), then \[ \Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}\tau}{n} \]
Equivalently, \[ \Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}}{n/\tau} \]
As though we had \(n/\tau\) uncorrelated observations, instead of \(n\) dependent ones

How sensible is \(\tau < \infty\)?

Uncorrelated processes have \(\tau = 1\)
If \(\gamma(h) = \gamma(0) \rho^{|h|}\), and \(|\rho| < 1\), then \[ \sum_{h=-\infty}^{\infty}{\gamma(h)} = \gamma(0)\left(1+2\sum_{h=1}^{\infty}{\rho^h}\right) = \gamma(0)\left(1+2\frac{\rho}{1-\rho}\right) \]
- (sum a geometric series)
- Continuity with the uncorrelated case if \(\rho \approx 0\)
An easy but implausible counter-example:
- \(X(1) \sim\) anything, \(X(t+1) = X(t)\) \(\Rightarrow\) \(\tau = \infty\)
- “Checking a newspaper by buying more copies”
More troublesome: very slow decay of correlations
- \(\gamma(h) \propto h^{-\alpha}\) is a “long-memory process”, or one with “long-range correlations”
- \(\lim_{T\rightarrow\infty}{\sum_{h=-T}^{T}{\gamma(h)}}=\infty\) if \(\alpha \leq 1\)
- But \(\gamma(h)\) is summable if \(\alpha > 1\)

Generalizing: non-stationary case

We don’t actually need stationarity!
Define \(\mu_n \equiv \frac{1}{n}\sum_{t=1}^{n}{\Expect{X(t)}}\)
Define \(V(n) \equiv \sum_{t=1}^{n}{\sum_{s=1}^{n}{\Cov{X(t), X(s)}}}\)
Then if \(V(n) = o(n^2)\) \[ \Expect{\left(\overline{X}_n - \mu_n\right)^2} \rightarrow 0 \]
- By exactly the same proof

Application: Stationary AR(1)

\(X(t) = a + b X(t-1) + \epsilon(t)\)
Assume stationary, so \(\Expect{X(t)} = \frac{a}{1-b}\) and \(\Cov{X(t), X(t+h)} = b^{|h|} \frac{\Var{\epsilon}}{1-b^2}\)
Then \(\tau = 1+ 2\frac{b}{1-b}\)
- N.B.: \(1+2\frac{b}{1-b} \geq 0\) for any \(b \in (-1, 1)\)
So \(\overline{X}_n \rightarrow \Expect{X(1)} = \frac{a}{1-b}\)
Similarly \[ \frac{1}{n}\sum_{t=1}^{n-h}{(X(t) - \overline{X}_n)(X(t+h)-\overline{X}_n)} \rightarrow \Cov{X(t), X(t+h)} \]

Application: Not-necessarily-stationary AR(1)

Back in Lecture 13, saw that if \(a=0\) and we use OLS \[\begin{eqnarray} \hat{b} & = & b + \frac{\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & b + \frac{n^{-1}\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{n^{-1}\sum_{t=0}^{n-1}{X^2(t)}} \end{eqnarray}\]
But now the numerator \(\rightarrow \Expect{X(t) \epsilon(t+1)} = 0\)
- because \(\Cov{X(t) \epsilon(t+1), X(t+1)\epsilon(t+2)} = 0\)

Application: AR(1)

Objective function at finite \(n\): \[ M_n(a,b) = \frac{1}{n}\sum_{i=1}^{n-1}{(X(t+1) - a - b X(t))^2} \]
Exercise (off-line): Assume stationarity and show that this goes to \[ m(a,b) = \Expect{(X(t+1) - a-bX(t))^2} \]
- (Can you do this under non-stationarity?)
So all our asymtptotics analysis from last time applies

Looking beyond the simplest ergodic theorem

More general “mean-square” ergodic theorem for stationary processes: \(\overline{X}_n \rightarrow \Expect{X(1)}\) iff \(n^{-1}\sum_{h=0}^{n}{\gamma(h)} \rightarrow 0\)
“Individual” ergodic theorem: if \(X_1, \ldots X_t, \ldots\) is strongly stationary, then for any \(k\) and any function \(f(X(1), X(2), \ldots X(k))\), \[ \Prob{\frac{1}{n}\sum_{t=1}^{n-k}{f(X(t), \ldots X(t+k))} \rightarrow \Expect{f(X(1), \ldots X(k))}} = 1 \]
- i.e., sample averages converge along (almost all) individual trajectories
These also work for asymptotically stationary processes
- because the long-run limit is dominated by the stationary process we’re approaching

Convergence of the log-likelihood

Assume a pdf \(p(x(1), x(2), \ldots x(t))\) generates \(X(1), X(2), \ldots X(t)\)
Assume \(X\) is stationary
Then \[ \lim_{n\rightarrow\infty}{\frac{1}{n}\Expect{\log{p(X(1), \ldots X(n))}}} = \lambda \] exists, and \[ \frac{1}{n}\log{p(X(1), \ldots X(n))} \rightarrow \lambda \]
- “Asymptotic equipartiton” or “Shannon-McMillan-Breiman” property

Convergence of the log-likelihood (II)

Assume \(X\) is generated by a distribution \(p\)
Consider a model pdf \(f(x(1), x(2), \ldots x(t); \theta)\)
Then, under stationarity, \[\begin{eqnarray} \lim_{n\rightarrow\infty}{\frac{1}{n}\Expect{\log{f(X(1), \ldots X(n);\theta)}}} & = & \lambda(\theta)\\ \frac{1}{n}\log{f(X(1), \ldots X(n); \theta)} & \rightarrow & \lambda(\theta) \end{eqnarray}\]
Some non-stationary extensions, especially if asymptotically stationary

Central limit theorems and weak dependence

Suppose \((X(t-k), \ldots X(t-1))\) and \((X(t+h), \ldots X(h+k-1)\) approach independence as \(h\rightarrow\infty\)
Then we’ve got nearly independent “blocks” of length \(k\)
\(\overline{X}_n\) acts like average of some nearly-independent blocks, plus remainder
Lets us transfer central limit theorem to dependent data
Need to be precise about “approaching independence”

Summary

If correlations go to zero fast enough, \(\overline{X}_n\) converges to the expectation value
This is enough to have a lot of estimators converge
- And even to get asymptotic standard errors
- Effective sample size is reduced
Maximum likelihood works very generally for parametric models
Central limit theorem still holds under weak dependence / asymptotic independence

Backup: Boltzmann

(Photo credit: Tom Schneider, downloaded 2008 from an apparently-defunct website)

Backup: “Ergodic”, “Ergodicity”

Boltzmann (1964) was interested in the behavior of a physical system at constant energy
Write:
- \(X(t)\) for the state of the system at time \(t\)
- \(E(x)\) for the energy of a system in state \(x\)
- \(R =\) set of states with \(E(x) = E(X(0))\)
- \(|R| =\) volume of \(R\)
He wanted to say that for any (well-behaved) set of states \(A\), \[ \frac{1}{T}\int_{t=0}^{T}{\mathbf{1}(X(t) \in A) dt} \rightarrow \int_{R}{\mathbf{1}(x \in A) \frac{1}{|R|} dx} \]
- so the average along a path \(X(t)\) equals the average over the region of constant energy
- ergon = “work, energy” (as in “erg”, the unit of energy), hodos = “way, path” (as in “odometer”) \(\Rightarrow\) “ergodic” = energy-path
The name stuck
A process is ergodic when averages over time converge on expectation values
An ergodic theorem is one showing some processes (or functions of processes, etc.) is ergodic

Backup: More on ergodic theory

Early history: Plato (1994)
Gentlest introduction, and a really good complement to this class: Grimmett and Stirzaker (1992)
Some of how ergodicity supports using probability theory in the real world: Ruelle (1991)
Ergodicity in physics: Lebowitz (1999), Castiglione et al. (2008)
Going deeper needs advanced (measure-theoretic) probability: Gray (2009) builds up what’s needed for ergodic theory
Ergodic properties for log-likelihood are part of information theory:
- Cover and Thomas (2006) is the best over-all textbook
- Gray (1990) gives detailed coverage of this topic
Mackey (1992) is really about decay-of-dependence

Backup: Weak dependence and central limit theorems

“Mixing” is a strong notion of asymptotic independence
Measure dependence between \(X(-\infty), \ldots X(t-1), X(t)\) and \(X(t+h), X(t+h+1), \ldots X(+\infty)\) by “total variation” from independence: \[ \beta(h) = \int{|p(x{-\infty:t}, x_{t+h:\infty}) - p(x_{-\infty:t})p(x_{t+h:\infty})| dx_{-\infty:t} dx_{t+h:\infty}} \]
The process is \(\beta\)-mixing if \(\beta(h) \rightarrow 0\) as \(h\rightarrow \infty\)
Now use “blocking”:
- Divide \(X_1, \ldots X_n\) into \(2m\) blocks of length \(h\) (plus \(< h\) extra observations as remainder)
- Take the odd-numbered blocks; they’re a random variable \(Z = (Z_1, Z_2, \ldots Z_m)\)
- Imagine \(\tilde{Z} = (\tilde{Z}_1, \ldots \tilde{Z}_m)\), where the blocks have the same marginal distribution but are all independent
- For any event \(A\), \(|\Prob{Z \in A} - \Prob{\tilde{Z} \in A}| \leq m \beta(h)\) (Yu 1994)
- Since sample averages using \(\tilde{Z} \rightsquigarrow \mathcal{N}\), so do sample averages using \(Z\)

References

Boltzmann, Ludwig. 1964. Lectures on Gas Theory. Berkeley: University of California Press.

Castiglione, Patrizia, Massimo Falcioni, Annick Lesne, and Angelo Vulpiani. 2008. Chaos and Coarse Graining in Statistical Mechanics. Cambridge, England: Cambridge University Press.

Cover, Thomas M., and Joy A. Thomas. 2006. Elements of Information Theory. Second. New York: John Wiley.

Gray, Robert M. 1990. Entropy and Information Theory. New York: Springer-Verlag. http://ee.stanford.edu/~gray/it.html.

———. 2009. Probability, Random Processes, and Ergodic Properties. Second. New York: Springer-Verlag. http://ee.stanford.edu/~gray/arp.html.

Grimmett, G. R., and D. R. Stirzaker. 1992. Probability and Random Processes. 2nd ed. Oxford: Oxford University Press.

Lebowitz, Joel L. 1999. “Statistical Mechanics: A Selective Review of Two Central Issues.” Reviews of Modern Physics 71:S346–S357. http://arxiv.org/abs/math-ph/0010018.

Mackey, Michael C. 1992. Time’s Arrow: The Origins of Thermodynamic Behavior. Berlin: Springer-Verlag.

Plato, Jan von. 1994. Creating Modern Probability: Its Mathematics, Physics and Philosophy in Historical Perspective. Cambridge, England: Cambridge University Press.

Ruelle, David. 1991. Chance and Chaos. Princeton, New Jersey: Princeton University Press.

Yu, Bin. 1994. “Rates of Convergence for Empirical Processes of Stationary Mixing Sequences.” Annals of Probability 22:94–116. https://doi.org/10.1214/aop/1176988849.

Inference II — Ergodic Theory

In our last episode…

Agenda for today

Ergodic theory

Second-order stationary and not-too-correlated

Our first ergodic theorem

Our first ergodic theorem

Effective sample size

How sensible is \(\tau < \infty\)?

Generalizing: non-stationary case

Application: Stationary AR(1)

Application: Not-necessarily-stationary AR(1)

Application: AR(1)

Looking beyond the simplest ergodic theorem

Convergence of the log-likelihood

Convergence of the log-likelihood (II)

Central limit theorems and weak dependence

Summary

Backup: Boltzmann

Backup: “Ergodic”, “Ergodicity”

Backup: More on ergodic theory

Backup: Weak dependence and central limit theorems

References