# Inference II — Ergodic Theory

25 September 2018

$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}}$

# In our last episode…

• We typically estimate a quantity $$\psi$$ by minimizing some objective function $$M_n(\psi)$$, $$\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}$$
• Generally, $$\hat{\psi}_n \rightarrow \psi_0$$ if
• $$M_n(\psi) \rightarrow m(\psi)$$ as $$n\rightarrow\infty$$
• and the true $$\psi_0 = \argmin_{m}{m(\psi)}$$
• Generally, $$\Var{\hat{\psi}_n} \rightarrow \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1}$$ if
• $$\nabla M_n(\psi_0) \rightarrow \nabla m(\psi_0) = 0$$
• and $$\nabla \nabla M_n(\psi_0) \rightarrow \nabla\nabla m(\psi_0) \equiv \mathbf{h}$$
• Law of large numbers gets used to ensure $$M_n \rightarrow m$$
• Central limit theorem gets used to ensure $$M_n \rightsquigarrow \mathcal{N}$$

# Agenda for today

• Convergence of not-too-correlated sample averages on expectations
• A “mean ergodic theorem”
• Stationary and non-stationary versions
• Notion of effective sample size
• Inference for AR(1)
• Some glimpses at more advanced ergodic theory (without proofs)
• Convergence of the log-likelihood
• Weak dependence and CLTs

# Ergodic theory

• Laws of large numbers for dependent variables are called ergodic theorems
• Blame Ludwig Boltzmann
• This has absorbed a lot of mathematical talent over the last $$\approx 150$$ years (Plato 1994)
• We (= you) can prove a useful one over the next few minutes

# Second-order stationary and not-too-correlated

• Assume $$X(1), X(2), \ldots X(t), \ldots$$ is (second-order) stationary:
• $$\Expect{X(t)} = \mu$$ for all $$t$$
• $$\Cov{X(t), X(t+h)} = \gamma(h)$$ for all $$t, h$$
• Assume the sum of the covariances is finite: $\sum_{h=-\infty}^{\infty}{\gamma(h)} = \gamma(0)\tau < \infty$
• Correlation time (also) refers to this $$\tau$$
• “Decay of correlations”: finite sum implies $$\gamma(h) \rightarrow 0$$ as $$h\rightarrow\infty$$

# Our first ergodic theorem

$\begin{eqnarray} \overline{X}_n & \equiv & \frac{1}{n}\sum_{t=1}^{n}{X(t)}\\ \Expect{\left(\overline{X}_n - \mu\right)^2} & = & \left(\Expect{\overline{X}_n - \mu}\right)^2 + \Var{\overline{X}_n}\\ \Expect{\overline{X}_n} & = & \frac{1}{n}\sum_{t=1}^{n}{\Expect{X(t)}} = \mu\\ \Var{\overline{X}_n} & = & \frac{1}{n^2}\left(\sum_{t=1}^{n}{\Var{X(t)}} + 2\sum_{t=1}^{n-1}{\sum_{s=t+1}^{n}{\Cov{X(t), X(s)}}}\right)\\ & = & \frac{1}{n^2}\left(n \gamma(0) + \sum_{t=1}^{n}{\sum_{s\neq t}{\gamma(t-s)}}\right)\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{\gamma(t-s)}}\\ & = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=1-t}^{n-t}{\gamma(h)}} \\ & \rightarrow & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=-\infty}^{\infty}{\gamma(h)}} = \frac{1}{n^2}\sum_{t=1}^{n}{\gamma(0)\tau} = \frac{\gamma(0)\tau}{n} \end{eqnarray}$

# Our first ergodic theorem

If $$\tau < \infty$$, then

$\begin{eqnarray} \Expect{\left(\overline{X}_n - \mu\right)^2} &\rightarrow & 0 + \frac{\gamma(0)\tau}{n} \rightarrow 0 \end{eqnarray}$

$$\Leftrightarrow$$ If $$\tau < \infty$$, then

$\begin{eqnarray} \overline{X}_n \rightarrow \mu \end{eqnarray}$

# Effective sample size

• For uncorrelated $$X_i$$, we saw last time that $\Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}}{n}$ We just showed that if $$\tau < \infty$$, then $\Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}\tau}{n}$
• Equivalently, $\Expect{\left(\overline{X}_n - \mu\right)^2} = \frac{\Var{X(1)}}{n/\tau}$
• As though we had $$n/\tau$$ uncorrelated observations, instead of $$n$$ dependent ones

# How sensible is $$\tau < \infty$$?

• Uncorrelated processes have $$\tau = 1$$
• If $$\gamma(h) = \gamma(0) \rho^{|h|}$$, and $$|\rho| < 1$$, then $\sum_{h=-\infty}^{\infty}{\gamma(h)} = \gamma(0)\left(1+2\sum_{h=1}^{\infty}{\rho^h}\right) = \gamma(0)\left(1+2\frac{\rho}{1-\rho}\right)$
• (sum a geometric series)
• Continuity with the uncorrelated case if $$\rho \approx 0$$
• An easy but implausible counter-example:
• $$X(1) \sim$$ anything, $$X(t+1) = X(t)$$ $$\Rightarrow$$ $$\tau = \infty$$
• “Checking a newspaper by buying more copies”
• More troublesome: very slow decay of correlations
• $$\gamma(h) \propto h^{-\alpha}$$ is a “long-memory process”, or one with “long-range correlations”
• $$\lim_{T\rightarrow\infty}{\sum_{h=-T}^{T}{\gamma(h)}}=\infty$$ if $$\alpha \leq 1$$
• But $$\gamma(h)$$ is summable if $$\alpha > 1$$

# Generalizing: non-stationary case

• We don’t actually need stationarity!
• Define $$\mu_n \equiv \frac{1}{n}\sum_{t=1}^{n}{\Expect{X(t)}}$$
• Define $$V(n) \equiv \sum_{t=1}^{n}{\sum_{s=1}^{n}{\Cov{X(t), X(s)}}}$$
• Then if $$V(n) = o(n^2)$$ $\Expect{\left(\overline{X}_n - \mu_n\right)^2} \rightarrow 0$
• By exactly the same proof

# Application: Stationary AR(1)

• $$X(t) = a + b X(t-1) + \epsilon(t)$$
• Assume stationary, so $$\Expect{X(t)} = \frac{a}{1-b}$$ and $$\Cov{X(t), X(t+h)} = b^{|h|} \frac{\Var{\epsilon}}{1-b^2}$$
• Then $$\tau = 1+ 2\frac{b}{1-b}$$
• N.B.: $$1+2\frac{b}{1-b} \geq 0$$ for any $$b \in (-1, 1)$$
• So $$\overline{X}_n \rightarrow \Expect{X(1)} = \frac{a}{1-b}$$
• Similarly $\frac{1}{n}\sum_{t=1}^{n-h}{(X(t) - \overline{X}_n)(X(t+h)-\overline{X}_n)} \rightarrow \Cov{X(t), X(t+h)}$

# Application: Not-necessarily-stationary AR(1)

• Back in Lecture 13, saw that if $$a=0$$ and we use OLS $\begin{eqnarray} \hat{b} & = & b + \frac{\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & b + \frac{n^{-1}\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{n^{-1}\sum_{t=0}^{n-1}{X^2(t)}} \end{eqnarray}$
• But now the numerator $$\rightarrow \Expect{X(t) \epsilon(t+1)} = 0$$
• because $$\Cov{X(t) \epsilon(t+1), X(t+1)\epsilon(t+2)} = 0$$

# Application: AR(1)

• Objective function at finite $$n$$: $M_n(a,b) = \frac{1}{n}\sum_{i=1}^{n-1}{(X(t+1) - a - b X(t))^2}$
• Exercise (off-line): Assume stationarity and show that this goes to $m(a,b) = \Expect{(X(t+1) - a-bX(t))^2}$
• (Can you do this under non-stationarity?)
• So all our asymtptotics analysis from last time applies

# Looking beyond the simplest ergodic theorem

• More general “mean-square” ergodic theorem for stationary processes: $$\overline{X}_n \rightarrow \Expect{X(1)}$$ iff $$n^{-1}\sum_{h=0}^{n}{\gamma(h)} \rightarrow 0$$
• “Individual” ergodic theorem: if $$X_1, \ldots X_t, \ldots$$ is strongly stationary, then for any $$k$$ and any function $$f(X(1), X(2), \ldots X(k))$$, $\Prob{\frac{1}{n}\sum_{t=1}^{n-k}{f(X(t), \ldots X(t+k))} \rightarrow \Expect{f(X(1), \ldots X(k))}} = 1$
• i.e., sample averages converge along (almost all) individual trajectories
• These also work for asymptotically stationary processes
• because the long-run limit is dominated by the stationary process we’re approaching

# Convergence of the log-likelihood

• Assume a pdf $$p(x(1), x(2), \ldots x(t))$$ generates $$X(1), X(2), \ldots X(t)$$
• Assume $$X$$ is stationary
• Then $\lim_{n\rightarrow\infty}{\frac{1}{n}\Expect{\log{p(X(1), \ldots X(n))}}} = \lambda$ exists, and $\frac{1}{n}\log{p(X(1), \ldots X(n))} \rightarrow \lambda$
• “Asymptotic equipartiton” or “Shannon-McMillan-Breiman” property

# Convergence of the log-likelihood (II)

• Assume $$X$$ is generated by a distribution $$p$$
• Consider a model pdf $$f(x(1), x(2), \ldots x(t); \theta)$$
• Then, under stationarity, $\begin{eqnarray} \lim_{n\rightarrow\infty}{\frac{1}{n}\Expect{\log{f(X(1), \ldots X(n);\theta)}}} & = & \lambda(\theta)\\ \frac{1}{n}\log{f(X(1), \ldots X(n); \theta)} & \rightarrow & \lambda(\theta) \end{eqnarray}$
• Some non-stationary extensions, especially if asymptotically stationary

# Central limit theorems and weak dependence

• Suppose $$(X(t-k), \ldots X(t-1))$$ and $$(X(t+h), \ldots X(h+k-1)$$ approach independence as $$h\rightarrow\infty$$
• Then we’ve got nearly independent “blocks” of length $$k$$
• $$\overline{X}_n$$ acts like average of some nearly-independent blocks, plus remainder
• Lets us transfer central limit theorem to dependent data
• Need to be precise about “approaching independence”

# Summary

• If correlations go to zero fast enough, $$\overline{X}_n$$ converges to the expectation value
• This is enough to have a lot of estimators converge
• And even to get asymptotic standard errors
• Effective sample size is reduced
• Maximum likelihood works very generally for parametric models
• Central limit theorem still holds under weak dependence / asymptotic independence