36-467/667

29 October 2020 (Lecture 18)

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#1}\left( #2 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Expectwrt}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\InitDist}{\vec{p}_{\mathrm{init}}} \newcommand{\InvDist}{\vec{p}^{*}} \]

- Looked at how to simulate a generative model
- Saw how to use those simulations for inference
- Now introduce one of the most important kinds of generative models

- The
**Markov property**: Given the current**state**\(X(t)\), earlier states \(X(t-1), X(t-2), \ldots\) are irrelevant to the future states \(X(t+1), X(t+2), \ldots\) - Equivalently: \[ \Prob{X(t+1)| X(t), X(t-1), X(t-2), \ldots X(0)} = \Prob{X(t+1)|X(t)} \]
- Equivalently: \[ \Prob{X(0), X(1), \ldots X(n)} = \Prob{X(0)} \Prob{X(1)|X(0)} \Prob{X(2)|X(1)} \ldots \Prob{X(n)|X(n-1)} \]
- This is an
*assumption*, not a law of nature - Some examples:
- AR(1)
- Random walk
- Deterministic dynamics + noise in the state

- Some non-examples come later

- Independent and identically distributed (IID)
- Technically Markov…
- …but even the current state doesn’t matter
- Nothing matters

- Deterministic dynamics
- Technically Markov…
- …but current state fixes all future states
*exactly*

- To generate data from / simulate a Markov process, we need to
- Draw the initial state \(X(0)\) from \(\Prob{X(0)}\)
- Draw \(X(t)\) from \(\Prob{X(t)|X(t-1)}\) — inherently sequential

- Generic code:

- A Markov
**chain**is a Markov process where the states \(X(t)\) are discrete, not continuous - Represent \(\Prob{X(t)|X(t-1)}\) in a
**transition matrix**, \(\mathbf{q}_{ij} \equiv \Prob{X(t) = j| X(t-1)=i}\) - \(\mathbf{q}\) is a
**stochastic matrix**:- All \(q_{ij} \geq 0\)
- for each row \(i\), \(\sum_{j}{q_{ij}} = 1\)

- Represent \(\Prob{X(0)}\) as a non-negative vector \(\InitDist\), summing to 1

\(\Leftrightarrow \mathbf{q}=\left[\begin{array}{cc} 0.5 & 0.5 \\ 0.75 & 0.25 \end{array} \right]\)

```
rmarkovchain <- function(n, p0, q) {
k <- length(p0)
stopifnot(k==nrow(q),k==ncol(q),all.equal(rowSums(q),rep(1,time=k)))
rinitial <- function() { sample(1:k,size=1,prob=p0) }
rtransition <- function(x) { sample(1:k,size=1,prob=q[x,]) }
return(rmarkov(n,rinitial,rtransition))
}
```

- It runs:

```
q <- matrix(c(0.5, 0.5, 0.75, 0.25), byrow=TRUE, nrow=2)
x <- rmarkovchain(1e4,c(0.5,0.5),q)
head(x)
```

`## [1] 1 1 2 2 1 2`

- How do we know it works?

```
ones <- which(x[-1e4]==1) # Why omit the last step?
twos <- which(x[-1e4]==2)
signif(table(x[ones+1])/length(ones),3)
```

```
##
## 1 2
## 0.504 0.496
```

```
##
## 1 2
## 0.758 0.242
```

vs. \((0.5,0.5)\) and \((0.75, 0.25)\) ideally

- For each \(t\) where \(X(t) =1\), the next state \(X(t+1)\) is drawn IIDly
- So law of large numbers holds for \(X(t+1)\) conditional on \(X(t)=1\)
- and similarly conditional on \(X(t)=2\)

- So sample conditional frequencies should converge on conditional probabilities

- You could try to write \[
X(t+1) = f(X(t)) + \epsilon(t+1)
\] but it’s generally nasty
- Very non-linear \(f\)
- Distribution of \(\epsilon(t+1)\) depends on \(X(t)\)

- Alternately write \(X(t+1) = g(X(t), \epsilon(t+1))\) but then \(g\) is even more complicated

- State \(i\) and state \(j\)
**communicate**if the chain can go from \(i\) to \(j\) and vice versa - A set of states communicates if every pair of states communicate
- A maximal communicating set is
**irreducible**- \(=\) a
**strongly connected component**in the graph - \(\approx\) an
**attractor**of a deterministic dynamical system

- \(=\) a
- A set is
**closed**if the probability of leaving it is 0 - Finite chains break up into closed, irreducible sets + transient states
- \(\Rightarrow\) the long run is always in some (closed) irreducible set or other

From the Markov property: \[\begin{eqnarray} \Prob{X(1)} & = & \InitDist \mathbf{q}\\ \Prob{X(2)} & = & \Prob{X(1)} \mathbf{q} = \InitDist \mathbf{q}^2\\ \Prob{X(t)} & = & \Prob{X(t-1)} \mathbf{q} = \InitDist \mathbf{q}^{t} \end{eqnarray}\]

The

*distribution*always evolves*linearly*

- \(\mathbf{q}\) has eigenvalues \(\lambda_1, \ldots \lambda_k\) and
*left*eigenvectors \(\vec{v}_1, \ldots \vec{v}_k\) - Decompose the initial distribution in terms of the eigenvectors: \[ \InitDist = \sum_{j=1}^{k}{c_j \vec{v}_j} \]
- So \[ \InitDist \mathbf{q} = \sum_{j=1}^{k}{\lambda_j c_j \vec{v}_j} \]

- All \(|\lambda_j| \leq 1\)
- There is always at least one \(\lambda_j = 1\)
- The eigenvectors which go with the \(\lambda=1\) eigenvalues have only non-negative entries
- so those eigenvectors can be scaled to probability distributions
- One \(\lambda_j=1\) eigenvalue per irreducible set
- The eigenvector is \(> 0\) on that set, \(=0\) off of it

- The chain is
**periodic**if all paths back to a given state have multiples of some length (e.g., 2)- Periodicity \(\Leftrightarrow\) some eigenvalues \(\lambda \neq 1\) but \(|\lambda|=1\)

- Otherwise the chain is
**aperiodic**

- \(\lambda_1=1\), corresponding \(\vec{v}_1 > 0\) everywhere
- Can normalize \(\vec{v}_1\) so \(\sum_{i}{v_{1i}} = 1\)

All other \(|\lambda_j| < 1\)

**Exercise**: Show that, as \(t\rightarrow\infty\), \[ \InitDist \mathbf{q}^t \rightarrow \vec{v}_1 \] for all \(\InitDist\) (which are non-negative and sum to 1)

- We can say a bit more: \[
\mathbf{q}^{t} \rightarrow \mathbf{v}^{-1} \mathrm{diag}(1, 0, \ldots 0) \mathbf{v} = \left[ \begin{array}{c} \vec{v}_1\\ \vec{v}_1\\ \vdots \\ \vec{v}_1 \end{array} \right]
\]
- i.e., each row converges to \(\vec{v}_1\)
- Use the eigendecomposition of \(\mathbf{q}^t\)

- This is the basic ergodic theorem for Markov chains

- The (left) eigenvector of \(\mathbf{q}\) with eigenvalue 1 is the
**invariant**distribution: \[ \vec{v}_1 \mathbf{q} = \vec{v}_1 \]- a.k.a.
**equilibrium**distribution

- a.k.a.
- This is important enough to deserve a special symbol, \(\InvDist\)
- If \(X(t) \sim \InvDist\), then \((X(t), X(t+1)) \sim \InvDist \otimes \mathbf{q}\)
- \(=\) invariant distribution over pairs
- define invariant distribution over longer blocks similarly

- If \(\InitDist=\InvDist\), then the process is (strictly) stationary

- Because \(\Prob{X(h)} \rightarrow \InvDist\) as \(h\rightarrow\infty\), \(X(t)\) and \(X(t+h)\) become independent as \(h\rightarrow \infty\)
- \(\Prob{X(t+h)|X(t)=x} \rightarrow \InvDist\) regardless of \(x\)
- Correlations actually decay exponentially in \(h\)

- \(\therefore\) there’s an ergodic theorem: \[ \frac{1}{n}\sum_{t=1}^{n}{f(X(t))} \rightarrow \sum_{i=1}^{k}{f(i) \InvDist_i} = \Expectwrt{X \sim \InvDist}{f(X)} \]
- Similarly for functions of pairs: \[ \frac{1}{n-1}\sum_{t=1}^{n-1}{g(X(t), X(t+1))} \rightarrow \sum_{i=1}^{k}{\sum_{j=1}^{k}{g(i,j) \InvDist_i \mathbf{q}_{ij}}} = \Expectwrt{(X, X^{\prime}) \sim \InvDist \otimes \mathbf{q}}{f(X, X^{\prime})} \]

- Define \(N_i \equiv\) number of times \(X(t)=i\)
- Define \(N_{ij} \equiv\) number of times \(X(t)=i, X(t+1)=j\)
- Then \[\begin{eqnarray} \frac{N_i}{n} & \rightarrow & \InvDist_i\\ \frac{N_{ij}}{n} & \rightarrow & \InvDist_i \mathbf{q}_{ij}\\ \frac{N_{ij}}{N_i} = \frac{N_{ij}/n}{N_i/n} & \rightarrow & \frac{\InvDist_i \mathbf{q}_{ij}}{\InvDist_i} = \mathbf{q}_{ij} \end{eqnarray}\]

```
## eigen() decomposition
## $values
## [1] 1.00 -0.25
##
## $vectors
## [,1] [,2]
## [1,] 0.8320503 -0.7071068
## [2,] 0.5547002 0.7071068
```

`## [1] 0.6 0.4`

```
##
## 1 2
## 5983 4017
```

```
##
## 1 2
## 6038 3962
```

```
##
## 1 2
## 5956 4044
```

```
##
## 1 2
## 5987 4013
```

- Because \(X(t)\) and \(X(t+h)\) become independent as \(h\rightarrow\infty\), the time average is \(\approx\) an average of IID blocks
- CLT would apply to an average of IID blocks
- For rigor, need to check that \(\approx\) becomes strict as \(n\rightarrow\infty\)

- Usually works in practice:

- For a finite chain, eventually \(X(t)\) enters an irreducible set, and never leaves
- \(\Rightarrow\) Long-run behavior of \(X(t)\) looks like it was
*always*in that set - but
*which*irreducible set is random - \(\Rightarrow\) Converge, but to a random limit

- There can still be irreducible sets (smaller than the whole space)
- If \(X(t)\) enters an irreducible set, we have nice ergodic behavior
- Might wander off to infinity, though
- Example: Random walk

- Again, we can have irreducible sets
- Again, nice ergodic behavior within an irreducible set
- Generally need a lot more math to handle this…

- Markov process: condition \(X(t+1)\) on \(X(t)\) only
- \(p^{\mathrm{th}}\) order Markov process: condition \(X(t+1)\) on \(X(t-p+1), \ldots X(t-1), X(t)\)
- Example: AR(\(p\)) model

- A higher-order Markov process is a first-order Markov process on
*blocks*- How long are the blocks for a \(p^{\mathrm{th}}\) order process?
- Will revisit when looking at estimation

- Jason Capehart, Seung Su Han, Alexander Murray-Watters, and Elizabeth Silver in Statistical Computing (36-350) in 2011

- Interacting or coupled Markov processes: transition rates for process 1 depend on its state
*and*the state of process 2 - Continuous time: stay in state for a random, exponentially-distributed time, then take a chain step
- Semi-Markov chain: Like a CTMC, but non-exponential holding times
- Hidden Markov Model (HMM): \(S(t)\) is Markov, but we see \(X(t) = f(S(t)) + \mathrm{noise}\)
- This isn’t Markov, but it
*is*widely applied in biology, economics, geoscience, signal processing, predictive text…

- This isn’t Markov, but it
- Hidden semi-Markov models (Hoek and Elliott 2018), etc., etc.

- Markov chains are the most basic non-trivial stochastic process
- Individual trajectories are random and erratic and possibly very nonlinear
- Distributions evolve deterministically, according to a linear operator
- Invariant distributions \(=\) leading eigenvectors

- With
*finite*state spaces:- Each individual trajectory eventually enters an irreducible set, and wanders ergodically around it
- One invariant distribution per irreducible set
- Long-run averages look like expectations over the invariant distribution

- With
*infinite*state spaces: usually similar, but sometimes wandering off to infinity - Coming up: estimation and testing with Markov models

- Early 19th century: administrations of nation-states begin collecting, and publishing, statistics on mass social phenomena
- “Vital statistics”: Birth rates, death rates, marriage rates, numbers of children…
- Accidents: the number of pieces of undeliverable mail each year in Paris was thought impressive
- Crimes: theft, murder, suicide

- Mid-19th century: pioneering social scientists realize the rates of these phenomena are pretty stable
- … Even of things that arise from really profound moral choices, like murder and suicide
- … which is what you’d expect from the law of large numbers if there was just some constant probability
- … so they concluded that free will does not exist (“Society itself prepares the crime”)

- See Hacking (1990), Porter (1986) for this history
- See Cassirer (1944) for how some people realized at the time that this was crazy-bad philosophy

- P. A. Nekrasov: 19th century Russian mathematician and theologian
- Claimed law of large numbers requires independent events
- But human actions are dependent
- \(\therefore\) Law of large numbers does not apply
- \(\therefore\) Free will exists and Russian Orthodox Christianity is exactly true
- This was not even close to the strangest idea entertained by the Moscow math department (Graham and Kantor 2009)

- A. A. Markov: Russian mathematician and vocal atheist
- Leading probability theorist and a very careful mathematician
- Viewed Nekrasov as his arch-nemesis
- Read a 1902 paper by Nekrasov and got enraged
- Invented the theory of Markov chains to show LLN and CLT can hold under dependence (first published 1906), essentially to refute Nekrasov
- See Basharin, Langville, and Naumov (2004) for Markov’s life and this history

- Markov processes are the outcome of an 19th century flamewar between determinist statisticians and mystic mathematicians

- Suppose we want to sample from a pdf or pmf \(p\)
- Exercise: Show that if
**detailed balance**\(q(y|x)p(x) = q(x|y)p(y)\) holds, then \(p\) is the invariant distribution of the chain with transition rates \(q\) - In
**Markov chain Monte Carlo**…- we make a Markov chain with transition rates that obey this equation
- simulate a long trajectory from this chain
- and use the simulation as an (approximate) sample from \(p\)

- Practical MCMC is based on the
**Metropolis algorithm**(Metropolis et al. 1953)

- Set \(X(0)\) however we like, and initialize \(t \leftarrow 0\).
- Draw a
**proposal**\(Z(t)\) from some conditional distribution \(r(\cdot|X_t)\) — for instance a Gaussian or uniform centered on \(X(t)\), or anything else easy to draw from and with smooth enough noise. - Draw \(U(t)\) independently from a uniform distribution on \([0,1)\).
- If \(U(t) < p(Z(t)))/p(X(t)))\), then \(X(t+1) = Z(t)\), otherwise \(X(t+1)=X(t)\)
- Increase \(t\) by 1 and go to step 2.

- Obeys detailed balance so it has the right invariant distribution

- Handouts on class website, excerpt from Guttorp (1995) on Canvas
- Grimmett and Stirzaker (1992) is really good on Markov chains, and more general Markov processes
- On invariant distributions in continuous state spaces, the standard reference is Meyn and Tweedie (1993), but Lasota and Mackey (1994) is also good, and Mackey (1992) is surprisingly readable and relevant

Basharin, Gely P., Amy N. Langville, and Valeriy A. Naumov. 2004. “The Life and Work of A. A. Markov.” *Linear Algebra and Its Applications* 386:3–26. http://langvillea.people.cofc.edu/MarkovReprint.pdf.

Cassirer, Ernst. 1944. *An Essay on Man: An Introduction to a Philosophy of Human Culutre*. New Haven, Connecticut: Yale University Press.

Graham, Loren, and Jean-Michel Kantor. 2009. *Naming Infinity: A True Story of Religious Mysticism and Mathematical Creativity*. Cambridge, Massachusetts: Harvard University Press.

Grimmett, G. R., and D. R. Stirzaker. 1992. *Probability and Random Processes*. 2nd ed. Oxford: Oxford University Press.

Guttorp, Peter. 1995. *Stochastic Modeling of Scientific Data*. London: Chapman; Hall.

Hacking, Ian. 1990. *The Taming of Chance*. Cambridge, England: Cambridge University Press.

Hoek, John van der, and Robert J. Elliott. 2018. *Introduction to Hidden Semi-Markov Models*. Cambridge, England: Cambridge University Press.

Lasota, Andrzej, and Michael C. Mackey. 1994. *Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics*. Berlin: Springer-Verlag.

Mackey, Michael C. 1992. *Time’s Arrow: The Origins of Thermodynamic Behavior*. Berlin: Springer-Verlag.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equations of State Calculations by Fast Computing Machines.” *Journal of Chemical Physics* 21:1087–92. https://doi.org/10.1063/1.1699114.

Meyn, S. P., and R. L. Tweedie. 1993. *Markov Chains and Stochastic Stability*. Berlin: Springer-Verlag.

Porter, Theodore M. 1986. *The Rise of Statistical Thinking, 1820–1900*. Princeton, New Jersey: Princeton University Press.