---
title: Markov Chains I
author: 36-467/36-667
date: 13 November 2018
output: slidy_presentation
bibliography: locusts.bib
---
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]}
\newcommand{\Probwrt}[2]{\mathbb{P}_{#1}\left( #2 \right)}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\Expectwrt}[2]{\mathbb{E}_{#1}\left[ #2 \right]}
\newcommand{\InitDist}{p_{\mathrm{init}}}
\newcommand{\InvDist}{p^{*}}
\]
```{r, include=FALSE}
# General set-up options
library(knitr)
opts_chunk$set(size="small", background="white",
cache=TRUE, autodep=TRUE,
tidy=FALSE, warning=FALSE, message=FALSE,
echo=TRUE)
```
## In our previous episodes...
- Looked at how to simulate a generative model
- Saw how to use those simulations for inference
- Now introduce one of the most important kinds of generative models
## Markov Processes
- The **Markov property**: Given the current **state** $X(t)$, earlier states
$X(t-1), X(t-2), \ldots$ are irrelevant to the future states $X(t+1), X(t+2), \ldots$
- Equivalently:
\[
\Prob{X(t+1)| X(t), X(t-1), X(t-2), \ldots X(0)} = \Prob{X(t+1)|X(t)}
\]
- Equivalently:
\[
\Prob{X(0), X(1), \ldots X(n)} = \Prob{X(0)} \Prob{X(1)|X(0)} \Prob{X(2)|X(1)} \ldots \Prob{X(n)|X(n-1)}
\]
- This is an _assumption_, not a law of nature
- Some examples:
+ AR(1)
+ Random walk
+ Deterministic dynamics + noise in the state
- Some non-examples come later
## The Markov property between two extremes
- Independent and identically distributed (IID)
+ Technically Markov...
+ ...but even the current state doesn't matter
+ Nothing matters
- Deterministic dynamics
+ Technically Markov...
+ ...but current state fixes all future states _exactly_
## Markov processes are generative models
- To generate data from / simulate a Markov process, we need to
+ Draw the initial state $X(0)$ from $\Prob{X(0)}$
+ Draw $X(t)$ from $\Prob{X(t)|X(t-1)}$ --- inherently sequential
- Generic code:
```{r}
rmarkov <- function(n, rinitial, rtransition) {
x <- vector(length=n)
x[1] <- rinitial()
for (t in 2:n) {
x[t] <- rtransition(x[t-1])
}
return(x)
}
```
## Markov Chains
- A Markov **chain** is a Markov process where the states $X(t)$ are discrete, not continuous
- Represent $\Prob{X(t)|X(t-1)}$ in a **transition matrix**, $\mathbf{q}_{ij} \equiv \Prob{X(t) = j| X(t-1)=i}$
- $\mathbf{q}$ is a **stochastic matrix**:
+ All $q_{ij} \geq 0$
+ for each row $i$, $\sum_{j}{q_{ij}} = 1$
- Represent $\Prob{X(0)}$ as a non-negative vector $\InitDist$, summing to 1
## Graph vs. matrix
![](binary-graph.jpg) $\Leftrightarrow \mathbf{q}=\left[\begin{array}{cc} 0.5 & 0.5 \\ 0.75 & 0.25 \end{array} \right]$
## Your Basic Markov Chain
```{r}
rmarkovchain <- function(n, p0, q) {
k <- length(p0)
stopifnot(k==nrow(q),k==ncol(q),all.equal(rowSums(q),rep(1,time=k)))
rinitial <- function() { sample(1:k,size=1,prob=p0) }
rtransition <- function(x) { sample(1:k,size=1,prob=q[x,]) }
return(rmarkov(n,rinitial,rtransition))
}
```
- It runs:
```{r}
q <- matrix(c(0.5, 0.5, 0.75, 0.25), byrow=TRUE, nrow=2)
x <- rmarkovchain(1e4,c(0.5,0.5),q)
head(x)
```
- How do we know it works?
## Checking that it works
```{r}
ones <- which(x[-1e4]==1) # Why omit the last step?
twos <- which(x[-1e4]==2)
signif(table(x[ones+1])/length(ones),3)
signif(table(x[twos+1])/length(twos),3)
```
vs.\ $(0.5,0.5)$ and $(0.75, 0.25)$ ideally
## Why is this a check?
- For each $t$ where $X(t) =1$, the next state $X(t+1)$ is drawn IIDly
- So law of large numbers holds for $X(t+1)$ conditional on $X(t)=1$
+ and conditional on $X(t)=2$
- So sample conditional frequencies should converge on conditional probabilities
## How trajectories evolve
- You could try to write
\[
X(t+1) = f(X(t)) + \epsilon(t+1)
\]
but it's generally nasty
+ Very non-linear $f$
+ Distribution of $\epsilon$ depends on $X(t-1)$
+ Alternately write $X(t+1) = g(X(t), \epsilon(t+1))$ but then $g$ is even more complicated
## Where trajectories end in the long run
- State $i$ and state $j$ **communicate** if the chain can go from $i$ to $j$ and vice versa
- A set of states communicates if every pair of states communicate
- A maximal communicating set is **irreducible**
+ $=$ a **strongly connected component** in the graph
+ $\approx$ an **attractor** of a deterministic dynamical system
- A set is **closed** if the probability of leaving it is 0
- Finite chains break up into closed irreducible sets + transient states
+ $\Rightarrow$ the long run is always in some (closed) irreducible set or other
## How distributions evolve
- From the Markov property:
\begin{eqnarray}
\Prob{X(1)} & = & \InitDist \mathbf{q}\\
\Prob{X(2)} & = & \Prob{X(1)} \mathbf{q} = \InitDist \mathbf{q}^2\\
\Prob{X(t)} & = & \Prob{X(t-1)} \mathbf{q} = \InitDist \mathbf{q}^{t}
\end{eqnarray}
- The _distribution_ always evolves _linearly_
## More fun with eigenvalues and eigenvectors
- $\mathbf{q}$ has eigenvalues $\lambda_1, \ldots \lambda_k$ and _left_ eigenvectors $v_1, \ldots v_k$
- Decompose the initial distribution in terms of the eigenvectors:
\[
\InitDist = \sum_{j=1}^{k}{c_j v_j}
\]
- So
\[
\InitDist \mathbf{q} = \sum_{j=1}^{k}{\lambda_j c_j v_j}
\]
## Special properties of stochastic matrices
- All $|\lambda_j| \leq 1$
- There is always at least one $\lambda_j = 1$
- The eigenvectors which go with those eigenvalues have non-negative entries
- so those eigenvectors can be scaled to probability distributions
- One $\lambda_j=1$ eigenvalue per irreducible set
+ The eigenvector is $> 0$ on that set, $=0$ off of it
- The chain is **periodic** if all paths back to a given state have multiples of some length (e.g., 2)
+ Periodicity $\Leftrightarrow$ other eigenvalues $|\lambda|=1$
- Otherwise the chain is **aperiodic**
## Irreducible, aperiodic Markov chains
- $\lambda_1=1$, corresponding $v_1 > 0$ everywhere
+ Can normalize $v_1$ so $\sum_{i}{v_{1i}} = 1$
- All other $|\lambda_j| < 1$
- **Exercise**: Show that, as $t\rightarrow\infty$,
\[
\InitDist \mathbf{q}^t \rightarrow v_1
\]
for all $\InitDist$ (which are non-negative and sum to 1)
## Solution
- Because $|\lambda_j| < 1$ if $j \geq 2$ but $\lambda_1=1$,
\[
\Prob{X(t)} = \sum_{j=1}^{k}{c_j \lambda_j^t v_j} \rightarrow c_1 v_1
\]
- Probabilities must add to 1, and this fixes $c_1$:
\begin{eqnarray}
\sum_{i=1}^{k}{\Prob{X(t)=i}} & = & 1\\
\sum_{i=1}^{k}{\sum_{j=1}^{k}{c_j \lambda_j^t v_j}} & = & 1\\
\sum_{i=1}^{k}{c_1 v_{1i}} & = & 1 ~ \text{(limit of both sides as} ~ t\rightarrow\infty)\\
c_1\sum_{i=1}^{k}{v_{1i}} & = & 1\\
c_1 & = & 1
\end{eqnarray}
## Irreducible, aperiodic Markov chains
- We can say a bit more:
\[
\mathbf{q}^{t} \rightarrow \mathbf{v}^{-1} \mathrm{diag}(1, 0, \ldots 0) \mathbf{v} = \left[ \begin{array}{c} v_1\\ v_1\\ \vdots \\ v_1 \end{array} \right]
\]
+ i.e., each row converges to $v_1$
+ Use the eigendecomposition of $\mathbf{q}^t$
- This is the basic ergodic theorem for Markov chains
## Invariant distributions
- The (left) eigenvector of $\mathbf{q}$ with eigenvalue 1 is the **invariant** distribution:
\[
v_1 \mathbf{q} = v_1
\]
+ a.k.a. **equilibrium** distribution
- This is important enough to deserve a special symbol, $\InvDist$
- If $X(t) \sim \InvDist$, then $(X(t), X(t+1)) \sim \InvDist \otimes \mathbf{q}$
+ $=$ invariant distribution over pairs
+ define invariant distribution over longer blocks similarly
- If $\InitDist=\InvDist$, then the process is (strictly) stationary
## Ergodicity and Weak Dependence
- Because $\Prob{X(h)} \rightarrow \InvDist$ as $h\rightarrow\infty$,
$X(t)$ and $X(t+h)$ become independent as $h\rightarrow \infty$
+ $\Prob{X(t+h)|X(t)=x} \rightarrow \InvDist$ regardless of $x$
+ Correlations actually decay exponentially in $h$
- $\therefore$ there's an ergodic theorem:
\[
\frac{1}{n}\sum_{t=1}^{n}{f(X(t))} \rightarrow \sum_{i=1}^{k}{f(i) \InvDist_i} = \Expectwrt{X \sim \InvDist}{f(X)}
\]
- Similarly for functions of pairs:
\[
\frac{1}{n-1}\sum_{t=1}^{n-1}{g(X(t), X(t+1))} \rightarrow \sum_{i=1}^{k}{\sum_{j=1}^{k}{g(i,j) \InvDist_i \mathbf{q}_{ij}}} = \Expectwrt{(X, X^{\prime}) \sim \InvDist \otimes \mathbf{q}}{f(X, X^{\prime})}
\]
## Sample frequencies vs.\ probabilities {.smaller}
- Define $N_i \equiv$ number of times $X(t)=i$
- Define $N_{ij} \equiv$ number of times $X(t)=i, X(t+1)=j$
- Then
\begin{eqnarray}
\frac{N_i}{n} & \rightarrow & \InvDist_i\\
\frac{N_{ij}}{n} & \rightarrow & \InvDist_i \mathbf{q}_{ij}\\
\frac{N_{ij}}{N_i} = \frac{N_{ij}/n}{N_i/n} & \rightarrow & \frac{\InvDist_i \mathbf{q}_{ij}}{\InvDist_i} = \mathbf{q}_{ij}
\end{eqnarray}
## {.smaller}
```{r}
eigen(t(q))
eigen(t(q))$vectors[,1]/sum(eigen(t(q))$vectors[,1])
```
## {.smaller}
```{r}
table(rmarkovchain(1e4,c(0.5,0.5),q))
table(rmarkovchain(1e4,c(0.5,0.5),q))
table(rmarkovchain(1e4,c(0,1),q))
table(rmarkovchain(1e4,c(1,0),q))
```
## Central limit theorem
- Because $X(t)$ and $X(t+h)$ become independent as $h\rightarrow\infty$,
the time average is $\approx$ an average of IID blocks
+ CLT would apply to an average of IID blocks
+ For rigor, need to check that $\approx$ becomes strict as $n\rightarrow\infty$
- Usually works in practice:
```{r}
time.avgs <- replicate(100, mean(rmarkovchain(1e4, c(0.5, 0.5), q)))
qqnorm(time.avgs); qqline(time.avgs)
```
## What if there's more than one irreducible set?
- For a finite chain, eventually $X(t)$ enters an irreducible set, and never
leaves
- $\Rightarrow$ Long-run behavior of $X(t)$ looks like it was _always_ in that set
- but _which_ irreducible set is random
- $\Rightarrow$ Converge, but to a random limit
## What if the state space isn't finite?
- There can still be irreducible sets (smaller than the whole space)
- If $X(t)$ enters an irreducible set, we have nice ergodic behavior
- Might wander off to infinity, though
+ Example: Random walk
## What if the state space is continuous?
- Again, we can have irreducible sets
- Again, nice ergodic behavior within an irreducible set
- Generally need a lot more math to handle this...
## Variations on the theme
- Higher-order Markov process: Need to condition on $X(t-p+1), \ldots X(t-1), X(t)$
+ $=$ A Markov process on blocks
+ AR$(p)$ is a $p^{\mathrm{th}}$ order chain
- Interacting/coupled Markov processes: transition rates for process 1 depend on its state _and_ the state of process 2
- Continuous time: stay in state for a random time, with exponential distribution, then take a chain step
- Semi-Markov chain: Like a CTMC, but non-exponential holding times
- Hidden Markov Model (HMM): $S(t)$ is Markov, but we see $X(t) = f(S(t)) + \mathrm{noise}$
+ This isn't Markov
+ MA$(q)$ is actually an HMM!
## Summary
- Markov chains are the most basic non-trivial stochastic process
- Individual trajectories are random and erratic
- Distributions evolve according to a linear operator
+ Invariant distributions $=$ leading eigenvectors
- With _finite_ state spaces:
+ Each individual trajectory eventually enters an irreducible set, and wanders ergodically around it
+ One invariant distribution per irreducible set
+ Long-run averages look like expectations over the invariant distribution
## Backup: Further reading
- Handout will be posted to the class website
- @Grimmett-Stirzaker is really good on Markov chains, and more general Markov processes
- On invariant distributions in continuous state spaces, the standard reference is @Meyn-Tweedie-1, but @Lasota-Mackey is also good, and @Mackey-times-arrow is surprisingly readable and relevant
## Backup: The philosophical origins of the Markov property {.smaller}
- Early 19th century: administrations of nation-states begin collecting, and publishing, statistics on mass social phenomena
+ "Vital statistics": Birth rates, death rates, marriage rates, numbers of children...
+ Accidents: the number of pieces of undeliverable mail each year in Paris was thought impressive
+ Crimes: theft, murder, suicide
- Mid-19th century: pioneering social scientists realize the rates of these phenomena are pretty stable
+ ... Even of things that arise from really profound moral choices, like murder and suicide
+ ... which is what you'd expect from the law of large numbers if there was just some constant probability
+ ... so they concluded that free will does not exist (["Society itself prepares the crime"](http://www-history.mcs.st-and.ac.uk/Extras/Quetelet_crime.html))
- See @Hacking-chance, @Porter-rise for this history
+ See @Cassirer-on-man for how some people realized at the time that this was crazy-bad philosophy
## Backup: The philosophical origins of the Markov property
- P. A. Nekrasov: 19th century Russian mathematician and theologian
+ Claimed law of large numbers requires independent events
+ But human actions are dependent
+ $\therefore$ Law of large numbers does not apply
+ $\therefore$ Free will exists and Russian Orthodox Christianity is exactly true
+ This was not even close to the strangest idea entertained by the Moscow math department [@Naming-infinity]
## Backup: The philosophical origins of the Markov property
![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/AAMarkov.jpg/369px-AAMarkov.jpg)
- A. A. Markov: Russian mathematician and vocal atheist
+ Leading probability theorist and a very careful mathematician
+ Viewed Nekrasov as his arch-nemesis
+ Read a 1902 paper by Nekrasov and got enraged
+ Invented the theory of Markov chains to show LLN and CLT can hold under dependence (first published 1906), essentially to refute Nekrasov
+ See @Basharin-et-al-on-Markov for Markov's life and this history
## Backup: TL;DR on the origins of the Markov property
- Markov processes are the outcome of an 19th century flamewar between determinist statisticians and mystic mathematicians
## Backup: Markov chain Monte Carlo
- Suppose we want to sample from a pdf or pmf $p$
- Exercise: Show that if **detailed balance** $q(y|x)p(x) = q(x|y)p(y)$ holds, then $p$ is the invariant distribution of the chain with transition rates $q$
- In **Markov chain Monte Carlo**...
+ we make a Markov chain with transition rates that obey this equation
+ simulate a long trajectory from this chain
+ and use the simulation as an (approximate) sample from $p$
## Backup: Markov chain Monte Carlo
- Practical MCMC is based on the **Metropolis algorithm** [@Metropolis-et-al-Monte-Carlo]
1. Set $X(0)$ however we like, and initialize $t \leftarrow 0$.
2. Draw a {\bf proposal} $Z(t)$ from some conditional distribution $r(\cdot|X_t)$ --- for instance a Gaussian or uniform centered on $X(t)$, or anything else easy to draw from and with smooth enough noise.
3. Draw $U(t)$ independently from a uniform distribution on $[0,1)$.
4. If $U(t) < p(Z(t)))/p(X(t)))$, then $X(t+1) = Z(t)$, otherwise $X(t+1)=X(t)$
5. Increase $t$ by 1 and go to step 2.
- Obeys detailed balance so it has the right invariant distribution
## References