36-467/36-667

13 November 2018

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#1}\left( #2 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Expectwrt}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\InitDist}{p_{\mathrm{init}}} \newcommand{\InvDist}{p^{*}} \]

- Looked at how to simulate a generative model
- Saw how to use those simulations for inference
- Now introduce one of the most important kinds of generative models

The

**Markov property**: Given the current**state**\(X(t)\), earlier states \(X(t-1), X(t-2), \ldots\) are irrelevant to the future states \(X(t+1), X(t+2), \ldots\)Equivalently: \[ \Prob{X(t+1)| X(t), X(t-1), X(t-2), \ldots X(0)} = \Prob{X(t+1)|X(t)} \]

Equivalently: \[ \Prob{X(0), X(1), \ldots X(n)} = \Prob{X(0)} \Prob{X(1)|X(0)} \Prob{X(2)|X(1)} \ldots \Prob{X(n)|X(n-1)} \]

This is an

*assumption*, not a law of nature- Some examples:
- AR(1)
- Random walk
- Deterministic dynamics + noise in the state

Some non-examples come later

- Independent and identically distributed (IID)
- Technically Markov…
- …but even the current state doesn’t matter
- Nothing matters

- Deterministic dynamics
- Technically Markov…
- …but current state fixes all future states
*exactly*

- To generate data from / simulate a Markov process, we need to
- Draw the initial state \(X(0)\) from \(\Prob{X(0)}\)
- Draw \(X(t)\) from \(\Prob{X(t)|X(t-1)}\) — inherently sequential

- Generic code:

A Markov

**chain**is a Markov process where the states \(X(t)\) are discrete, not continuousRepresent \(\Prob{X(t)|X(t-1)}\) in a

**transition matrix**, \(\mathbf{q}_{ij} \equiv \Prob{X(t) = j| X(t-1)=i}\)- \(\mathbf{q}\) is a
**stochastic matrix**:- All \(q_{ij} \geq 0\)
- for each row \(i\), \(\sum_{j}{q_{ij}} = 1\)

Represent \(\Prob{X(0)}\) as a non-negative vector \(\InitDist\), summing to 1

\(\Leftrightarrow \mathbf{q}=\left[\begin{array}{cc} 0.5 & 0.5 \\ 0.75 & 0.25 \end{array} \right]\)

```
rmarkovchain <- function(n, p0, q) {
k <- length(p0)
stopifnot(k==nrow(q),k==ncol(q),all.equal(rowSums(q),rep(1,time=k)))
rinitial <- function() { sample(1:k,size=1,prob=p0) }
rtransition <- function(x) { sample(1:k,size=1,prob=q[x,]) }
return(rmarkov(n,rinitial,rtransition))
}
```

- It runs:

```
q <- matrix(c(0.5, 0.5, 0.75, 0.25), byrow=TRUE, nrow=2)
x <- rmarkovchain(1e4,c(0.5,0.5),q)
head(x)
```

`## [1] 1 1 1 2 1 2`

- How do we know it works?

```
ones <- which(x[-1e4]==1) # Why omit the last step?
twos <- which(x[-1e4]==2)
signif(table(x[ones+1])/length(ones),3)
```

```
##
## 1 2
## 0.499 0.501
```

```
##
## 1 2
## 0.752 0.248
```

vs. \((0.5,0.5)\) and \((0.75, 0.25)\) ideally

- For each \(t\) where \(X(t) =1\), the next state \(X(t+1)\) is drawn IIDly
- So law of large numbers holds for \(X(t+1)\) conditional on \(X(t)=1\)
- and conditional on \(X(t)=2\)

- So sample conditional frequencies should converge on conditional probabilities

- You could try to write \[
X(t+1) = f(X(t)) + \epsilon(t+1)
\] but it’s generally nasty
- Very non-linear \(f\)
- Distribution of \(\epsilon\) depends on \(X(t-1)\)
- Alternately write \(X(t+1) = g(X(t), \epsilon(t+1))\) but then \(g\) is even more complicated

- State \(i\) and state \(j\)
**communicate**if the chain can go from \(i\) to \(j\) and vice versa - A set of states communicates if every pair of states communicate
- A maximal communicating set is
**irreducible**- \(=\) a
**strongly connected component**in the graph - \(\approx\) an
**attractor**of a deterministic dynamical system

- \(=\) a
- A set is
**closed**if the probability of leaving it is 0 - Finite chains break up into closed irreducible sets + transient states
- \(\Rightarrow\) the long run is always in some (closed) irreducible set or other

From the Markov property: \[\begin{eqnarray} \Prob{X(1)} & = & \InitDist \mathbf{q}\\ \Prob{X(2)} & = & \Prob{X(1)} \mathbf{q} = \InitDist \mathbf{q}^2\\ \Prob{X(t)} & = & \Prob{X(t-1)} \mathbf{q} = \InitDist \mathbf{q}^{t} \end{eqnarray}\]

The

*distribution*always evolves*linearly*

- \(\mathbf{q}\) has eigenvalues \(\lambda_1, \ldots \lambda_k\) and
*left*eigenvectors \(v_1, \ldots v_k\) - Decompose the initial distribution in terms of the eigenvectors: \[ \InitDist = \sum_{j=1}^{k}{c_j v_j} \]
- So \[ \InitDist \mathbf{q} = \sum_{j=1}^{k}{\lambda_j c_j v_j} \]

- All \(|\lambda_j| \leq 1\)
- There is always at least one \(\lambda_j = 1\)
- The eigenvectors which go with those eigenvalues have non-negative entries
- so those eigenvectors can be scaled to probability distributions
- One \(\lambda_j=1\) eigenvalue per irreducible set
- The eigenvector is \(> 0\) on that set, \(=0\) off of it

- The chain is
**periodic**if all paths back to a given state have multiples of some length (e.g., 2)- Periodicity \(\Leftrightarrow\) other eigenvalues \(|\lambda|=1\)

- Otherwise the chain is
**aperiodic**

- \(\lambda_1=1\), corresponding \(v_1 > 0\) everywhere
- Can normalize \(v_1\) so \(\sum_{i}{v_{1i}} = 1\)

All other \(|\lambda_j| < 1\)

**Exercise**: Show that, as \(t\rightarrow\infty\), \[ \InitDist \mathbf{q}^t \rightarrow v_1 \] for all \(\InitDist\) (which are non-negative and sum to 1)

- Because \(|\lambda_j| < 1\) if \(j \geq 2\) but \(\lambda_1=1\), \[ \Prob{X(t)} = \sum_{j=1}^{k}{c_j \lambda_j^t v_j} \rightarrow c_1 v_1 \]
- Probabilities must add to 1, and this fixes \(c_1\): \[\begin{eqnarray} \sum_{i=1}^{k}{\Prob{X(t)=i}} & = & 1\\ \sum_{i=1}^{k}{\sum_{j=1}^{k}{c_j \lambda_j^t v_j}} & = & 1\\ \sum_{i=1}^{k}{c_1 v_{1i}} & = & 1 ~ \text{(limit of both sides as} ~ t\rightarrow\infty)\\ c_1\sum_{i=1}^{k}{v_{1i}} & = & 1\\ c_1 & = & 1 \end{eqnarray}\]

- We can say a bit more: \[
\mathbf{q}^{t} \rightarrow \mathbf{v}^{-1} \mathrm{diag}(1, 0, \ldots 0) \mathbf{v} = \left[ \begin{array}{c} v_1\\ v_1\\ \vdots \\ v_1 \end{array} \right]
\]
- i.e., each row converges to \(v_1\)
- Use the eigendecomposition of \(\mathbf{q}^t\)

- This is the basic ergodic theorem for Markov chains

- The (left) eigenvector of \(\mathbf{q}\) with eigenvalue 1 is the
**invariant**distribution: \[ v_1 \mathbf{q} = v_1 \]- a.k.a.
**equilibrium**distribution

- a.k.a.
- This is important enough to deserve a special symbol, \(\InvDist\)
- If \(X(t) \sim \InvDist\), then \((X(t), X(t+1)) \sim \InvDist \otimes \mathbf{q}\)
- \(=\) invariant distribution over pairs
- define invariant distribution over longer blocks similarly

- If \(\InitDist=\InvDist\), then the process is (strictly) stationary

- Because \(\Prob{X(h)} \rightarrow \InvDist\) as \(h\rightarrow\infty\), \(X(t)\) and \(X(t+h)\) become independent as \(h\rightarrow \infty\)
- \(\Prob{X(t+h)|X(t)=x} \rightarrow \InvDist\) regardless of \(x\)
- Correlations actually decay exponentially in \(h\)

- \(\therefore\) there’s an ergodic theorem: \[ \frac{1}{n}\sum_{t=1}^{n}{f(X(t))} \rightarrow \sum_{i=1}^{k}{f(i) \InvDist_i} = \Expectwrt{X \sim \InvDist}{f(X)} \]
- Similarly for functions of pairs: \[ \frac{1}{n-1}\sum_{t=1}^{n-1}{g(X(t), X(t+1))} \rightarrow \sum_{i=1}^{k}{\sum_{j=1}^{k}{g(i,j) \InvDist_i \mathbf{q}_{ij}}} = \Expectwrt{(X, X^{\prime}) \sim \InvDist \otimes \mathbf{q}}{f(X, X^{\prime})} \]

- Define \(N_i \equiv\) number of times \(X(t)=i\)
- Define \(N_{ij} \equiv\) number of times \(X(t)=i, X(t+1)=j\)
- Then \[\begin{eqnarray} \frac{N_i}{n} & \rightarrow & \InvDist_i\\ \frac{N_{ij}}{n} & \rightarrow & \InvDist_i \mathbf{q}_{ij}\\ \frac{N_{ij}}{N_i} = \frac{N_{ij}/n}{N_i/n} & \rightarrow & \frac{\InvDist_i \mathbf{q}_{ij}}{\InvDist_i} = \mathbf{q}_{ij} \end{eqnarray}\]

```
## eigen() decomposition
## $values
## [1] 1.00 -0.25
##
## $vectors
## [,1] [,2]
## [1,] 0.8320503 -0.7071068
## [2,] 0.5547002 0.7071068
```

`## [1] 0.6 0.4`

```
##
## 1 2
## 6010 3990
```

```
##
## 1 2
## 6003 3997
```

```
##
## 1 2
## 6057 3943
```

```
##
## 1 2
## 5953 4047
```

- Because \(X(t)\) and \(X(t+h)\) become independent as \(h\rightarrow\infty\), the time average is \(\approx\) an average of IID blocks
- CLT would apply to an average of IID blocks
- For rigor, need to check that \(\approx\) becomes strict as \(n\rightarrow\infty\)

- Usually works in practice:

- For a finite chain, eventually \(X(t)\) enters an irreducible set, and never leaves
- \(\Rightarrow\) Long-run behavior of \(X(t)\) looks like it was
*always*in that set - but
*which*irreducible set is random - \(\Rightarrow\) Converge, but to a random limit

- There can still be irreducible sets (smaller than the whole space)
- If \(X(t)\) enters an irreducible set, we have nice ergodic behavior
- Might wander off to infinity, though
- Example: Random walk

- Again, we can have irreducible sets
- Again, nice ergodic behavior within an irreducible set
- Generally need a lot more math to handle this…

- Higher-order Markov process: Need to condition on \(X(t-p+1), \ldots X(t-1), X(t)\)
- \(=\) A Markov process on blocks
- AR\((p)\) is a \(p^{\mathrm{th}}\) order chain

- Interacting/coupled Markov processes: transition rates for process 1 depend on its state
*and*the state of process 2 - Continuous time: stay in state for a random time, with exponential distribution, then take a chain step
- Semi-Markov chain: Like a CTMC, but non-exponential holding times
- Hidden Markov Model (HMM): \(S(t)\) is Markov, but we see \(X(t) = f(S(t)) + \mathrm{noise}\)
- This isn’t Markov
- MA\((q)\) is actually an HMM!

- Markov chains are the most basic non-trivial stochastic process
- Individual trajectories are random and erratic
- Distributions evolve according to a linear operator
- Invariant distributions \(=\) leading eigenvectors

- With
*finite*state spaces:- Each individual trajectory eventually enters an irreducible set, and wanders ergodically around it
- One invariant distribution per irreducible set
- Long-run averages look like expectations over the invariant distribution

- Handout will be posted to the class website
- Grimmett and Stirzaker (1992) is really good on Markov chains, and more general Markov processes
- On invariant distributions in continuous state spaces, the standard reference is Meyn and Tweedie (1993), but Lasota and Mackey (1994) is also good, and Mackey (1992) is surprisingly readable and relevant

- Early 19th century: administrations of nation-states begin collecting, and publishing, statistics on mass social phenomena
- “Vital statistics”: Birth rates, death rates, marriage rates, numbers of children…
- Accidents: the number of pieces of undeliverable mail each year in Paris was thought impressive
- Crimes: theft, murder, suicide

- Mid-19th century: pioneering social scientists realize the rates of these phenomena are pretty stable
- … Even of things that arise from really profound moral choices, like murder and suicide
- … which is what you’d expect from the law of large numbers if there was just some constant probability
- … so they concluded that free will does not exist (“Society itself prepares the crime”)

- See Hacking (1990), Porter (1986) for this history
- See Cassirer (1944) for how some people realized at the time that this was crazy-bad philosophy

- P. A. Nekrasov: 19th century Russian mathematician and theologian
- Claimed law of large numbers requires independent events
- But human actions are dependent
- \(\therefore\) Law of large numbers does not apply
- \(\therefore\) Free will exists and Russian Orthodox Christianity is exactly true
- This was not even close to the strangest idea entertained by the Moscow math department (Graham and Kantor 2009)

- A. A. Markov: Russian mathematician and vocal atheist
- Leading probability theorist and a very careful mathematician
- Viewed Nekrasov as his arch-nemesis
- Read a 1902 paper by Nekrasov and got enraged
- Invented the theory of Markov chains to show LLN and CLT can hold under dependence (first published 1906), essentially to refute Nekrasov
- See Basharin, Langville, and Naumov (2004) for Markov’s life and this history

- Markov processes are the outcome of an 19th century flamewar between determinist statisticians and mystic mathematicians

- Suppose we want to sample from a pdf or pmf \(p\)
- Exercise: Show that if
**detailed balance**\(q(y|x)p(x) = q(x|y)p(y)\) holds, then \(p\) is the invariant distribution of the chain with transition rates \(q\) - In
**Markov chain Monte Carlo**…- we make a Markov chain with transition rates that obey this equation
- simulate a long trajectory from this chain
- and use the simulation as an (approximate) sample from \(p\)

- Practical MCMC is based on the
**Metropolis algorithm**(Metropolis et al. 1953)

- Set \(X(0)\) however we like, and initialize \(t \leftarrow 0\).
- Draw a {} \(Z(t)\) from some conditional distribution \(r(\cdot|X_t)\) — for instance a Gaussian or uniform centered on \(X(t)\), or anything else easy to draw from and with smooth enough noise.
- Draw \(U(t)\) independently from a uniform distribution on \([0,1)\).
- If \(U(t) < p(Z(t)))/p(X(t)))\), then \(X(t+1) = Z(t)\), otherwise \(X(t+1)=X(t)\)
- Increase \(t\) by 1 and go to step 2.

- Obeys detailed balance so it has the right invariant distribution

Basharin, Gely P., Amy N. Langville, and Valeriy A. Naumov. 2004. “The Life and Work of A. A. Markov.” *Linear Algebra and Its Applications* 386:3–26. http://langvillea.people.cofc.edu/MarkovReprint.pdf.

Cassirer, Ernst. 1944. *An Essay on Man: An Introduction to a Philosophy of Human Culutre*. New Haven, Connecticut: Yale University Press.

Graham, Loren, and Jean-Michel Kantor. 2009. *Naming Infinity: A True Story of Religious Mysticism and Mathematical Creativity*. Cambridge, Massachusetts: Harvard University Press.

Grimmett, G. R., and D. R. Stirzaker. 1992. *Probability and Random Processes*. 2nd ed. Oxford: Oxford University Press.

Hacking, Ian. 1990. *The Taming of Chance*. Cambridge, England: Cambridge University Press.

Lasota, Andrzej, and Michael C. Mackey. 1994. *Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics*. Berlin: Springer-Verlag.

Mackey, Michael C. 1992. *Time’s Arrow: The Origins of Thermodynamic Behavior*. Berlin: Springer-Verlag.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equations of State Calculations by Fast Computing Machines.” *Journal of Chemical Physics* 21:1087–92. https://doi.org/10.1063/1.1699114.

Meyn, S. P., and R. L. Tweedie. 1993. *Markov Chains and Stochastic Stability*. Berlin: Springer-Verlag.

Porter, Theodore M. 1986. *The Rise of Statistical Thinking, 1820–1900*. Princeton, New Jersey: Princeton University Press.