# Markov Chains I

13 November 2018

$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#1}\left( #2 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Expectwrt}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\InitDist}{p_{\mathrm{init}}} \newcommand{\InvDist}{p^{*}}$

# In our previous episodes…

• Looked at how to simulate a generative model
• Saw how to use those simulations for inference
• Now introduce one of the most important kinds of generative models

# Markov Processes

• The Markov property: Given the current state $$X(t)$$, earlier states $$X(t-1), X(t-2), \ldots$$ are irrelevant to the future states $$X(t+1), X(t+2), \ldots$$

• Equivalently: $\Prob{X(t+1)| X(t), X(t-1), X(t-2), \ldots X(0)} = \Prob{X(t+1)|X(t)}$

• Equivalently: $\Prob{X(0), X(1), \ldots X(n)} = \Prob{X(0)} \Prob{X(1)|X(0)} \Prob{X(2)|X(1)} \ldots \Prob{X(n)|X(n-1)}$

• This is an assumption, not a law of nature

• Some examples:
• AR(1)
• Random walk
• Deterministic dynamics + noise in the state
• Some non-examples come later

# The Markov property between two extremes

• Independent and identically distributed (IID)
• Technically Markov…
• …but even the current state doesn’t matter
• Nothing matters
• Deterministic dynamics
• Technically Markov…
• …but current state fixes all future states exactly

# Markov processes are generative models

• To generate data from / simulate a Markov process, we need to
• Draw the initial state $$X(0)$$ from $$\Prob{X(0)}$$
• Draw $$X(t)$$ from $$\Prob{X(t)|X(t-1)}$$ — inherently sequential
• Generic code:
rmarkov <- function(n, rinitial, rtransition) {
x <- vector(length=n)
x[1] <- rinitial()
for (t in 2:n) {
x[t] <- rtransition(x[t-1])
}
return(x)
}

# Markov Chains

• A Markov chain is a Markov process where the states $$X(t)$$ are discrete, not continuous

• Represent $$\Prob{X(t)|X(t-1)}$$ in a transition matrix, $$\mathbf{q}_{ij} \equiv \Prob{X(t) = j| X(t-1)=i}$$

• $$\mathbf{q}$$ is a stochastic matrix:
• All $$q_{ij} \geq 0$$
• for each row $$i$$, $$\sum_{j}{q_{ij}} = 1$$
• Represent $$\Prob{X(0)}$$ as a non-negative vector $$\InitDist$$, summing to 1

# Graph vs. matrix

$$\Leftrightarrow \mathbf{q}=\left[\begin{array}{cc} 0.5 & 0.5 \\ 0.75 & 0.25 \end{array} \right]$$

rmarkovchain <- function(n, p0, q) {
k <- length(p0)
stopifnot(k==nrow(q),k==ncol(q),all.equal(rowSums(q),rep(1,time=k)))
rinitial <- function() { sample(1:k,size=1,prob=p0) }
rtransition <- function(x) { sample(1:k,size=1,prob=q[x,]) }
return(rmarkov(n,rinitial,rtransition))
}
• It runs:
q <- matrix(c(0.5, 0.5, 0.75, 0.25), byrow=TRUE, nrow=2)
x <- rmarkovchain(1e4,c(0.5,0.5),q)
head(x)
## [1] 1 1 1 2 1 2
• How do we know it works?

# Checking that it works

ones <- which(x[-1e4]==1)  # Why omit the last step?
twos <- which(x[-1e4]==2)
signif(table(x[ones+1])/length(ones),3)
##
##     1     2
## 0.499 0.501
signif(table(x[twos+1])/length(twos),3)
##
##     1     2
## 0.752 0.248

vs. $$(0.5,0.5)$$ and $$(0.75, 0.25)$$ ideally

# Why is this a check?

• For each $$t$$ where $$X(t) =1$$, the next state $$X(t+1)$$ is drawn IIDly
• So law of large numbers holds for $$X(t+1)$$ conditional on $$X(t)=1$$
• and conditional on $$X(t)=2$$
• So sample conditional frequencies should converge on conditional probabilities

# How trajectories evolve

• You could try to write $X(t+1) = f(X(t)) + \epsilon(t+1)$ but it’s generally nasty
• Very non-linear $$f$$
• Distribution of $$\epsilon$$ depends on $$X(t-1)$$
• Alternately write $$X(t+1) = g(X(t), \epsilon(t+1))$$ but then $$g$$ is even more complicated

# Where trajectories end in the long run

• State $$i$$ and state $$j$$ communicate if the chain can go from $$i$$ to $$j$$ and vice versa
• A set of states communicates if every pair of states communicate
• A maximal communicating set is irreducible
• $$=$$ a strongly connected component in the graph
• $$\approx$$ an attractor of a deterministic dynamical system
• A set is closed if the probability of leaving it is 0
• Finite chains break up into closed irreducible sets + transient states
• $$\Rightarrow$$ the long run is always in some (closed) irreducible set or other

# How distributions evolve

• From the Markov property: $\begin{eqnarray} \Prob{X(1)} & = & \InitDist \mathbf{q}\\ \Prob{X(2)} & = & \Prob{X(1)} \mathbf{q} = \InitDist \mathbf{q}^2\\ \Prob{X(t)} & = & \Prob{X(t-1)} \mathbf{q} = \InitDist \mathbf{q}^{t} \end{eqnarray}$

• The distribution always evolves linearly

# More fun with eigenvalues and eigenvectors

• $$\mathbf{q}$$ has eigenvalues $$\lambda_1, \ldots \lambda_k$$ and left eigenvectors $$v_1, \ldots v_k$$
• Decompose the initial distribution in terms of the eigenvectors: $\InitDist = \sum_{j=1}^{k}{c_j v_j}$
• So $\InitDist \mathbf{q} = \sum_{j=1}^{k}{\lambda_j c_j v_j}$

# Special properties of stochastic matrices

• All $$|\lambda_j| \leq 1$$
• There is always at least one $$\lambda_j = 1$$
• The eigenvectors which go with those eigenvalues have non-negative entries
• so those eigenvectors can be scaled to probability distributions
• One $$\lambda_j=1$$ eigenvalue per irreducible set
• The eigenvector is $$> 0$$ on that set, $$=0$$ off of it
• The chain is periodic if all paths back to a given state have multiples of some length (e.g., 2)
• Periodicity $$\Leftrightarrow$$ other eigenvalues $$|\lambda|=1$$
• Otherwise the chain is aperiodic

# Irreducible, aperiodic Markov chains

• $$\lambda_1=1$$, corresponding $$v_1 > 0$$ everywhere
• Can normalize $$v_1$$ so $$\sum_{i}{v_{1i}} = 1$$
• All other $$|\lambda_j| < 1$$

• Exercise: Show that, as $$t\rightarrow\infty$$, $\InitDist \mathbf{q}^t \rightarrow v_1$ for all $$\InitDist$$ (which are non-negative and sum to 1)

# Solution

• Because $$|\lambda_j| < 1$$ if $$j \geq 2$$ but $$\lambda_1=1$$, $\Prob{X(t)} = \sum_{j=1}^{k}{c_j \lambda_j^t v_j} \rightarrow c_1 v_1$
• Probabilities must add to 1, and this fixes $$c_1$$: $\begin{eqnarray} \sum_{i=1}^{k}{\Prob{X(t)=i}} & = & 1\\ \sum_{i=1}^{k}{\sum_{j=1}^{k}{c_j \lambda_j^t v_j}} & = & 1\\ \sum_{i=1}^{k}{c_1 v_{1i}} & = & 1 ~ \text{(limit of both sides as} ~ t\rightarrow\infty)\\ c_1\sum_{i=1}^{k}{v_{1i}} & = & 1\\ c_1 & = & 1 \end{eqnarray}$

# Irreducible, aperiodic Markov chains

• We can say a bit more: $\mathbf{q}^{t} \rightarrow \mathbf{v}^{-1} \mathrm{diag}(1, 0, \ldots 0) \mathbf{v} = \left[ \begin{array}{c} v_1\\ v_1\\ \vdots \\ v_1 \end{array} \right]$
• i.e., each row converges to $$v_1$$
• Use the eigendecomposition of $$\mathbf{q}^t$$
• This is the basic ergodic theorem for Markov chains

# Invariant distributions

• The (left) eigenvector of $$\mathbf{q}$$ with eigenvalue 1 is the invariant distribution: $v_1 \mathbf{q} = v_1$
• a.k.a. equilibrium distribution
• This is important enough to deserve a special symbol, $$\InvDist$$
• If $$X(t) \sim \InvDist$$, then $$(X(t), X(t+1)) \sim \InvDist \otimes \mathbf{q}$$
• $$=$$ invariant distribution over pairs
• define invariant distribution over longer blocks similarly
• If $$\InitDist=\InvDist$$, then the process is (strictly) stationary

# Ergodicity and Weak Dependence

• Because $$\Prob{X(h)} \rightarrow \InvDist$$ as $$h\rightarrow\infty$$, $$X(t)$$ and $$X(t+h)$$ become independent as $$h\rightarrow \infty$$
• $$\Prob{X(t+h)|X(t)=x} \rightarrow \InvDist$$ regardless of $$x$$
• Correlations actually decay exponentially in $$h$$
• $$\therefore$$ there’s an ergodic theorem: $\frac{1}{n}\sum_{t=1}^{n}{f(X(t))} \rightarrow \sum_{i=1}^{k}{f(i) \InvDist_i} = \Expectwrt{X \sim \InvDist}{f(X)}$
• Similarly for functions of pairs: $\frac{1}{n-1}\sum_{t=1}^{n-1}{g(X(t), X(t+1))} \rightarrow \sum_{i=1}^{k}{\sum_{j=1}^{k}{g(i,j) \InvDist_i \mathbf{q}_{ij}}} = \Expectwrt{(X, X^{\prime}) \sim \InvDist \otimes \mathbf{q}}{f(X, X^{\prime})}$

# Sample frequencies vs. probabilities

• Define $$N_i \equiv$$ number of times $$X(t)=i$$
• Define $$N_{ij} \equiv$$ number of times $$X(t)=i, X(t+1)=j$$
• Then $\begin{eqnarray} \frac{N_i}{n} & \rightarrow & \InvDist_i\\ \frac{N_{ij}}{n} & \rightarrow & \InvDist_i \mathbf{q}_{ij}\\ \frac{N_{ij}}{N_i} = \frac{N_{ij}/n}{N_i/n} & \rightarrow & \frac{\InvDist_i \mathbf{q}_{ij}}{\InvDist_i} = \mathbf{q}_{ij} \end{eqnarray}$
eigen(t(q))
## eigen() decomposition
## $values ## [1] 1.00 -0.25 ## ##$vectors
##           [,1]       [,2]
## [1,] 0.8320503 -0.7071068
## [2,] 0.5547002  0.7071068
eigen(t(q))$vectors[,1]/sum(eigen(t(q))$vectors[,1])
## [1] 0.6 0.4
table(rmarkovchain(1e4,c(0.5,0.5),q))
##
##    1    2
## 6010 3990
table(rmarkovchain(1e4,c(0.5,0.5),q))
##
##    1    2
## 6003 3997
table(rmarkovchain(1e4,c(0,1),q))
##
##    1    2
## 6057 3943
table(rmarkovchain(1e4,c(1,0),q))
##
##    1    2
## 5953 4047

# Central limit theorem

• Because $$X(t)$$ and $$X(t+h)$$ become independent as $$h\rightarrow\infty$$, the time average is $$\approx$$ an average of IID blocks
• CLT would apply to an average of IID blocks
• For rigor, need to check that $$\approx$$ becomes strict as $$n\rightarrow\infty$$
• Usually works in practice:
time.avgs <- replicate(100, mean(rmarkovchain(1e4, c(0.5, 0.5), q)))
qqnorm(time.avgs); qqline(time.avgs)

# What if there’s more than one irreducible set?

• For a finite chain, eventually $$X(t)$$ enters an irreducible set, and never leaves
• $$\Rightarrow$$ Long-run behavior of $$X(t)$$ looks like it was always in that set
• but which irreducible set is random
• $$\Rightarrow$$ Converge, but to a random limit

# What if the state space isn’t finite?

• There can still be irreducible sets (smaller than the whole space)
• If $$X(t)$$ enters an irreducible set, we have nice ergodic behavior
• Might wander off to infinity, though
• Example: Random walk

# What if the state space is continuous?

• Again, we can have irreducible sets
• Again, nice ergodic behavior within an irreducible set
• Generally need a lot more math to handle this…

# Variations on the theme

• Higher-order Markov process: Need to condition on $$X(t-p+1), \ldots X(t-1), X(t)$$
• $$=$$ A Markov process on blocks
• AR$$(p)$$ is a $$p^{\mathrm{th}}$$ order chain
• Interacting/coupled Markov processes: transition rates for process 1 depend on its state and the state of process 2
• Continuous time: stay in state for a random time, with exponential distribution, then take a chain step
• Semi-Markov chain: Like a CTMC, but non-exponential holding times
• Hidden Markov Model (HMM): $$S(t)$$ is Markov, but we see $$X(t) = f(S(t)) + \mathrm{noise}$$
• This isn’t Markov
• MA$$(q)$$ is actually an HMM!

# Summary

• Markov chains are the most basic non-trivial stochastic process
• Individual trajectories are random and erratic
• Distributions evolve according to a linear operator
• Invariant distributions $$=$$ leading eigenvectors
• With finite state spaces:
• Each individual trajectory eventually enters an irreducible set, and wanders ergodically around it
• One invariant distribution per irreducible set
• Long-run averages look like expectations over the invariant distribution

• Handout will be posted to the class website
• Grimmett and Stirzaker (1992) is really good on Markov chains, and more general Markov processes
• On invariant distributions in continuous state spaces, the standard reference is Meyn and Tweedie (1993), but Lasota and Mackey (1994) is also good, and Mackey (1992) is surprisingly readable and relevant

# Backup: The philosophical origins of the Markov property

• Early 19th century: administrations of nation-states begin collecting, and publishing, statistics on mass social phenomena
• “Vital statistics”: Birth rates, death rates, marriage rates, numbers of children…
• Accidents: the number of pieces of undeliverable mail each year in Paris was thought impressive
• Crimes: theft, murder, suicide
• Mid-19th century: pioneering social scientists realize the rates of these phenomena are pretty stable
• … Even of things that arise from really profound moral choices, like murder and suicide
• … which is what you’d expect from the law of large numbers if there was just some constant probability
• … so they concluded that free will does not exist (“Society itself prepares the crime”)
• See Hacking (1990), Porter (1986) for this history
• See Cassirer (1944) for how some people realized at the time that this was crazy-bad philosophy

# Backup: The philosophical origins of the Markov property

• P. A. Nekrasov: 19th century Russian mathematician and theologian
• Claimed law of large numbers requires independent events
• But human actions are dependent
• $$\therefore$$ Law of large numbers does not apply
• $$\therefore$$ Free will exists and Russian Orthodox Christianity is exactly true
• This was not even close to the strangest idea entertained by the Moscow math department (Graham and Kantor 2009)

# Backup: The philosophical origins of the Markov property

• A. A. Markov: Russian mathematician and vocal atheist
• Leading probability theorist and a very careful mathematician
• Viewed Nekrasov as his arch-nemesis
• Read a 1902 paper by Nekrasov and got enraged
• Invented the theory of Markov chains to show LLN and CLT can hold under dependence (first published 1906), essentially to refute Nekrasov
• See Basharin, Langville, and Naumov (2004) for Markov’s life and this history

# Backup: TL;DR on the origins of the Markov property

• Markov processes are the outcome of an 19th century flamewar between determinist statisticians and mystic mathematicians

# Backup: Markov chain Monte Carlo

• Suppose we want to sample from a pdf or pmf $$p$$
• Exercise: Show that if detailed balance $$q(y|x)p(x) = q(x|y)p(y)$$ holds, then $$p$$ is the invariant distribution of the chain with transition rates $$q$$
• In Markov chain Monte Carlo
• we make a Markov chain with transition rates that obey this equation
• simulate a long trajectory from this chain
• and use the simulation as an (approximate) sample from $$p$$

# Backup: Markov chain Monte Carlo

• Practical MCMC is based on the Metropolis algorithm (Metropolis et al. 1953)
1. Set $$X(0)$$ however we like, and initialize $$t \leftarrow 0$$.
2. Draw a {} $$Z(t)$$ from some conditional distribution $$r(\cdot|X_t)$$ — for instance a Gaussian or uniform centered on $$X(t)$$, or anything else easy to draw from and with smooth enough noise.
3. Draw $$U(t)$$ independently from a uniform distribution on $$[0,1)$$.
4. If $$U(t) < p(Z(t)))/p(X(t)))$$, then $$X(t+1) = Z(t)$$, otherwise $$X(t+1)=X(t)$$
5. Increase $$t$$ by 1 and go to step 2.
• Obeys detailed balance so it has the right invariant distribution

# References

Basharin, Gely P., Amy N. Langville, and Valeriy A. Naumov. 2004. “The Life and Work of A. A. Markov.” Linear Algebra and Its Applications 386:3–26. http://langvillea.people.cofc.edu/MarkovReprint.pdf.

Cassirer, Ernst. 1944. An Essay on Man: An Introduction to a Philosophy of Human Culutre. New Haven, Connecticut: Yale University Press.

Graham, Loren, and Jean-Michel Kantor. 2009. Naming Infinity: A True Story of Religious Mysticism and Mathematical Creativity. Cambridge, Massachusetts: Harvard University Press.

Grimmett, G. R., and D. R. Stirzaker. 1992. Probability and Random Processes. 2nd ed. Oxford: Oxford University Press.

Hacking, Ian. 1990. The Taming of Chance. Cambridge, England: Cambridge University Press.

Lasota, Andrzej, and Michael C. Mackey. 1994. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics. Berlin: Springer-Verlag.

Mackey, Michael C. 1992. Time’s Arrow: The Origins of Thermodynamic Behavior. Berlin: Springer-Verlag.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equations of State Calculations by Fast Computing Machines.” Journal of Chemical Physics 21:1087–92. https://doi.org/10.1063/1.1699114.

Meyn, S. P., and R. L. Tweedie. 1993. Markov Chains and Stochastic Stability. Berlin: Springer-Verlag.

Porter, Theodore M. 1986. The Rise of Statistical Thinking, 1820–1900. Princeton, New Jersey: Princeton University Press.