\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#1}\left( #2 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Expectwrt}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\InitDist}{\vec{p}_{\mathrm{init}}} \newcommand{\InvDist}{\vec{p}^{*}} \]

In our previous episodes…

Looked at how to simulate a generative model
Saw how to use those simulations for inference
Now introduce one of the most important kinds of generative models

Markov Processes

The Markov property: Given the current state \(X(t)\), earlier states \(X(t-1), X(t-2), \ldots\) are irrelevant to the future states \(X(t+1), X(t+2), \ldots\)
Equivalently: \[ \Prob{X(t+1)| X(t), X(t-1), X(t-2), \ldots X(0)} = \Prob{X(t+1)|X(t)} \]
Equivalently: \[ \Prob{X(0), X(1), \ldots X(n)} = \Prob{X(0)} \Prob{X(1)|X(0)} \Prob{X(2)|X(1)} \ldots \Prob{X(n)|X(n-1)} \]
This is an assumption, not a law of nature
Some examples:
- AR(1)
- Random walk
- Deterministic dynamics + noise in the state
Some non-examples come later

The Markov property includes two extremes

Independent and identically distributed (IID)
- Technically Markov…
- …but even the current state doesn’t matter
- Nothing matters
Deterministic dynamics
- Technically Markov…
- …but current state fixes all future states exactly

Markov processes are generative models

To generate data from / simulate a Markov process, we need to
- Draw the initial state \(X(0)\) from \(\Prob{X(0)}\)
- Draw \(X(t)\) from \(\Prob{X(t)|X(t-1)}\) — inherently sequential
Generic code:

rmarkov <- function(n, rinitial, rtransition) {
  x <- vector(length=n)
  x[1] <- rinitial()
  for (t in 2:n) {
    x[t] <- rtransition(x[t-1])
  }
  return(x)
}

Markov Chains

A Markov chain is a Markov process where the states \(X(t)\) are discrete, not continuous
Represent \(\Prob{X(t)|X(t-1)}\) in a transition matrix, \(\mathbf{q}_{ij} \equiv \Prob{X(t) = j| X(t-1)=i}\)
\(\mathbf{q}\) is a stochastic matrix:
- All \(q_{ij} \geq 0\)
- for each row \(i\), \(\sum_{j}{q_{ij}} = 1\)
Represent \(\Prob{X(0)}\) as a non-negative vector \(\InitDist\), summing to 1

Graph vs. matrix

\(\Leftrightarrow \mathbf{q}=\left[\begin{array}{cc} 0.5 & 0.5 \\ 0.75 & 0.25 \end{array} \right]\)

Your Basic Markov Chain

rmarkovchain <- function(n, p0, q) {
  k <- length(p0)
  stopifnot(k==nrow(q),k==ncol(q),all.equal(rowSums(q),rep(1,time=k)))
  rinitial <- function() { sample(1:k,size=1,prob=p0) }
  rtransition <- function(x) { sample(1:k,size=1,prob=q[x,]) }
  return(rmarkov(n,rinitial,rtransition))
}

It runs:

q <- matrix(c(0.5, 0.5, 0.75, 0.25), byrow=TRUE, nrow=2)
x <- rmarkovchain(1e4,c(0.5,0.5),q)
head(x)

## [1] 1 1 2 2 1 2

How do we know it works?

Checking that it works

ones <- which(x[-1e4]==1)  # Why omit the last step?
twos <- which(x[-1e4]==2)
signif(table(x[ones+1])/length(ones),3)

## 
##     1     2 
## 0.504 0.496

signif(table(x[twos+1])/length(twos),3)

## 
##     1     2 
## 0.758 0.242

vs. \((0.5,0.5)\) and \((0.75, 0.25)\) ideally

Why is this a check?

For each \(t\) where \(X(t) =1\), the next state \(X(t+1)\) is drawn IIDly
So law of large numbers holds for \(X(t+1)\) conditional on \(X(t)=1\)
- and similarly conditional on \(X(t)=2\)
So sample conditional frequencies should converge on conditional probabilities

How trajectories evolve

You could try to write \[ X(t+1) = f(X(t)) + \epsilon(t+1) \] but it’s generally nasty
- Very non-linear \(f\)
- Distribution of \(\epsilon(t+1)\) depends on \(X(t)\)
Alternately write \(X(t+1) = g(X(t), \epsilon(t+1))\) but then \(g\) is even more complicated

Where trajectories end in the long run

State \(i\) and state \(j\) communicate if the chain can go from \(i\) to \(j\) and vice versa
A set of states communicates if every pair of states communicate
A maximal communicating set is irreducible
- \(=\) a strongly connected component in the graph
- \(\approx\) an attractor of a deterministic dynamical system
A set is closed if the probability of leaving it is 0
Finite chains break up into closed, irreducible sets + transient states
\(\Rightarrow\) the long run is always in some (closed) irreducible set or other

How distributions evolve

From the Markov property: \[\begin{eqnarray} \Prob{X(1)} & = & \InitDist \mathbf{q}\\ \Prob{X(2)} & = & \Prob{X(1)} \mathbf{q} = \InitDist \mathbf{q}^2\\ \Prob{X(t)} & = & \Prob{X(t-1)} \mathbf{q} = \InitDist \mathbf{q}^{t} \end{eqnarray}\]
The distribution always evolves linearly

More fun with eigenvalues and eigenvectors

\(\mathbf{q}\) has eigenvalues \(\lambda_1, \ldots \lambda_k\) and left eigenvectors \(\vec{v}_1, \ldots \vec{v}_k\)
Decompose the initial distribution in terms of the eigenvectors: \[ \InitDist = \sum_{j=1}^{k}{c_j \vec{v}_j} \]
So \[ \InitDist \mathbf{q} = \sum_{j=1}^{k}{\lambda_j c_j \vec{v}_j} \]

Special properties of stochastic matrices

All \(|\lambda_j| \leq 1\)
There is always at least one \(\lambda_j = 1\)
The eigenvectors which go with the \(\lambda=1\) eigenvalues have only non-negative entries
so those eigenvectors can be scaled to probability distributions
One \(\lambda_j=1\) eigenvalue per irreducible set
- The eigenvector is \(> 0\) on that set, \(=0\) off of it
The chain is periodic if all paths back to a given state have multiples of some length (e.g., 2)
- Periodicity \(\Leftrightarrow\) some eigenvalues \(\lambda \neq 1\) but \(|\lambda|=1\)
Otherwise the chain is aperiodic

Irreducible, aperiodic Markov chains

\(\lambda_1=1\), corresponding \(\vec{v}_1 > 0\) everywhere
- Can normalize \(\vec{v}_1\) so \(\sum_{i}{v_{1i}} = 1\)
All other \(|\lambda_j| < 1\)
Exercise: Show that, as \(t\rightarrow\infty\), \[ \InitDist \mathbf{q}^t \rightarrow \vec{v}_1 \] for all \(\InitDist\) (which are non-negative and sum to 1)

Irreducible, aperiodic Markov chains

We can say a bit more: \[ \mathbf{q}^{t} \rightarrow \mathbf{v}^{-1} \mathrm{diag}(1, 0, \ldots 0) \mathbf{v} = \left[ \begin{array}{c} \vec{v}_1\\ \vec{v}_1\\ \vdots \\ \vec{v}_1 \end{array} \right] \]
- i.e., each row converges to \(\vec{v}_1\)
- Use the eigendecomposition of \(\mathbf{q}^t\)
This is the basic ergodic theorem for Markov chains

Invariant distributions

The (left) eigenvector of \(\mathbf{q}\) with eigenvalue 1 is the invariant distribution: \[ \vec{v}_1 \mathbf{q} = \vec{v}_1 \]
- a.k.a. equilibrium distribution
This is important enough to deserve a special symbol, \(\InvDist\)
If \(X(t) \sim \InvDist\), then \((X(t), X(t+1)) \sim \InvDist \otimes \mathbf{q}\)
- \(=\) invariant distribution over pairs
- define invariant distribution over longer blocks similarly
If \(\InitDist=\InvDist\), then the process is (strictly) stationary

Ergodicity and Weak Dependence

Because \(\Prob{X(h)} \rightarrow \InvDist\) as \(h\rightarrow\infty\), \(X(t)\) and \(X(t+h)\) become independent as \(h\rightarrow \infty\)
- \(\Prob{X(t+h)|X(t)=x} \rightarrow \InvDist\) regardless of \(x\)
- Correlations actually decay exponentially in \(h\)
\(\therefore\) there’s an ergodic theorem: \[ \frac{1}{n}\sum_{t=1}^{n}{f(X(t))} \rightarrow \sum_{i=1}^{k}{f(i) \InvDist_i} = \Expectwrt{X \sim \InvDist}{f(X)} \]
Similarly for functions of pairs: \[ \frac{1}{n-1}\sum_{t=1}^{n-1}{g(X(t), X(t+1))} \rightarrow \sum_{i=1}^{k}{\sum_{j=1}^{k}{g(i,j) \InvDist_i \mathbf{q}_{ij}}} = \Expectwrt{(X, X^{\prime}) \sim \InvDist \otimes \mathbf{q}}{f(X, X^{\prime})} \]

Sample frequencies vs. probabilities

Define \(N_i \equiv\) number of times \(X(t)=i\)
Define \(N_{ij} \equiv\) number of times \(X(t)=i, X(t+1)=j\)
Then \[\begin{eqnarray} \frac{N_i}{n} & \rightarrow & \InvDist_i\\ \frac{N_{ij}}{n} & \rightarrow & \InvDist_i \mathbf{q}_{ij}\\ \frac{N_{ij}}{N_i} = \frac{N_{ij}/n}{N_i/n} & \rightarrow & \frac{\InvDist_i \mathbf{q}_{ij}}{\InvDist_i} = \mathbf{q}_{ij} \end{eqnarray}\]

eigen(t(q))

## eigen() decomposition
## $values
## [1]  1.00 -0.25
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.8320503 -0.7071068
## [2,] 0.5547002  0.7071068

eigen(t(q))$vectors[,1]/sum(eigen(t(q))$vectors[,1])

## [1] 0.6 0.4

table(rmarkovchain(1e4,c(0.5,0.5),q))

## 
##    1    2 
## 5983 4017

table(rmarkovchain(1e4,c(0.5,0.5),q))

## 
##    1    2 
## 6038 3962

table(rmarkovchain(1e4,c(0,1),q))

## 
##    1    2 
## 5956 4044

table(rmarkovchain(1e4,c(1,0),q))

## 
##    1    2 
## 5987 4013

Central limit theorem

Because \(X(t)\) and \(X(t+h)\) become independent as \(h\rightarrow\infty\), the time average is \(\approx\) an average of IID blocks
- CLT would apply to an average of IID blocks
- For rigor, need to check that \(\approx\) becomes strict as \(n\rightarrow\infty\)
Usually works in practice:

time.avgs <- replicate(100, mean(rmarkovchain(1e4, c(0.5, 0.5), q)))
qqnorm(time.avgs); qqline(time.avgs)

What if there’s more than one irreducible set?

For a finite chain, eventually \(X(t)\) enters an irreducible set, and never leaves
\(\Rightarrow\) Long-run behavior of \(X(t)\) looks like it was always in that set
but which irreducible set is random
\(\Rightarrow\) Converge, but to a random limit

What if the state space isn’t finite?

There can still be irreducible sets (smaller than the whole space)
If \(X(t)\) enters an irreducible set, we have nice ergodic behavior
Might wander off to infinity, though
- Example: Random walk

What if the state space is continuous?

Again, we can have irreducible sets
Again, nice ergodic behavior within an irreducible set
Generally need a lot more math to handle this…

Higher-order Markov processes

Markov process: condition \(X(t+1)\) on \(X(t)\) only
\(p^{\mathrm{th}}\) order Markov process: condition \(X(t+1)\) on \(X(t-p+1), \ldots X(t-1), X(t)\)
- Example: AR(\(p\)) model
A higher-order Markov process is a first-order Markov process on blocks
- How long are the blocks for a \(p^{\mathrm{th}}\) order process?
- Will revisit when looking at estimation

First- vs. Higher- order Markov chains

Jason Capehart, Seung Su Han, Alexander Murray-Watters, and Elizabeth Silver in Statistical Computing (36-350) in 2011

Variations on the theme

Interacting or coupled Markov processes: transition rates for process 1 depend on its state and the state of process 2
Continuous time: stay in state for a random, exponentially-distributed time, then take a chain step
Semi-Markov chain: Like a CTMC, but non-exponential holding times
Hidden Markov Model (HMM): \(S(t)\) is Markov, but we see \(X(t) = f(S(t)) + \mathrm{noise}\)
- This isn’t Markov, but it is widely applied in biology, economics, geoscience, signal processing, predictive text…
Hidden semi-Markov models (Hoek and Elliott 2018), etc., etc.

Summary

Markov chains are the most basic non-trivial stochastic process
Individual trajectories are random and erratic and possibly very nonlinear
Distributions evolve deterministically, according to a linear operator
- Invariant distributions \(=\) leading eigenvectors
With finite state spaces:
- Each individual trajectory eventually enters an irreducible set, and wanders ergodically around it
- One invariant distribution per irreducible set
- Long-run averages look like expectations over the invariant distribution
With infinite state spaces: usually similar, but sometimes wandering off to infinity
Coming up: estimation and testing with Markov models

Backup: The philosophical origins of the Markov property

Early 19th century: administrations of nation-states begin collecting, and publishing, statistics on mass social phenomena
- “Vital statistics”: Birth rates, death rates, marriage rates, numbers of children…
- Accidents: the number of pieces of undeliverable mail each year in Paris was thought impressive
- Crimes: theft, murder, suicide
Mid-19th century: pioneering social scientists realize the rates of these phenomena are pretty stable
- … Even of things that arise from really profound moral choices, like murder and suicide
- … which is what you’d expect from the law of large numbers if there was just some constant probability
- … so they concluded that free will does not exist (“Society itself prepares the crime”)
See Hacking (1990), Porter (1986) for this history
- See Cassirer (1944) for how some people realized at the time that this was crazy-bad philosophy

Backup: The philosophical origins of the Markov property

P. A. Nekrasov: 19th century Russian mathematician and theologian
- Claimed law of large numbers requires independent events
- But human actions are dependent
- \(\therefore\) Law of large numbers does not apply
- \(\therefore\) Free will exists and Russian Orthodox Christianity is exactly true
- This was not even close to the strangest idea entertained by the Moscow math department (Graham and Kantor 2009)

Backup: The philosophical origins of the Markov property

A. A. Markov: Russian mathematician and vocal atheist
- Leading probability theorist and a very careful mathematician
- Viewed Nekrasov as his arch-nemesis
- Read a 1902 paper by Nekrasov and got enraged
- Invented the theory of Markov chains to show LLN and CLT can hold under dependence (first published 1906), essentially to refute Nekrasov
- See Basharin, Langville, and Naumov (2004) for Markov’s life and this history

Backup: TL;DR on the origins of the Markov property

Markov processes are the outcome of an 19th century flamewar between determinist statisticians and mystic mathematicians

Backup: Markov chain Monte Carlo

Suppose we want to sample from a pdf or pmf \(p\)
Exercise: Show that if detailed balance \(q(y|x)p(x) = q(x|y)p(y)\) holds, then \(p\) is the invariant distribution of the chain with transition rates \(q\)
In Markov chain Monte Carlo…
- we make a Markov chain with transition rates that obey this equation
- simulate a long trajectory from this chain
- and use the simulation as an (approximate) sample from \(p\)

Backup: Markov chain Monte Carlo

Practical MCMC is based on the Metropolis algorithm (Metropolis et al. 1953)

Set \(X(0)\) however we like, and initialize \(t \leftarrow 0\).
Draw a proposal \(Z(t)\) from some conditional distribution \(r(\cdot|X_t)\) — for instance a Gaussian or uniform centered on \(X(t)\), or anything else easy to draw from and with smooth enough noise.
Draw \(U(t)\) independently from a uniform distribution on \([0,1)\).
If \(U(t) < p(Z(t)))/p(X(t)))\), then \(X(t+1) = Z(t)\), otherwise \(X(t+1)=X(t)\)
Increase \(t\) by 1 and go to step 2.

Obeys detailed balance so it has the right invariant distribution

Backup: Further reading

Handouts on class website, excerpt from Guttorp (1995) on Canvas
Grimmett and Stirzaker (1992) is really good on Markov chains, and more general Markov processes
On invariant distributions in continuous state spaces, the standard reference is Meyn and Tweedie (1993), but Lasota and Mackey (1994) is also good, and Mackey (1992) is surprisingly readable and relevant

References

Basharin, Gely P., Amy N. Langville, and Valeriy A. Naumov. 2004. “The Life and Work of A. A. Markov.” Linear Algebra and Its Applications 386:3–26. http://langvillea.people.cofc.edu/MarkovReprint.pdf.

Cassirer, Ernst. 1944. An Essay on Man: An Introduction to a Philosophy of Human Culutre. New Haven, Connecticut: Yale University Press.

Graham, Loren, and Jean-Michel Kantor. 2009. Naming Infinity: A True Story of Religious Mysticism and Mathematical Creativity. Cambridge, Massachusetts: Harvard University Press.

Grimmett, G. R., and D. R. Stirzaker. 1992. Probability and Random Processes. 2nd ed. Oxford: Oxford University Press.

Guttorp, Peter. 1995. Stochastic Modeling of Scientific Data. London: Chapman; Hall.

Hacking, Ian. 1990. The Taming of Chance. Cambridge, England: Cambridge University Press.

Hoek, John van der, and Robert J. Elliott. 2018. Introduction to Hidden Semi-Markov Models. Cambridge, England: Cambridge University Press.

Lasota, Andrzej, and Michael C. Mackey. 1994. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics. Berlin: Springer-Verlag.

Mackey, Michael C. 1992. Time’s Arrow: The Origins of Thermodynamic Behavior. Berlin: Springer-Verlag.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equations of State Calculations by Fast Computing Machines.” Journal of Chemical Physics 21:1087–92. https://doi.org/10.1063/1.1699114.

Meyn, S. P., and R. L. Tweedie. 1993. Markov Chains and Stochastic Stability. Berlin: Springer-Verlag.

Porter, Theodore M. 1986. The Rise of Statistical Thinking, 1820–1900. Princeton, New Jersey: Princeton University Press.

Markov Chains I

In our previous episodes…

Markov Processes

The Markov property includes two extremes

Markov processes are generative models

Markov Chains

Graph vs. matrix

Your Basic Markov Chain

Checking that it works

Why is this a check?

How trajectories evolve

Where trajectories end in the long run

How distributions evolve

More fun with eigenvalues and eigenvectors

Special properties of stochastic matrices

Irreducible, aperiodic Markov chains

Irreducible, aperiodic Markov chains

Invariant distributions

Ergodicity and Weak Dependence

Sample frequencies vs. probabilities

Central limit theorem

What if there’s more than one irreducible set?

What if the state space isn’t finite?

What if the state space is continuous?

Higher-order Markov processes

First- vs. Higher- order Markov chains

Variations on the theme

Summary

Backup: The philosophical origins of the Markov property

Backup: The philosophical origins of the Markov property

Backup: The philosophical origins of the Markov property

Backup: TL;DR on the origins of the Markov property

Backup: Markov chain Monte Carlo

Backup: Markov chain Monte Carlo

Backup: Further reading

References