--- title: Markov Chains I author: 36-467/36-667 date: 13 November 2018 output: slidy_presentation bibliography: locusts.bib --- $\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#1}\left( #2 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Expectwrt}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\InitDist}{p_{\mathrm{init}}} \newcommand{\InvDist}{p^{*}}$ {r, include=FALSE} # General set-up options library(knitr) opts_chunk$set(size="small", background="white", cache=TRUE, autodep=TRUE, tidy=FALSE, warning=FALSE, message=FALSE, echo=TRUE)  ## In our previous episodes... - Looked at how to simulate a generative model - Saw how to use those simulations for inference - Now introduce one of the most important kinds of generative models ## Markov Processes - The **Markov property**: Given the current **state**$X(t)$, earlier states$X(t-1), X(t-2), \ldots$are irrelevant to the future states$X(t+1), X(t+2), \ldots$- Equivalently: $\Prob{X(t+1)| X(t), X(t-1), X(t-2), \ldots X(0)} = \Prob{X(t+1)|X(t)}$ - Equivalently: $\Prob{X(0), X(1), \ldots X(n)} = \Prob{X(0)} \Prob{X(1)|X(0)} \Prob{X(2)|X(1)} \ldots \Prob{X(n)|X(n-1)}$ - This is an _assumption_, not a law of nature - Some examples: + AR(1) + Random walk + Deterministic dynamics + noise in the state - Some non-examples come later ## The Markov property between two extremes - Independent and identically distributed (IID) + Technically Markov... + ...but even the current state doesn't matter + Nothing matters - Deterministic dynamics + Technically Markov... + ...but current state fixes all future states _exactly_ ## Markov processes are generative models - To generate data from / simulate a Markov process, we need to + Draw the initial state$X(0)$from$\Prob{X(0)}$+ Draw$X(t)$from$\Prob{X(t)|X(t-1)}$--- inherently sequential - Generic code: {r} rmarkov <- function(n, rinitial, rtransition) { x <- vector(length=n) x[1] <- rinitial() for (t in 2:n) { x[t] <- rtransition(x[t-1]) } return(x) }  ## Markov Chains - A Markov **chain** is a Markov process where the states$X(t)$are discrete, not continuous - Represent$\Prob{X(t)|X(t-1)}$in a **transition matrix**,$\mathbf{q}_{ij} \equiv \Prob{X(t) = j| X(t-1)=i}$-$\mathbf{q}$is a **stochastic matrix**: + All$q_{ij} \geq 0$+ for each row$i$,$\sum_{j}{q_{ij}} = 1$- Represent$\Prob{X(0)}$as a non-negative vector$\InitDist$, summing to 1 ## Graph vs. matrix ![](binary-graph.jpg)$\Leftrightarrow \mathbf{q}=\left[\begin{array}{cc} 0.5 & 0.5 \\ 0.75 & 0.25 \end{array} \right]$## Your Basic Markov Chain {r} rmarkovchain <- function(n, p0, q) { k <- length(p0) stopifnot(k==nrow(q),k==ncol(q),all.equal(rowSums(q),rep(1,time=k))) rinitial <- function() { sample(1:k,size=1,prob=p0) } rtransition <- function(x) { sample(1:k,size=1,prob=q[x,]) } return(rmarkov(n,rinitial,rtransition)) }  - It runs: {r} q <- matrix(c(0.5, 0.5, 0.75, 0.25), byrow=TRUE, nrow=2) x <- rmarkovchain(1e4,c(0.5,0.5),q) head(x)  - How do we know it works? ## Checking that it works {r} ones <- which(x[-1e4]==1) # Why omit the last step? twos <- which(x[-1e4]==2) signif(table(x[ones+1])/length(ones),3) signif(table(x[twos+1])/length(twos),3)  vs.\$(0.5,0.5)$and$(0.75, 0.25)$ideally ## Why is this a check? - For each$t$where$X(t) =1$, the next state$X(t+1)$is drawn IIDly - So law of large numbers holds for$X(t+1)$conditional on$X(t)=1$+ and conditional on$X(t)=2$- So sample conditional frequencies should converge on conditional probabilities ## How trajectories evolve - You could try to write $X(t+1) = f(X(t)) + \epsilon(t+1)$ but it's generally nasty + Very non-linear$f$+ Distribution of$\epsilon$depends on$X(t-1)$+ Alternately write$X(t+1) = g(X(t), \epsilon(t+1))$but then$g$is even more complicated ## Where trajectories end in the long run - State$i$and state$j$**communicate** if the chain can go from$i$to$j$and vice versa - A set of states communicates if every pair of states communicate - A maximal communicating set is **irreducible** +$=$a **strongly connected component** in the graph +$\approx$an **attractor** of a deterministic dynamical system - A set is **closed** if the probability of leaving it is 0 - Finite chains break up into closed irreducible sets + transient states +$\Rightarrow$the long run is always in some (closed) irreducible set or other ## How distributions evolve - From the Markov property: \begin{eqnarray} \Prob{X(1)} & = & \InitDist \mathbf{q}\\ \Prob{X(2)} & = & \Prob{X(1)} \mathbf{q} = \InitDist \mathbf{q}^2\\ \Prob{X(t)} & = & \Prob{X(t-1)} \mathbf{q} = \InitDist \mathbf{q}^{t} \end{eqnarray} - The _distribution_ always evolves _linearly_ ## More fun with eigenvalues and eigenvectors -$\mathbf{q}$has eigenvalues$\lambda_1, \ldots \lambda_k$and _left_ eigenvectors$v_1, \ldots v_k$- Decompose the initial distribution in terms of the eigenvectors: $\InitDist = \sum_{j=1}^{k}{c_j v_j}$ - So $\InitDist \mathbf{q} = \sum_{j=1}^{k}{\lambda_j c_j v_j}$ ## Special properties of stochastic matrices - All$|\lambda_j| \leq 1$- There is always at least one$\lambda_j = 1$- The eigenvectors which go with those eigenvalues have non-negative entries - so those eigenvectors can be scaled to probability distributions - One$\lambda_j=1$eigenvalue per irreducible set + The eigenvector is$> 0$on that set,$=0$off of it - The chain is **periodic** if all paths back to a given state have multiples of some length (e.g., 2) + Periodicity$\Leftrightarrow$other eigenvalues$|\lambda|=1$- Otherwise the chain is **aperiodic** ## Irreducible, aperiodic Markov chains -$\lambda_1=1$, corresponding$v_1 > 0$everywhere + Can normalize$v_1$so$\sum_{i}{v_{1i}} = 1$- All other$|\lambda_j| < 1$- **Exercise**: Show that, as$t\rightarrow\infty$, $\InitDist \mathbf{q}^t \rightarrow v_1$ for all$\InitDist$(which are non-negative and sum to 1) ## Solution - Because$|\lambda_j| < 1$if$j \geq 2$but$\lambda_1=1$, $\Prob{X(t)} = \sum_{j=1}^{k}{c_j \lambda_j^t v_j} \rightarrow c_1 v_1$ - Probabilities must add to 1, and this fixes$c_1$: \begin{eqnarray} \sum_{i=1}^{k}{\Prob{X(t)=i}} & = & 1\\ \sum_{i=1}^{k}{\sum_{j=1}^{k}{c_j \lambda_j^t v_j}} & = & 1\\ \sum_{i=1}^{k}{c_1 v_{1i}} & = & 1 ~ \text{(limit of both sides as} ~ t\rightarrow\infty)\\ c_1\sum_{i=1}^{k}{v_{1i}} & = & 1\\ c_1 & = & 1 \end{eqnarray} ## Irreducible, aperiodic Markov chains - We can say a bit more: $\mathbf{q}^{t} \rightarrow \mathbf{v}^{-1} \mathrm{diag}(1, 0, \ldots 0) \mathbf{v} = \left[ \begin{array}{c} v_1\\ v_1\\ \vdots \\ v_1 \end{array} \right]$ + i.e., each row converges to$v_1$+ Use the eigendecomposition of$\mathbf{q}^t$- This is the basic ergodic theorem for Markov chains ## Invariant distributions - The (left) eigenvector of$\mathbf{q}$with eigenvalue 1 is the **invariant** distribution: $v_1 \mathbf{q} = v_1$ + a.k.a. **equilibrium** distribution - This is important enough to deserve a special symbol,$\InvDist$- If$X(t) \sim \InvDist$, then$(X(t), X(t+1)) \sim \InvDist \otimes \mathbf{q}$+$=$invariant distribution over pairs + define invariant distribution over longer blocks similarly - If$\InitDist=\InvDist$, then the process is (strictly) stationary ## Ergodicity and Weak Dependence - Because$\Prob{X(h)} \rightarrow \InvDist$as$h\rightarrow\infty$,$X(t)$and$X(t+h)$become independent as$h\rightarrow \infty$+$\Prob{X(t+h)|X(t)=x} \rightarrow \InvDist$regardless of$x$+ Correlations actually decay exponentially in$h$-$\therefore$there's an ergodic theorem: $\frac{1}{n}\sum_{t=1}^{n}{f(X(t))} \rightarrow \sum_{i=1}^{k}{f(i) \InvDist_i} = \Expectwrt{X \sim \InvDist}{f(X)}$ - Similarly for functions of pairs: $\frac{1}{n-1}\sum_{t=1}^{n-1}{g(X(t), X(t+1))} \rightarrow \sum_{i=1}^{k}{\sum_{j=1}^{k}{g(i,j) \InvDist_i \mathbf{q}_{ij}}} = \Expectwrt{(X, X^{\prime}) \sim \InvDist \otimes \mathbf{q}}{f(X, X^{\prime})}$ ## Sample frequencies vs.\ probabilities {.smaller} - Define$N_i \equiv$number of times$X(t)=i$- Define$N_{ij} \equiv$number of times$X(t)=i, X(t+1)=j$- Then \begin{eqnarray} \frac{N_i}{n} & \rightarrow & \InvDist_i\\ \frac{N_{ij}}{n} & \rightarrow & \InvDist_i \mathbf{q}_{ij}\\ \frac{N_{ij}}{N_i} = \frac{N_{ij}/n}{N_i/n} & \rightarrow & \frac{\InvDist_i \mathbf{q}_{ij}}{\InvDist_i} = \mathbf{q}_{ij} \end{eqnarray} ## {.smaller} {r} eigen(t(q)) eigen(t(q))$vectors[,1]/sum(eigen(t(q))$vectors[,1])  ## {.smaller} {r} table(rmarkovchain(1e4,c(0.5,0.5),q)) table(rmarkovchain(1e4,c(0.5,0.5),q)) table(rmarkovchain(1e4,c(0,1),q)) table(rmarkovchain(1e4,c(1,0),q))  ## Central limit theorem - Because$X(t)$and$X(t+h)$become independent as$h\rightarrow\infty$, the time average is$\approx$an average of IID blocks + CLT would apply to an average of IID blocks + For rigor, need to check that$\approx$becomes strict as$n\rightarrow\infty$- Usually works in practice: {r} time.avgs <- replicate(100, mean(rmarkovchain(1e4, c(0.5, 0.5), q))) qqnorm(time.avgs); qqline(time.avgs)  ## What if there's more than one irreducible set? - For a finite chain, eventually$X(t)$enters an irreducible set, and never leaves -$\Rightarrow$Long-run behavior of$X(t)$looks like it was _always_ in that set - but _which_ irreducible set is random -$\Rightarrow$Converge, but to a random limit ## What if the state space isn't finite? - There can still be irreducible sets (smaller than the whole space) - If$X(t)$enters an irreducible set, we have nice ergodic behavior - Might wander off to infinity, though + Example: Random walk ## What if the state space is continuous? - Again, we can have irreducible sets - Again, nice ergodic behavior within an irreducible set - Generally need a lot more math to handle this... ## Variations on the theme - Higher-order Markov process: Need to condition on$X(t-p+1), \ldots X(t-1), X(t)$+$=$A Markov process on blocks + AR$(p)$is a$p^{\mathrm{th}}$order chain - Interacting/coupled Markov processes: transition rates for process 1 depend on its state _and_ the state of process 2 - Continuous time: stay in state for a random time, with exponential distribution, then take a chain step - Semi-Markov chain: Like a CTMC, but non-exponential holding times - Hidden Markov Model (HMM):$S(t)$is Markov, but we see$X(t) = f(S(t)) + \mathrm{noise}$+ This isn't Markov + MA$(q)$is actually an HMM! ## Summary - Markov chains are the most basic non-trivial stochastic process - Individual trajectories are random and erratic - Distributions evolve according to a linear operator + Invariant distributions$=$leading eigenvectors - With _finite_ state spaces: + Each individual trajectory eventually enters an irreducible set, and wanders ergodically around it + One invariant distribution per irreducible set + Long-run averages look like expectations over the invariant distribution ## Backup: Further reading - Handout will be posted to the class website - @Grimmett-Stirzaker is really good on Markov chains, and more general Markov processes - On invariant distributions in continuous state spaces, the standard reference is @Meyn-Tweedie-1, but @Lasota-Mackey is also good, and @Mackey-times-arrow is surprisingly readable and relevant ## Backup: The philosophical origins of the Markov property {.smaller} - Early 19th century: administrations of nation-states begin collecting, and publishing, statistics on mass social phenomena + "Vital statistics": Birth rates, death rates, marriage rates, numbers of children... + Accidents: the number of pieces of undeliverable mail each year in Paris was thought impressive + Crimes: theft, murder, suicide - Mid-19th century: pioneering social scientists realize the rates of these phenomena are pretty stable + ... Even of things that arise from really profound moral choices, like murder and suicide + ... which is what you'd expect from the law of large numbers if there was just some constant probability + ... so they concluded that free will does not exist (["Society itself prepares the crime"](http://www-history.mcs.st-and.ac.uk/Extras/Quetelet_crime.html)) - See @Hacking-chance, @Porter-rise for this history + See @Cassirer-on-man for how some people realized at the time that this was crazy-bad philosophy ## Backup: The philosophical origins of the Markov property - P. A. Nekrasov: 19th century Russian mathematician and theologian + Claimed law of large numbers requires independent events + But human actions are dependent +$\therefore$Law of large numbers does not apply +$\therefore$Free will exists and Russian Orthodox Christianity is exactly true + This was not even close to the strangest idea entertained by the Moscow math department [@Naming-infinity] ## Backup: The philosophical origins of the Markov property ![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/AAMarkov.jpg/369px-AAMarkov.jpg) - A. A. Markov: Russian mathematician and vocal atheist + Leading probability theorist and a very careful mathematician + Viewed Nekrasov as his arch-nemesis + Read a 1902 paper by Nekrasov and got enraged + Invented the theory of Markov chains to show LLN and CLT can hold under dependence (first published 1906), essentially to refute Nekrasov + See @Basharin-et-al-on-Markov for Markov's life and this history ## Backup: TL;DR on the origins of the Markov property - Markov processes are the outcome of an 19th century flamewar between determinist statisticians and mystic mathematicians ## Backup: Markov chain Monte Carlo - Suppose we want to sample from a pdf or pmf$p$- Exercise: Show that if **detailed balance**$q(y|x)p(x) = q(x|y)p(y)$holds, then$p$is the invariant distribution of the chain with transition rates$q$- In **Markov chain Monte Carlo**... + we make a Markov chain with transition rates that obey this equation + simulate a long trajectory from this chain + and use the simulation as an (approximate) sample from$p$## Backup: Markov chain Monte Carlo - Practical MCMC is based on the **Metropolis algorithm** [@Metropolis-et-al-Monte-Carlo] 1. Set$X(0)$however we like, and initialize$t \leftarrow 0$. 2. Draw a {\bf proposal}$Z(t)$from some conditional distribution$r(\cdot|X_t)$--- for instance a Gaussian or uniform centered on$X(t)$, or anything else easy to draw from and with smooth enough noise. 3. Draw$U(t)$independently from a uniform distribution on$[0,1)$. 4. If$U(t) < p(Z(t)))/p(X(t)))$, then$X(t+1) = Z(t)$, otherwise$X(t+1)=X(t)$5. Increase$t\$ by 1 and go to step 2. - Obeys detailed balance so it has the right invariant distribution ## References