--- title: Markov Chains I author: 36-467/36-667 date: 13 November 2018 output: slidy_presentation bibliography: locusts.bib --- \[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#1}\left( #2 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Expectwrt}[2]{\mathbb{E}_{#1}\left[ #2 \right]} \newcommand{\InitDist}{p_{\mathrm{init}}} \newcommand{\InvDist}{p^{*}} \] ```{r, include=FALSE} # General set-up options library(knitr) opts_chunk$set(size="small", background="white", cache=TRUE, autodep=TRUE, tidy=FALSE, warning=FALSE, message=FALSE, echo=TRUE) ``` ## In our previous episodes... - Looked at how to simulate a generative model - Saw how to use those simulations for inference - Now introduce one of the most important kinds of generative models ## Markov Processes - The **Markov property**: Given the current **state** $X(t)$, earlier states $X(t-1), X(t-2), \ldots$ are irrelevant to the future states $X(t+1), X(t+2), \ldots$ - Equivalently: \[ \Prob{X(t+1)| X(t), X(t-1), X(t-2), \ldots X(0)} = \Prob{X(t+1)|X(t)} \] - Equivalently: \[ \Prob{X(0), X(1), \ldots X(n)} = \Prob{X(0)} \Prob{X(1)|X(0)} \Prob{X(2)|X(1)} \ldots \Prob{X(n)|X(n-1)} \] - This is an _assumption_, not a law of nature - Some examples: + AR(1) + Random walk + Deterministic dynamics + noise in the state - Some non-examples come later ## The Markov property between two extremes - Independent and identically distributed (IID) + Technically Markov... + ...but even the current state doesn't matter + Nothing matters - Deterministic dynamics + Technically Markov... + ...but current state fixes all future states _exactly_ ## Markov processes are generative models - To generate data from / simulate a Markov process, we need to + Draw the initial state $X(0)$ from $\Prob{X(0)}$ + Draw $X(t)$ from $\Prob{X(t)|X(t-1)}$ --- inherently sequential - Generic code: ```{r} rmarkov <- function(n, rinitial, rtransition) { x <- vector(length=n) x[1] <- rinitial() for (t in 2:n) { x[t] <- rtransition(x[t-1]) } return(x) } ``` ## Markov Chains - A Markov **chain** is a Markov process where the states $X(t)$ are discrete, not continuous - Represent $\Prob{X(t)|X(t-1)}$ in a **transition matrix**, $\mathbf{q}_{ij} \equiv \Prob{X(t) = j| X(t-1)=i}$ - $\mathbf{q}$ is a **stochastic matrix**: + All $q_{ij} \geq 0$ + for each row $i$, $\sum_{j}{q_{ij}} = 1$ - Represent $\Prob{X(0)}$ as a non-negative vector $\InitDist$, summing to 1 ## Graph vs. matrix ![](binary-graph.jpg) $\Leftrightarrow \mathbf{q}=\left[\begin{array}{cc} 0.5 & 0.5 \\ 0.75 & 0.25 \end{array} \right]$ ## Your Basic Markov Chain ```{r} rmarkovchain <- function(n, p0, q) { k <- length(p0) stopifnot(k==nrow(q),k==ncol(q),all.equal(rowSums(q),rep(1,time=k))) rinitial <- function() { sample(1:k,size=1,prob=p0) } rtransition <- function(x) { sample(1:k,size=1,prob=q[x,]) } return(rmarkov(n,rinitial,rtransition)) } ``` - It runs: ```{r} q <- matrix(c(0.5, 0.5, 0.75, 0.25), byrow=TRUE, nrow=2) x <- rmarkovchain(1e4,c(0.5,0.5),q) head(x) ``` - How do we know it works? ## Checking that it works ```{r} ones <- which(x[-1e4]==1) # Why omit the last step? twos <- which(x[-1e4]==2) signif(table(x[ones+1])/length(ones),3) signif(table(x[twos+1])/length(twos),3) ``` vs.\ $(0.5,0.5)$ and $(0.75, 0.25)$ ideally ## Why is this a check? - For each $t$ where $X(t) =1$, the next state $X(t+1)$ is drawn IIDly - So law of large numbers holds for $X(t+1)$ conditional on $X(t)=1$ + and conditional on $X(t)=2$ - So sample conditional frequencies should converge on conditional probabilities ## How trajectories evolve - You could try to write \[ X(t+1) = f(X(t)) + \epsilon(t+1) \] but it's generally nasty + Very non-linear $f$ + Distribution of $\epsilon$ depends on $X(t-1)$ + Alternately write $X(t+1) = g(X(t), \epsilon(t+1))$ but then $g$ is even more complicated ## Where trajectories end in the long run - State $i$ and state $j$ **communicate** if the chain can go from $i$ to $j$ and vice versa - A set of states communicates if every pair of states communicate - A maximal communicating set is **irreducible** + $=$ a **strongly connected component** in the graph + $\approx$ an **attractor** of a deterministic dynamical system - A set is **closed** if the probability of leaving it is 0 - Finite chains break up into closed irreducible sets + transient states + $\Rightarrow$ the long run is always in some (closed) irreducible set or other ## How distributions evolve - From the Markov property: \begin{eqnarray} \Prob{X(1)} & = & \InitDist \mathbf{q}\\ \Prob{X(2)} & = & \Prob{X(1)} \mathbf{q} = \InitDist \mathbf{q}^2\\ \Prob{X(t)} & = & \Prob{X(t-1)} \mathbf{q} = \InitDist \mathbf{q}^{t} \end{eqnarray} - The _distribution_ always evolves _linearly_ ## More fun with eigenvalues and eigenvectors - $\mathbf{q}$ has eigenvalues $\lambda_1, \ldots \lambda_k$ and _left_ eigenvectors $v_1, \ldots v_k$ - Decompose the initial distribution in terms of the eigenvectors: \[ \InitDist = \sum_{j=1}^{k}{c_j v_j} \] - So \[ \InitDist \mathbf{q} = \sum_{j=1}^{k}{\lambda_j c_j v_j} \] ## Special properties of stochastic matrices - All $|\lambda_j| \leq 1$ - There is always at least one $\lambda_j = 1$ - The eigenvectors which go with those eigenvalues have non-negative entries - so those eigenvectors can be scaled to probability distributions - One $\lambda_j=1$ eigenvalue per irreducible set + The eigenvector is $> 0$ on that set, $=0$ off of it - The chain is **periodic** if all paths back to a given state have multiples of some length (e.g., 2) + Periodicity $\Leftrightarrow$ other eigenvalues $|\lambda|=1$ - Otherwise the chain is **aperiodic** ## Irreducible, aperiodic Markov chains - $\lambda_1=1$, corresponding $v_1 > 0$ everywhere + Can normalize $v_1$ so $\sum_{i}{v_{1i}} = 1$ - All other $|\lambda_j| < 1$ - **Exercise**: Show that, as $t\rightarrow\infty$, \[ \InitDist \mathbf{q}^t \rightarrow v_1 \] for all $\InitDist$ (which are non-negative and sum to 1) ## Solution - Because $|\lambda_j| < 1$ if $j \geq 2$ but $\lambda_1=1$, \[ \Prob{X(t)} = \sum_{j=1}^{k}{c_j \lambda_j^t v_j} \rightarrow c_1 v_1 \] - Probabilities must add to 1, and this fixes $c_1$: \begin{eqnarray} \sum_{i=1}^{k}{\Prob{X(t)=i}} & = & 1\\ \sum_{i=1}^{k}{\sum_{j=1}^{k}{c_j \lambda_j^t v_j}} & = & 1\\ \sum_{i=1}^{k}{c_1 v_{1i}} & = & 1 ~ \text{(limit of both sides as} ~ t\rightarrow\infty)\\ c_1\sum_{i=1}^{k}{v_{1i}} & = & 1\\ c_1 & = & 1 \end{eqnarray} ## Irreducible, aperiodic Markov chains - We can say a bit more: \[ \mathbf{q}^{t} \rightarrow \mathbf{v}^{-1} \mathrm{diag}(1, 0, \ldots 0) \mathbf{v} = \left[ \begin{array}{c} v_1\\ v_1\\ \vdots \\ v_1 \end{array} \right] \] + i.e., each row converges to $v_1$ + Use the eigendecomposition of $\mathbf{q}^t$ - This is the basic ergodic theorem for Markov chains ## Invariant distributions - The (left) eigenvector of $\mathbf{q}$ with eigenvalue 1 is the **invariant** distribution: \[ v_1 \mathbf{q} = v_1 \] + a.k.a. **equilibrium** distribution - This is important enough to deserve a special symbol, $\InvDist$ - If $X(t) \sim \InvDist$, then $(X(t), X(t+1)) \sim \InvDist \otimes \mathbf{q}$ + $=$ invariant distribution over pairs + define invariant distribution over longer blocks similarly - If $\InitDist=\InvDist$, then the process is (strictly) stationary ## Ergodicity and Weak Dependence - Because $\Prob{X(h)} \rightarrow \InvDist$ as $h\rightarrow\infty$, $X(t)$ and $X(t+h)$ become independent as $h\rightarrow \infty$ + $\Prob{X(t+h)|X(t)=x} \rightarrow \InvDist$ regardless of $x$ + Correlations actually decay exponentially in $h$ - $\therefore$ there's an ergodic theorem: \[ \frac{1}{n}\sum_{t=1}^{n}{f(X(t))} \rightarrow \sum_{i=1}^{k}{f(i) \InvDist_i} = \Expectwrt{X \sim \InvDist}{f(X)} \] - Similarly for functions of pairs: \[ \frac{1}{n-1}\sum_{t=1}^{n-1}{g(X(t), X(t+1))} \rightarrow \sum_{i=1}^{k}{\sum_{j=1}^{k}{g(i,j) \InvDist_i \mathbf{q}_{ij}}} = \Expectwrt{(X, X^{\prime}) \sim \InvDist \otimes \mathbf{q}}{f(X, X^{\prime})} \] ## Sample frequencies vs.\ probabilities {.smaller} - Define $N_i \equiv$ number of times $X(t)=i$ - Define $N_{ij} \equiv$ number of times $X(t)=i, X(t+1)=j$ - Then \begin{eqnarray} \frac{N_i}{n} & \rightarrow & \InvDist_i\\ \frac{N_{ij}}{n} & \rightarrow & \InvDist_i \mathbf{q}_{ij}\\ \frac{N_{ij}}{N_i} = \frac{N_{ij}/n}{N_i/n} & \rightarrow & \frac{\InvDist_i \mathbf{q}_{ij}}{\InvDist_i} = \mathbf{q}_{ij} \end{eqnarray} ## {.smaller} ```{r} eigen(t(q)) eigen(t(q))$vectors[,1]/sum(eigen(t(q))$vectors[,1]) ``` ## {.smaller} ```{r} table(rmarkovchain(1e4,c(0.5,0.5),q)) table(rmarkovchain(1e4,c(0.5,0.5),q)) table(rmarkovchain(1e4,c(0,1),q)) table(rmarkovchain(1e4,c(1,0),q)) ``` ## Central limit theorem - Because $X(t)$ and $X(t+h)$ become independent as $h\rightarrow\infty$, the time average is $\approx$ an average of IID blocks + CLT would apply to an average of IID blocks + For rigor, need to check that $\approx$ becomes strict as $n\rightarrow\infty$ - Usually works in practice: ```{r} time.avgs <- replicate(100, mean(rmarkovchain(1e4, c(0.5, 0.5), q))) qqnorm(time.avgs); qqline(time.avgs) ``` ## What if there's more than one irreducible set? - For a finite chain, eventually $X(t)$ enters an irreducible set, and never leaves - $\Rightarrow$ Long-run behavior of $X(t)$ looks like it was _always_ in that set - but _which_ irreducible set is random - $\Rightarrow$ Converge, but to a random limit ## What if the state space isn't finite? - There can still be irreducible sets (smaller than the whole space) - If $X(t)$ enters an irreducible set, we have nice ergodic behavior - Might wander off to infinity, though + Example: Random walk ## What if the state space is continuous? - Again, we can have irreducible sets - Again, nice ergodic behavior within an irreducible set - Generally need a lot more math to handle this... ## Variations on the theme - Higher-order Markov process: Need to condition on $X(t-p+1), \ldots X(t-1), X(t)$ + $=$ A Markov process on blocks + AR$(p)$ is a $p^{\mathrm{th}}$ order chain - Interacting/coupled Markov processes: transition rates for process 1 depend on its state _and_ the state of process 2 - Continuous time: stay in state for a random time, with exponential distribution, then take a chain step - Semi-Markov chain: Like a CTMC, but non-exponential holding times - Hidden Markov Model (HMM): $S(t)$ is Markov, but we see $X(t) = f(S(t)) + \mathrm{noise}$ + This isn't Markov + MA$(q)$ is actually an HMM! ## Summary - Markov chains are the most basic non-trivial stochastic process - Individual trajectories are random and erratic - Distributions evolve according to a linear operator + Invariant distributions $=$ leading eigenvectors - With _finite_ state spaces: + Each individual trajectory eventually enters an irreducible set, and wanders ergodically around it + One invariant distribution per irreducible set + Long-run averages look like expectations over the invariant distribution ## Backup: Further reading - Handout will be posted to the class website - @Grimmett-Stirzaker is really good on Markov chains, and more general Markov processes - On invariant distributions in continuous state spaces, the standard reference is @Meyn-Tweedie-1, but @Lasota-Mackey is also good, and @Mackey-times-arrow is surprisingly readable and relevant ## Backup: The philosophical origins of the Markov property {.smaller} - Early 19th century: administrations of nation-states begin collecting, and publishing, statistics on mass social phenomena + "Vital statistics": Birth rates, death rates, marriage rates, numbers of children... + Accidents: the number of pieces of undeliverable mail each year in Paris was thought impressive + Crimes: theft, murder, suicide - Mid-19th century: pioneering social scientists realize the rates of these phenomena are pretty stable + ... Even of things that arise from really profound moral choices, like murder and suicide + ... which is what you'd expect from the law of large numbers if there was just some constant probability + ... so they concluded that free will does not exist (["Society itself prepares the crime"](http://www-history.mcs.st-and.ac.uk/Extras/Quetelet_crime.html)) - See @Hacking-chance, @Porter-rise for this history + See @Cassirer-on-man for how some people realized at the time that this was crazy-bad philosophy ## Backup: The philosophical origins of the Markov property - P. A. Nekrasov: 19th century Russian mathematician and theologian + Claimed law of large numbers requires independent events + But human actions are dependent + $\therefore$ Law of large numbers does not apply + $\therefore$ Free will exists and Russian Orthodox Christianity is exactly true + This was not even close to the strangest idea entertained by the Moscow math department [@Naming-infinity] ## Backup: The philosophical origins of the Markov property ![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/AAMarkov.jpg/369px-AAMarkov.jpg) - A. A. Markov: Russian mathematician and vocal atheist + Leading probability theorist and a very careful mathematician + Viewed Nekrasov as his arch-nemesis + Read a 1902 paper by Nekrasov and got enraged + Invented the theory of Markov chains to show LLN and CLT can hold under dependence (first published 1906), essentially to refute Nekrasov + See @Basharin-et-al-on-Markov for Markov's life and this history ## Backup: TL;DR on the origins of the Markov property - Markov processes are the outcome of an 19th century flamewar between determinist statisticians and mystic mathematicians ## Backup: Markov chain Monte Carlo - Suppose we want to sample from a pdf or pmf $p$ - Exercise: Show that if **detailed balance** $q(y|x)p(x) = q(x|y)p(y)$ holds, then $p$ is the invariant distribution of the chain with transition rates $q$ - In **Markov chain Monte Carlo**... + we make a Markov chain with transition rates that obey this equation + simulate a long trajectory from this chain + and use the simulation as an (approximate) sample from $p$ ## Backup: Markov chain Monte Carlo - Practical MCMC is based on the **Metropolis algorithm** [@Metropolis-et-al-Monte-Carlo] 1. Set $X(0)$ however we like, and initialize $t \leftarrow 0$. 2. Draw a {\bf proposal} $Z(t)$ from some conditional distribution $r(\cdot|X_t)$ --- for instance a Gaussian or uniform centered on $X(t)$, or anything else easy to draw from and with smooth enough noise. 3. Draw $U(t)$ independently from a uniform distribution on $[0,1)$. 4. If $U(t) < p(Z(t)))/p(X(t)))$, then $X(t+1) = Z(t)$, otherwise $X(t+1)=X(t)$ 5. Increase $t$ by 1 and go to step 2. - Obeys detailed balance so it has the right invariant distribution ## References