36-467/667, Fall 2020

5 November 2020 (Lecture 19)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\transition}{\mathbf{q}} \newcommand{\loglike}{L_n} \newcommand{\TrueTransition}{\transition^*} \newcommand{\InitDist}{p_{\mathrm{init}}} \newcommand{InvDist}{p^*} \]

- Markov process: future values of \(X\) independent of past given current state \(X(t)\)
- Markov chains: Markov process with \(X(t)\) discrete valued, say \(k\) different states
- Transition matrix \(\transition\), \(q_{ij} = \Prob{X(t+1) = j| X(t) = i}\)
- \(X(0)\) follows distribution \(\InitDist\)

- Invariant distribution \(\InvDist\) is the left eigenvector of \(\transition\) with eigenvalue 1
- Ergodicity: for a finite chain, distribution of \(X_t\) converges on \(\InvDist\), and time-averages converge on expectations w.r.t. \(\InvDist\)
- Often ergodicity for continuous-valued Markov processes as well

- Individual level: A whole trajectory or realization, \(X(0), X(1), X(2), \ldots X(n)\)
- Multiple whole trajectories will work very similarly

- Aggregates level: counts of how many individuals are in each state
- There are \(s\) processes all following the Markov chain, so \(X^{(c)}(t)\) is state of process \(c\) at time \(t\), \(c \in 1:s\)
- We see \(D_i(t) =\) number of processes in state \(i\) at time \(t\), \(i \in 1:k\)
- Don’t get to observe whether process \(c\) in particular has moved

- Individual-level data is much easier to deal with

- Assume individual-level data \(x(0), \ldots x(n)\)
- Using the Markov property, the likelihood is just \[ \Prob{X(0) = x(0)} \prod_{t=0}^{n-1}{\Prob{X(t+1) = x(t+1)|X(t) = x(t)}} = \InitDist(x(0))\prod_{t=0}^{n-1}{ \transition_{x(t), x(t+1)}} \]
- The log-likelihood is \[ \log{\InitDist(x(0))} + \sum_{t=0}^{n-1}{\log{\transition_{x(t), x(t+1)}}} \]
- Condition on \(X(0)\), since it’s uninformative about \(\transition\): \[ \loglike(\transition) \equiv \sum_{t=0}^{n-1}{\log{\transition_{x(t), x(t+1)}}} \]
- Normalize, take the minus sign so we’re minimizing: \[ M_n(\transition) \equiv - \frac{1}{n} \sum_{t=0}^{n-1}{\log{\transition_{x(t), x(t+1)}}} \]

- Define \(N_{ij} \equiv\) number of times state \(j\) follows state \(i\) in the data
- Define \(N_i \equiv \sum_{j}{N_{ij}}\)
- So \(\sum_{i}{N_i} = n\)

- The set of \(N_{ij}\) are the
**sufficient statistics**, because the likelihood only depends on the data through them: \[ M_n \equiv - \frac{1}{n} \sum_{t=0}^{n-1}{\log{\transition_{x(t), x(t+1)}}} = -\frac{1}{n}\sum_{i,j}{N_{ij}\log{\transition_{ij}}} \]

(sum over time vs. sum over state pairs)

- Minimize \(M_n\) and get the obvious answer: \[
\hat{\transition}_{ij} = \frac{N_{ij}}{N_i}
\]
- When we minimize, we need to ensure that each row of \(\hat{\transition}\) adds to 1
- See handout on website, or Guttorp (1995), pp. 20–21 and 60–64

- By the ergodic theorem, \[ \frac{N_{ij}}{n} \rightarrow \InvDist_i \TrueTransition_{ij} \]
- Therefore \[ \frac{N_i}{n} \rightarrow \InvDist_i \]
- Therefore \[ \frac{N_{ij}}{N_i} = \frac{N_{ij}/n}{N_i/n} \rightarrow \frac{\InvDist_i \TrueTransition_{ij}}{\InvDist_i} = \TrueTransition_{ij} \]

- May not be able to vary all the transition probabilities separately
- May have an actual theory about how the transition probabilities are functions of underlying parameters
- Compartment models (next week) exemplify both issues

- Either way, \(\transition\) is really \(\transition(\theta)\), with \(\theta\) the \(r\)-dimensional vector of parameters \[ \frac{\partial \loglike}{\partial \theta_u} = \sum_{ij}{\frac{\partial \loglike}{\partial \transition_{ij}}\frac{\partial \transition_{ij}}{\partial \theta_u}} \]
- Set this to zero and solve (or just maximize \(L_n\) numerically…)

- Generally need Guttorp’s “Conditions A” (pp. 64–65)
- which he got from Billingsley (1961) (p. 23)

- The allowed transitions are the same for all \(\theta\)
- (technical convenience)

- \(\transition_{ij}(\theta)\) has continuous \(\theta\)-derivatives up to order 3
- (authorizes Taylor expansions to 2nd order)
- (can sometimes get away with just 2nd partials)

- The matrix \(\partial \transition_{ij}/\partial \theta_u\) always has rank \(r\)
- (no redundancy in the parameter space)

- The chain is ergodic without transients for all \(\theta\)
- ( trajectories are representative samples)

- Assume all this, and that the chain was generated from \(\theta=\theta^*\)

- MLE \(\hat{\theta}\) exists
- \(\hat{\theta} \rightarrow \theta^*\) (consistency)
- Asymptotic normality: \[
\sqrt{n}\left(\hat{\theta} - \theta^*\right) \rightsquigarrow \mathcal{N}(0,I^{-1}(\theta^*))
\] with
**expected or Fisher information matrix**\[ I_{uv}(\theta) \equiv \sum_{ij}{\frac{p_i(\theta)}{\transition_{ij}(\theta)}\frac{\partial \transition_{ij}}{\partial \theta_u}\frac{\partial \transition_{ij}}{\partial \theta_v}} = -\sum_{ij}{p_i(\theta)\transition_{ij}(\theta)\frac{\partial^2 \log{\transition_{ij}(\theta)}}{\partial \theta_u \partial \theta_v}} \] (equality is not obvious)

Error estimates based on \(I(\theta^*)\) are weird: if you knew \(\theta^*\), why would you be calculating errors?

- Plug-in: Use \(I(\hat{\theta})\)
- Use the
**observed informatio matrixn**\[ J_{uv} = -\sum_{ij}{\frac{N_{ij}}{n}\frac{\partial^2 \log{\transition_{ij}(\hat{\theta})}}{\partial \theta_u \partial \theta_v}} \]- (Guttorp’s Eq. 2.207, but he’s missing the sum over state pairs)

- Use a model-based bootstrap: simulate the fitted chain, re-estimate, repeat
- (Could use a block bootstrap to resample blocks if you think it’s stationary)

\[ J_{uv} = -\frac{1}{n} \frac{\partial^2 \loglike(\hat{\theta})}{\partial \theta_u \partial \theta_v} \]

- \(nJ=\) how much the log-likelihood changes with a small change in parameters from the maximum
- \(J^{-1}\) says how much we can change the parameters before the change in log-likelihood is noticeable
- If the model is right, then \(J \rightarrow I(\hat{\theta})\), because both \(\rightarrow I(\theta^*)\)
- \(\therefore\) there are ways to use \(J-I(\hat{\theta})\) to test for mis-specification (White 1994)

- All of this is a special case of the general asymptotic theory from Lectures 13 and 14

- Markov property: for all \(t\), \[ \Prob{X(t)|X(0), X(1), \ldots X(t-1)} = \Prob{X(t)|X(t-1)} \]
- \(m^{\mathrm{th}}\)-order Markov: for all \(t\), \[ \Prob{X(t)|X(0), X(1), \ldots X(t-1)} = \Prob{X(t)|X(t-m), \ldots X(t-1)} \]
- In a Markov chain, the
*immediate*state determines the distribution of future trajectories - In a higher-order chain, the
*last \(p\) states*jointly determine the distribution of future trajectories **Extended chain device**: Define \(Y(t) = (X(t-m+1), \ldots X(t))\), and notice it’s a first-order Markov chain- The likelihood theory is
*exactly*the same, only we need to condition on the first \(m\) observations

- The likelihood theory is

- Jason Capehart, Seung Su Han, Alexander Murray-Watters, and Elizabeth Silver in Statistical Computing (36-350) in 2011

Likelihood-ratio testing is simple, for nested hypotheses:

- \(\hat{\theta}_{\mathrm{small}}\) = MLE under the smaller, more restricted hypothesis
- \(d_{\mathrm{small}}\) parameters / degrees of freedom

- \(\hat{\theta}_{\mathrm{big}}\) = MLE under larger hypothesis
- d.o.f. \(=d_{\mathrm{big}}\)

- If the smaller hypothesis is true, \[ \Lambda = 2[\loglike(\hat{\theta}_{\mathrm{big}}) - \loglike(\hat{\theta}_{\mathrm{small}})] \rightsquigarrow \chi^2_{d_{\mathrm{big}} - d_{\mathrm{small}}} \] which gives significance level
- If the bigger hypothesis is true, \(\Lambda\) follows a non-central \(\chi^2\) distribution
- Can be used to calculate power, or we could just simulate

- What about small \(n\)? Distribution mightn’t have converged
- Use model-based bootstrapping again:

- Calculate log likelihood ratio \(\Lambda\) on real data, call this \(\lambda\)
- Simulate from \(\hat{\theta}_{\mathrm{small}}\), get fake data \(Y(1), \ldots Y(n)\)
- Estimate \(\tilde{\theta}_{\mathrm{small}}\), \(\tilde{\theta}_{\mathrm{big}}\) from \(Y(1), \ldots Y(n)\)
- Calculate \(\tilde{\Lambda}\) from \(\tilde{\theta}_{\mathrm{small}}\), \(\tilde{\theta}_{\mathrm{big}}\)
- Repeat (2)–(4) \(b\) times to get sample of \(\tilde{\Lambda}\)
- \(p\)-value = \(\#\left\{\tilde{\Lambda} \geq \lambda\right\}/b\)
- Some people add 1 to numerator and denominator to avoid ever reporting 0

- Getting power is similar but simulate from \(\hat{\theta}_{\mathrm{big}}\)

- Testing parametrized transition matrices:
- d.o.f. \(=k(k-1)\) for a first-order chain
- d.o.f. \(=k^m(k-1)\) for a \(k\)-order chain
- fixed transition matrix, or fixed \(\theta^*\), has d.o.f. \(=0\)
- otherwise the parameterized transitions have d.o.f. \(=r\), the dimension of \(\theta\)

- Lower-order chains are nested inside higher-order chains, so you can test for order restrictions

- There are \(s\) individuals, \(X_c(t)\), \(c \in 1:s\), all following transition matrix \(\transition\)
- Assume transitions are independent across individuals

- All we see is \(D_i(t) =\) count of individuals in state \(i\) at time \(t\)
- \(D(t) =\) vector of all \(D_i(t)\)

- \(\Expect{D_j(t+1)|D(t)} = \sum_{i}{\transition_{ij} D_i(t)}\)
- Because each of the \(D_i(t)\) individuals moves to state \(j\) with probability \(\transition_{ij}\)

- \(\Var{D_j(t+1)|D(t)} = \sum_{i}{\transition_{ij}(1-\transition_{ij})D_i(t)}\)
- Similarly for the variance

- Linearly regress \(D_j(t+1)\) on \(D(t)\)
- Slope on \(D_i(t)\) is \(\transition_{ij}\)
- No intercept (why?)
- Using
*weighted*least squares with inverse-variance weights gives more precise estimates

- Instead of a transition matrix, have a transition
**density**or**kernel**, i.e., a conditional pdf \(\transition(x,y) = p(X(t+1) = y|X(t) =x)\) - If this is parameterized, \(\transition(x,y;\theta)\), can do the MLE as usual
- Replace sums above with integrals

- What if we want non-parameteric estimates?

- Introduce a
**kernel**\(K(u)\), a pdf for \(X\)- Usually require \(\int{u K(u) du} = 0\), \(0 < \int{u^2 K(u) du} < \infty\)

- \(\frac{1}{h}K\left(\frac{x-x_i}{h}\right)\) is a pdf centered at \(x_i\), with
**bandwidth**\(h\) - If \(X_1, X_2, \ldots X_n\) have pdf \(p\), then \[
\hat{p}_n(x) = \frac{1}{nh}\sum_{i=1}^{n}{K\left(\frac{x-x_i}{h}\right)}
\] is a non-parametric estimate of \(p\)
- Need to let \(h \rightarrow 0\) as \(n\rightarrow \infty\)

- Pick \(h\) by cross-validation
- Want high log-likelihood for unseen data

- Remember \(p(X(t+1)=y|X(t)=x) = p(X(t)=x, X(t+1)=y)/p(X(t)=x)\)
- Estimate by \[ \hat{\transition}(x,y) = \frac{\frac{1}{(n-1) h_1 h_2}\sum_{t=1}^{n}{K\left(\frac{x-x(t-1)}{h_1}\right) K\left(\frac{y-x(t)}{h_2}\right)}}{\frac{1}{(n-1) h_1}\sum_{t=1}^{n}{K\left(\frac{x-x(t-1)}{h_1}\right)}} \]
- Again, pick both \(h_1\) and \(h_2\) to get high conditional log-likelihood on unseen data
- See Hall, Racine, and Li (2004)

- Higher-order processes work similarly

- The
`np`

library lets us estimate conditional densities with the`npcdens`

function

- Maximum likelihood for chains is natural and efficient
- Even for parameterized models

- Model checking / hypothesis testing is very much as in the IID case
- For continuous processes, use conditional density estimation

- MLE says \(\hat{\transition}_{ij} = 0\) iff \(N_{ij}=0\)
- If the state space is large, \(N_{ij}\) might be zero just because we haven’t had time to see that transition
- One solution: fictitious counts
- Fix \(\mu > 0\), and set \[ \tilde{\transition}_{ij} = \frac{N_{ij}+\mu}{\sum_{j}{(N_{ij}+\mu)}} \]
- If \(\mu\) is fixed as \(n\rightarrow\infty\), \[ \frac{N_{ij}+\mu}{\sum_{j}{(N_{ij}+\mu)}} = \frac{N_{ij}/n + \mu/n}{N_i/n + k\mu/n} \rightarrow \TrueTransition_{ij} \]
- Can pick \(\mu\) by some estimate of how much we’ve under-sampled
- Or by cross-validation
- Refinement: Different \(\mu_j\) for each \(j\), perhaps reflecting \(N_j\)

- Markov chains for text typically have words for states
- Sometimes do letters or sounds (phonemes) for spelling, predictive text, speech recognition…
- Markov actually did a model of letters (vowels vs. consonants)
- but usually words

- Languages have large vocabularies
- Say \(10^4\) words at the low end to \(10^6\) at the high end
- Depends on language, how you count “word”, etc.

- So languages have even larger numbers of word sequences
- \(10^8\) to \(10^{12}\) for pairs
- Even more for higher-order chains
- A decent-length book is only \(10^5\) words, so even a corpus of \(10^3\) books can’t cover the state space

- Detailed example of making this work at scale: Pereira, Singer, and Tishby (1995)

- The MLE is still \(\hat{\transition}_{ij} = N_{ij}/N_i\)
- Problem: This gives 0 probability to visiting states outside the data
- Solutions:
- Parameterize the transition probabilities
- Replace the MLE with something else

- Outstanding example of “something else”:
- Pick \(\mu_j\) such that \(\mu_j > 0\), \(\sum_{j}{\mu_j} < \infty\)
- Set \(\hat{\transition}_{ij} \propto (N_{ij} + \mu_j)\)
- Can you work out what the denominator needs to be?

- \(X(t)\) takes only discrete values, with jumps at times \(t_1, t_2, \ldots t_n\)
- Need \(\Prob{X(t+h)=j|X(t)=i} = \transition(h)\) for all \(h\)
- Consistency condition: \(\transition_{ij}(h_1+h_2) = \sum_{k}{\transition_{ik}(h_1)\transition_{kj}(h_2)}\)
- a.k.a.
**Chapman-Kolmogorov equation**

- a.k.a.
- This forces \(\transition(h) = e^{h\mathbf{g}}\)
- Read this as \(\transition(h) = \sum_{r=0}^{\infty}{\frac{1}{r!}(h\mathbf{g})^r}\)
- Matrix \(\mathbf{g}\) is called the
**generator**of the transition matrices \(\transition(h)\) - If \(h\approx 0\), \(\transition(h) \approx \mathbf{I} + h\mathbf{g}\)

- Can do full likelihood-based inference (see Guttorp, sec. 3.7)
*or*discretize time at some small \(h\), get \(\hat{\transition}(h)\), and back out \(\hat{\mathbf{g}}\)

*Everyone*steals the theory of likelihood inference for Markov chains from Billingsley (1961)- Guttorp (1995) is easier reading, and the relevant parts are on Canvas

Billingsley, Patrick. 1961. *Statistical Inference for Markov Processes*. Chicago: University of Chicago Press.

Guttorp, Peter. 1995. *Stochastic Modeling of Scientific Data*. London: Chapman; Hall.

Hall, Peter, Jeff Racine, and Qi Li. 2004. “Cross-Validation and the Estimation of Conditional Probability Densities.” *Journal of the American Statistical Association* 99:1015–26. http://www.ssc.wisc.edu/~bhansen/workshop/QiLi.pdf.

Kalbfleisch, J. D., and J. F. Lawless. 1984. “Least-Squares Estimation of Transition Probabilities from Aggregate Data.” *The Canadian Journal of Statistics* 12:169–82. http://www.jstor.org/stable/3314745.

Pereira, Fernando, Yoram Singer, and Naftali Z. Tishby. 1995. “Beyond Word \(n\)-Grams.” In *Proceedings of the Third Workshop on Very Large Corpora*, edited by David Yarowsky and Kenneth Church, 95–106. Columbus, Ohio: Association for Computational Linguistics. http://arxiv.org/abs/cmp-lg/9607016.

White, Halbert. 1994. *Estimation, Inference and Specification Analysis*. Cambridge, England: Cambridge University Press.