---
title: Hidden Markov Models and State Estimation
author: 36-467/36-667
date: 4 December 2018
bibliography: locusts.bib
---
\[
\newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)}
\]
```{r, include=FALSE}
set.seed(2018-12-3)
library(knitr)
opts_chunk$set(size="small", background="white",
cache=TRUE, autodep=TRUE,
tidy=FALSE, warning=FALSE, message=FALSE,
echo=TRUE)
```
The last few lectures have focused on Markov processes, where conditioning on
the whole past is equivalent to conditioning on the most recent value:
\[
\Prob{X(t+1)|X(1:t)} = \Prob{X(t+1)|X(t)}
\]
When this is true, we say that $X(t)$ is the **state** of the process at time
$t$, the variable which determines the whole distribution of future
observations. When this assumption holds, we can easily do likelihood-based
inference and prediction. But the Markov property commits us to
$X(t+1)$ being independent of all earlier $X$'s given $X(t)$.
The most natural route from Markov models to hidden Markov models is to ask
what happens if we don't observe the state perfectly. I have been using $X(t)$
to always stand for the time series we observe, so I am going to introduce a
new stochastic process, $S(t)$, which will be a Markov process, and say that
what we observe is $X(t)$ plus independent noise. Will this $X(t)$ be Markov?
To be concrete, consider the following set-up. $S(t)$ is a Markov chain with
two states, $-1$ and $+1$, and $\Prob{S(t+1)=+1|S(t)=+1} =
\Prob{S(t+1)=-1|S(t)=-1} q > 0.5$. This chain thus alternates between two
states, but tends to stick in the same state for a while, to be "persistent".
What we observe is $X(t) = S(t) + \epsilon(t)$, where $\epsilon(t)$ are IID
Gaussian noise variables with mean 0 and variance $1$. (Nothing particularly
turns on the choice of Gaussian noise or variance 1, etc.) The figure below
illustrates both a typical realization of the observables $X(t)$, and of the
chain $S(t)$ that generated it.
```{r}
# From lecture 20: generic, abstract Markov simulator
rmarkov <- function(n, rinitial, rtransition) {
x <- vector(length=n)
x[1] <- rinitial()
for (t in 2:n) {
x[t] <- rtransition(x[t-1])
}
return(x)
}
# Simulate a chain that's +1/-1, with persistence in state
# Pr(X(t+1)=+1|X(t)=+1) = Pr(X(t+1)=-1|X(t)=-1) = q, assumed > 0.5
# Exercise: Prove that the invariant distribution is (1/2, 1/2) for all q
rpmchain <- function(n, q) {
# Take one step of the chain
transition.with.persistence <- function (x) {
# Stay in the same state with probability q
stay <- sample(c(TRUE, FALSE), size=1, prob=c(q, 1-q))
if (stay) { return(x) }
# Or flip to the opposite state
if (!stay) { return(-x) }
}
pm.seq <- rmarkov(n, rinitial=function() { sample(c(-1, +1),
size=1) },
rtransition=transition.with.persistence)
return(pm.seq)
}
# Simulate a +1/-1 chain, but observe it through noise
noisy.rpmchain <- function(n, q, sd=1) {
pm.seq <- rpmchain(n, q)
x <- rnorm(n=n, mean=pm.seq, sd=sd)
return(list(x=x, s=pm.seq))
}
```
```{r}
sim <- noisy.rpmchain(n=100, q=0.75)
plot(sim$x, xlab="t", ylab="X(t)", xlim=c(0, 120))
lines(1:100, sim$s, col="grey")
legend("right", legend=c("Hidden state S(t)", "Observable X(t)"),
pch=c(NA, 1), lty=c("solid", "blank"), col=c("grey", "black"),
cex=0.5)
```
Now any value of $X(t)$ _could_ have come from either value of $S(t)$. (See
next figure.) Large positive values of $X(t)$ are _really unlikely_ if
$S(t)=-1$, but not impossible, and values of $X(t)$ near zero are about equally
likely under either state.
```{r}
curve(0.5*dnorm(x,-1,1)+0.5*dnorm(x,1,1), xlab="x", ylab="p(x)",
main="Marginal and Conditional Distributions of X(t)", xlim=c(-5,5), ylim=c(0, 0.5))
curve(dnorm(x,-1,1), lty="dashed", add=TRUE)
curve(dnorm(x,+1,1), lty="dotted", add=TRUE)
legend("topright", legend=c("X(t)", "X(t)|S(t)=-1", "X(t)|S(t)=+1"),
lty=c("solid", "dashed", "dotted"))
```
This has consequences when we ask about the distribution of $X(t+1)$. If $X(t)
= `r signif(max(sim$x), 2)`$, then it's very likely that $S(t)=+1$ --- so
likely that earlier history won't change much of anything. But if $X(t) = 0$,
well, it will make a difference whether $X(t-1)$ was positive or negative.
This simple observation is enough, in itself, to tell us that $X(t)$ can't be a
Markov process. In any Markov process, $X(t-1)$ is irrelevant to $X(t+1)$
given $X(t)$. In particular, if $X$ is numerical and we regress $X(t+1)$ on
both $X(t)$ and $X(t-1)$, the true coefficient on $X(t-1)$ is _exactly_ zero.
This is _not_ the case here (Exercise 1).
Quite generally, if $S(t)$ is a Markov process, and $X(t)$ is a function of
$S(t)$, plus, possibly, some extra noise, then $X(t)$ is _not_
Markov[^functionofmarkov].
[^functionofmarkov]: This observation suggests the interesting mathematical question of whether _any_ functions of a Markov process are themselves Markovian, and, if so, which. This has been studied by a variety of authors, often interested in "lumping" the states of large simulation models. @Rosenblatt-on-markov-processes and @Rogers-Pitman-markov-functions are particularly notable contributions to this problem; @Nilsson-Jacobi-Gornerup-lumpability gives a necessary-and-sufficient algebraic condition. On the related problem of characterizing the processes which are functions of Markov chains (whether or not the result is Markovian), see @Heller-functions-of-Markov-chains and @Erickson-functions-of-markov-chains.
# Definition of HMMs
Examples like these lead to a general notion of a **hidden Markov model**, or
**state-space model**. In these models, there is a **latent** or **hidden**
state $S(t)$, which follows a Markov process. We'll write
$\Prob{S(t+1)=r|S(t)=s} = q(r,s)$. As in Markov models, the transitions need
to be complemented with a distribution for the initial state. Because these
states are hidden, we don't actually get to observe them. The **observable**
or **manifest** variable $X(t)$ is a random or noisy function of $S(t)$. That
is, there's some sequence of IID noise variables, say $\epsilon(t)$, and $X(t)
= f(S(t), \epsilon(t))$ for some function $f$. As a consequence, given $S(t)$,
$X(t)$ is independent of $X(t^\prime), S(t^\prime)$ for all $t^\prime \neq t$.
An _example_ of this would be when $X(t) = a S(t) + \epsilon(t)$, but the
function can be nonlinear and the noise can be non-additive. In fact, the
exact function $f$ usually doesn't matter very much; what matters are the
conditional observation probabilities
$\Prob{X(t)=x|S(t)=s} = g(s,x)$.
Some comments are in order on this definition.
- Both $S(t)$ and $X(t)$ can be numerical or qualitative variables, or indeed
more complicated.
+ Some people make a distinction between "hidden Markov models", where $S(t)$ is discrete, and "state-space models", where $S(t)$ is continuous. Some of these people would further limit "state-space models" to situations where $X(t)$ is also continuous. Since the theory and methods are almost the same, however, I will use the terms interchangeably.
+ Obviously, when variables are continuous, probability mass functions need to get replaced by probability density functions, and sums by integrals.
- Both $S(t)$ and $X(t)$ can have multiple dimensions. I won't indicate this
in the notation.
- Any type of state variable can, potentially, be combined with any type of
observable. For instance, you can have discrete $S(t)$ and
continuous-function-valued $X(t)$. (These are sometimes used in speech
recognition, where the discrete states are the qualitative linguistic sounds,
"phonemes", people are saying, and the function-valued observations are the
actual noises they make over brief intervals of time.)
- HMMs can be extra _useful_ when the state is somehow simpler than the
observable --- say a low-dimensional or even binary state explaining a
high-dimensional vector of observables --- but that's not required by the
definition.
+ In particular, if $S(t)$ has $r$ dimensions, but we observe $k < r$ of
them, the variables we _observe_ will not be Markov, even if they're
observed perfectly.
# Some Basic Calculations for HMMs
Because the states $S(t)$ form a Markov process, we can easily calculate
the probability of a state sequence:
\begin{eqnarray}
\Prob{S(1:n) = s(1:n)} & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{S(t)=s(t)|S(1:t-1)=s(1:t-1)}\\
& = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{S(t)=s(t)|S(t-1)=s(t-1)}\\
& = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{q(s(t-1), s(t))}
\end{eqnarray}
The first line is basic probability; the second uses the Markov property.
It's equally easy to calculate the probability of a sequence of observations,
given a sequence of states:
\begin{eqnarray}
\Prob{X(1:n)=x(1:n)|S(1:n)=s(1:n)} & = & \prod_{t=1}^{n}{\Prob{X(t)=x(t)|S(t)=s(t)}}\\
& = & \prod_{t=1}^{n}{g(s(t), x(t))}
\end{eqnarray}
This is because $X(t)$ is independent of everything else, given $S(t)$.
Putting these together gives us the joint probability of a sequence of states and a sequence of observations:
\begin{eqnarray}
\Prob{X(1:n)=x(1:n), S(1:n)=s(1:n)} & = & \Prob{S(1:n)=s(1:n)}\Prob{X(1:n)=x(1:n)|S(1:n)=s(1:n)}\\
& = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{q(s(t-1), s(t))}\prod_{t=1}^{n}{g(s(t), x(t))}
\end{eqnarray}
This, in turn, in principle gives us the probability of a sequence of observations:
\begin{eqnarray}
\Prob{X(1:n)=x(1:n)} & = & \sum_{s(1:n)}{\Prob{X(1:n)=x(1:n), S(1:n)=s(1:n)}}\\
\end{eqnarray}
This probability is a very valuable thing to know. When we make predictions, we
want
\[
\Prob{X(t)=x|X(1:t-1)=x(1:t-1)} = \frac{\Prob{X(1:t)=x(1:t)}}{\Prob{X(1:t-1)=x(1:t-1)}}
\]
so if we can calculate the probability of a sequence, and we can divide,
then we can predict. If $q$ and $g$ contain some unknown parameters $\theta$,
and we want to do inference on them, then we really should be
writing things like
\[
\Prob{X(1:n)=x(1:n); \theta}
\]
and so calculating sequence probabilities will let us do likelihood-based
inference.
Unfortunately, the sum-over-state-histories expression, $
\sum_{s(1:n)}{\Prob{X(1:n)=x(1:n), S(1:n)=s(1:n)}}$, is just not suitable for
direct evaluation. If there are even two states, there are $2^n$ possible
values of $s(1:n)$, and that's clearly not a sum you want to do if $n$ is at
all large.
# The Recursive Trick for Likelihood, Prediction, and State Estimation
#### a.k.a. "The Forward Algorithm"
There is a neat trick which lets us avoid the exponential explosion in
evaluating the likelihood, or in making predictions. The same trick will also
let us estimate the hidden state at a given time. The trick is to do all of
these things **recursively**, based on having solved the same problem with less
of the data.
Let's start with the base case. If we knew $S(1)$, it'd be easy to find
the distribution of $X(1)$:
\[
\Prob{X(1)=x(1)|S(1)=s} = g(s, x(1))
\]
It's therefore easy to find the marginal distribution of $X(1)$, i.e.,
our predictive distribution for $X(1)$:
\[
\Prob{X(1)=x(1)} = \sum_{s}{\Prob{X(1)=x(1)|S(1)=s}\Prob{S(1)=s}} = \sum_{s}{g(s, x(1)) \Prob{S(1)=s}}
\]
Now we can use the definition of conditional probability to estimate the state at time 1:
\begin{eqnarray}
\Prob{S(1)=s|X(1)=x} & = & \frac{\Prob{S(1)=s, X(1)=x(1)}}{\Prob{X(1)=x(1)}}\\
& = & \frac{\Prob{S(1)=s}\Prob{X(1)=x(1)|S(1)=s}}{\Prob{X(1)=x(1)}}\\
& = & \frac{\Prob{S(1)=s} g(s, x(1))}{\sum_{s^{\prime}}{\Prob{S(1)=s} g(s, x(1))}}
\end{eqnarray}
This is, of course, just Bayes's rule. Because this will be important
later, let's abbreviate it by $F_1(s)$.
Now that we know (the distribution of) the state at time 1, we can extrapolate
forward to time 2:
\begin{eqnarray}
\Prob{S(2)=r|X(1)=x(1)} & = & \sum_{s}{\Prob{S(2)=r, S(1)=s|X(1)=x(1)}}\\
& = & \sum_{s}{\Prob{S(2)=r|S(1)=s, X(1)=x(1)}\Prob{S(1)=s|X(1)=x(1)}}\\
& = & \sum_{s}{\Prob{S(2)=r|S(1)=s}\Prob{S(1)=s|X(1)=x(1)}}\\
& = & \sum_{s}{q(s,r) F_1(s)}
\end{eqnarray}
using the Markov property in the next-to-last line, and the definitions of $g$ and $F_1$ in the last line.
We want to extrapolate the state forward because it lets us make a prediction:
\begin{eqnarray}
\Prob{X(2)=x|X(1)=x(1)} & = & \sum_{s}{\Prob{X(2)=x, S(2)=s|X(1)=x(1)}}\\
& = & \sum_{s}{\Prob{X(2)=x|S(2)=s, X(1)=x(1)}\Prob{S(2)=s|X(1)=x(1)}}\\
& = & \sum_{s}{\Prob{X(2)=x|S(2)=s}\Prob{S(2)=s|X(1)=x(1)}}\\
& = & \sum_{s}{g(s, x)\Prob{S(2)=s|X(1)=x(1)}}
\end{eqnarray}
using the assumption that $S(t)$ screens off everything else from $X(t)$ in
the next-to-last line.
Finally, once we've seen $X(2)$, we can use that to get the distribution
of $S(2)$:
\begin{eqnarray}
F_2(s) & \equiv & \Prob{S(2)=s|X(2)=x(2), X(1)=x(1)}\\
& = & \frac{\Prob{S(2)=s, X(2)=x(2)|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\
& = & \frac{\Prob{X(2)=x(2)|S(2)=s, X(1)=x(1)}\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\
& = & \frac{\Prob{X(2)=x(2)|S(2)=s}\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\
& = & \frac{g(s, x(2))\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}
\end{eqnarray}
Again, this is Bayes's rule. Notice that the role of "prior distribution" here
isn't played by $\Prob{S(1)}$ or even $\Prob{S(1)|X(1)}$, but by
$\Prob{S(2)|X(1)}$.
Now we're ready to go on to $t=3$, and to state the general, recursive
pattern. This is sometimes called the **forward algorithm**.
- Begin with $\Prob{S(t)=s|X(1:t)=x(1:t)} \equiv F_t(s)$
- Extrapolate the state forward:
\[
\Prob{S(t+1)=s|X(1:t)=x(1:t)} = \sum_{s^{\prime}}{q(s^{\prime}, s)\Prob{S(t)=s^{\prime}|X(1:t)=x(1:t)}} = \sum_{s^{\prime}}{q(s^{\prime}, s) F_t(s^{\prime})}
\]
- Predict the next observation:
\[
\Prob{X(t+1)=x|X(1:t)=x(1:t)} = \sum_{s}{g(s,x) \Prob{S(t+1)=s|X(1:t)=x(1:t)}}
\]
- Condition $S(t+1)$ on $X(t+1)$:
\[
\Prob{S(t+1)=s|X(1:t+1)=x(1:t+1)} = \frac{g(s,x) \Prob{S(t+1)=s|X(1:t)=x(1:t)}}{\Prob{X(t+1)=x|X(1:t)=x(1:t)}}
\]
- End with $F_{t+1}(s) \equiv \Prob{S(t+1)=s|X(1:t+1)=x(1:t+1)}$, or begin the cycle over again with $t+2$.
## Back to the Likelihood; Time Complexity of the Forward Algorithm
We began this by wanting an alternative to direct sum-over-state-histories
for the likelihood $\Prob{X(1:n)=x(1:n)}$. We can get that from this
approach, by multiplying together predictive distributions:
\[
\Prob{X(1:n)=x(1:n)} = \Prob{X(1)=x(1)}\prod_{t=2}^{n}{\Prob{X(t)=x(t)|X(1:t-1)=x(1:t-1)}}
\]
We haven't, however, established that this approach is any faster than
summing over histories!
To see that it is, think about how time-consuming it will be to go through the
cycle above once. To make a prediction for $X(t+1)$, we need to sum over
states $S(t+1)$, and to get the probability that $S(t+1)=s$, we need to sum
over states $S(t)$. So if there are $k$ states, we need to so $O(k^2)$
operations in that double sum. Each of the $n$ time-steps requires one pass
through the cycle, so we need up needing to do $O(k^2 n)$ operations, not
$O(k^n)$ as with direct summation.
In this analysis, it's important that the procedure is recursive, that we
calculate $F_{t+1}$ using just $F_t$ and $x(t+1)$ (and the model!), so that we
don't need to keep track of earlier states or their distributions, or even, in
fact, earlier values of $X$. To the extent they matter, they've been
summarized in $F_t$, which is a sufficient statistic for predicting $X(t+1)$.
To sum up, in an HMM it's straightforward to calculate $\Prob{X(1)=x(1)}$ and
$F_1(s) = \Prob{S(1)=s|X(1)=x(1)}$. Thereafter, we can predict by cycling
through (i) extrapolating the state distribution $F_t$ forward in time (using
the transition probabilitiess $q$), (ii) predicting the next observation (using
(i) and the observation probabilities $g$), and (iii) updating the distribution
of the state to $F_{t+1}$ (using (ii) and Bayes's rule). This lets
us make predictions, and so calculate a likelihood, or log-likelihood.
## The Backward Algorithm
You may have noticed that we're just calculating $\Prob{S(t)|X(1:t)}$ and not
$\Prob{S(t)|X(1:n)}$, let alone $\Prob{S(1:n)|X(1:n)}$. Partly, this is
because all we really _need_ for prediction or the likelihood is
$\Prob{S(t)|X(1:t)}$. Partly, however, it is because it's just a lot _easier_
to calculate this than it is to calculate $\Prob{S(t)|X(1:n)}$. Nonetheless,
there is _also_ a recursive algorithm for the latter. To use it, one must
first have run the forward algorithm, getting a sequence $\Prob{S(t)|X(1:t)}$
for each $t\in 1:n$. The last of these, $\Prob{S(n)|X(1:n)}$, is the last
probability we want, so the recursion starts at the _end_ of the time series
and works backwards in time (hence the name). I will omit all the details
here, because they're neither as illuminating as in the forward algorithm,
and because you can find them in readily-available references, such
as @Andy-Fraser-on-HMMs.
## The Kalman Filter
An important special case is when $S(t)$ and $X(t)$ are both vectors,
and everything is linear-and-Gaussian:
\begin{eqnarray}
S(t+1) & = & \mathbf{a} S(t) + \eta(t)\\
X(t) & = & \mathbf{b} S(t) + \epsilon(t)\\
\eta(t) & \sim_{IID} & \mathcal{N}(0, \mathbf{\Sigma}_{S})\\
\epsilon(t) & \sim_{IID} & \mathcal{N}(0, \mathbf{\Sigma}_X)
\end{eqnarray}
In this case, it's possible to write out all the conditional probabilities in
the recursion _exactly_ --- everything is Gaussian, with mean vectors and
variance matrices that are straightforward functions of the data. Finding
the filtering distribution then becomes known as the **Kalman filter**, after
@Kalman (and @Kalman-Bucy). One of the clearest presentations I've seen
is in @Andy-Fraser-on-HMMs, which you should read if this is at all
interesting. There is a base R implementation in the function `KalmanRun`
(with various companion functions `KalmanLike`, `KalmanForecast`, etc.).
While the Kalman-filter solution is is nicely straightforward, it is heavily
tied to the linear-and-Gaussian assumptions. There are many important
situations where that's good enough
[@McGee-Schmidt-discovery-of-kalman-filter], but it can fail in important
ways. This has led to a number of attempts to use local linear approximations
to the Kalman filter in non-linear problems, and to alternative
closed-form approximate filters [@Big-Laplace-Gauss-filter-paper].
A different direction, however, is to go to back to the core ideas of the
forward algorithm, and to simulate them. This is the "particle filter".
## The Particle Filter
While the forward algorithm is straightforward in principle, it still involves
a double sum over the state space. This can be a time-consuming when the
number of states $k$ is large. When the state-space is continuous, the sum
becomes an integral, and doing even a double integral numerically can be quite
time-consuming and/or inaccurate. In this situation, a useful trick is to
replace complicated or time-consuming calculations with simulations,
i.e., to use Monte Carlo.
Applied here, the **Monte Carlo filter**, or **particle filter**, approximates
the distribution $F_t$ with the empirical distribution of a large
number of simulations of the HMM, conditioned on the data. It works as follows[^particlefilternote].
[^particlefilternote]: See @Kunsch-on-particle-filters for a good, brief review of particle filters, and @Sequential-Monte-Carlo-book for a more exhaustive, though now a little old, treatment. The actual history of who introduced which idea when is a bit more involved than I feel like going into, but covered in both these references. @Douc-Moulines-Stoffer discusses statistical applications.
1. Make a large number of independent draws $S^*_1, S^*_2, \ldots S^*_m$ from
the distribution $\Prob{S(1)}$. Each random state value is called a
**particle**, and the number of particles $m$ is, ideally, as large as
practical. (See more on this below.) Set $t=1$.
2. For each particle $S^*_i$, calculate $g(S^*_i, x(t)) = w_{i}$. Adding these
up approximates the predictive probability,
$\Prob{X(t)=x(t)|X(1:t-1)=x(1:t-1)} \approx \sum_{i=1}^{m}{w_i}$.
3. Now resample the particles, i.e., draw $m$ of them, with replacement, from
$S^*$, with probabilities proportional to the $w_{i}$. Call the new values
$\tilde{S}_1, \tilde{S}_2, \ldots \tilde{S}_m$. Set $\hat{F}_t$ to be the
empirical distribution of the $\tilde{S}$. That is, their empirical
distribution approximates $\Prob{S(t)|X(1:t)=x(1:t)}$.
4. Increment $t$ by 1, and evolve the particles forward in time. That is, for each $\tilde{S}_i$, pick
a $S^*_i$ according to $q(\tilde{S}_i, S^*_i)$. The distribution of these new
$S^*$ approximates $\Prob{S(t+1)|X(1:t)=x(1:t)}$.
5. Go to (2).
This all lends itself wonderfully to code.
```{r}
# Particle filter for an HMM
# Inputs: time series of length n (x)
# function generating draws for initial state (rinitial)
# function generating draws for next state (rtransition)
# function giving observational probabilities (dobs)
# number of particles (m)
# Output: n*m array of particles, approximating Pr(S(t)|X(1:t))
# Presumes: states can be elements in a vector
# observations can be elements in a vector
# rinitial is a no-argument function (for compatibility with rmarkov())
# rtransition is a one-argument function (ditto)
# dobs is a two-argument function
particle.filter <- function(x, rinitial, rtransition, dobs, m) {
# How long is the time series?
n <- length(x)
# Our initial guesses at the state, before seeing anything
particles <- replicate(m, rinitial())
# A storage area for our guesses
particle.history <- matrix(0, nrow=m, ncol=n)
for (t in 1:n) {
# Calculate observationa likelihoods/weights for current observation
# given state=particles
weights <- dobs(state=particles, observation=x[t])
# Resample the particles, with probabilities (proportional to)
# the weights
particles <- sample(particles, size=m, replace=TRUE, prob=weights)
# Store the current set of guesses
particle.history[,t] <- particles
# Evolve the particles forward using the Markov transitions for the
# next time step
particles <-sapply(particles, rtransition)
}
return(particle.history)
}
```
I will leave it as an exercise to extend this function so that it also
returns (an approximation to) the sequence $\Prob{X(t)|X(1:t-1)}$.
Here is an implementation using the HMM introduced at the beginning,
where $S(t)$ is either $+1$ or $-1$, and $X(t)|S(t) \sim \mathcal{N}(S(t), 1)$.
```{r}
sim.pf <- particle.filter(x=sim$x,
rinitial=function() { sample(-1,+1, size=1) },
rtransition=function (x) {
stay <- sample(c(TRUE, FALSE), size=1, prob=c(0.75, 0.25))
if (stay) { return(x) }
if (!stay) { return(-x) }
},
dobs=function(state, observation) {
dnorm(x=observation, mean=state, sd=1)
},
m=100)
```
We have 100 particles, i.e., 100 guesses for $S(t)$ at each time point. Rather
than try to visualize this directly ---- it's rather "busy" --- I'll summarize
with lines at the 5% and 95% percentiles (so a 90% probability interval).
```{r}
plot(sim$x, xlab="t", ylab="X(t)", xlim=c(0, 120))
lines(1:100, sim$s, col="grey", lwd=2)
pf.limits <- apply(sim.pf, 2, quantile, prob=c(0.05, 0.95))
lines(1:100, pf.limits[1,], col="blue")
lines(1:100, pf.limits[2,], col="blue")
legend("right", legend=c("Hidden state S(t)", "Observable X(t)", "Particle filter"),
pch=c(NA, 1, NA), lty=c("solid", "blank", "solid"), col=c("grey", "black", "blue"),
cex=0.5)
```
Notice that there are places where the limits become very narrow --- these
are mostly situations where $X(t)$ has a large absolute value, and do it's
very easy to tell what $S(t)$ must be.
### An Analogy for the Particle Filter
A common analogy for understanding how the particle filter works is to think of
the particles as an evolving population, with the state of each particle being
its genotype. The probability $w_{i}$ that particle $i$ gives to the current
observation is then the "fitness" of particle $i$. In the resampling step,
each particle has a chance to reproduce, and the expected number of offspring
of particle $i$ is $\propto w_i$. Finally, moving the new particles according
to the Markov chain is like introducing mutations. So the whole particle
filter process is analogous to "breeding" a population with the goal of giving
high likelihood (=fitness) to the data.[^bayesasevol]
[^bayesasevol]: In one direction, you can extend this analogy to techniques of "evolutionary optimization", such as "genetic algorithms", which try to breed good solutions to complex optimization problems [@Holland-adaptation;@MM-on-GAs]. On the other, you can extend it to all forms of Bayesian inference, which are mathematically equivalent to natural selection without any sort of variation [@CRS-dynamics-of-bayes].
### Issues with the Particle Filter
1. Using a finite number of particles (= Monte Carlo samples) introduces
approximation error --- the distribution $\hat{F}_t$ isn't quite the same as
$F_t$. This error shrinks as the number of particles $m \rightarrow
\infty$. All else being equal, $m$ should always be as large as possible.
+ What might not be equal is that increasing $m$ costs more time to run.
Each step of the particle filter --- moving each particle according to
the Markov process, calculating a probability for $X(t)=x(t)$ given the
particle state, and re-sampling the particles --- takes $O(m)$ operations.
(You can parallelize the transitions and the likelihood calculations, but
not the resampling.)
+ There are diminishing returns to increasing $m$ --- the approximation
error is generally $O(1/\sqrt{m})$.
+ Exactly how large you need to make $m$ thus depends on how you
value computing time vs. numerical accuracy, which I can't tell you.
2. The particle filter will perform badly if at some time $t$, one particle $i$
gives much higher probability to $X(t)$ than all the others, since then $w_i
\gg w_j$, and all (or almost all) of the resampled particles will just be
copies of $w_i$. This "particle collapse" is most apt to be a problem when
the data does something which is very improbable according to the model,
either because of an actual outlier, o because of model mis-specification.
@Bengtsson-Bickel-Li-on-particle-filters studies this problem in some detail.
A slightly hacky solution is to add a little bit of noise on to the resampled
particles. (This is equivalent to smoothing the empirical distribution.)
3. I have presented the particle filter assuming the state-transition
probabilities $q$ and the state-observation probabilities $g$ are _known_.
Obviously, this is often not the case for real data problems, which brings
us to the topic of parameter estimation.
# Parameter Estimation
Up to now, I have been taking the state-transition probabilities
$q(s,s^{\prime})$ and the state-observation probabilities $g(s,x)$ as known
functions. This is obviously unrealistic. They will generally depend on
one or more unknown parameters $\theta$, so I should really be writing
$q(s,s^{\prime};\theta)$ and $g(s,x;\theta)$. How might we estimate $\theta$?
- _Use the likelihood_ We can use the forward algorithm to calculate, or at least approximate, $\Prob{X(1:n)=x(1:n);\theta}$ for each $\theta$, as
$\prod_{t}{\Prob{X(t)=x(t)|X(1:t-1) = x(1:t-1);\theta}}$. This is a function
of $\theta$, which we can maximize. In the usual way, it will be easier,
and more numerically stable, to work with the log-likelihood,
\[
\sum_{t}{\log{\Prob{X(t)=x(t)|X(1:t-1)=x(1:t-1);\theta}}}
\]
A fully rigorous proof that maximum likelihood estimation is consistent for
general HMMs is surprisingly tricky, but doable
[@Cappe-Moulines-Ryden-inference-in-HMMs]. As a practical matter, however,
the log-likelihood is a function, and can be optimized like anything else.
+ One issue with the likelihood-based approach is that if either $X(t)$ or
$S(t)$ is a continuous variable, we _must_ impose some parametric structure
on $g$ and/or $q$, since otherwise maximizing the likelihood leads to bad
answers. Checking these assumptions is important, and potentially tricky.
- _Alternate between guessing at states and estimation_ If we _knew_ the
sequence of states $S(t)$, it would be straightforward to estimate both $q$
and $g$. We could estimate $q$ using any technique we'd use to estimate
transition probabilities for a Markov process, and we'd estimate $g$
using any technique for conditional distributions. (We might even use
regression in both problems, if we could believe in additive noise.)
In the other direction, if we knew $q$ and $g$, we can estimate $S(t)$
for each $t$. This suggests a whole slew of iterative-approximation
methods, which start _either_ with a guess for the $S$'s or for $q$ and $g$,
and then alternate between using the current estimate for one to improve
the estimate of the other.
+ The most famous iterative-approximation approach is the
**Baum-Welch**, **forward-backward**, **expectation-maximization**, or
**EM** algorithm. It uses the backward algorithm (see above) to find
$\Prob{S(t)|X(1:n)}$, using the current estimates for $q$ and $g$. (This
is the "E" step, since a probability is an expectation.) It then does
weighted maximum likelihood to re-estimate $q$ and $g$, with every
possible sequence of states getting a weight given by its probability
under the backward algorithm. (This is the "M" step, since it's
maximizing the likelihood.) It then re-runs the backward algorithm to
re-estimate the states, and so forth to convergence. One can
show that both the E-step and the M-step increase the log-likelihood, so
this has to converge on at least a local maximum, though perhaps not
a global one. (See @CRS-ADAfaEPoV for a tutorial introduction to EM.)
+ Variants include replacing finding the whole distribution
$\Prob{S(t)|X(1:n)}$ with drawing samples from it (perhaps with the
particle filter), or even just finding the most probable sequence
$\hat{s}(1:n)$.
+ Non-parametric variants of the iterative-approximation idea are
straightforward to code up, but hard to prove things about.
- _Simulation-based inference_ Even with all the tricks for the forward
algorithm, filtering, etc., calculating the likelihood can be difficult
for complex HMMs. These can, however, be very straightforward to simulate.
All of the techniques for simulation-based inference from earlier lectures
can be brought to bear here, especially indirect inference.
### Example of Simulation-Based Inference
Let's work through an example of simulation-based inference, with our toy
problem. (The advantage of trying it out on a toy is that we can tell whether
or not it's working.) Specifically, we'll use indirect inference, with an
autoregressive auxiliary model. We'll need (i) a function to simulate the
generative model, (ii) a function to estimate the auxiliary model,
and (iii) a function which optimizes the generative model so that auxiliaries
fit to its simulation match auxiliaries fit to the data.
We wrote (i) already, `noisy.rpmchain`. We need to write (ii) and (iii).
For (ii), the estimator of the auxiliary model, we'll use an autoregressive
model. There is one[^moreparams] parameter we need to estimate, the probability
$\Prob{S(t+1)=+1|S(t)=+1} = \Prob{S(t+1)=-1|S(t)=-1} = q$. This suggests
an AR(1) should be enough. To give this a bit more information, though,
we'll fit an AR(2).
[^moreparams]: If we wanted to, we could treat all the entries in the transition matrix for $S$ as unknown, and estimate them, in the same way we're doing here, but this is just a demo. Similarly, we could treat the variance of the observational noise as unknown and estimate it, etc.
```{r}
# Estimate the auxiliary model (here, an AR(2)) for indirect inference
# Inputs: a time series (x)
# Output: Point estimate of the auxiiary model (here, a vector of length 2)
aux.estimator <- function(x) {
ar2.fit <- ar.ols(x, aic=FALSE, order=2)
return(ar2.fit$ar)
}
# Repeatedly simulate the generative model, fit the auxiliary, and average
# Inputs: parameter of generative model (q)
# number of simulations to run (s)
# length of each simulation (n)
# Output: vector of averaged auxiliary parameter estimates
aux.from.sim <- function(q, s, n=length(sim$x)) {
estimates <- replicate(s, aux.estimator(noisy.rpmchain(n=n, q=q)$x))
return(rowMeans(estimates))
}
```
We should check that there's a unique mapping from the parameter(s) in the
generative model to the auxiliary parameters (so that inverting this mapping in
principle lets us recover the generative parameter). We can do this here by
plotting how the auxiliary parameters vary as we sweep over the generative
parameter. (Obviously this sort of visual check gets harder with more
parameters on each side.)
```{r}
plottable.aux.from.sim <- Vectorize(aux.from.sim, vectorize.args="q")
ar2.vs.q <- sapply(seq(from=0.01, to=0.99, by=0.01),
plottable.aux.from.sim, s=500)
```
```{r}
plot(ar2.vs.q[1,], ar2.vs.q[2,], type="l", xlab="Coefficient on X(t-1)",
ylab="Coefficient on X(t-2)")
text(ar2.vs.q[1,1], ar2.vs.q[2,1], labels="q=0.01")
text(ar2.vs.q[1,99], ar2.vs.q[2,99], labels="q=0.99")
```
We can now follow the examples in earlier lectures to write out the
indirect-inference function. One thing worth noting here is that we know the
parameter $q$ has to be between $0$ and $1$ (it's a probability), and,
fortunately, `optim` has a good method for optimizing a single parameter
between known limits. With multiple parameters, it might be worthwhile to
use a package like `alabama`, which allows for very general constraints on
the parameters.
```{r}
toy.indirect.inference <- function(x, starting.q, s) {
n <- length(x)
# What's the auxiliary model on the actual data?
aux.data <- aux.estimator(x)
# Calculate the discrepancy between the auxiliary on the data and
# the auxiliary we get from simulating at a particular parameter value
discrepency <- function(param) {
q <- param
# Get the average of auxiliary estimates over s simulation runs
aux.mean <- aux.from.sim(q=q, s=s, n=n)
# Mean squared difference over auxiliary parameters
return(sum((aux.data-aux.mean)^2))
}
# Now minimize the discrepancy
fit <- optim(par=starting.q, fn=discrepency, method="Brent", lower=0,
upper=1)
# Return just the best-fitting value
return(fit$par)
}
```
Now let's actually run it:
```{r}
(toy.ii <- toy.indirect.inference(x=sim$x, starting.q=0.5, s=500))
```
Recall that the true value of $q$ was $0.75$, so this is doing pretty well. If
we want to get a standard error, we could either follow the theory in Lecture
19, or just do a bootstrap. (Since we know this is a stationary process, that
could either be a parametric or a non-stationary block bootstrap.)
Note that now that we could estimate the parameter of the generative
model without also _having_ to estimate the unobserved state. Having estimated
$q$, however, we can now estimate the states $S(t)$, say with the particle
filter.
## Some Important Special Cases
### Mixture Models
A sequence of IID variables is, technically, a Markov process.
So if the sequence $S(1), \ldots S(n)$ is IID, we _could_ apply the techniques
above to estimate the latent variables $S(t)$. But most of the work would be
unnecessary. Because $S(t)$ is independent of both $S(t-1)$ and $S(t+1)$,
it breaks up into a collection of independent problems, of working out
the distribution of $S(t)$ given $X(t)$ (and _only_ given $X(t)$).
When the latent variable $S(t)$ is discrete, we call this problem
a "mixture model" or a "latent class model". When $S(t)$ is continuous, the
problem has multiple names; "factor model" is common when $X(t)$ is linearly
related to $S(t)$.
@CRS-ADAfaEPoV goes over these situations in some detail.
### Deterministic Latent Dynamics
If there is actually a _deterministic_ relationship between $S(t)$ and $S(t+1)$,
say $S(t+1) = \phi(S(t))$ for some function $\phi$, that, too, is technically
a Markov process. What makes this case special is that giving the initial
value $S(1)$ fixes the whole rest of the $S(t)$ sequence. Thus there isn't
the exponential growth in the number of possible trajectories we see with
other Markov models. In fact, it makes sense to treat the initial state
$S(1)$ as an unknown but _fixed_ parameter, say $s(1)$. The likelihood is then
\[
\prod_{t=1}^{n}{g(s(t), x(t); \theta)}
\]
with the understanding that $s(t) = \phi^t(s(1))$. Of course, if the dynamics
in the function $\phi$ are very sensitive to small changes in the initial state,
optimizing this likelihood can be very difficult, and so it might make more
sense to still fall back on techniques like simulation-based inference.
# Exercises
To think through / practice on, not to hand in.
1. Think back to the first example: $S(t)$ is a Markov chain with states $+1$
and $-1$, $\Prob{S(t+1)=+1|S(t)=+1} = \Prob{S(t+1)=-1|S(t)=-1}=q$, and $X(t)
\sim \mathcal{N}(S(t), 1)$. Assume that $S(1)$ is drawn from the invariant
distribution, which is $(1/2, 1/2)$, so that the process is stationary.
Find (a) the variance of $X(t)$; (b) all the covariances between
$X(t-1)$, $X(t)$ and $X(t+1)$, and (c) the optimal coefficients for
a linear prediction of $X(t+1)$ from $X(t)$ and $X(t-1)$. (You will need
to invert a $2\times 2$ matrix.)
# References