\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]

```
set.seed(2018-12-3)
library(knitr)
opts_chunk$set(size="small", background="white",
cache=TRUE, autodep=TRUE,
tidy=FALSE, warning=FALSE, message=FALSE,
echo=TRUE)
```

The last few lectures have focused on Markov processes, where conditioning on the whole past is equivalent to conditioning on the most recent value: \[
\Prob{X(t+1)|X(1:t)} = \Prob{X(t+1)|X(t)}
\] When this is true, we say that \(X(t)\) is the **state** of the process at time \(t\), the variable which determines the whole distribution of future observations. When this assumption holds, we can easily do likelihood-based inference and prediction. But the Markov property commits us to \(X(t+1)\) being independent of all earlier \(X\)’s given \(X(t)\).

The most natural route from Markov models to hidden Markov models is to ask what happens if we don’t observe the state perfectly. I have been using \(X(t)\) to always stand for the time series we observe, so I am going to introduce a new stochastic process, \(S(t)\), which will be a Markov process, and say that what we observe is \(X(t)\) plus independent noise. Will this \(X(t)\) be Markov?

To be concrete, consider the following set-up. \(S(t)\) is a Markov chain with two states, \(-1\) and \(+1\), and \(\Prob{S(t+1)=+1|S(t)=+1} = \Prob{S(t+1)=-1|S(t)=-1} q > 0.5\). This chain thus alternates between two states, but tends to stick in the same state for a while, to be “persistent”. What we observe is \(X(t) = S(t) + \epsilon(t)\), where \(\epsilon(t)\) are IID Gaussian noise variables with mean 0 and variance \(1\). (Nothing particularly turns on the choice of Gaussian noise or variance 1, etc.) The figure illustrates both a typical realization of the observables \(X(t)\), and of the chain \(S(t)\) that generated it. (The source file has the code.)

Now any value of \(X(t)\) *could* have come from either value of \(S(t)\). (See next figure.) Large positive values of \(X(t)\) are *really unlikely* if \(S(t)=-1\), but not impossible, and values of \(X(t)\) near zero are about equally likely under either state.

```
curve(0.5*dnorm(x,-1,1)+0.5*dnorm(x,1,1), xlab="x", ylab="p(x)",
main="Marginal and Conditional Distributions of X(t)", xlim=c(-5,5), ylim=c(0, 0.5))
curve(dnorm(x,-1,1), lty="dashed", add=TRUE)
curve(dnorm(x,+1,1), lty="dotted", add=TRUE)
legend("topright", legend=c("X(t)", "X(t)|S(t)=-1", "X(t)|S(t)=+1"),
lty=c("solid", "dashed", "dotted"))
```

This has consequences when we ask about the distribution of \(X(t+1)\). If \(X(t) = 3.5\), then it’s very likely that \(S(t)=+1\) — so likely that earlier history won’t change much of anything. But if \(X(t) = 0\), well, it will make a difference whether \(X(t-1)\) was positive or negative.

This simple observation is enough, in itself, to tell us that \(X(t)\) can’t be a Markov process. In any Markov process, \(X(t-1)\) is irrelevant to \(X(t+1)\) given \(X(t)\). In particular, if \(X\) is numerical and we regress \(X(t+1)\) on both \(X(t)\) and \(X(t-1)\), the true coefficient on \(X(t-1)\) is *exactly* zero. This is *not* the case here (Exercise 1).

Quite generally, if \(S(t)\) is a Markov process, and \(X(t)\) is a function of \(S(t)\), plus, possibly, some extra noise, then \(X(t)\) is *not* Markov^{1}.

Examples like these lead to a general notion of a **hidden Markov model**, or **state-space model**. In these models, there is a **latent** or **hidden** state \(S(t)\), which follows a Markov process. We’ll write \(\Prob{S(t+1)=r|S(t)=s} = q(r,s)\). As in Markov models, the transitions need to be complemented with a distribution for the initial state. Because these states are hidden, we don’t actually get to observe them. The **observable** or **manifest** variable \(X(t)\) is a random or noisy function of \(S(t)\). That is, there’s some sequence of IID noise variables, say \(\epsilon(t)\), and \(X(t) = f(S(t), \epsilon(t))\) for some function \(f\). As a consequence, given \(S(t)\), \(X(t)\) is independent of \(X(t^\prime), S(t^\prime)\) for all \(t^\prime \neq t\). An *example* of this would be when \(X(t) = a S(t) + \epsilon(t)\), but the function can be nonlinear and the noise can be non-additive. In fact, the exact function \(f\) usually doesn’t matter very much; what matters are the conditional observation probabilities \(\Prob{X(t)=x|S(t)=s} = g(s,x)\).

Some comments are in order on this definition.

- Both \(S(t)\) and \(X(t)\) can be numerical or qualitative variables, or indeed more complicated.
- Some people make a distinction between “hidden Markov models”, where \(S(t)\) is discrete, and “state-space models”, where \(S(t)\) is continuous. Some of these people would further limit “state-space models” to situations where \(X(t)\) is also continuous. Since the theory and methods are almost the same, however, I will use the terms interchangeably.
- Obviously, when variables are continuous, probability mass functions need to get replaced by probability density functions, and sums by integrals.

- Both \(S(t)\) and \(X(t)\) can have multiple dimensions. I won’t indicate this in the notation.
- Any type of state variable can, potentially, be combined with any type of observable. For instance, you can have discrete \(S(t)\) and continuous-function-valued \(X(t)\). (These are sometimes used in speech recognition, where the discrete states are the qualitative linguistic sounds, “phonemes”, people are saying, and the function-valued observations are the actual noises they make over brief intervals of time.)
- HMMs can be extra
*useful*when the state is somehow simpler than the observable — say a low-dimensional or even binary state explaining a high-dimensional vector of observables — but that’s not required by the definition.- In particular, if \(S(t)\) has \(r\) dimensions, but we observe \(k < r\) of them, the variables we
*observe*will not be Markov, even if they’re observed perfectly.

- In particular, if \(S(t)\) has \(r\) dimensions, but we observe \(k < r\) of them, the variables we

Because the states \(S(t)\) form a Markov process, we can easily calculate the probability of a state sequence: \[\begin{eqnarray} \Prob{S(1:n) = s(1:n)} & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{S(t)=s(t)|S(1:t-1)=s(1:t-1)}\\ & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{S(t)=s(t)|S(t-1)=s(t-1)}\\ & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{q(s(t-1), s(t))} \end{eqnarray}\] The first line is basic probability; the second uses the Markov property.

It’s equally easy to calculate the probability of a sequence of observations, given a sequence of states: \[\begin{eqnarray} \Prob{X(1:n)=x(1:n)|S(1:n)=s(1:n)} & = & \prod_{t=1}^{n}{\Prob{X(t)=x(t)|S(t)=s(t)}}\\ & = & \prod_{t=1}^{n}{g(s(t), x(t))} \end{eqnarray}\] This is because \(X(t)\) is independent of everything else, given \(S(t)\).

Putting these together gives us the joint probability of a sequence of states and a sequence of observations: \[\begin{eqnarray} \Prob{X(1:n)=x(1:n), S(1:n)=s(1:n)} & = & \Prob{S(1:n)=s(1:n)}\Prob{X(1:n)=x(1:n)|S(1:n)=s(1:n)}\\ & = & \Prob{S(1)=s(1)} \prod_{t=2}^{n}{q(s(t-1), s(t))}\prod_{t=1}^{n}{g(s(t), x(t))} \end{eqnarray}\]

This, in turn, in principle gives us the probability of a sequence of observations: \[\begin{eqnarray} \Prob{X(1:n)=x(1:n)} & = & \sum_{s(1:n)}{\Prob{X(1:n)=x(1:n), S(1:n)=s(1:n)}}\\ \end{eqnarray}\]

This probability is a very valuable thing to know. When we make predictions, we want \[ \Prob{X(t)=x|X(1:t-1)=x(1:t-1)} = \frac{\Prob{X(1:t)=x(1:t)}}{\Prob{X(1:t-1)=x(1:t-1)}} \] so if we can calculate the probability of a sequence, and we can divide, then we can predict. If \(q\) and \(g\) contain some unknown parameters \(\theta\), and we want to do inference on them, then we really should be writing things like \[ \Prob{X(1:n)=x(1:n); \theta} \] and so calculating sequence probabilities will let us do likelihood-based inference.

Unfortunately, the sum-over-state-histories expression, $ _{s(1:n)}{}$, is just not suitable for direct evaluation. If there are even two states, there are \(2^n\) possible values of \(s(1:n)\), and that’s clearly not a sum you want to do if \(n\) is at all large.

There is a neat trick which lets us avoid the exponential explosion in evaluating the likelihood, or in making predictions. The same trick will also let us estimate the hidden state at a given time. The trick is to do all of these things **recursively**, based on having solved the same problem with less of the data.

Let’s start with the base case. If we knew \(S(1)\), it’d be easy to find the distribution of \(X(1)\): \[ \Prob{X(1)=x(1)|S(1)=s} = g(s, x(1)) \] It’s therefore easy to find the marginal distribution of \(X(1)\), i.e., our predictive distribution for \(X(1)\): \[ \Prob{X(1)=x(1)} = \sum_{s}{\Prob{X(1)=x(1)|S(1)=s}\Prob{S(1)=s}} = \sum_{s}{g(s, x(1)) \Prob{S(1)=s}} \] Now we can use the definition of conditional probability to estimate the state at time 1: \[\begin{eqnarray} \Prob{S(1)=s|X(1)=x} & = & \frac{\Prob{S(1)=s, X(1)=x(1)}}{\Prob{X(1)=x(1)}}\\ & = & \frac{\Prob{S(1)=s}\Prob{X(1)=x(1)|S(1)=s}}{\Prob{X(1)=x(1)}}\\ & = & \frac{\Prob{S(1)=s} g(s, x(1))}{\sum_{s^{\prime}}{\Prob{S(1)=s} g(s, x(1))}} \end{eqnarray}\] This is, of course, just Bayes’s rule. Because this will be important later, let’s abbreviate it by \(F_1(s)\).

Now that we know (the distribution of) the state at time 1, we can extrapolate forward to time 2: \[\begin{eqnarray} \Prob{S(2)=r|X(1)=x(1)} & = & \sum_{s}{\Prob{S(2)=r, S(1)=s|X(1)=x(1)}}\\ & = & \sum_{s}{\Prob{S(2)=r|S(1)=s, X(1)=x(1)}\Prob{S(1)=s|X(1)=x(1)}}\\ & = & \sum_{s}{\Prob{S(2)=r|S(1)=s}\Prob{S(1)=s|X(1)=x(1)}}\\ & = & \sum_{s}{q(s,r) F_1(s)} \end{eqnarray}\] using the Markov property in the next-to-last line, and the definitions of \(g\) and \(F_1\) in the last line.

We want to extrapolate the state forward because it lets us make a prediction: \[\begin{eqnarray} \Prob{X(2)=x|X(1)=x(1)} & = & \sum_{s}{\Prob{X(2)=x, S(2)=s|X(1)=x(1)}}\\ & = & \sum_{s}{\Prob{X(2)=x|S(2)=s, X(1)=x(1)}\Prob{S(2)=s|X(1)=x(1)}}\\ & = & \sum_{s}{\Prob{X(2)=x|S(2)=s}\Prob{S(2)=s|X(1)=x(1)}}\\ & = & \sum_{s}{g(s, x)\Prob{S(2)=s|X(1)=x(1)}} \end{eqnarray}\] using the assumption that \(S(t)\) screens off everything else from \(X(t)\) in the next-to-last line.

Finally, once we’ve seen \(X(2)\), we can use that to get the distribution of \(S(2)\): \[\begin{eqnarray} F_2(s) & \equiv & \Prob{S(2)=s|X(2)=x(2), X(1)=x(1)}\\ & = & \frac{\Prob{S(2)=s, X(2)=x(2)|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\ & = & \frac{\Prob{X(2)=x(2)|S(2)=s, X(1)=x(1)}\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\ & = & \frac{\Prob{X(2)=x(2)|S(2)=s}\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}}\\ & = & \frac{g(s, x(2))\Prob{S(2)=s|X(1)=x(1)}}{\Prob{X(2)=x(2)|X(1)=x(1)}} \end{eqnarray}\] Again, this is Bayes’s rule. Notice that the role of “prior distribution” here isn’t played by \(\Prob{S(1)}\) or even \(\Prob{S(1)|X(1)}\), but by \(\Prob{S(2)|X(1)}\).

Now we’re ready to go on to \(t=3\), and to state the general, recursive pattern. This is sometimes called the **forward algorithm**.

- Begin with \(\Prob{S(t)=s|X(1:t)=x(1:t)} \equiv F_t(s)\)
- Extrapolate the state forward: \[ \Prob{S(t+1)=s|X(1:t)=x(1:t)} = \sum_{s^{\prime}}{q(s^{\prime}, s)\Prob{S(t)=s^{\prime}|X(1:t)=x(1:t)}} = \sum_{s^{\prime}}{q(s^{\prime}, s) F_t(s^{\prime})} \]
- Predict the next observation: \[ \Prob{X(t+1)=x|X(1:t)=x(1:t)} = \sum_{s}{g(s,x) \Prob{S(t+1)=s|X(1:t)=x(1:t)}} \]
- Condition \(S(t+1)\) on \(X(t+1)\): \[ \Prob{S(t+1)=s|X(1:t+1)=x(1:t+1)} = \frac{g(s,x) \Prob{S(t+1)=s|X(1:t)=x(1:t)}}{\Prob{X(t+1)=x|X(1:t)=x(1:t)}} \]
- End with \(F_{t+1}(s) \equiv \Prob{S(t+1)=s|X(1:t+1)=x(1:t+1)}\), or begin the cycle over again with \(t+2\).

We began this by wanting an alternative to direct sum-over-state-histories for the likelihood \(\Prob{X(1:n)=x(1:n)}\). We can get that from this approach, by multiplying together predictive distributions: \[ \Prob{X(1:n)=x(1:n)} = \Prob{X(1)=x(1)}\prod_{t=2}^{n}{\Prob{X(t)=x(t)|X(1:t-1)=x(1:t-1)}} \] We haven’t, however, established that this approach is any faster than summing over histories!

To see that it is, think about how time-consuming it will be to go through the cycle above once. To make a prediction for \(X(t+1)\), we need to sum over states \(S(t+1)\), and to get the probability that \(S(t+1)=s\), we need to sum over states \(S(t)\). So if there are \(k\) states, we need to so \(O(k^2)\) operations in that double sum. Each of the \(n\) time-steps requires one pass through the cycle, so we need up needing to do \(O(k^2 n)\) operations, not \(O(k^n)\) as with direct summation.

In this analysis, it’s important that the procedure is recursive, that we calculate \(F_{t+1}\) using just \(F_t\) and \(x(t+1)\) (and the model!), so that we don’t need to keep track of earlier states or their distributions, or even, in fact, earlier values of \(X\). To the extent they matter, they’ve been summarized in \(F_t\), which is a sufficient statistic for predicting \(X(t+1)\).

To sum up, in an HMM it’s straightforward to calculate \(\Prob{X(1)=x(1)}\) and \(F_1(s) = \Prob{S(1)=s|X(1)=x(1)}\). Thereafter, we can predict by cycling through (i) extrapolating the state distribution \(F_t\) forward in time (using the transition probabilitiess \(q\)), (ii) predicting the next observation (using (i) and the observation probabilities \(g\)), and (iii) updating the distribution of the state to \(F_{t+1}\) (using (ii) and Bayes’s rule). This lets us make predictions, and so calculate a likelihood, or log-likelihood.

You may have noticed that we’re just calculating \(\Prob{S(t)|X(1:t)}\) and not \(\Prob{S(t)|X(1:n)}\), let alone \(\Prob{S(1:n)|X(1:n)}\). Partly, this is because all we really *need* for prediction or the likelihood is \(\Prob{S(t)|X(1:t)}\). Partly, however, it is because it’s just a lot *easier* to calculate this than it is to calculate \(\Prob{S(t)|X(1:n)}\). Nonetheless, there is *also* a recursive algorithm for the latter. To use it, one must first have run the forward algorithm, getting a sequence \(\Prob{S(t)|X(1:t)}\) for each \(t\in 1:n\). The last of these, \(\Prob{S(n)|X(1:n)}\), is the last probability we want, so the recursion starts at the *end* of the time series and works backwards in time (hence the name). I will omit all the details here, because they’re neither as illuminating as in the forward algorithm, and because you can find them in readily-available references, such as Fraser (2008).

An important special case is when \(S(t)\) and \(X(t)\) are both vectors, and everything is linear-and-Gaussian: \[\begin{eqnarray}
S(t+1) & = & \mathbf{a} S(t) + \eta(t)\\
X(t) & = & \mathbf{b} S(t) + \epsilon(t)\\
\eta(t) & \sim_{IID} & \mathcal{N}(0, \mathbf{\Sigma}_{S})\\
\epsilon(t) & \sim_{IID} & \mathcal{N}(0, \mathbf{\Sigma}_X)
\end{eqnarray}\] In this case, it’s possible to write out all the conditional probabilities in the recursion *exactly* — everything is Gaussian, with mean vectors and variance matrices that are straightforward functions of the data. Finding the filtering distribution then becomes known as the **Kalman filter**, after Kalman (1960) (and Kalman and Bucy (1961)). One of the clearest presentations I’ve seen is in Fraser (2008), which you should read if this is at all interesting. There is a base R implementation in the function `KalmanRun`

(with various companion functions `KalmanLike`

, `KalmanForecast`

, etc.).

While the Kalman-filter solution is is nicely straightforward, it is heavily tied to the linear-and-Gaussian assumptions. There are many important situations where that’s good enough (McGee and Schmidt 1985), but it can fail in important ways. This has led to a number of attempts to use local linear approximations to the Kalman filter in non-linear problems, and to alternative closed-form approximate filters (Koyama et al. 2010).

A different direction, however, is to go to back to the core ideas of the forward algorithm, and to simulate them. This is the “particle filter”.

While the forward algorithm is straightforward in principle, it still involves a double sum over the state space. This can be a time-consuming when the number of states \(k\) is large. When the state-space is continuous, the sum becomes an integral, and doing even a double integral numerically can be quite time-consuming and/or inaccurate. In this situation, a useful trick is to replace complicated or time-consuming calculations with simulations, i.e., to use Monte Carlo.

Applied here, the **Monte Carlo filter**, or **particle filter**, approximates the distribution \(F_t\) with the empirical distribution of a large number of simulations of the HMM, conditioned on the data. It works as follows^{2}.

- Make a large number of independent draws \(S^*_1, S^*_2, \ldots S^*_m\) from the distribution \(\Prob{S(1)}\). Each random state value is called a
**particle**, and the number of particles \(m\) is, ideally, as large as practical. (See more on this below.) Set \(t=1\). - For each particle \(S^*_i\), calculate \(g(S^*_i, x(t)) = w_{i}\). Adding these up approximates the predictive probability, \(\Prob{X(t)=x(t)|X(1:t-1)=x(1:t-1)} \approx \sum_{i=1}^{m}{w_i}\).
- Now resample the particles, i.e., draw \(m\) of them, with replacement, from \(S^*\), with probabilities proportional to the \(w_{i}\). Call the new values \(\tilde{S}_1, \tilde{S}_2, \ldots \tilde{S}_m\). Set \(\hat{F}_t\) to be the empirical distribution of the \(\tilde{S}\). That is, their empirical distribution approximates \(\Prob{S(t)|X(1:t)=x(1:t)}\).
- Increment \(t\) by 1, and evolve the particles forward in time. That is, for each \(\tilde{S}_i\), pick a \(S^*_i\) according to \(q(\tilde{S}_i, S^*_i)\). The distribution of these new \(S^*\) approximates \(\Prob{S(t+1)|X(1:t)=x(1:t)}\).
- Go to (2).

This all lends itself wonderfully to code.

```
# Particle filter for an HMM
# Inputs: time series of length n (x)
# function generating draws for initial state (rinitial)
# function generating draws for next state (rtransition)
# function giving observational probabilities (dobs)
# number of particles (m)
# Output: n*m array of particles, approximating Pr(S(t)|X(1:t))
# Presumes: states can be elements in a vector
# observations can be elements in a vector
# rinitial is a no-argument function (for compatibility with rmarkov())
# rtransition is a one-argument function (ditto)
# dobs is a two-argument function
particle.filter <- function(x, rinitial, rtransition, dobs, m) {
# How long is the time series?
n <- length(x)
# Our initial guesses at the state, before seeing anything
particles <- replicate(m, rinitial())
# A storage area for our guesses
particle.history <- matrix(0, nrow=m, ncol=n)
for (t in 1:n) {
# Calculate observationa likelihoods/weights for current observation
# given state=particles
weights <- dobs(state=particles, observation=x[t])
# Resample the particles, with probabilities (proportional to)
# the weights
particles <- sample(particles, size=m, replace=TRUE, prob=weights)
# Store the current set of guesses
particle.history[,t] <- particles
# Evolve the particles forward using the Markov transitions for the
# next time step
particles <-sapply(particles, rtransition)
}
return(particle.history)
}
```

I will leave it as an exercise to extend this function so that it also returns (an approximation to) the sequence \(\Prob{X(t)|X(1:t-1)}\).

Here is an implementation using the HMM introduced at the beginning, where \(S(t)\) is either \(+1\) or \(-1\), and \(X(t)|S(t) \sim \mathcal{N}(S(t), 1)\).

```
sim.pf <- particle.filter(x=sim$x,
rinitial=function() { sample(-1,+1, size=1) },
rtransition=function (x) {
stay <- sample(c(TRUE, FALSE), size=1, prob=c(0.75, 0.25))
if (stay) { return(x) }
if (!stay) { return(-x) }
},
dobs=function(state, observation) {
dnorm(x=observation, mean=state, sd=1)
},
m=100)
```

We have 100 particles, i.e., 100 guesses for \(S(t)\) at each time point. Rather than try to visualize this directly —- it’s rather “busy” — I’ll summarize with lines at the 5% and 95% percentiles (so a 90% probability interval).

```
plot(sim$x, xlab="t", ylab="X(t)", xlim=c(0, 120))
lines(1:100, sim$s, col="grey", lwd=2)
pf.limits <- apply(sim.pf, 2, quantile, prob=c(0.05, 0.95))
lines(1:100, pf.limits[1,], col="blue")
lines(1:100, pf.limits[2,], col="blue")
legend("right", legend=c("Hidden state S(t)", "Observable X(t)", "Particle filter"),
pch=c(NA, 1, NA), lty=c("solid", "blank", "solid"), col=c("grey", "black", "blue"),
cex=0.5)
```