# Linear Generative Models for Spatial and Spatio-Temporal Data

18 October 2018

$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Neighbors}{N}$

# In our last episode

• Autoregressive models of order $$p$$ (AR$$(p)$$) for time univariate time series:
• $$X(t) = a+\sum_{j=1}^{p}{b_j X(t-j)} + \epsilon(t)$$
• Innovations $$\epsilon$$ uncorrelated with each other and with past of $$X$$
• Generating a time series = step through that rule over and over
• Predictions = plug in to the deterministic part
• Vector autoregressions of order $$p$$ (VAR$$(p)$$) for multivariate time series:
• $$\vec{X}(t) = \vec{a} + \sum_{j=1}^{p}{\mathbf{b}_j \vec{X}(t-j)} + \vec{\epsilon}(t)$$
• All vectors have dimension $$k$$, matrices dimension $$k\times k$$
• In general:
• Can eliminate higher-order dependence in favor of first-order dependence with more variables (to keep track of memory)
• In a first-order linear system, the eigenvalues and eigenvectors tell us everything

# Today

• A little about estimating AR(1)
• What to do with purely spatial data?
• What to do with spatio-temporal data?

# Estimating AR(1)

• If we can estimate an AR(1), we can figure out how to estimate a VAR(1)
• If we can estimate a VAR(1), we should be able to estimate an AR(p)
• So focus on estimating an AR(1)

$X(t) = a+b X(t-1) + \epsilon(t)$

• Given data $$X(0), X(1), X(2), \ldots X(t), \ldots X(n)$$, how do we fit this?
• If we can find $$a, b$$ we can get distribution of $$\epsilon$$ from residuals
• If we can find $$b$$ we can find $$a$$ in lots of ways

# Estimating $$b$$: Yule-Walker approach

• Assume stationarity (so $$|b| < 1$$)
• Work out covariance function $$\gamma(h; b) = b^h \gamma(0)$$
• More work if doing this for an AR$$(p)$$
• Pick the $$b$$ that best fits observed $$\hat{\gamma}(h)$$
• Seems reasonable if stationary
• and if $$\hat{\gamma}(h) \rightarrow \gamma_{\mathrm{true}}(h)$$ as $$n\rightarrow \infty$$ — why do we believe that?

# Estimating $$b$$: Ordinary least squares approach

• Assume $$a=0$$ to simplify book-keeping
• Regress $$X(1), X(2), \ldots X(n)$$ on $$X(0), X(1), \ldots X(n-1)$$
• i.e., “shut up and calculate”
• The usual math leads to $\begin{eqnarray} \hat{b} & = & \frac{\sum_{t=0}^{n-1}{X(t)X(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & \frac{\sum_{t=0}^{n-1}{X(t)(b X(t) + \epsilon(t+1))}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & b + \frac{\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}} \end{eqnarray}$
• $$X(t)$$ and $$\epsilon(t+1)$$ are uncorrelated so maybe the extra bit goes to zero?

# Estimating $$b$$: Maximum likelihood approach

• Assume a pdf $$f$$ for $$\epsilon(t)$$ with parameter $$\theta$$
• Assume $$\epsilon(t)$$ are IID, not just uncorrelated
• Conditional likelihood, given $$X(0)$$, is $\prod_{t=1}^{n}{f(X(t) - a - bX(t-1); \theta)}$
• Normalized conditional log-likelihood, given $$X(0)$$, is $L_n(a,b,\theta) = \frac{1}{n}\sum_{t=1}^{n}{\log{f(X(t) - a - bX(t-1); \theta)}}$
• Maximize this over $$a, b, \theta$$
• Can work if $$L_n(a,b,\theta)$$ converges to a well-behaved limit

# Estimating AR(1)

• We will see later what we need to assume to get either approach to work
• Stationarity will be sufficient, but OLS and MLE sometimes works for non-stationary processes

# In space, no one can decide which way to go

• For time, it makes sense to generate the future from the past
• Space isn’t directed the same way time is
• This makes purely spatial generative models much trickier, and much harder to interpret

# Conditional autoregressive (CAR) spatial models

• Data $$X(r_1), X(r_2), \ldots X(r_n)$$
• Assume $$r_i$$ are on a regular grid
• Write $$\Neighbors(r)$$ for the locations next to $$r$$
• CAR(1): $X(r) | \{X(q): q \in \Neighbors(r)\} = \sum_{j\in\Neighbors}{b_j X(r+j)} + \epsilon(r)$
• All neighborhoods have the same shape
• Each neighbor gets to have its own coefficient
• CAR$$(p)$$ is a sum of neighbors, plus neighbors’ neighbors, etc.
• Typically assume $$\epsilon$$ is IID
• Easy to include a global mean

# The Gaussian CAR model

• Assume $$\epsilon(r)$$ is IID Gaussian, variance $$\tau^2$$
• Assemble a matrix $$\mathbf{b}$$ where $$b_{ij} =$$ coefficient of $$X(r_j)$$ in the model for $$X(r_i)$$
• Then $$X \sim \mathcal{N}(0, \tau^2(\mathbf{I}-\mathbf{b})^{-1})$$
• In-class exercise: $X(r_i)|\text{all other}~ X(r_k)\sim \mathcal{N}(\mu(r_i), \tau^2)$ What is $$\mu(r_i)$$, in terms of $$\mathbf{b}$$ and the $$X$$’s?

# The Gaussian CAR model

• Assume $$\epsilon(r)$$ is IID Gaussian, variance $$\tau^2$$
• Assemble a matrix $$\mathbf{b}$$ where $$b_{ij} =$$ coefficient of $$X(r_j)$$ in the model for $$X(r_i)$$
• Then $$X \sim \mathcal{N}(0, \tau^2(\mathbf{I}-\mathbf{b})^{-1})$$
• Conditional law: $X(r_i)|\text{all other} ~ X(r_k) \sim \mathcal{N}(\mu(r_i), \tau^2)$ with $\mu(r_i) = \sum_{k=1}^{n}{b_{ik} X(r_k)}$
• Intuition: we’ve built in conditional independence
• Rigor: general properties of Gaussians; see back-up

# Some points about CAR models

• Generally, you can write down any conditional regression model you like, and this is internally consistent
• e.g., non-linear but with Gaussian noise
• e.g., logistic regression on neighbors for binary-valued $$X$$
• e.g., non-linear and non-Gaussian, etc., etc.
• Conditional prediction is straightforward from the specification
• How do we generate a new field?
• There’s a global solution for the linear-Gaussian case…
• … but finding a global solution otherwise is really hard
• The model only specifies conditional distributions

# The “Gibbs sampler” trick

• Start with some value for $$X(r_1), X(r_2), \ldots X(r_n)$$
• Calculate the distribution of $$X(r_1) | X(\Neighbors(r_1))$$
• Draw a random value $$X^*$$ from that distribution
• Update: $$X(r_1) \leftarrow X^*$$
• Now go on to $$X(r_2)$$ and condition on its neighbors:
• Calculate the distribution of $$X(r_2) | X(\Neighbors(r_2))$$
• Draw a random value $$X^*$$ from that distribution
• Update: $$X(r_2) \leftarrow X^*$$
• Iterate through to $$X(r_n)$$, then go back to $$X(r_1)$$
• Stop after a few passes through all the sites
• Optional variants: randomize the order in which you visit sites, etc.
• We’ve made a Markov chain, and this converges when the Markov chain has a unique, attracting invariant distribution
• That jargon will make more sense when we’ve done Markov chains in a few weeks

# Simultaneous autoregressions (SAR):

• A non-conditional specification: $X(r_i) = \sum_{j=1}^{n}{b_{ij} X(r_j)} + \epsilon(r)$
• If $$X(r)$$ and $$X(q)$$ are neighbors, the same values have to be used in both equations
• If $$\epsilon$$ is IID Gaussian with variance $$\tau^2$$, $X \sim \mathcal{N}(0, \tau^2(\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1})$
• This is not the same as the CAR model with the same $$\mathbf{b}$$

# Simultaneous autoregressions (SAR):

• SAR models are self-referential and weird
• e.g., in one dimension with nearest, we have the simultaneous equations $\begin{eqnarray} X(r-1) & = & b_1 X(r-2) + b_2 X(r) + \epsilon(r-1)\\ X(r) & = & b_1 X(r-1) + b_2 X(r+1) + \epsilon(r)\\ X(r+1) & = & b_1 X(r) + b_2 X(r+2) + \epsilon(r+1)\\ \end{eqnarray}$
• Economists are bizarrely fond of SARs
• Supposed to model equilibrium, where you take account of what your neighbors do, who are taking account of what you do, etc.
• Generally, there is no joint distribution of the $$X(r_i)$$ which satisfies the simultaneous equations
• Unless everything is linear and Gaussian
• Every SAR is equivalent to some CAR

# Spatio-temporal

• If you just want to predict $$X(r,t)$$ from $$X(\Neighbors(r), t-1)$$, you have a bunch of VARs
• If you must include dependence of $$X(r,t)$$ on $$X(q,t)$$, then try to use a conditional model, if you can

# Summary

• Spatial autoregressions come in two forms:
• Conditional autoregressions are interpretable
• Simultaneous autoregressions are bizarre
• Spatio-temporal autoregressions come in two forms:
• Dependence on the past alone is just a VAR
• Dependence on past and present is a hybrid
• Still need to see how to really estimate everything
• When do sample averages converge on expectations?
• When will log-likelihoods tend to good limits?

# Backup: Deriving the global distribution of a linear-Gaussian CAR

(after Guttorp (1995, 206–7))

• Abbreviate $$X(r_i)$$ as $$X_i$$; “all other $$X$$’s” as $$X_{-i}$$; and “all other $$X$$’s, with 0s from position $$j$$ on” as $$X_{-i;j}$$
$\begin{eqnarray} \log{\frac{p(\mathbf{X}=\mathbf{x})}{p(\mathbf{X}=\mathbf{0})}} & = & \sum_{i=1}^{n}{\log{p(X_i=x_i|X_{-i}=x_{-i;i+1})} - \log{p(X_i=0|X_{-i}=x_{-i;i+1})}} ~ \text{(Brooks representation, see backup)}\\ & = & -\frac{1}{2\tau^2}\left(\sum_{i=1}^{n}{\left(x_i - \sum_{j=1}^{i-1}{b_{ij} x_j}\right)^2 - \left(0 - \sum_{j=1}^{i-1}{b_{ij} x_j}\right)^2}\right)\\ & = & -\frac{1}{2\tau^2}\left(\sum_{i=1}^{n}{x_i^2 - 2 x_i \sum_{j=1}^{i-1}{b_{ij} x_j}}\right)\\ & = & \frac{1}{2\tau^2}\left(\sum_{i=1}^{n}{x_i^2 - x_i (\mathbf{b} \mathbf{x})_{i}}\right) ~ \text{(symmetry of}~\mathbf{b})\\ & = & \frac{1}{2\tau^2}\mathbf{x}^T(\mathbf{I}-\mathbf{b})\mathbf{x}\\ \end{eqnarray}$
• This is the form of a multivariate Gaussian, with variance-covariance matrix $$\tau^2(\mathbf{I}-\mathbf{b})^{-1}$$

# Backup: Deriving the global distribution of a linear-Gaussian SAR

(after Guttorp (1995, 219–21, Exercise 4.1))

• Again, abbreviate $$X(r_i)$$ as $$X_i$$, etc.
• For each $$i$$, $X_i = \sum_{j=1}^{n}{b_{ij} X_j} + \epsilon_i$
• System of simultaneous equations $$\Leftrightarrow$$ one matrix equation: $\mathbf{X} = \mathbf{b}\mathbf{X} + \mathbf{\epsilon}$
• Solve: $\mathbf{X} = (\mathbf{I}-\mathbf{b})^{-1} \mathbf{\epsilon}$
• By assumption, $$\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$$
• So by general properties of multivariate Gaussians, $\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \tau^2 (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1}$ using the symmetry of $$\mathbf{b}$$

# Backup: Converting an SAR to a CAR

• Assume a linear, Gaussian SAR, with matrix $$\mathbf{b}$$
• The global distribution is $$X \sim \mathcal{N}(0, (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1})$$
• The global distribution of a CAR with matrix $$\mathbf{\beta}$$ is $$\mathcal{N}(0, (\mathbf{I} - \mathbf{\beta})^{-1})$$ $\begin{eqnarray} (\mathbf{I}-\mathbf{\beta})^{-1} & = & (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1}\\ \mathbf{I}-\mathbf{\beta} & = & (\mathbf{I}-\mathbf{b})(\mathbf{I}-\mathbf{b})\\ \mathbf{I}-\mathbf{\beta} & = & \mathbf{I} - 2\mathbf{b} + \mathbf{b}^2\\ \mathbf{\beta} & = & 2\mathbf{b} + \mathbf{b}^2 \end{eqnarray}$
• So a nearest-neighbors SAR corresponds to a CAR with next-nearest-neighbors, etc.

• Query: Can you always find an SAR matrix to match a CAR?

# Backup: The “Brooks representation” result

After Guttorp (1995, 7, Proposition 1.1)

$$\mathbf{X}$$ is an $$n$$-dimensional random vector. Assume $$p(\mathbf{x}) > 0$$ for all $$\mathbf{x}$$. Then $\frac{p(\mathbf{x})}{p(\mathbf{y})} = \prod_{i=1}^{n}{\frac{p(x_i|x_{1:i-1}, y_{i+1:n})}{p(y_i|x_{1:i-1}, y_{i+1:n})}}$ Since $$\sum_{\mathbf{x}}{p(\mathbf{x})}=1$$, this implies that the conditional distributions uniquely fix the whole distribution

# Backup: Proof of the Brooks representation result

Again, after Guttorp (1995, 7)

$\begin{eqnarray} p(\mathbf{x}) & = & p(x_n|x_{1:n-1}) p(x_{1:n-1})\\ & = & p(x_n|x_{1:n-1})p(x_{1:n-1})\frac{p(y_n|x_{1:n-1})}{p(y_n|x_{1:n-1})}\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})}p(x_{1:n-1}, y_n)\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} p(x_{n-1}| x_{1:n-2}, y_n) p(x_{1:n-2}, y_n)\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} \frac{p(x_{n-1}| x_{1:n-2}, y_n)}{p(y_{n-1}| x_{1:n-2}, y_n)}p(x_{1:n-2}, y_{n-1:n})\\ & \vdots &\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} \frac{p(x_{n-1}| x_{1:n-2}, y_n)}{p(y_{n-1}| x_{1:n-2}, y_n)} \ldots \frac{p(x_1|y_{2:n})}{p(y_1|y_{2:n})} \end{eqnarray}$

# References

Guttorp, Peter. 1995. Stochastic Modeling of Scientific Data. London: Chapman; Hall.