36-467/36-667

18 October 2018

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Neighbors}{N} \]

- Autoregressive models of order \(p\) (AR\((p)\)) for time univariate time series:
- \(X(t) = a+\sum_{j=1}^{p}{b_j X(t-j)} + \epsilon(t)\)
- Innovations \(\epsilon\) uncorrelated with each other and with past of \(X\)
- Generating a time series = step through that rule over and over
- Predictions = plug in to the deterministic part

- Vector autoregressions of order \(p\) (VAR\((p)\)) for multivariate time series:
- \(\vec{X}(t) = \vec{a} + \sum_{j=1}^{p}{\mathbf{b}_j \vec{X}(t-j)} + \vec{\epsilon}(t)\)
- All vectors have dimension \(k\), matrices dimension \(k\times k\)

- In general:
- Can eliminate higher-order dependence in favor of first-order dependence with more variables (to keep track of memory)
- In a first-order linear system, the eigenvalues and eigenvectors tell us everything

- A little about estimating AR(1)
- What to do with purely spatial data?
- What to do with spatio-temporal data?

- If we can estimate an AR(1), we can figure out how to estimate a VAR(1)
- If we can estimate a VAR(1), we should be able to estimate an AR(p)
- So focus on estimating an AR(1)

\[ X(t) = a+b X(t-1) + \epsilon(t) \]

- Given data \(X(0), X(1), X(2), \ldots X(t), \ldots X(n)\), how do we fit this?
- If we can find \(a, b\) we can get distribution of \(\epsilon\) from residuals
- If we can find \(b\) we can find \(a\) in lots of ways

- Assume stationarity (so \(|b| < 1\))
- Work out covariance function \(\gamma(h; b) = b^h \gamma(0)\)
- More work if doing this for an AR\((p)\)

- Pick the \(b\) that best fits observed \(\hat{\gamma}(h)\)
- Seems reasonable
*if*stationary **and**if \(\hat{\gamma}(h) \rightarrow \gamma_{\mathrm{true}}(h)\) as \(n\rightarrow \infty\) — why do we believe that?

- Assume \(a=0\) to simplify book-keeping
- Regress \(X(1), X(2), \ldots X(n)\) on \(X(0), X(1), \ldots X(n-1)\)
- i.e., “shut up and calculate”

- The usual math leads to \[\begin{eqnarray} \hat{b} & = & \frac{\sum_{t=0}^{n-1}{X(t)X(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & \frac{\sum_{t=0}^{n-1}{X(t)(b X(t) + \epsilon(t+1))}}{\sum_{t=0}^{n-1}{X^2(t)}}\\ & = & b + \frac{\sum_{t=0}^{n-1}{X(t)\epsilon(t+1)}}{\sum_{t=0}^{n-1}{X^2(t)}} \end{eqnarray}\]
- \(X(t)\) and \(\epsilon(t+1)\) are uncorrelated so
*maybe*the extra bit goes to zero?

- Assume a pdf \(f\) for \(\epsilon(t)\) with parameter \(\theta\)
- Assume \(\epsilon(t)\) are IID, not just uncorrelated
- Conditional likelihood, given \(X(0)\), is \[ \prod_{t=1}^{n}{f(X(t) - a - bX(t-1); \theta)} \]
- Normalized conditional log-likelihood, given \(X(0)\), is \[ L_n(a,b,\theta) = \frac{1}{n}\sum_{t=1}^{n}{\log{f(X(t) - a - bX(t-1); \theta)}} \]
- Maximize this over \(a, b, \theta\)
- Can work if \(L_n(a,b,\theta)\) converges to a well-behaved limit

- We will see later what we need to assume to get either approach to work
- Stationarity will be sufficient, but OLS and MLE sometimes works for non-stationary processes

- For time, it makes sense to generate the future from the past
- Space isn’t
*directed*the same way time is - This makes purely spatial generative models much trickier, and much harder to interpret

- Data \(X(r_1), X(r_2), \ldots X(r_n)\)
- Assume \(r_i\) are on a regular grid
- Write \(\Neighbors(r)\) for the locations next to \(r\)
- CAR(1): \[
X(r) | \{X(q): q \in \Neighbors(r)\} = \sum_{j\in\Neighbors}{b_j X(r+j)} + \epsilon(r)
\]
- All neighborhoods have the same shape
- Each neighbor gets to have its own coefficient
- CAR\((p)\) is a sum of neighbors, plus neighbors’ neighbors, etc.
- Typically assume \(\epsilon\) is IID
- Easy to include a global mean

- Assume \(\epsilon(r)\) is IID Gaussian, variance \(\tau^2\)
- Assemble a matrix \(\mathbf{b}\) where \(b_{ij} =\) coefficient of \(X(r_j)\) in the model for \(X(r_i)\)
- Then \(X \sim \mathcal{N}(0, \tau^2(\mathbf{I}-\mathbf{b})^{-1})\)
**In-class exercise**: \[ X(r_i)|\text{all other}~ X(r_k)\sim \mathcal{N}(\mu(r_i), \tau^2) \] What is \(\mu(r_i)\), in terms of \(\mathbf{b}\) and the \(X\)’s?

- Assume \(\epsilon(r)\) is IID Gaussian, variance \(\tau^2\)
- Assemble a matrix \(\mathbf{b}\) where \(b_{ij} =\) coefficient of \(X(r_j)\) in the model for \(X(r_i)\)
- Then \(X \sim \mathcal{N}(0, \tau^2(\mathbf{I}-\mathbf{b})^{-1})\)
- Conditional law: \[
X(r_i)|\text{all other} ~ X(r_k) \sim \mathcal{N}(\mu(r_i), \tau^2)
\] with \[
\mu(r_i) = \sum_{k=1}^{n}{b_{ik} X(r_k)}
\]
- Intuition: we’ve built in conditional independence
- Rigor: general properties of Gaussians; see back-up

- Generally, you can write down any
*conditional*regression model you like, and this is internally consistent- e.g., non-linear but with Gaussian noise
- e.g., logistic regression on neighbors for binary-valued \(X\)
- e.g., non-linear and non-Gaussian, etc., etc.

*Conditional*prediction is straightforward from the specification- How do we generate a new field?
- There’s a global solution for the linear-Gaussian case…
- … but finding a global solution otherwise is
**really hard** - The model only specifies
*conditional*distributions

- Start with
*some*value for \(X(r_1), X(r_2), \ldots X(r_n)\) - Calculate the distribution of \(X(r_1) | X(\Neighbors(r_1))\)
- Draw a random value \(X^*\) from that distribution
- Update: \(X(r_1) \leftarrow X^*\)
- Now go on to \(X(r_2)\) and condition on
*its*neighbors:- Calculate the distribution of \(X(r_2) | X(\Neighbors(r_2))\)
- Draw a random value \(X^*\) from that distribution
- Update: \(X(r_2) \leftarrow X^*\)

- Iterate through to \(X(r_n)\), then go back to \(X(r_1)\)
- Stop after a few passes through all the sites
- Optional variants: randomize the order in which you visit sites, etc.

- We’ve made a Markov chain, and this converges when the Markov chain has a unique, attracting invariant distribution
- That jargon will make more sense when we’ve done Markov chains in a few weeks

- A
*non*-conditional specification: \[ X(r_i) = \sum_{j=1}^{n}{b_{ij} X(r_j)} + \epsilon(r) \] - If \(X(r)\) and \(X(q)\) are neighbors, the
*same*values have to be used in both equations - If \(\epsilon\) is IID Gaussian with variance \(\tau^2\), \[ X \sim \mathcal{N}(0, \tau^2(\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1}) \]
- This is
*not the same*as the CAR model with the same \(\mathbf{b}\)

- SAR models are self-referential and weird
- e.g., in one dimension with nearest, we have the
*simultaneous*equations \[\begin{eqnarray} X(r-1) & = & b_1 X(r-2) + b_2 X(r) + \epsilon(r-1)\\ X(r) & = & b_1 X(r-1) + b_2 X(r+1) + \epsilon(r)\\ X(r+1) & = & b_1 X(r) + b_2 X(r+2) + \epsilon(r+1)\\ \end{eqnarray}\] - Economists are bizarrely fond of SARs
- Supposed to model equilibrium, where you take account of what your neighbors do, who are taking account of what you do, etc.

- Generally, there is
**no**joint distribution of the \(X(r_i)\) which satisfies the simultaneous equations*Unless*everything is linear and Gaussian- Every SAR is equivalent to some CAR

- If you just want to predict \(X(r,t)\) from \(X(\Neighbors(r), t-1)\), you have a bunch of VARs
- If you must include dependence of \(X(r,t)\) on \(X(q,t)\), then try to use a conditional model, if you can

- Spatial autoregressions come in two forms:
- Conditional autoregressions are interpretable
- Simultaneous autoregressions are bizarre

- Spatio-temporal autoregressions come in two forms:
- Dependence on the past
*alone*is just a VAR - Dependence on past
*and present*is a hybrid

- Dependence on the past
- Still need to see how to really estimate everything
- When do sample averages converge on expectations?
- When will log-likelihoods tend to good limits?

(after Guttorp (1995, 206–7))

- Abbreviate \(X(r_i)\) as \(X_i\); “all other \(X\)’s” as \(X_{-i}\); and “all other \(X\)’s, with 0s from position \(j\) on” as \(X_{-i;j}\)

- This is the form of a multivariate Gaussian, with variance-covariance matrix \(\tau^2(\mathbf{I}-\mathbf{b})^{-1}\)

(after Guttorp (1995, 219–21, Exercise 4.1))

- Again, abbreviate \(X(r_i)\) as \(X_i\), etc.
- For each \(i\), \[ X_i = \sum_{j=1}^{n}{b_{ij} X_j} + \epsilon_i \]
- System of simultaneous equations \(\Leftrightarrow\) one matrix equation: \[ \mathbf{X} = \mathbf{b}\mathbf{X} + \mathbf{\epsilon} \]
- Solve: \[ \mathbf{X} = (\mathbf{I}-\mathbf{b})^{-1} \mathbf{\epsilon} \]
- By assumption, \(\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\)
- So by general properties of multivariate Gaussians, \[ \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \tau^2 (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1} \] using the symmetry of \(\mathbf{b}\)

- Assume a linear, Gaussian SAR, with matrix \(\mathbf{b}\)
- The global distribution is \(X \sim \mathcal{N}(0, (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1})\)
- The global distribution of a CAR with matrix \(\mathbf{\beta}\) is \(\mathcal{N}(0, (\mathbf{I} - \mathbf{\beta})^{-1})\) \[\begin{eqnarray} (\mathbf{I}-\mathbf{\beta})^{-1} & = & (\mathbf{I}-\mathbf{b})^{-1}(\mathbf{I}-\mathbf{b})^{-1}\\ \mathbf{I}-\mathbf{\beta} & = & (\mathbf{I}-\mathbf{b})(\mathbf{I}-\mathbf{b})\\ \mathbf{I}-\mathbf{\beta} & = & \mathbf{I} - 2\mathbf{b} + \mathbf{b}^2\\ \mathbf{\beta} & = & 2\mathbf{b} + \mathbf{b}^2 \end{eqnarray}\]
So a nearest-neighbors SAR corresponds to a CAR with next-nearest-neighbors, etc.

**Query**: Can you always find an SAR matrix to match a CAR?

After Guttorp (1995, 7, Proposition 1.1)

\(\mathbf{X}\) is an \(n\)-dimensional random vector. Assume \(p(\mathbf{x}) > 0\) for all \(\mathbf{x}\). Then \[
\frac{p(\mathbf{x})}{p(\mathbf{y})} = \prod_{i=1}^{n}{\frac{p(x_i|x_{1:i-1}, y_{i+1:n})}{p(y_i|x_{1:i-1}, y_{i+1:n})}}
\] Since \(\sum_{\mathbf{x}}{p(\mathbf{x})}=1\), this implies that the conditional distributions *uniquely* fix the whole distribution

Again, after Guttorp (1995, 7)

\[\begin{eqnarray} p(\mathbf{x}) & = & p(x_n|x_{1:n-1}) p(x_{1:n-1})\\ & = & p(x_n|x_{1:n-1})p(x_{1:n-1})\frac{p(y_n|x_{1:n-1})}{p(y_n|x_{1:n-1})}\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})}p(x_{1:n-1}, y_n)\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} p(x_{n-1}| x_{1:n-2}, y_n) p(x_{1:n-2}, y_n)\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} \frac{p(x_{n-1}| x_{1:n-2}, y_n)}{p(y_{n-1}| x_{1:n-2}, y_n)}p(x_{1:n-2}, y_{n-1:n})\\ & \vdots &\\ & = & \frac{p(x_n|x_{1:n-1})}{p(y_n|x_{1:n-1})} \frac{p(x_{n-1}| x_{1:n-2}, y_n)}{p(y_{n-1}| x_{1:n-2}, y_n)} \ldots \frac{p(x_1|y_{2:n})}{p(y_1|y_{2:n})} \end{eqnarray}\]Guttorp, Peter. 1995. *Stochastic Modeling of Scientific Data*. London: Chapman; Hall.