36-467/36-667

6 November 2018

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#2}\left( #1 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]

- General strategy for simulating
- The Monte Carlo method: replace complicated probability calculations with repeated simulation
- To find \(\Expect{f(Y)}\), do \(m\) simulations \(Y_1, Y_2, ldots Y_m\)
- Then \(m^{-1}\sum_{1}^{m}{f(Y_i)} \approx \Expect{f(Y)}\)
- Also works for variances, probabilities, etc.

- The sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
- The
**bootstrap principle**:*approximate*the sampling distribution by*simulating*from a good model, and treating the simulation run just like the data - Sometimes we simulate from the model we’re estimating (model-based or “parametric” bootstrap)
- Sometimes we simulate by re-sampling the original data (resampling or “nonparametric” bootstrap)
- Stronger assumptions \(\Rightarrow\) less uncertainty
*if we’re right*

Re-run the experiment (survey, census, …) and get different data

\(\therefore\) everything we calculate from data (estimates, test statistics, policies, …) will change from trial to trial as well

This variability is (the source of)

**statistical uncertainty**Quantifying this is a way to be honest about what we actually know

- Standard error = standard deviation of an estimator
- (could equally well use median absolute deviation, etc.)

- \(p\)-value = Probability we’d see a signal this big if there was just noise
- Confidence region = All the parameter values we can’t reject at low error rates:

*Either*the true parameter is in the confidence region*or*we are very unlucky*or*our model is wrong

- etc., etc.

Data \(X \sim P_{X,\theta_0}\), for some true \(\theta_0\)

We calculate a statistic \(T = \tau(X)\) so it has distribution \(P_{T,\theta_0}\)

- If we knew \(P_{T,\theta_0}\), we could calculate
- \(\Var{T}\) (and so standard error)
- \(\Expect{T}\) (and so bias)
- quantiles (and so confidence intervals or \(p\)-values), etc.

- Most of the time, \(P_{X,\theta_0}\) is very complicated
- Most of the time, \(\tau\) is a very complicated function
- We certainly don’t know \(\theta_0\)

Upshot: We don’t know \(P_{T,\theta_0}\) and can’t use it to calculate anything

- Classically (\(\approx 1900\)–\(\approx 1975\)): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large \(n\)
- This is the asymptotic theory we did in Lectures 14 and 15

- Modern (\(\approx 1975\)–): Use complex models and statistics, but
*simulate*calculating the statistic on the model

- Find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)
- Generate a simulation \(\tilde{X}\) from \(\hat{P}\), set \(\tilde{T} = \tau(\tilde{X})\)
- Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T,\theta_0}\)

Refinements:

- improving the initial estimate \(\hat{P}\)
- reducing the number of simulations or speeding them up
- transforming \(\tau\) so the final approximation is more stable

First step: find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)

If we are using a model, our best guess at \(P_{X,\theta_0}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters

- Get data \(X\), estimate \(\hat{\theta}\) from \(X\)
- Repeat \(m\) times:
- Simulate \(\tilde{X}\) from \(P_{X,\hat{\theta}}\) (simulate data of same size/“shape” as real data)
- Calculate \(\tilde{T} = \tau(\tilde{X}\)) (treat simulated data the same as real data)

- Use empirical distribution of \(\tilde{T}\) as \(P_{T,\theta_0}\)

- Annual time series of number of lynxes trapped for the fur company in the Mackenzie River area of Canada:

- This oscillates, so an AR(1) isn’t appropriate — maybe an AR(2)

```
## , , 1
##
## [,1]
## [1,] 1.152423
## [2,] -0.606229
```

`## [1] 710.1056`

so the estimated model is \[ X(t) = 710.1055888 + 1.1524226 X(t-1) - 0.606229 X(t-2) + \epsilon(t) \]

- What do some time series from this look like?
- (We wrote
`ar2.sim`

last time)

- (We wrote

- Write a function which estimates the coefficients

```
# Estimate the coefficients of an AR(2) model
# Inputs: a time series (x)
# Outputs: vector containing intercept and 2 AR slopes
ar2.estimator <- function(x) {
# Treat the input exactly the same way we treated the real data
fit <- ar.ols(x, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
return(c(a=fit$x.intercept, b1=fit$ar[1], b2=fit$ar[2]))
}
```

Check that this gives the right answer on the data, and a *different* answer on a simulation:

```
## a b1 b2
## 710.105589 1.152423 -0.606229
```

```
## a b1 b2
## 762.6063220 1.1102328 -0.5743644
```

Now do *lots* of simulations, and apply this function to each simulation:

```
lynx.coef.sims <- replicate(1000, ar2.estimator( ar2.sim(n=length(lynx),
a=lynx.ar2$x.intercept,
b=lynx.ar2$ar,
sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
x.start=lynx[1:2])))
dim(lynx.coef.sims)
```

`## [1] 3 1000`

```
## [,1] [,2] [,3] [,4] [,5]
## a 830.8435812 724.7291944 598.4342755 761.794216 885.9718640
## b1 1.1771914 1.1892204 1.2535834 1.109654 0.9910510
## b2 -0.7153194 -0.6591029 -0.6459415 -0.578512 -0.5180509
```

Standard errors \(\approx\) standard deviation across simulations:

```
## a b1 b2
## 118.50209000 0.07622429 0.07498753
```

95% confidence intervals \(\approx\) quantiles across simulations:

```
## a b1 b2
## 2.5% 508.0321 0.9839003 -0.7473846
## 97.5% 963.5467 1.2820643 -0.4422757
```

- … the bootstrapping isn’t needed, because there’s an asymptotic theory which tells us about standard errors and confidence intervals
- Like what you did for the AR(1) in homework 8, only with even more book-keeping

- But we can use this approach for
*any*generative time-series model- E.g., make \(X(t)\) have a Poisson distribution rather than a Gaussian
- All we have to do is swap out the simulation

- You can apply these ideas when the “statistic” is \(X(t)\) for a fixed \(t\)
- Or when it’s \(X(t) | X(t-1) = x\) for a fixed \(x\)
- Since the model tells us how to calculate that!

- Expected values across simulations = best guess for the point prediction
- Variances and intervals incorporate modeling uncertainty

- Problem: Suppose we don’t have a trust-worthy model
- Resource: We do have data, which tells us a lot about the distribution
- Solution:
**Resampling**, treat the sample like a whole population

- Get data \(X\), containing \(n\) samples, find \(T=\tau(X)\)
- Fix a
**block length**\(k\) and divide \(X\) into (overlapping) blocks of length \(k\)- So first block is \((X(1), \ldots X(k))\), second block is \((X(2), \ldots X(k+1))\), etc.

- Repeat \(m\) times:
- Generate \(\tilde{X}\) by drawing \(n/k\) blocks from \(X\)
*with replacement*(resample the data) - Concatenate the blocks to make \(\tilde{X}\)
- Calculate \(\tilde{T} = \tau(\tilde{X})\) (treat simulated data the same as real data)

- Generate \(\tilde{X}\) by drawing \(n/k\) blocks from \(X\)
- Use empirical distribution of \(\tilde{T}\) as \(P_{T,\theta}\)

```
# Convert a time series into a data frame of lagged values
# Input: a time series, maximum lag to use, whether older values go on the right
# or the left
# Output: a data frame with (order+1) columns, named lag0, lag1, ... , and
# length(ts)-order rows
design.matrix.from.ts <- function(ts,order,right.older=TRUE) {
n <- length(ts)
x <- ts[(order+1):n]
for (lag in 1:order) {
if (right.older) {
x <- cbind(x,ts[(order+1-lag):(n-lag)])
} else {
x <- cbind(ts[(order+1-lag):(n-lag)],x)
}
}
lag.names <- c("lag0",paste("lag",1:order,sep=""))
if (right.older) {
colnames(x) <- lag.names
} else {
colnames(x) <- rev(lag.names)
}
return(as.data.frame(x))
}
head(lynx)
```

`## [1] 269 321 585 871 1475 2821`

```
## lag0 lag1 lag2
## 1 585 321 269
## 2 871 585 321
## 3 1475 871 585
## 4 2821 1475 871
## 5 3928 2821 1475
## 6 5943 3928 2821
```

```
# Simple block bootstrap
# Inputs: time series (ts), block length, length of output
# Output: one resampled time series
# Presumes: ts is a univariate time series
rblockboot <- function(ts,block.length,len.out=length(ts)) {
# chop up ts into blocks
the.blocks <- as.matrix(design.matrix.from.ts(ts,block.length-1,
right.older=FALSE))
# look carefully at design.matrix.from.ts to see why we need the -1
# How many blocks is that?
blocks.in.ts <- nrow(the.blocks)
# Sanity-check
stopifnot(blocks.in.ts == length(ts) - block.length+1)
# How many blocks will we need (round up)?
blocks.needed <- ceiling(len.out/block.length)
# Sample blocks with replacement
picked.blocks <- sample(1:blocks.in.ts,size=blocks.needed,replace=TRUE)
# put the blocks in the randomly-selected order
x <- the.blocks[picked.blocks,]
# convert from a matrix to a vector and return
# need transpose because R goes from vectors to matrices and back column by
# column, not row by row
x.vec <- as.vector(t(x))
# Discard uneeded extra observations at the end silently
return(x.vec[1:len.out])
}
```

`## [1] 269 321 585 871 1475 2821`

`## [1] 1676 2251 1426 473 358 784`

- We can do everything the same as before, just changing how we simulate

- Standard errors:

```
## a b1 b2
## 198.36971077 0.09812245 0.07406839
```

- 95% confidence intervals:

```
## a b1 b2
## 2.5% 598.7163 0.3864529 -0.35398540
## 97.5% 1365.3888 0.7632487 -0.06106865
```

(Anything funny here?)

- Preserves the distribution within blocks
- Independence across blocks
- So \(\tilde{X} \approx\) stationary, with the same short-range distribution as \(X\) but no longer-range dependence
- Should work well if \(X\) is stationary without long-range dependence
- Want to let \(k\) grow with \(n\), but still have a growing number of blocks, so \(n/k_n \rightarrow \infty\), or \(k=o(n)\)
- Some theory suggests \(k \propto n^{1/3}\)
- Politis and White (2004) provides an automatic procedure for picking \(k\)
- R implementation at [http://www.math.ucsd.edu/~politis/SOFT/PPW/ppw.R]
- …which suggests a block-length of 3 for the lynxes

- Model-based: OK if the model is non-stationary
- Otherwise:
- Estimate the trend
- Bootstrap the deviations from the trend (by model or resampling)
- Add these back to the trend
- \(\tilde{X} =\) trend + bootstrapped deviations

- Model-based: simulate the model
- Resampling: use spatial blocks (e.g., rectangles)

**Simulation**Using only \(m\) bootstrap replicates- Make this small by letting \(m\rightarrow\infty\)
- Costs computing time
- Diminishing returns: error is generally \(O(1/\sqrt{m})\)

**Approximation**Using \(\hat{P}\) instead of \(P_{X,\theta_0}\)- Make this small by careful statistical modeling

**Estimation**Only a finite number of samples- Make this small by being careful about what we simulate (example in backup slides)

Generally: for fixed \(n\), resampling boostrap shows more uncertainty than model-based bootstraps, but is less at risk to modeling mistakes (yet another bias-variance tradeoff)

- Bootstrapping is applying the Monte Carlo idea to the sampling distribution
- Find a good approximation to the data-generating distribution
- Simulate it many times
- Treat each simulation run just like the original data
- Distribution over simulations \(\approx\) sampling distribution

- Choice of how to simulate:
- Based on a model
- By resampling blocks

- Many, many refinements, but this is the core of it

- You think that \(X\) comes from some interesting class of processes \(\mathcal{H}_1\)
- Your skeptical friend thinks maybe it’s just some boring process in \(\mathcal{H}_0\)
- Fit a model to \(X\) using only the boring processes in \(\mathcal{H}_0\), say \(\hat{Q}\)
- Now simulate \(\hat{Q}\) to get many “surrogate data” sets \(\tilde{X}\)
- Look at the distribution of \(\tau(\tilde{X})\) over your simulations
- If \(\tau(X)\) is really unlikely under that simulation, it’s evidence against \(\mathcal{H}_0\)
- When would it be evidence
*for*\(\mathcal{H}_1\)?

- When would it be evidence
- The term “surrogate data” comes out of physics (Kantz and Schreiber 2004)
- testing the null hypothesis \(\mathcal{H}_0\) against the alternative \(\mathcal{H}_1\)

- Simulating a dynamical model needs an initial value \(X(0)\) (or \(X(1)\), or maybe \(X(1), X(2)\), etc.)
- What if this isn’t known precisely?
- Easy case: it follows a probability distribution
- Then draw \(X(0)\) from that distribution independently on each run
- Aggregate across runs as usual

- Less easy: \(X(0)\) within known limits but no distribution otherwise
- My advice: draw \(X(0)\) uniformly from inside the region
- Rationale 1: Maximum entropy distribution compatible with constraints
- Rationale 2: If the process is rapidly mixing, memory of the distribution of \(X(0)\) decays quickly anyway

Crude CI use distribution of \(\tilde{\theta}\) under \(\hat{\theta}\)

But really we want the distribution of \(\hat{\theta}\) under \(\theta\)

Mathematical observation: Generally speaking, (distributions of) \(\tilde{\theta} - \hat{\theta}\) is closer to \(\hat{\theta}-\theta_0\) than \(\tilde{\theta}\) is to \(\hat{\theta}\)

(“centering” or “pivoting”)

quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)

\[ \begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*} \]

Basically: re-center the simulations around the empirical data

```
## a b1 b2
## 54.822397 1.541596 -1.151389
```

```
## a b1 b2
## 821.4949207 1.9183923 -0.8584727
```

- Bootstrap begins with Efron (1979) (drawing on an older idea from the 1940s called the “jackknife”)
- Block bootstrap for time series comes from Künsch (1989)
- Bühlmann (2002) is a good review on time series bootstraps
- Davison and Hinkley (1997) is the best over-all textbook on the bootstrap
- Lahiri (2003) is
*very*theoretical

Bühlmann, Peter. 2002. “Bootstraps for Time Series.” *Statistical Science* 17:52–72. https://doi.org/10.1214/ss/1023798998.

Davison, A. C., and D. V. Hinkley. 1997. *Bootstrap Methods and Their Applications*. Cambridge, England: Cambridge University Press.

Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” *Annals of Statistics* 7:1–26. https://doi.org/10.1214/aos/1176344552.

Kantz, Holger, and Thomas Schreiber. 2004. *Nonlinear Time Series Analysis*. Second. Cambridge, England: Cambridge University Press.

Künsch, Hans R. 1989. “The Jackknife and the Bootstrap for General Stationary Observations.” *Annals of Statistics* 17:1217–41. https://doi.org/10.1214/aos/1176347265.

Lahiri, S. N. 2003. *Resampling Methods for Dependent Data*. New York: Springer-Verlag.

Politis, Dimitris N., and Halbert White. 2004. “Automatic Block-Length Selection for the Dependent Bootstrap.” *Econometric Reviews* 23:53–70. https://doi.org/10.1081/ETC-120028836.