---
title: Simulation for Inference I --- The Bootstrap
author: 36-467/36-667
date: 6 November 2018
output: slidy_presentation
bibliography: locusts.bib
---
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]}
\newcommand{\Probwrt}[2]{\mathbb{P}_{#2}\left( #1 \right)}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\]
```{r, include=FALSE}
# General set-up options
library(knitr)
opts_chunk$set(size="small", background="white",
cache=TRUE, autodep=TRUE,
tidy=FALSE, warning=FALSE, message=FALSE,
echo=TRUE)
```
## In our last episode
- General strategy for simulating
- The Monte Carlo method: replace complicated probability calculations with repeated simulation
+ To find $\Expect{f(Y)}$, do $m$ simulations $Y_1, Y_2, ldots Y_m$
+ Then $m^{-1}\sum_{1}^{m}{f(Y_i)} \approx \Expect{f(Y)}$
+ Also works for variances, probabilities, etc.
## Agenda
- The sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
- The **bootstrap principle**: _approximate_ the sampling distribution by _simulating_ from a good model, and treating the simulation run just like the data
- Sometimes we simulate from the model we're estimating (model-based or "parametric" bootstrap)
- Sometimes we simulate by re-sampling the original data (resampling or "nonparametric" bootstrap)
- Stronger assumptions $\Rightarrow$ less uncertainty _if we're right_
## Statistical Uncertainty
- Re-run the experiment (survey, census, ...) and get different data
- $\therefore$ everything we calculate from data (estimates, test statistics, policies, ...) will change from trial to trial as well
- This variability is (the source of) **statistical uncertainty**
- Quantifying this is a way to be honest about what we actually know
## Measures of Uncertainty
- Standard error = standard deviation of an estimator
+ (could equally well use median absolute deviation, etc.)
- $p$-value = Probability we'd see a signal this big if there was just noise
- Confidence region = All the parameter values we can't reject at low error
rates:
1. _Either_ the true parameter is in the confidence region
2. _or_ we are very unlucky
3. _or_ our model is wrong
- etc., etc.
## The Sampling Distribution Is the Source of All Knowledge
- Data $X \sim P_{X,\theta_0}$, for some true $\theta_0$
- We calculate a statistic $T = \tau(X)$ so it has distribution $P_{T,\theta_0}$
- If we knew $P_{T,\theta_0}$, we could calculate
+ $\Var{T}$ (and so standard error)
+ $\Expect{T}$ (and so bias)
+ quantiles (and so confidence intervals or $p$-values), etc.
## The Problems
1. Most of the time, $P_{X,\theta_0}$ is very complicated
2. Most of the time, $\tau$ is a very complicated function
3. We certainly don't know $\theta_0$
Upshot: We don't know $P_{T,\theta_0}$ and can't use it to calculate anything
## The Solutions
- Classically ($\approx 1900$--$\approx 1975$): Restrict the model and the
statistic until you can calculate the sampling distribution, at least for very
large $n$
+ This is the asymptotic theory we did in Lectures 14 and 15
- Modern ($\approx 1975$--): Use complex models and statistics, but _simulate_ calculating the statistic on the model
## The Bootstrap Principle
1. Find a good estimate $\hat{P}$ for $P_{X,\theta_0}$
2. Generate a simulation $\tilde{X}$ from $\hat{P}$, set $\tilde{T} = \tau(\tilde{X})$
3. Use the simulated distribution of the $\tilde{T}$ to approximate $P_{T,\theta_0}$
Refinements:
- improving the initial estimate $\hat{P}$
- reducing the number of simulations or speeding them up
- transforming $\tau$ so the final approximation is more stable
First step: find a good estimate $\hat{P}$ for $P_{X,\theta_0}$
## Model-based Bootstrap
If we are using a model, our best guess at $P_{X,\theta_0}$ is
$P_{X,\hat{\theta}}$, with our best estimate $\hat{\theta}$ of the parameters
#### The Model-based Bootstrap
- Get data $X$, estimate $\hat{\theta}$ from $X$
- Repeat $m$ times:
+ Simulate $\tilde{X}$ from $P_{X,\hat{\theta}}$ (simulate data of same size/"shape" as real data)
+ Calculate $\tilde{T} = \tau(\tilde{X}$) (treat simulated data the same as real data)
- Use empirical distribution of $\tilde{T}$ as $P_{T,\theta_0}$
## Example: lynxes in Canada
![](http://farm4.static.flickr.com/3131/2295650288_a8f7a9a51b_m.jpg)
## Example: lynxes in Canada
- Annual time series of number of lynxes trapped for the fur company in the Mackenzie River area of Canada:
```{r}
data(lynx)
plot(lynx, xlab="Year", ylab="Lynxes")
```
## Fitting an AR model to the example data
- This oscillates, so an AR(1) isn't appropriate --- maybe an AR(2)
```{r}
lynx.ar2 <- ar.ols(lynx, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
lynx.ar2$ar
lynx.ar2$x.intercept
```
so the estimated model is
\[
X(t) = `r lynx.ar2$x.intercept` + `r lynx.ar2$ar[1]` X(t-1) - `r abs(lynx.ar2$ar[2])` X(t-2) + \epsilon(t)
\]
## Simulating from the fitted model
- What do some time series from this look like?
+ (We wrote `ar2.sim` last time)
```{r, include=FALSE}
# Simulate a time series of length n from a Gaussian AR(2) model with given
# coefficients, noise level and initial condition
# Copied from lecture 17
# Inputs: length of output time series (n)
# intercept of AR(2) (a)
# vecotr of coefficients of AR(2) (b)
# standard deviation of the innovations (sd.innovation)
# initial two values of time series (x.start)
# Output: vector of length n
# Presumes: innovations should be Gaussian white noise
# a and sd.innovation are length-1 numerical values
# b and x.start are length-2 numerical vectors
# b has coefficient on X(t-1) first
# x.start has first value first
ar2.sim <- function(n, a, b, sd.innovation, x.start) {
x <- vector(length=n)
x[1] <- x.start[1] # Because R enumerates from 0
x[2] <- x.start[2]
for (t in 1:(n-2)) {
x[t+2] <- a+b[1]*x[t+1] + b[2]*x[t] + rnorm(1, sd=sd.innovation)
}
return(x)
}
```
```{r}
lynx.sims <- replicate(5, ar2.sim(n=length(lynx),
a=lynx.ar2$x.intercept,
b=lynx.ar2$ar,
sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
x.start=lynx[1:2]))
plot(lynx, xlab="Year", ylab="Lynxes caught")
for (i in 1:ncol(lynx.sims)) {
lines(1821:1934, lynx.sims[,i], col="grey")
}
```
## Uncertainty in the coefficients {.smaller}
- Write a function which estimates the coefficients
```{r}
# Estimate the coefficients of an AR(2) model
# Inputs: a time series (x)
# Outputs: vector containing intercept and 2 AR slopes
ar2.estimator <- function(x) {
# Treat the input exactly the same way we treated the real data
fit <- ar.ols(x, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
return(c(a=fit$x.intercept, b1=fit$ar[1], b2=fit$ar[2]))
}
```
Check that this gives the right answer on the data, and a _different_ answer
on a simulation:
```{r}
ar2.estimator(lynx)
ar2.estimator(lynx.sims[,1])
```
## Uncertainty in the coefficients {.smaller}
Now do _lots_ of simulations, and apply this function to each simulation:
```{r}
lynx.coef.sims <- replicate(1000, ar2.estimator( ar2.sim(n=length(lynx),
a=lynx.ar2$x.intercept,
b=lynx.ar2$ar,
sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
x.start=lynx[1:2])))
dim(lynx.coef.sims)
lynx.coef.sims[1:3, 1:5]
```
## Uncertainty in the coefficients {.smaller}
Standard errors $\approx$ standard deviation across simulations:
```{r}
apply(lynx.coef.sims, 1, sd)
```
95% confidence intervals $\approx$ quantiles across simulations:
```{r}
apply(lynx.coef.sims, 1, quantile, prob=c(0.025, 0.975))
```
## For Gaussian AR$(p)$ models...
- ... the bootstrapping isn't needed, because there's an asymptotic theory
which tells us about standard errors and confidence intervals
+ Like what you did for the AR(1) in homework 8, only with even more book-keeping
- But we can use this approach for _any_ generative time-series model
+ E.g., make $X(t)$ have a Poisson distribution rather than a Gaussian
+ All we have to do is swap out the simulation
## Prediction
- You can apply these ideas when the "statistic" is $X(t)$ for a fixed $t$
- Or when it's $X(t) | X(t-1) = x$ for a fixed $x$
+ Since the model tells us how to calculate that!
- Expected values across simulations = best guess for the point prediction
- Variances and intervals incorporate modeling uncertainty
## Resampling
- Problem: Suppose we don't have a trust-worthy model
- Resource: We do have data, which tells us a lot about the distribution
- Solution: **Resampling**, treat the sample like a whole population
## The Resampling ("Nonparameteric") Bootstrap
- Get data $X$, containing $n$ samples, find $T=\tau(X)$
- Fix a **block length** $k$ and divide $X$ into (overlapping) blocks of length $k$
+ So first block is $(X(1), \ldots X(k))$, second block is $(X(2), \ldots X(k+1))$, etc.
- Repeat $m$ times:
+ Generate $\tilde{X}$ by drawing $n/k$ blocks from $X$ _with replacement_ (resample the data)
+ Concatenate the blocks to make $\tilde{X}$
+ Calculate $\tilde{T} = \tau(\tilde{X})$ (treat simulated data the same as real data)
- Use empirical distribution of $\tilde{T}$ as $P_{T,\theta}$
## Example: Back to the lynxes {.smaller}
```{r, include=FALSE}
# Source the PPW block-length estimator
source("http://www.math.ucsd.edu/~politis/SOFT/PPW/ppw.R")
# This estimates a good block-length for two different time-series bootstraps,
# one with randomized block lengths (the "stationary bootstrap"), and one
# with fixed block lengths (the "circular bootstrap"); we're using the
# second one
```
```{r}
# Convert a time series into a data frame of lagged values
# Input: a time series, maximum lag to use, whether older values go on the right
# or the left
# Output: a data frame with (order+1) columns, named lag0, lag1, ... , and
# length(ts)-order rows
design.matrix.from.ts <- function(ts,order,right.older=TRUE) {
n <- length(ts)
x <- ts[(order+1):n]
for (lag in 1:order) {
if (right.older) {
x <- cbind(x,ts[(order+1-lag):(n-lag)])
} else {
x <- cbind(ts[(order+1-lag):(n-lag)],x)
}
}
lag.names <- c("lag0",paste("lag",1:order,sep=""))
if (right.older) {
colnames(x) <- lag.names
} else {
colnames(x) <- rev(lag.names)
}
return(as.data.frame(x))
}
head(lynx)
head(design.matrix.from.ts(lynx,2))
```
## Example: Back to the lynxes
```{r}
# Simple block bootstrap
# Inputs: time series (ts), block length, length of output
# Output: one resampled time series
# Presumes: ts is a univariate time series
rblockboot <- function(ts,block.length,len.out=length(ts)) {
# chop up ts into blocks
the.blocks <- as.matrix(design.matrix.from.ts(ts,block.length-1,
right.older=FALSE))
# look carefully at design.matrix.from.ts to see why we need the -1
# How many blocks is that?
blocks.in.ts <- nrow(the.blocks)
# Sanity-check
stopifnot(blocks.in.ts == length(ts) - block.length+1)
# How many blocks will we need (round up)?
blocks.needed <- ceiling(len.out/block.length)
# Sample blocks with replacement
picked.blocks <- sample(1:blocks.in.ts,size=blocks.needed,replace=TRUE)
# put the blocks in the randomly-selected order
x <- the.blocks[picked.blocks,]
# convert from a matrix to a vector and return
# need transpose because R goes from vectors to matrices and back column by
# column, not row by row
x.vec <- as.vector(t(x))
# Discard uneeded extra observations at the end silently
return(x.vec[1:len.out])
}
```
```{r}
head(lynx)
head(rblockboot(lynx, 3))
```
## Resampling the lynxes
- We can do everything the same as before, just changing how we simulate
```{r}
lynx.coef.sims.resamp <- replicate(1000, ar2.estimator(rblockboot(lynx, 3)))
```
- Standard errors:
```{r}
apply(lynx.coef.sims.resamp, 1, sd)
```
- 95% confidence intervals:
```{r}
apply(lynx.coef.sims.resamp, 1, quantile, prob=c(0.025, 0.975))
```
(Anything funny here?)
## Why should this work?
- Preserves the distribution within blocks
- Independence across blocks
- So $\tilde{X} \approx$ stationary, with the same short-range distribution as $X$ but no longer-range dependence
- Should work well if $X$ is stationary without long-range dependence
- Want to let $k$ grow with $n$, but still have a growing number of blocks, so $n/k_n \rightarrow \infty$, or $k=o(n)$
- Some theory suggests $k \propto n^{1/3}$
+ @PolitisWhite2004 provides an automatic procedure for picking $k$
+ R implementation at [http://www.math.ucsd.edu/~politis/SOFT/PPW/ppw.R]
+ ...which suggests a block-length of 3 for the lynxes
## What about non-stationary time series?
- Model-based: OK if the model is non-stationary
- Otherwise:
+ Estimate the trend
+ Bootstrap the deviations from the trend (by model or resampling)
+ Add these back to the trend
+ $\tilde{X} =$ trend + bootstrapped deviations
## What about spatial data?
- Model-based: simulate the model
- Resampling: use spatial blocks (e.g., rectangles)
## Sources of Error in Bootstrapping
- **Simulation** Using only $m$ bootstrap replicates
+ Make this small by letting $m\rightarrow\infty$
+ Costs computing time
+ Diminishing returns: error is generally $O(1/\sqrt{m})$
- **Approximation** Using $\hat{P}$ instead of $P_{X,\theta_0}$
+ Make this small by careful statistical modeling
- **Estimation** Only a finite number of samples
+ Make this small by being careful about what we simulate (example in backup slides)
Generally: for fixed $n$, resampling boostrap shows more uncertainty
than model-based bootstraps, but is less at risk to modeling mistakes
(yet another bias-variance tradeoff)
## Summary
- Bootstrapping is applying the Monte Carlo idea to the sampling distribution
+ Find a good approximation to the data-generating distribution
+ Simulate it many times
+ Treat each simulation run just like the original data
+ Distribution over simulations $\approx$ sampling distribution
- Choice of how to simulate:
+ Based on a model
+ By resampling blocks
- Many, many refinements, but this is the core of it
## Backup: The "surrogate data" method
- You think that $X$ comes from some interesting class of processes $\mathcal{H}_1$
- Your skeptical friend thinks maybe it's just some boring process in $\mathcal{H}_0$
- Fit a model to $X$ using only the boring processes in $\mathcal{H}_0$, say $\hat{Q}$
- Now simulate $\hat{Q}$ to get many "surrogate data" sets $\tilde{X}$
- Look at the distribution of $\tau(\tilde{X})$ over your simulations
- If $\tau(X)$ is really unlikely under that simulation, it's evidence against $\mathcal{H}_0$
+ When would it be evidence _for_ $\mathcal{H}_1$?
- The term "surrogate data" comes out of physics [@Kantz-Schreiber-2nd]
+ testing the null hypothesis $\mathcal{H}_0$ against the alternative $\mathcal{H}_1$
## Backup: Incorporating uncertainty in the initial conditions
- Simulating a dynamical model needs an initial value $X(0)$ (or $X(1)$, or maybe $X(1), X(2)$, etc.)
- What if this isn't known precisely?
- Easy case: it follows a probability distribution
+ Then draw $X(0)$ from that distribution independently on each run
+ Aggregate across runs as usual
- Less easy: $X(0)$ within known limits but no distribution otherwise
+ My advice: draw $X(0)$ uniformly from inside the region
+ Rationale 1: Maximum entropy distribution compatible with constraints
+ Rationale 2: If the process is rapidly mixing, memory of the distribution of $X(0)$ decays quickly anyway
## Backup: Improving on the Crude Confidence Interval
Crude CI use distribution of $\tilde{\theta}$ under $\hat{\theta}$
But really we want the distribution of $\hat{\theta}$ under $\theta$
Mathematical observation: Generally speaking, (distributions of)
$\tilde{\theta} - \hat{\theta}$ is closer to $\hat{\theta}-\theta_0$ than
$\tilde{\theta}$ is to $\hat{\theta}$
("centering" or "pivoting")
## Backup: The Basic, Pivotal CI
quantiles of $\tilde{\theta}$ = $q_{\alpha/2}, q_{1-\alpha/2}$
\[
\begin{eqnarray*}
1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\
& = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\
& \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\
& = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\
& = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0}
\end{eqnarray*}
\]
Basically: re-center the simulations around the empirical data
```{r}
2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.975)
2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.025)
```
## Backup: Further reading
- Bootstrap begins with @Efron-on-boostrap-in-annals (drawing on an older idea from the 1940s called the "jackknife")
- Block bootstrap for time series comes from @Kunsch-bootstrap-for-stationary-obs
- @Buhlmann-on-bootstraps-for-time-series is a good review on time series bootstraps
- @Davison-Hinkley-bootstrap is the best over-all textbook on the bootstrap
- @Lahiri-resampling-for-dependent is _very_ theoretical
## References