# Simulation for Inference I — The Bootstrap

6 November 2018

$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#2}\left( #1 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}$

# In our last episode

• General strategy for simulating
• The Monte Carlo method: replace complicated probability calculations with repeated simulation
• To find $$\Expect{f(Y)}$$, do $$m$$ simulations $$Y_1, Y_2, ldots Y_m$$
• Then $$m^{-1}\sum_{1}^{m}{f(Y_i)} \approx \Expect{f(Y)}$$
• Also works for variances, probabilities, etc.

# Agenda

• The sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
• The bootstrap principle: approximate the sampling distribution by simulating from a good model, and treating the simulation run just like the data
• Sometimes we simulate from the model we’re estimating (model-based or “parametric” bootstrap)
• Sometimes we simulate by re-sampling the original data (resampling or “nonparametric” bootstrap)
• Stronger assumptions $$\Rightarrow$$ less uncertainty if we’re right

# Statistical Uncertainty

• Re-run the experiment (survey, census, …) and get different data

• $$\therefore$$ everything we calculate from data (estimates, test statistics, policies, …) will change from trial to trial as well

• This variability is (the source of) statistical uncertainty

• Quantifying this is a way to be honest about what we actually know

# Measures of Uncertainty

• Standard error = standard deviation of an estimator
• (could equally well use median absolute deviation, etc.)
• $$p$$-value = Probability we’d see a signal this big if there was just noise
• Confidence region = All the parameter values we can’t reject at low error rates:
1. Either the true parameter is in the confidence region
2. or we are very unlucky
3. or our model is wrong
• etc., etc.

# The Sampling Distribution Is the Source of All Knowledge

• Data $$X \sim P_{X,\theta_0}$$, for some true $$\theta_0$$

• We calculate a statistic $$T = \tau(X)$$ so it has distribution $$P_{T,\theta_0}$$

• If we knew $$P_{T,\theta_0}$$, we could calculate
• $$\Var{T}$$ (and so standard error)
• $$\Expect{T}$$ (and so bias)
• quantiles (and so confidence intervals or $$p$$-values), etc.

# The Problems

1. Most of the time, $$P_{X,\theta_0}$$ is very complicated
2. Most of the time, $$\tau$$ is a very complicated function
3. We certainly don’t know $$\theta_0$$

Upshot: We don’t know $$P_{T,\theta_0}$$ and can’t use it to calculate anything

# The Solutions

• Classically ($$\approx 1900$$$$\approx 1975$$): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large $$n$$
• This is the asymptotic theory we did in Lectures 14 and 15
• Modern ($$\approx 1975$$–): Use complex models and statistics, but simulate calculating the statistic on the model

# The Bootstrap Principle

1. Find a good estimate $$\hat{P}$$ for $$P_{X,\theta_0}$$
2. Generate a simulation $$\tilde{X}$$ from $$\hat{P}$$, set $$\tilde{T} = \tau(\tilde{X})$$
3. Use the simulated distribution of the $$\tilde{T}$$ to approximate $$P_{T,\theta_0}$$

Refinements:

• improving the initial estimate $$\hat{P}$$
• reducing the number of simulations or speeding them up
• transforming $$\tau$$ so the final approximation is more stable

First step: find a good estimate $$\hat{P}$$ for $$P_{X,\theta_0}$$

# Model-based Bootstrap

If we are using a model, our best guess at $$P_{X,\theta_0}$$ is $$P_{X,\hat{\theta}}$$, with our best estimate $$\hat{\theta}$$ of the parameters

#### The Model-based Bootstrap

• Get data $$X$$, estimate $$\hat{\theta}$$ from $$X$$
• Repeat $$m$$ times:
• Simulate $$\tilde{X}$$ from $$P_{X,\hat{\theta}}$$ (simulate data of same size/“shape” as real data)
• Calculate $$\tilde{T} = \tau(\tilde{X}$$) (treat simulated data the same as real data)
• Use empirical distribution of $$\tilde{T}$$ as $$P_{T,\theta_0}$$

• Annual time series of number of lynxes trapped for the fur company in the Mackenzie River area of Canada:
data(lynx)
plot(lynx, xlab="Year", ylab="Lynxes")

# Fitting an AR model to the example data

• This oscillates, so an AR(1) isn’t appropriate — maybe an AR(2)
lynx.ar2 <- ar.ols(lynx, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
lynx.ar2$ar ## , , 1 ## ## [,1] ## [1,] 1.152423 ## [2,] -0.606229 lynx.ar2$x.intercept
## [1] 710.1056

so the estimated model is $X(t) = 710.1055888 + 1.1524226 X(t-1) - 0.606229 X(t-2) + \epsilon(t)$

# Simulating from the fitted model

• What do some time series from this look like?
• (We wrote ar2.sim last time)
lynx.sims <- replicate(5, ar2.sim(n=length(lynx),
a=lynx.ar2$x.intercept, b=lynx.ar2$ar,
sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE), x.start=lynx[1:2])) plot(lynx, xlab="Year", ylab="Lynxes caught") for (i in 1:ncol(lynx.sims)) { lines(1821:1934, lynx.sims[,i], col="grey") } # Uncertainty in the coefficients • Write a function which estimates the coefficients # Estimate the coefficients of an AR(2) model # Inputs: a time series (x) # Outputs: vector containing intercept and 2 AR slopes ar2.estimator <- function(x) { # Treat the input exactly the same way we treated the real data fit <- ar.ols(x, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE) return(c(a=fit$x.intercept, b1=fit$ar[1], b2=fit$ar[2]))
}

Check that this gives the right answer on the data, and a different answer on a simulation:

ar2.estimator(lynx)
##          a         b1         b2
## 710.105589   1.152423  -0.606229
ar2.estimator(lynx.sims[,1])
##           a          b1          b2
## 762.6063220   1.1102328  -0.5743644

# Uncertainty in the coefficients

Now do lots of simulations, and apply this function to each simulation:

lynx.coef.sims <- replicate(1000, ar2.estimator( ar2.sim(n=length(lynx),
a=lynx.ar2$x.intercept, b=lynx.ar2$ar,
sd.innovation=sd(lynx.ar2\$resid, na.rm=TRUE),
x.start=lynx[1:2])))
dim(lynx.coef.sims)
## [1]    3 1000
lynx.coef.sims[1:3, 1:5]
##           [,1]        [,2]        [,3]       [,4]        [,5]
## a  830.8435812 724.7291944 598.4342755 761.794216 885.9718640
## b1   1.1771914   1.1892204   1.2535834   1.109654   0.9910510
## b2  -0.7153194  -0.6591029  -0.6459415  -0.578512  -0.5180509

# Uncertainty in the coefficients

Standard errors $$\approx$$ standard deviation across simulations:

apply(lynx.coef.sims, 1, sd)
##            a           b1           b2
## 118.50209000   0.07622429   0.07498753

95% confidence intervals $$\approx$$ quantiles across simulations:

apply(lynx.coef.sims, 1, quantile, prob=c(0.025, 0.975))
##              a        b1         b2
## 2.5%  508.0321 0.9839003 -0.7473846
## 97.5% 963.5467 1.2820643 -0.4422757

# For Gaussian AR$$(p)$$ models…

• … the bootstrapping isn’t needed, because there’s an asymptotic theory which tells us about standard errors and confidence intervals
• Like what you did for the AR(1) in homework 8, only with even more book-keeping
• But we can use this approach for any generative time-series model
• E.g., make $$X(t)$$ have a Poisson distribution rather than a Gaussian
• All we have to do is swap out the simulation

# Prediction

• You can apply these ideas when the “statistic” is $$X(t)$$ for a fixed $$t$$
• Or when it’s $$X(t) | X(t-1) = x$$ for a fixed $$x$$
• Since the model tells us how to calculate that!
• Expected values across simulations = best guess for the point prediction
• Variances and intervals incorporate modeling uncertainty

# Resampling

• Problem: Suppose we don’t have a trust-worthy model
• Resource: We do have data, which tells us a lot about the distribution
• Solution: Resampling, treat the sample like a whole population

# The Resampling (“Nonparameteric”) Bootstrap

• Get data $$X$$, containing $$n$$ samples, find $$T=\tau(X)$$
• Fix a block length $$k$$ and divide $$X$$ into (overlapping) blocks of length $$k$$
• So first block is $$(X(1), \ldots X(k))$$, second block is $$(X(2), \ldots X(k+1))$$, etc.
• Repeat $$m$$ times:
• Generate $$\tilde{X}$$ by drawing $$n/k$$ blocks from $$X$$ with replacement (resample the data)
• Concatenate the blocks to make $$\tilde{X}$$
• Calculate $$\tilde{T} = \tau(\tilde{X})$$ (treat simulated data the same as real data)
• Use empirical distribution of $$\tilde{T}$$ as $$P_{T,\theta}$$

# Example: Back to the lynxes

# Convert a time series into a data frame of lagged values
# Input: a time series, maximum lag to use, whether older values go on the right
# or the left
# Output: a data frame with (order+1) columns, named lag0, lag1, ... , and
# length(ts)-order rows
design.matrix.from.ts <- function(ts,order,right.older=TRUE) {
n <- length(ts)
x <- ts[(order+1):n]
for (lag in 1:order) {
if (right.older) {
x <- cbind(x,ts[(order+1-lag):(n-lag)])
} else {
x <- cbind(ts[(order+1-lag):(n-lag)],x)
}
}
lag.names <- c("lag0",paste("lag",1:order,sep=""))
if (right.older) {
colnames(x) <- lag.names
} else {
colnames(x) <- rev(lag.names)
}
return(as.data.frame(x))
}

head(lynx)
## [1]  269  321  585  871 1475 2821
head(design.matrix.from.ts(lynx,2))
##   lag0 lag1 lag2
## 1  585  321  269
## 2  871  585  321
## 3 1475  871  585
## 4 2821 1475  871
## 5 3928 2821 1475
## 6 5943 3928 2821

# Example: Back to the lynxes

# Simple block bootstrap
# Inputs: time series (ts), block length, length of output
# Output: one resampled time series
# Presumes: ts is a univariate time series
rblockboot <- function(ts,block.length,len.out=length(ts)) {
# chop up ts into blocks
the.blocks <- as.matrix(design.matrix.from.ts(ts,block.length-1,
right.older=FALSE))
# look carefully at design.matrix.from.ts to see why we need the -1
# How many blocks is that?
blocks.in.ts <- nrow(the.blocks)
# Sanity-check
stopifnot(blocks.in.ts == length(ts) - block.length+1)
# How many blocks will we need (round up)?
blocks.needed <- ceiling(len.out/block.length)
# Sample blocks with replacement
picked.blocks <- sample(1:blocks.in.ts,size=blocks.needed,replace=TRUE)
# put the blocks in the randomly-selected order
x <- the.blocks[picked.blocks,]
# convert from a matrix to a vector and return
# need transpose because R goes from vectors to matrices and back column by
# column, not row by row
x.vec <- as.vector(t(x))
# Discard uneeded extra observations at the end silently
return(x.vec[1:len.out])
}
head(lynx)
## [1]  269  321  585  871 1475 2821
head(rblockboot(lynx, 3))
## [1] 1676 2251 1426  473  358  784

# Resampling the lynxes

• We can do everything the same as before, just changing how we simulate
lynx.coef.sims.resamp <- replicate(1000, ar2.estimator(rblockboot(lynx, 3)))
• Standard errors:
apply(lynx.coef.sims.resamp, 1, sd)
##            a           b1           b2
## 198.36971077   0.09812245   0.07406839
• 95% confidence intervals:
apply(lynx.coef.sims.resamp, 1, quantile, prob=c(0.025, 0.975))
##               a        b1          b2
## 2.5%   598.7163 0.3864529 -0.35398540
## 97.5% 1365.3888 0.7632487 -0.06106865

(Anything funny here?)

# Why should this work?

• Preserves the distribution within blocks
• Independence across blocks
• So $$\tilde{X} \approx$$ stationary, with the same short-range distribution as $$X$$ but no longer-range dependence
• Should work well if $$X$$ is stationary without long-range dependence
• Want to let $$k$$ grow with $$n$$, but still have a growing number of blocks, so $$n/k_n \rightarrow \infty$$, or $$k=o(n)$$
• Some theory suggests $$k \propto n^{1/3}$$

# What about non-stationary time series?

• Model-based: OK if the model is non-stationary
• Otherwise:
• Estimate the trend
• Bootstrap the deviations from the trend (by model or resampling)
• Add these back to the trend
• $$\tilde{X} =$$ trend + bootstrapped deviations

• Model-based: simulate the model
• Resampling: use spatial blocks (e.g., rectangles)

# Sources of Error in Bootstrapping

• Simulation Using only $$m$$ bootstrap replicates
• Make this small by letting $$m\rightarrow\infty$$
• Costs computing time
• Diminishing returns: error is generally $$O(1/\sqrt{m})$$
• Approximation Using $$\hat{P}$$ instead of $$P_{X,\theta_0}$$
• Make this small by careful statistical modeling
• Estimation Only a finite number of samples
• Make this small by being careful about what we simulate (example in backup slides)

Generally: for fixed $$n$$, resampling boostrap shows more uncertainty than model-based bootstraps, but is less at risk to modeling mistakes (yet another bias-variance tradeoff)

# Summary

• Bootstrapping is applying the Monte Carlo idea to the sampling distribution
• Find a good approximation to the data-generating distribution
• Simulate it many times
• Treat each simulation run just like the original data
• Distribution over simulations $$\approx$$ sampling distribution
• Choice of how to simulate:
• Based on a model
• By resampling blocks
• Many, many refinements, but this is the core of it

# Backup: The “surrogate data” method

• You think that $$X$$ comes from some interesting class of processes $$\mathcal{H}_1$$
• Your skeptical friend thinks maybe it’s just some boring process in $$\mathcal{H}_0$$
• Fit a model to $$X$$ using only the boring processes in $$\mathcal{H}_0$$, say $$\hat{Q}$$
• Now simulate $$\hat{Q}$$ to get many “surrogate data” sets $$\tilde{X}$$
• Look at the distribution of $$\tau(\tilde{X})$$ over your simulations
• If $$\tau(X)$$ is really unlikely under that simulation, it’s evidence against $$\mathcal{H}_0$$
• When would it be evidence for $$\mathcal{H}_1$$?
• The term “surrogate data” comes out of physics (Kantz and Schreiber 2004)
• testing the null hypothesis $$\mathcal{H}_0$$ against the alternative $$\mathcal{H}_1$$

# Backup: Incorporating uncertainty in the initial conditions

• Simulating a dynamical model needs an initial value $$X(0)$$ (or $$X(1)$$, or maybe $$X(1), X(2)$$, etc.)
• What if this isn’t known precisely?
• Easy case: it follows a probability distribution
• Then draw $$X(0)$$ from that distribution independently on each run
• Aggregate across runs as usual
• Less easy: $$X(0)$$ within known limits but no distribution otherwise
• My advice: draw $$X(0)$$ uniformly from inside the region
• Rationale 1: Maximum entropy distribution compatible with constraints
• Rationale 2: If the process is rapidly mixing, memory of the distribution of $$X(0)$$ decays quickly anyway

# Backup: Improving on the Crude Confidence Interval

Crude CI use distribution of $$\tilde{\theta}$$ under $$\hat{\theta}$$

But really we want the distribution of $$\hat{\theta}$$ under $$\theta$$

Mathematical observation: Generally speaking, (distributions of) $$\tilde{\theta} - \hat{\theta}$$ is closer to $$\hat{\theta}-\theta_0$$ than $$\tilde{\theta}$$ is to $$\hat{\theta}$$

(“centering” or “pivoting”)

# Backup: The Basic, Pivotal CI

quantiles of $$\tilde{\theta}$$ = $$q_{\alpha/2}, q_{1-\alpha/2}$$

$\begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*}$

Basically: re-center the simulations around the empirical data

2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.975)
##         a        b1        b2
## 54.822397  1.541596 -1.151389
2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.025)
##           a          b1          b2
## 821.4949207   1.9183923  -0.8584727

• Bootstrap begins with Efron (1979) (drawing on an older idea from the 1940s called the “jackknife”)
• Block bootstrap for time series comes from Künsch (1989)
• Bühlmann (2002) is a good review on time series bootstraps
• Davison and Hinkley (1997) is the best over-all textbook on the bootstrap
• Lahiri (2003) is very theoretical

# References

Bühlmann, Peter. 2002. “Bootstraps for Time Series.” Statistical Science 17:52–72. https://doi.org/10.1214/ss/1023798998.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press.

Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” Annals of Statistics 7:1–26. https://doi.org/10.1214/aos/1176344552.

Kantz, Holger, and Thomas Schreiber. 2004. Nonlinear Time Series Analysis. Second. Cambridge, England: Cambridge University Press.

Künsch, Hans R. 1989. “The Jackknife and the Bootstrap for General Stationary Observations.” Annals of Statistics 17:1217–41. https://doi.org/10.1214/aos/1176347265.

Lahiri, S. N. 2003. Resampling Methods for Dependent Data. New York: Springer-Verlag.

Politis, Dimitris N., and Halbert White. 2004. “Automatic Block-Length Selection for the Dependent Bootstrap.” Econometric Reviews 23:53–70. https://doi.org/10.1081/ETC-120028836.