--- title: Simulation for Inference I --- The Bootstrap author: 36-467/36-667 date: 6 November 2018 output: slidy_presentation bibliography: locusts.bib --- $\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#2}\left( #1 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}$ {r, include=FALSE} # General set-up options library(knitr) opts_chunk$set(size="small", background="white", cache=TRUE, autodep=TRUE, tidy=FALSE, warning=FALSE, message=FALSE, echo=TRUE)  ## In our last episode - General strategy for simulating - The Monte Carlo method: replace complicated probability calculations with repeated simulation + To find$\Expect{f(Y)}$, do$m$simulations$Y_1, Y_2, ldots Y_m$+ Then$m^{-1}\sum_{1}^{m}{f(Y_i)} \approx \Expect{f(Y)}$+ Also works for variances, probabilities, etc. ## Agenda - The sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets) - The **bootstrap principle**: _approximate_ the sampling distribution by _simulating_ from a good model, and treating the simulation run just like the data - Sometimes we simulate from the model we're estimating (model-based or "parametric" bootstrap) - Sometimes we simulate by re-sampling the original data (resampling or "nonparametric" bootstrap) - Stronger assumptions$\Rightarrow$less uncertainty _if we're right_ ## Statistical Uncertainty - Re-run the experiment (survey, census, ...) and get different data -$\therefore$everything we calculate from data (estimates, test statistics, policies, ...) will change from trial to trial as well - This variability is (the source of) **statistical uncertainty** - Quantifying this is a way to be honest about what we actually know ## Measures of Uncertainty - Standard error = standard deviation of an estimator + (could equally well use median absolute deviation, etc.) -$p$-value = Probability we'd see a signal this big if there was just noise - Confidence region = All the parameter values we can't reject at low error rates: 1. _Either_ the true parameter is in the confidence region 2. _or_ we are very unlucky 3. _or_ our model is wrong - etc., etc. ## The Sampling Distribution Is the Source of All Knowledge - Data$X \sim P_{X,\theta_0}$, for some true$\theta_0$- We calculate a statistic$T = \tau(X)$so it has distribution$P_{T,\theta_0}$- If we knew$P_{T,\theta_0}$, we could calculate +$\Var{T}$(and so standard error) +$\Expect{T}$(and so bias) + quantiles (and so confidence intervals or$p$-values), etc. ## The Problems 1. Most of the time,$P_{X,\theta_0}$is very complicated 2. Most of the time,$\tau$is a very complicated function 3. We certainly don't know$\theta_0$Upshot: We don't know$P_{T,\theta_0}$and can't use it to calculate anything ## The Solutions - Classically ($\approx 1900$--$\approx 1975$): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large$n$+ This is the asymptotic theory we did in Lectures 14 and 15 - Modern ($\approx 1975$--): Use complex models and statistics, but _simulate_ calculating the statistic on the model ## The Bootstrap Principle 1. Find a good estimate$\hat{P}$for$P_{X,\theta_0}$2. Generate a simulation$\tilde{X}$from$\hat{P}$, set$\tilde{T} = \tau(\tilde{X})$3. Use the simulated distribution of the$\tilde{T}$to approximate$P_{T,\theta_0}$Refinements: - improving the initial estimate$\hat{P}$- reducing the number of simulations or speeding them up - transforming$\tau$so the final approximation is more stable First step: find a good estimate$\hat{P}$for$P_{X,\theta_0}$## Model-based Bootstrap If we are using a model, our best guess at$P_{X,\theta_0}$is$P_{X,\hat{\theta}}$, with our best estimate$\hat{\theta}$of the parameters #### The Model-based Bootstrap - Get data$X$, estimate$\hat{\theta}$from$X$- Repeat$m$times: + Simulate$\tilde{X}$from$P_{X,\hat{\theta}}$(simulate data of same size/"shape" as real data) + Calculate$\tilde{T} = \tau(\tilde{X}$) (treat simulated data the same as real data) - Use empirical distribution of$\tilde{T}$as$P_{T,\theta_0}$## Example: lynxes in Canada ![](http://farm4.static.flickr.com/3131/2295650288_a8f7a9a51b_m.jpg) ## Example: lynxes in Canada - Annual time series of number of lynxes trapped for the fur company in the Mackenzie River area of Canada: {r} data(lynx) plot(lynx, xlab="Year", ylab="Lynxes")  ## Fitting an AR model to the example data - This oscillates, so an AR(1) isn't appropriate --- maybe an AR(2) {r} lynx.ar2 <- ar.ols(lynx, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE) lynx.ar2$ar lynx.ar2$x.intercept  so the estimated model is $X(t) = r lynx.ar2x.intercept + r lynx.ar2ar[1] X(t-1) - r abs(lynx.ar2ar[2]) X(t-2) + \epsilon(t)$ ## Simulating from the fitted model - What do some time series from this look like? + (We wrote ar2.sim last time) {r, include=FALSE} # Simulate a time series of length n from a Gaussian AR(2) model with given # coefficients, noise level and initial condition # Copied from lecture 17 # Inputs: length of output time series (n) # intercept of AR(2) (a) # vecotr of coefficients of AR(2) (b) # standard deviation of the innovations (sd.innovation) # initial two values of time series (x.start) # Output: vector of length n # Presumes: innovations should be Gaussian white noise # a and sd.innovation are length-1 numerical values # b and x.start are length-2 numerical vectors # b has coefficient on X(t-1) first # x.start has first value first ar2.sim <- function(n, a, b, sd.innovation, x.start) { x <- vector(length=n) x[1] <- x.start[1] # Because R enumerates from 0 x[2] <- x.start[2] for (t in 1:(n-2)) { x[t+2] <- a+b[1]*x[t+1] + b[2]*x[t] + rnorm(1, sd=sd.innovation) } return(x) }  {r} lynx.sims <- replicate(5, ar2.sim(n=length(lynx), a=lynx.ar2$x.intercept, b=lynx.ar2$ar, sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE), x.start=lynx[1:2])) plot(lynx, xlab="Year", ylab="Lynxes caught") for (i in 1:ncol(lynx.sims)) { lines(1821:1934, lynx.sims[,i], col="grey") }  ## Uncertainty in the coefficients {.smaller} - Write a function which estimates the coefficients {r} # Estimate the coefficients of an AR(2) model # Inputs: a time series (x) # Outputs: vector containing intercept and 2 AR slopes ar2.estimator <- function(x) { # Treat the input exactly the same way we treated the real data fit <- ar.ols(x, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE) return(c(a=fit$x.intercept, b1=fit$ar[1], b2=fit$ar[2])) }  Check that this gives the right answer on the data, and a _different_ answer on a simulation: {r} ar2.estimator(lynx) ar2.estimator(lynx.sims[,1])  ## Uncertainty in the coefficients {.smaller} Now do _lots_ of simulations, and apply this function to each simulation: {r} lynx.coef.sims <- replicate(1000, ar2.estimator( ar2.sim(n=length(lynx), a=lynx.ar2$x.intercept, b=lynx.ar2$ar, sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE), x.start=lynx[1:2]))) dim(lynx.coef.sims) lynx.coef.sims[1:3, 1:5]  ## Uncertainty in the coefficients {.smaller} Standard errors $\approx$ standard deviation across simulations: {r} apply(lynx.coef.sims, 1, sd)  95% confidence intervals $\approx$ quantiles across simulations: {r} apply(lynx.coef.sims, 1, quantile, prob=c(0.025, 0.975))  ## For Gaussian AR$(p)$ models... - ... the bootstrapping isn't needed, because there's an asymptotic theory which tells us about standard errors and confidence intervals + Like what you did for the AR(1) in homework 8, only with even more book-keeping - But we can use this approach for _any_ generative time-series model + E.g., make $X(t)$ have a Poisson distribution rather than a Gaussian + All we have to do is swap out the simulation ## Prediction - You can apply these ideas when the "statistic" is $X(t)$ for a fixed $t$ - Or when it's $X(t) | X(t-1) = x$ for a fixed $x$ + Since the model tells us how to calculate that! - Expected values across simulations = best guess for the point prediction - Variances and intervals incorporate modeling uncertainty ## Resampling - Problem: Suppose we don't have a trust-worthy model - Resource: We do have data, which tells us a lot about the distribution - Solution: **Resampling**, treat the sample like a whole population ## The Resampling ("Nonparameteric") Bootstrap - Get data $X$, containing $n$ samples, find $T=\tau(X)$ - Fix a **block length** $k$ and divide $X$ into (overlapping) blocks of length $k$ + So first block is $(X(1), \ldots X(k))$, second block is $(X(2), \ldots X(k+1))$, etc. - Repeat $m$ times: + Generate $\tilde{X}$ by drawing $n/k$ blocks from $X$ _with replacement_ (resample the data) + Concatenate the blocks to make $\tilde{X}$ + Calculate $\tilde{T} = \tau(\tilde{X})$ (treat simulated data the same as real data) - Use empirical distribution of $\tilde{T}$ as $P_{T,\theta}$ ## Example: Back to the lynxes {.smaller} {r, include=FALSE} # Source the PPW block-length estimator source("http://www.math.ucsd.edu/~politis/SOFT/PPW/ppw.R") # This estimates a good block-length for two different time-series bootstraps, # one with randomized block lengths (the "stationary bootstrap"), and one # with fixed block lengths (the "circular bootstrap"); we're using the # second one  {r} # Convert a time series into a data frame of lagged values # Input: a time series, maximum lag to use, whether older values go on the right # or the left # Output: a data frame with (order+1) columns, named lag0, lag1, ... , and # length(ts)-order rows design.matrix.from.ts <- function(ts,order,right.older=TRUE) { n <- length(ts) x <- ts[(order+1):n] for (lag in 1:order) { if (right.older) { x <- cbind(x,ts[(order+1-lag):(n-lag)]) } else { x <- cbind(ts[(order+1-lag):(n-lag)],x) } } lag.names <- c("lag0",paste("lag",1:order,sep="")) if (right.older) { colnames(x) <- lag.names } else { colnames(x) <- rev(lag.names) } return(as.data.frame(x)) } head(lynx) head(design.matrix.from.ts(lynx,2))  ## Example: Back to the lynxes {r} # Simple block bootstrap # Inputs: time series (ts), block length, length of output # Output: one resampled time series # Presumes: ts is a univariate time series rblockboot <- function(ts,block.length,len.out=length(ts)) { # chop up ts into blocks the.blocks <- as.matrix(design.matrix.from.ts(ts,block.length-1, right.older=FALSE)) # look carefully at design.matrix.from.ts to see why we need the -1 # How many blocks is that? blocks.in.ts <- nrow(the.blocks) # Sanity-check stopifnot(blocks.in.ts == length(ts) - block.length+1) # How many blocks will we need (round up)? blocks.needed <- ceiling(len.out/block.length) # Sample blocks with replacement picked.blocks <- sample(1:blocks.in.ts,size=blocks.needed,replace=TRUE) # put the blocks in the randomly-selected order x <- the.blocks[picked.blocks,] # convert from a matrix to a vector and return # need transpose because R goes from vectors to matrices and back column by # column, not row by row x.vec <- as.vector(t(x)) # Discard uneeded extra observations at the end silently return(x.vec[1:len.out]) }  {r} head(lynx) head(rblockboot(lynx, 3))  ## Resampling the lynxes - We can do everything the same as before, just changing how we simulate {r} lynx.coef.sims.resamp <- replicate(1000, ar2.estimator(rblockboot(lynx, 3)))  - Standard errors: {r} apply(lynx.coef.sims.resamp, 1, sd)  - 95% confidence intervals: {r} apply(lynx.coef.sims.resamp, 1, quantile, prob=c(0.025, 0.975))  (Anything funny here?) ## Why should this work? - Preserves the distribution within blocks - Independence across blocks - So $\tilde{X} \approx$ stationary, with the same short-range distribution as $X$ but no longer-range dependence - Should work well if $X$ is stationary without long-range dependence - Want to let $k$ grow with $n$, but still have a growing number of blocks, so $n/k_n \rightarrow \infty$, or $k=o(n)$ - Some theory suggests $k \propto n^{1/3}$ + @PolitisWhite2004 provides an automatic procedure for picking $k$ + R implementation at [http://www.math.ucsd.edu/~politis/SOFT/PPW/ppw.R] + ...which suggests a block-length of 3 for the lynxes ## What about non-stationary time series? - Model-based: OK if the model is non-stationary - Otherwise: + Estimate the trend + Bootstrap the deviations from the trend (by model or resampling) + Add these back to the trend + $\tilde{X} =$ trend + bootstrapped deviations ## What about spatial data? - Model-based: simulate the model - Resampling: use spatial blocks (e.g., rectangles) ## Sources of Error in Bootstrapping - **Simulation** Using only $m$ bootstrap replicates + Make this small by letting $m\rightarrow\infty$ + Costs computing time + Diminishing returns: error is generally $O(1/\sqrt{m})$ - **Approximation** Using $\hat{P}$ instead of $P_{X,\theta_0}$ + Make this small by careful statistical modeling - **Estimation** Only a finite number of samples + Make this small by being careful about what we simulate (example in backup slides) Generally: for fixed $n$, resampling boostrap shows more uncertainty than model-based bootstraps, but is less at risk to modeling mistakes (yet another bias-variance tradeoff) ## Summary - Bootstrapping is applying the Monte Carlo idea to the sampling distribution + Find a good approximation to the data-generating distribution + Simulate it many times + Treat each simulation run just like the original data + Distribution over simulations $\approx$ sampling distribution - Choice of how to simulate: + Based on a model + By resampling blocks - Many, many refinements, but this is the core of it ## Backup: The "surrogate data" method - You think that $X$ comes from some interesting class of processes $\mathcal{H}_1$ - Your skeptical friend thinks maybe it's just some boring process in $\mathcal{H}_0$ - Fit a model to $X$ using only the boring processes in $\mathcal{H}_0$, say $\hat{Q}$ - Now simulate $\hat{Q}$ to get many "surrogate data" sets $\tilde{X}$ - Look at the distribution of $\tau(\tilde{X})$ over your simulations - If $\tau(X)$ is really unlikely under that simulation, it's evidence against $\mathcal{H}_0$ + When would it be evidence _for_ $\mathcal{H}_1$? - The term "surrogate data" comes out of physics [@Kantz-Schreiber-2nd] + testing the null hypothesis $\mathcal{H}_0$ against the alternative $\mathcal{H}_1$ ## Backup: Incorporating uncertainty in the initial conditions - Simulating a dynamical model needs an initial value $X(0)$ (or $X(1)$, or maybe $X(1), X(2)$, etc.) - What if this isn't known precisely? - Easy case: it follows a probability distribution + Then draw $X(0)$ from that distribution independently on each run + Aggregate across runs as usual - Less easy: $X(0)$ within known limits but no distribution otherwise + My advice: draw $X(0)$ uniformly from inside the region + Rationale 1: Maximum entropy distribution compatible with constraints + Rationale 2: If the process is rapidly mixing, memory of the distribution of $X(0)$ decays quickly anyway ## Backup: Improving on the Crude Confidence Interval Crude CI use distribution of $\tilde{\theta}$ under $\hat{\theta}$ But really we want the distribution of $\hat{\theta}$ under $\theta$ Mathematical observation: Generally speaking, (distributions of) $\tilde{\theta} - \hat{\theta}$ is closer to $\hat{\theta}-\theta_0$ than $\tilde{\theta}$ is to $\hat{\theta}$ ("centering" or "pivoting") ## Backup: The Basic, Pivotal CI quantiles of $\tilde{\theta}$ = $q_{\alpha/2}, q_{1-\alpha/2}$ $\begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*}$ Basically: re-center the simulations around the empirical data {r} 2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.975) 2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.025)  ## Backup: Further reading - Bootstrap begins with @Efron-on-boostrap-in-annals (drawing on an older idea from the 1940s called the "jackknife") - Block bootstrap for time series comes from @Kunsch-bootstrap-for-stationary-obs - @Buhlmann-on-bootstraps-for-time-series is a good review on time series bootstraps - @Davison-Hinkley-bootstrap is the best over-all textbook on the bootstrap - @Lahiri-resampling-for-dependent is _very_ theoretical ## References