36-467/36-667
6 November 2018
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#2}\left( #1 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]
Re-run the experiment (survey, census, …) and get different data
\(\therefore\) everything we calculate from data (estimates, test statistics, policies, …) will change from trial to trial as well
This variability is (the source of) statistical uncertainty
Quantifying this is a way to be honest about what we actually know
Data \(X \sim P_{X,\theta_0}\), for some true \(\theta_0\)
We calculate a statistic \(T = \tau(X)\) so it has distribution \(P_{T,\theta_0}\)
Upshot: We don’t know \(P_{T,\theta_0}\) and can’t use it to calculate anything
Refinements:
First step: find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)
If we are using a model, our best guess at \(P_{X,\theta_0}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters
## , , 1
##
## [,1]
## [1,] 1.152423
## [2,] -0.606229
## [1] 710.1056
so the estimated model is \[ X(t) = 710.1055888 + 1.1524226 X(t-1) - 0.606229 X(t-2) + \epsilon(t) \]
ar2.sim
last time)lynx.sims <- replicate(5, ar2.sim(n=length(lynx),
a=lynx.ar2$x.intercept,
b=lynx.ar2$ar,
sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
x.start=lynx[1:2]))
plot(lynx, xlab="Year", ylab="Lynxes caught")
for (i in 1:ncol(lynx.sims)) {
lines(1821:1934, lynx.sims[,i], col="grey")
}
# Estimate the coefficients of an AR(2) model
# Inputs: a time series (x)
# Outputs: vector containing intercept and 2 AR slopes
ar2.estimator <- function(x) {
# Treat the input exactly the same way we treated the real data
fit <- ar.ols(x, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
return(c(a=fit$x.intercept, b1=fit$ar[1], b2=fit$ar[2]))
}
Check that this gives the right answer on the data, and a different answer on a simulation:
## a b1 b2
## 710.105589 1.152423 -0.606229
## a b1 b2
## 762.6063220 1.1102328 -0.5743644
Now do lots of simulations, and apply this function to each simulation:
lynx.coef.sims <- replicate(1000, ar2.estimator( ar2.sim(n=length(lynx),
a=lynx.ar2$x.intercept,
b=lynx.ar2$ar,
sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
x.start=lynx[1:2])))
dim(lynx.coef.sims)
## [1] 3 1000
## [,1] [,2] [,3] [,4] [,5]
## a 830.8435812 724.7291944 598.4342755 761.794216 885.9718640
## b1 1.1771914 1.1892204 1.2535834 1.109654 0.9910510
## b2 -0.7153194 -0.6591029 -0.6459415 -0.578512 -0.5180509
Standard errors \(\approx\) standard deviation across simulations:
## a b1 b2
## 118.50209000 0.07622429 0.07498753
95% confidence intervals \(\approx\) quantiles across simulations:
## a b1 b2
## 2.5% 508.0321 0.9839003 -0.7473846
## 97.5% 963.5467 1.2820643 -0.4422757
# Convert a time series into a data frame of lagged values
# Input: a time series, maximum lag to use, whether older values go on the right
# or the left
# Output: a data frame with (order+1) columns, named lag0, lag1, ... , and
# length(ts)-order rows
design.matrix.from.ts <- function(ts,order,right.older=TRUE) {
n <- length(ts)
x <- ts[(order+1):n]
for (lag in 1:order) {
if (right.older) {
x <- cbind(x,ts[(order+1-lag):(n-lag)])
} else {
x <- cbind(ts[(order+1-lag):(n-lag)],x)
}
}
lag.names <- c("lag0",paste("lag",1:order,sep=""))
if (right.older) {
colnames(x) <- lag.names
} else {
colnames(x) <- rev(lag.names)
}
return(as.data.frame(x))
}
head(lynx)
## [1] 269 321 585 871 1475 2821
## lag0 lag1 lag2
## 1 585 321 269
## 2 871 585 321
## 3 1475 871 585
## 4 2821 1475 871
## 5 3928 2821 1475
## 6 5943 3928 2821
# Simple block bootstrap
# Inputs: time series (ts), block length, length of output
# Output: one resampled time series
# Presumes: ts is a univariate time series
rblockboot <- function(ts,block.length,len.out=length(ts)) {
# chop up ts into blocks
the.blocks <- as.matrix(design.matrix.from.ts(ts,block.length-1,
right.older=FALSE))
# look carefully at design.matrix.from.ts to see why we need the -1
# How many blocks is that?
blocks.in.ts <- nrow(the.blocks)
# Sanity-check
stopifnot(blocks.in.ts == length(ts) - block.length+1)
# How many blocks will we need (round up)?
blocks.needed <- ceiling(len.out/block.length)
# Sample blocks with replacement
picked.blocks <- sample(1:blocks.in.ts,size=blocks.needed,replace=TRUE)
# put the blocks in the randomly-selected order
x <- the.blocks[picked.blocks,]
# convert from a matrix to a vector and return
# need transpose because R goes from vectors to matrices and back column by
# column, not row by row
x.vec <- as.vector(t(x))
# Discard uneeded extra observations at the end silently
return(x.vec[1:len.out])
}
## [1] 269 321 585 871 1475 2821
## [1] 1676 2251 1426 473 358 784
## a b1 b2
## 198.36971077 0.09812245 0.07406839
## a b1 b2
## 2.5% 598.7163 0.3864529 -0.35398540
## 97.5% 1365.3888 0.7632487 -0.06106865
(Anything funny here?)
Generally: for fixed \(n\), resampling boostrap shows more uncertainty than model-based bootstraps, but is less at risk to modeling mistakes (yet another bias-variance tradeoff)
Crude CI use distribution of \(\tilde{\theta}\) under \(\hat{\theta}\)
But really we want the distribution of \(\hat{\theta}\) under \(\theta\)
Mathematical observation: Generally speaking, (distributions of) \(\tilde{\theta} - \hat{\theta}\) is closer to \(\hat{\theta}-\theta_0\) than \(\tilde{\theta}\) is to \(\hat{\theta}\)
(“centering” or “pivoting”)
quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)
\[ \begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*} \]
Basically: re-center the simulations around the empirical data
## a b1 b2
## 54.822397 1.541596 -1.151389
## a b1 b2
## 821.4949207 1.9183923 -0.8584727
Bühlmann, Peter. 2002. “Bootstraps for Time Series.” Statistical Science 17:52–72. https://doi.org/10.1214/ss/1023798998.
Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press.
Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” Annals of Statistics 7:1–26. https://doi.org/10.1214/aos/1176344552.
Kantz, Holger, and Thomas Schreiber. 2004. Nonlinear Time Series Analysis. Second. Cambridge, England: Cambridge University Press.
Künsch, Hans R. 1989. “The Jackknife and the Bootstrap for General Stationary Observations.” Annals of Statistics 17:1217–41. https://doi.org/10.1214/aos/1176347265.
Lahiri, S. N. 2003. Resampling Methods for Dependent Data. New York: Springer-Verlag.
Politis, Dimitris N., and Halbert White. 2004. “Automatic Block-Length Selection for the Dependent Bootstrap.” Econometric Reviews 23:53–70. https://doi.org/10.1081/ETC-120028836.