---
title: Bootstrap (with feeling this time)
author: 36-402, Section A
date: 9 February 2017
output: ioslides_presentation
---
```{r, include=FALSE}
# General set up
# Load packages we may use later
library(knitr)
# Set knitr options for knitting code into the report:
# - Save results so that code blocks aren't re-run unless code changes (cache),
# _or_ a relevant earlier code block changed (autodep), but don't re-run if the
# only thing that changed was the comments (cache.comments)
# - Don't clutter R output with messages or warnings (message, warning)
# This _will_ leave error messages showing up in the knitted report
opts_chunk$set(cache=TRUE, autodep=TRUE, cache.comments=FALSE,
message=FALSE, warning=FALSE)
```
## The Big Picture
\[
\newcommand{\Expect}[1]{\mathbf{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)}
\newcommand{\Probwrt}[2]{\mathrm{Pr}_{#2}\left( #1 \right)}
\]
1. Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
2. The bootstrap principle: _approximate_ the sampling distribution by _simulating_ from a good model of the data, and treating the simulated data just like the real data
3. Sometimes we simulate from the model we're estimating (model-based or "parametric" bootstrap)
4. Sometimes we simulate by re-sampling the original data (resampling or "nonparametric" bootstrap)
5. Stronger assumptions $\Rightarrow$ less uncertainty _if we're right_
## Statistical Uncertainty
Re-run the experiment (survey, census, ...) and get different data
$\therefore$ everything we calculate from data (estimates, test statistics, policies, ...) will change from trial to trial as well
This variability is (the source of) **statistical uncertainty**
Quantifying this is a way to be honest about what we actually know
## Measures of Uncertainty
Standard error = standard deviation of an estimator
(could equally well use median absolute deviation, etc.)
$p$-value = Probability we'd see a signal this big if there was just noise
Confidence region = All the parameter values we can't reject at low error
rates:
1. _Either_ the true parameter is in the confidence region
2. _or_ we are very unlucky
3. _or_ our model is wrong
etc., etc.
## The Sampling Distribution Is the Source of All Knowledge
Data $X \sim P_{X,\theta_0}$, for some true $\theta_0$
We calculate a statistic $T = \tau(X)$ so it has distribution $P_{T,\theta_0}$
If we knew $P_{T,\theta_0}$, we could calculate
- $\Var{T}$ (and so standard error)
- $\Expect{T}$ (and so bias)
- quantiles (and so confidence intervals or $p$-values), etc.
## The Problems
Problem 1: Most of the time, $P_{X,\theta_0}$ is very complicated
Problem 2: Most of the time, $\tau$ is a very complicated function
Problem 3: We certainly don't know $\theta_0$
Upshot: We don't know $P_{T,\theta_0}$ and can't use it to calculate anything
## The Solutions
Classically ($\approx 1900$--$\approx 1975$): Restrict the model and the
statistic until you can calculate the sampling distribution, at least for very
large $n$
Modern ($\approx 1975$--): Use complex models and statistics, but simulate calculating the statistic on the model
## The Bootstrap Principle
1. Find a good estimate $\hat{P}$ for $P_{X,\theta_0}$
2. Generate a simulation $\tilde{X}$ from $\hat{P}$, set $\tilde{T} = \tau(\tilde{X})$
3. Use the simulated distribution of the $\tilde{T}$ to approximate $P_{T,\theta_0}$
Refinements:
- improving the initial estimate $\hat{P}$
- reducing the number of simulations or speeding them up
- transforming $\tau$ so the final approximation is more stable
First step: find a good estimate $\hat{P}$ for $P_{X,\theta_0}$
## Model-based Bootstrap
If we are using a model, our best guess at $P_{X,\theta_0}$ is
$P_{X,\hat{\theta}}$, with our best estimate $\hat{\theta}$ of the parameters
#### The Model-based Bootstrap
- Get data $X$, estimate $\hat{\theta}$ from $X$
- Repeat $b$ times:
+ Simulate $\tilde{X}$ from $P_{X,\hat{\theta}}$ (simulate data of same size/"shape" as real data)
+ Calculate $\tilde{T} = \tau(\tilde{X}$) (treat simulated data the same as real data)
- Use empirical distribution of $\tilde{T}$ as $P_{T,\theta_0}$
## Example: Is Karakedi overweight?
```{r, fig.retina=NULL, out.width=600, echo=FALSE}
knitr::include_graphics("kara-as-snookums.png")
```
## Example (cont'd.)
```{r}
library(MASS); data(cats); summary(cats)
(q95.gaussian <- qnorm(0.95,mean=mean(cats$Bwt),sd=sd(cats$Bwt)))
```
## Example (cont'd.)
Simulate from fitted Gaussian; bundle up estimating 95th percentile into a function
```{r}
rcats.gaussian <- function() {
rnorm(n=nrow(cats),
mean=mean(cats$Bwt),
sd=sd(cats$Bwt))
}
est.q95.gaussian <- function(x) {
m <- mean(x)
s <- sd(x)
return(qnorm(0.95,mean=m,sd=s))
}
```
## Example (cont'd.)
Simulate, plot the sampling distribution from the simulations
```{r}
sampling.dist.gaussian <- replicate(1000, est.q95.gaussian(rcats.gaussian()))
plot(hist(sampling.dist.gaussian,breaks=50, plot=FALSE), freq=FALSE)
lines(density(sampling.dist.gaussian),lwd=2)
abline(v=q95.gaussian,lty="dashed",lwd=4)
```
## Example (cont'd.)
Find standard error and a crude confidence interval
```{r}
sd(sampling.dist.gaussian)
quantile(sampling.dist.gaussian,c(0.025,0.975))
```
## Model Checking
As always, if the model isn't right, relying on the model is asking for trouble
Is the Gaussian a good model for cats' weights?
## Example (re-cont'd.)
Compare histogram to fitted Gaussian density and to a smooth density estimate
```{r, fig.align="center", echo=FALSE}
plot(hist(cats$Bwt,plot=FALSE),freq=FALSE, main="", xlab="Body weight (kg)")
curve(dnorm(x,mean=mean(cats$Bwt),sd=sd(cats$Bwt)),
add=TRUE,col="purple")
lines(density(cats$Bwt),lty="dashed")
```
## Resampling
Problem: Suppose we don't have a trust-worthy model
Resource: We do have data, which tells us a lot about the distribution
Solution: **Resampling**, treat the sample like a whole population
## The Resampling ("Nonparameteric") Bootstrap
- Get data $X$, containing $n$ samples, find $T=\tau(X)$
- Repeat $b$ times:
+ Generate $\tilde{X}$ by drawing $n$ samples from $X$ _with replacement_ (resample the data)
+ Calculate $\tilde{T} = \tau(\tilde{X})$ (treat simulated data the same as real data)
- Use empirical distribution of $\tilde{T}$ as $P_{T,\theta}$
## Example, Take 2
Model-free estimate of the 95th percentile is the 95th percentile of the data
```{r}
(q95.np <- quantile(cats$Bwt,0.95))
```
How precise is that?
## Example, Take 2
Resampling, re-estimating, and finding sampling distribution
```{r}
resample <- function(x) {
sample(x,size=length(x),replace=TRUE)
}
est.q95.np <- function(x) { quantile(x,0.95) }
```
## Example, Take 2
```{r, fig.align="center"}
sampling.dist.np <- replicate(1000, est.q95.np(resample(cats$Bwt)))
plot(density(sampling.dist.np), main="", xlab="Body weight (kg)")
abline(v=q95.np,lty=2)
```
## Concrete Example, Take 2
standard error, bias, CI
```{r}
sd(sampling.dist.np)
mean(sampling.dist.np - q95.np)
quantile(sampling.dist.np,c(0.025,0.975))
```
## Bootstrapping Regressions
A regression is a model for $Y$ conditional on $X$
\[
Y= m(X) + \mathrm{noise}
\]
Silent about distribution of $X$, so how do we simulate?
Options, putting less and less trust in the model:
- Hold $x_i$ fixed, set $\tilde{y}_i = \hat{m}(x_i) + \mathrm{noise}$ from model's estimated noise distribution (e.g., Gaussian or $t$)
- Hold $x_i$ fixed, set $\tilde{y}_i = \hat{m}(x_i) + \mathrm{noise}$, noise _resampled from the residuals_
- Resample $(x_i, y_i)$ pairs (resample data-points or resample cases)
## Cats' Hearts
`cats` has weights for cats' hearts, as well as bodies
![cute cat heart](2017-02-09-cat-heart.jpg)
[(Much cuter than any photo of real cats' hearts)](http://yaleheartstudy.org/site/wp-content/uploads/2012/03/cat-heart1.jpg)
How does heart weight relate to body weight?
(Useful if Kara's vet wants to know how much heart medicine to prescribe)
## Cats' Hearts
```{r, fig.align="center"}
plot(Hwt~Bwt, data=cats, xlab="Body weight (kg)", ylab="Heart weight (g)")
cats.lm <- lm(Hwt ~ Bwt, data=cats)
abline(cats.lm)
```
## Cats' Hearts
Coefficients and "standard" confidence intervals:
```{r}
coefficients(cats.lm)
confint(cats.lm)
```
## Cats' Hearts
The residuals don't look very Gaussian:
```{r, echo=FALSE, fig.align="center"}
par(mfrow=c(1,2))
plot(cats$Bwt,residuals(cats.lm))
plot(density(residuals(cats.lm)),main="")
par(mfrow=c(1,1))
```
## Cats' Hearts
Resample residuals:
```{r}
sim.cats.resids <- function() {
new.cats <- cats
noise <- resample(residuals(cats.lm))
new.cats$Hwt <- fitted(cats.lm) + noise
return(new.cats)
}
```
Re-estimate on new data:
```{r}
coefs.cats.lm <- function(df) {
fit <- lm(Hwt~Bwt,data=df)
return(coefficients(fit))
}
```
## Cats' Hearts
Re-sample to get CIs:
```{r}
cats.lm.samp.dist.resids <- replicate(1000,
coefs.cats.lm(sim.cats.resids()))
apply(cats.lm.samp.dist.resids,1,quantile,c(0.025,0.975))
```
## Cats' Hearts
Try resampling whole rows:
```{r}
resample.data.frame <- function(df) {
return(df[resample(1:nrow(df)),])
}
```
```{r}
cats.lm.samp.dist.cases <- replicate(1000,
coefs.cats.lm(resample.data.frame(cats)))
apply(cats.lm.samp.dist.cases,1,quantile,c(0.025,0.975))
```
## Why Do This?
- Resampling residuals works as long as the noise is IID
+ Noise could be Gaussian...
+ Or it could be very non-Gaussian
- Resampling whole cases works as long as observations are IID
+ noise needn't be independent of $X$
+ needn't be Gaussian
+ linear model needn't be right
## Sources of Error in Bootstrapping
- **Simulation** Using only $b$ bootstrap replicates
+ Make this small by letting $b\rightarrow\infty$
+ Costs computing time
+ Diminishing returns: error is generally $O(1/\sqrt{b})$
- **Approximation** Using $\hat{P}$ instead of $P_{X,\theta_0}$
+ Make this small by careful statistical modeling
- **Estimation** Only a finite number of samples
+ Make this small by being careful about what we simulate (example next slide)
Generally: for fixed $n$, nonparametric boostrap shows more uncertainty
than parametric bootstraps, but is less at risk to modeling mistakes
(yet another bias-variance tradeoff)
## Improving on the Crude Confidence Interval
Crude CI use distribution of $\tilde{\theta}$ under $\hat{\theta}$
But really we want the distribution of $\hat{\theta}$ under $\theta$
Mathematical observation: Generally speaking, (distributions of)
$\tilde{\theta} - \hat{\theta}$ is closer to $\hat{\theta}-\theta_0$ than
$\tilde{\theta}$ is to $\hat{\theta}$
("centering" or "pivoting")
## The Basic, Pivotal CI
quantiles of $\tilde{\theta}$ = $q_{\alpha/2}, q_{1-\alpha/2}$
\[
\begin{eqnarray*}
1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\
& = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\
& \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\
& = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\
& = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0}
\end{eqnarray*}
\]
Basically: re-center the simulations around the empirical data