---
title: Bootstrap (with feeling this time)
author: 36-402, Section A
date: 9 February 2017
output: ioslides_presentation
---

```{r, include=FALSE}
# General set up

# Load packages we may use later
library(knitr)

# Set knitr options for knitting code into the report:
# - Save results so that code blocks aren't re-run unless code changes (cache),
# _or_ a relevant earlier code block changed (autodep), but don't re-run if the
# only thing that changed was the comments (cache.comments)
# - Don't clutter R output with messages or warnings (message, warning)
  # This _will_ leave error messages showing up in the knitted report
opts_chunk$set(cache=TRUE, autodep=TRUE, cache.comments=FALSE,
               message=FALSE, warning=FALSE)
```


## The Big Picture


\[
\newcommand{\Expect}[1]{\mathbf{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)}
\newcommand{\Probwrt}[2]{\mathrm{Pr}_{#2}\left( #1 \right)}
\]

1. Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
2. The bootstrap principle: _approximate_ the sampling distribution by _simulating_ from a good model of the data, and treating the simulated data just like the real data
3. Sometimes we simulate from the model we're estimating (model-based or "parametric" bootstrap)
4. Sometimes we simulate by re-sampling the original data (resampling or "nonparametric" bootstrap)
5. Stronger assumptions $\Rightarrow$ less uncertainty _if we're right_


## Statistical Uncertainty


Re-run the experiment (survey, census, ...) and get different data

$\therefore$ everything we calculate from data (estimates, test statistics, policies, ...) will change from trial to trial as well

This variability is (the source of) **statistical uncertainty**

Quantifying this is a way  to be honest about what we actually know


## Measures of Uncertainty


Standard error = standard deviation of an estimator
(could equally well use median absolute deviation, etc.)

$p$-value = Probability we'd see a signal this big if there was just noise

Confidence region = All the parameter values we can't reject at low error
rates:

1. _Either_ the true parameter is in the confidence region
2. _or_ we are very unlucky
3. _or_ our model is wrong

etc., etc.


## The Sampling Distribution Is the Source of All Knowledge


Data $X \sim P_{X,\theta_0}$, for some true $\theta_0$

We calculate a statistic $T = \tau(X)$ so it has distribution $P_{T,\theta_0}$

If we knew $P_{T,\theta_0}$, we could calculate

- $\Var{T}$ (and so standard error)
- $\Expect{T}$ (and so bias)
- quantiles (and so confidence intervals or $p$-values), etc.


## The Problems

Problem 1: Most of the time, $P_{X,\theta_0}$ is very complicated

Problem 2: Most of the time, $\tau$ is a very complicated function

Problem 3: We certainly don't know $\theta_0$

Upshot: We don't know $P_{T,\theta_0}$ and can't use it to calculate anything


## The Solutions

Classically ($\approx 1900$--$\approx 1975$): Restrict the model and the
statistic until you can calculate the sampling distribution, at least for very
large $n$

Modern ($\approx 1975$--): Use complex models and statistics, but simulate calculating the statistic on the model


## The Bootstrap Principle

1. Find a good estimate $\hat{P}$ for $P_{X,\theta_0}$
2. Generate a simulation $\tilde{X}$ from $\hat{P}$, set $\tilde{T} = \tau(\tilde{X})$
3. Use the simulated distribution of the $\tilde{T}$ to approximate $P_{T,\theta_0}$

Refinements:

- improving the initial estimate $\hat{P}$
- reducing the number of simulations or speeding them up
- transforming $\tau$ so the final approximation is more stable

First step: find a good estimate $\hat{P}$ for $P_{X,\theta_0}$


## Model-based Bootstrap

If we are using a model, our best guess at $P_{X,\theta_0}$ is
$P_{X,\hat{\theta}}$, with our best estimate $\hat{\theta}$ of the parameters

#### The Model-based Bootstrap

- Get data $X$, estimate $\hat{\theta}$ from $X$
- Repeat $b$ times:
    + Simulate $\tilde{X}$ from $P_{X,\hat{\theta}}$ (simulate data of same size/"shape" as real data)
    + Calculate $\tilde{T} = \tau(\tilde{X}$) (treat simulated data the same as real data)
- Use empirical distribution of $\tilde{T}$ as $P_{T,\theta_0}$


## Example: Is Karakedi overweight?

```{r, fig.retina=NULL, out.width=600, echo=FALSE}
knitr::include_graphics("kara-as-snookums.png")
```

## Example (cont'd.)

```{r}
library(MASS); data(cats); summary(cats)
(q95.gaussian <- qnorm(0.95,mean=mean(cats$Bwt),sd=sd(cats$Bwt)))
```

## Example (cont'd.)

Simulate from fitted Gaussian; bundle up estimating 95th percentile into a function

```{r}
rcats.gaussian <- function() {
  rnorm(n=nrow(cats),
        mean=mean(cats$Bwt),
        sd=sd(cats$Bwt))
}

est.q95.gaussian <- function(x) {
  m <- mean(x)
  s <- sd(x)
  return(qnorm(0.95,mean=m,sd=s))
}
```

## Example (cont'd.)

Simulate, plot the sampling distribution from the simulations

```{r}
sampling.dist.gaussian <- replicate(1000, est.q95.gaussian(rcats.gaussian()))
plot(hist(sampling.dist.gaussian,breaks=50, plot=FALSE), freq=FALSE)
lines(density(sampling.dist.gaussian),lwd=2)
abline(v=q95.gaussian,lty="dashed",lwd=4)
```


## Example (cont'd.)

Find standard error and a crude confidence interval

```{r}
sd(sampling.dist.gaussian)
quantile(sampling.dist.gaussian,c(0.025,0.975))
```


## Model Checking

As always, if the model isn't right, relying on the model is asking for trouble

Is the Gaussian a good model for cats' weights?


## Example (re-cont'd.)

Compare histogram to fitted Gaussian density and to a smooth density estimate

```{r, fig.align="center", echo=FALSE}
plot(hist(cats$Bwt,plot=FALSE),freq=FALSE, main="", xlab="Body weight (kg)")
curve(dnorm(x,mean=mean(cats$Bwt),sd=sd(cats$Bwt)),
      add=TRUE,col="purple")
lines(density(cats$Bwt),lty="dashed")
```


## Resampling

Problem: Suppose we don't have a trust-worthy model

Resource: We do have data, which tells us a lot about the distribution

Solution: **Resampling**, treat the sample like a whole population

## The Resampling ("Nonparameteric") Bootstrap

- Get data $X$, containing $n$ samples, find $T=\tau(X)$
- Repeat $b$ times:
    + Generate $\tilde{X}$ by drawing $n$ samples from $X$ _with replacement_ (resample the data)
    + Calculate $\tilde{T} = \tau(\tilde{X})$ (treat simulated data the same as real data)
- Use empirical distribution of $\tilde{T}$ as $P_{T,\theta}$


## Example, Take 2

Model-free estimate of the 95th percentile is the 95th percentile of the data

```{r}
(q95.np <- quantile(cats$Bwt,0.95))
```

How precise is that?


## Example, Take 2

Resampling, re-estimating, and finding sampling distribution

```{r}
resample <- function(x) {
  sample(x,size=length(x),replace=TRUE)
}

est.q95.np <- function(x) { quantile(x,0.95) }
```

## Example, Take 2

```{r, fig.align="center"}
sampling.dist.np <- replicate(1000, est.q95.np(resample(cats$Bwt)))
plot(density(sampling.dist.np), main="", xlab="Body weight (kg)")
abline(v=q95.np,lty=2)
```

## Concrete Example, Take 2

standard error, bias, CI

```{r}
sd(sampling.dist.np)
mean(sampling.dist.np - q95.np)
quantile(sampling.dist.np,c(0.025,0.975))
```


## Bootstrapping Regressions

A regression is a model for $Y$ conditional on $X$
\[
Y= m(X) + \mathrm{noise}
\]
Silent about distribution of $X$, so how do we simulate?

Options, putting less and less trust in the model:

- Hold $x_i$ fixed, set $\tilde{y}_i = \hat{m}(x_i) + \mathrm{noise}$ from model's estimated noise distribution (e.g., Gaussian or $t$)
- Hold $x_i$ fixed, set $\tilde{y}_i = \hat{m}(x_i) + \mathrm{noise}$, noise _resampled from the residuals_
- Resample $(x_i, y_i)$ pairs (resample data-points or resample cases)


## Cats' Hearts

`cats` has weights for cats' hearts, as well as bodies

![cute cat heart](2017-02-09-cat-heart.jpg)
[(Much cuter than any  photo of real cats' hearts)](http://yaleheartstudy.org/site/wp-content/uploads/2012/03/cat-heart1.jpg)

How does heart weight relate to body weight?

(Useful if Kara's vet wants to know how much heart medicine to prescribe)


## Cats' Hearts

```{r, fig.align="center"}
plot(Hwt~Bwt, data=cats, xlab="Body weight (kg)", ylab="Heart weight (g)")
cats.lm <- lm(Hwt ~ Bwt, data=cats)
abline(cats.lm)
```


## Cats' Hearts

Coefficients and "standard" confidence intervals:
```{r}
coefficients(cats.lm)
confint(cats.lm)
```


## Cats' Hearts

The residuals don't look very Gaussian:
```{r, echo=FALSE, fig.align="center"}
par(mfrow=c(1,2))
plot(cats$Bwt,residuals(cats.lm))
plot(density(residuals(cats.lm)),main="")
par(mfrow=c(1,1))
```


## Cats' Hearts

Resample residuals:
```{r}
sim.cats.resids <- function() {
  new.cats <- cats
  noise <- resample(residuals(cats.lm))
  new.cats$Hwt <- fitted(cats.lm) + noise
  return(new.cats)
}
```

Re-estimate on new data:
```{r}
coefs.cats.lm <- function(df) {
  fit <- lm(Hwt~Bwt,data=df)
  return(coefficients(fit))
}
```

## Cats' Hearts

Re-sample to get CIs:

```{r}
cats.lm.samp.dist.resids <- replicate(1000,
   coefs.cats.lm(sim.cats.resids()))
apply(cats.lm.samp.dist.resids,1,quantile,c(0.025,0.975))
```


## Cats' Hearts

Try resampling whole rows:

```{r}
resample.data.frame <- function(df) {
  return(df[resample(1:nrow(df)),])
}
```

```{r}
cats.lm.samp.dist.cases <- replicate(1000,
  coefs.cats.lm(resample.data.frame(cats)))
apply(cats.lm.samp.dist.cases,1,quantile,c(0.025,0.975))
```

## Why Do This?

- Resampling residuals works as long as the noise is IID
    + Noise could be Gaussian...
    + Or it could be very non-Gaussian
- Resampling whole cases works as long as observations are IID
    + noise needn't be independent of $X$
    + needn't be Gaussian
    + linear model needn't be right


## Sources of Error in Bootstrapping

- **Simulation** Using only $b$ bootstrap replicates
    + Make this small by letting $b\rightarrow\infty$
    + Costs computing time
	+ Diminishing returns: error is generally $O(1/\sqrt{b})$
- **Approximation** Using $\hat{P}$ instead of $P_{X,\theta_0}$
    + Make this small by careful statistical modeling
- **Estimation** Only a finite number of samples
    + Make this small by being careful about what we simulate (example next slide)

Generally: for fixed $n$, nonparametric boostrap shows more uncertainty
than parametric bootstraps, but is less at risk to modeling mistakes
(yet another bias-variance tradeoff)


## Improving on the Crude Confidence Interval

Crude CI use distribution of $\tilde{\theta}$ under $\hat{\theta}$

But really we want the distribution of $\hat{\theta}$ under $\theta$

Mathematical observation: Generally speaking, (distributions of)
$\tilde{\theta} - \hat{\theta}$ is closer to $\hat{\theta}-\theta_0$ than
$\tilde{\theta}$ is to $\hat{\theta}$

("centering" or "pivoting")


## The Basic, Pivotal CI

quantiles of $\tilde{\theta}$ = $q_{\alpha/2}, q_{1-\alpha/2}$

\[
\begin{eqnarray*}
1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\
& = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\
 & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\
& = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\
& = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0}
\end{eqnarray*}
\]

Basically: re-center the simulations around the empirical data