--- title: Bootstrap (with feeling this time) author: 36-402, Section A date: 9 February 2017 output: ioslides_presentation --- ```{r, include=FALSE} # General set up # Load packages we may use later library(knitr) # Set knitr options for knitting code into the report: # - Save results so that code blocks aren't re-run unless code changes (cache), # _or_ a relevant earlier code block changed (autodep), but don't re-run if the # only thing that changed was the comments (cache.comments) # - Don't clutter R output with messages or warnings (message, warning) # This _will_ leave error messages showing up in the knitted report opts_chunk$set(cache=TRUE, autodep=TRUE, cache.comments=FALSE, message=FALSE, warning=FALSE) ``` ## The Big Picture \[ \newcommand{\Expect}[1]{\mathbf{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)} \newcommand{\Probwrt}[2]{\mathrm{Pr}_{#2}\left( #1 \right)} \] 1. Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets) 2. The bootstrap principle: _approximate_ the sampling distribution by _simulating_ from a good model of the data, and treating the simulated data just like the real data 3. Sometimes we simulate from the model we're estimating (model-based or "parametric" bootstrap) 4. Sometimes we simulate by re-sampling the original data (resampling or "nonparametric" bootstrap) 5. Stronger assumptions $\Rightarrow$ less uncertainty _if we're right_ ## Statistical Uncertainty Re-run the experiment (survey, census, ...) and get different data $\therefore$ everything we calculate from data (estimates, test statistics, policies, ...) will change from trial to trial as well This variability is (the source of) **statistical uncertainty** Quantifying this is a way to be honest about what we actually know ## Measures of Uncertainty Standard error = standard deviation of an estimator (could equally well use median absolute deviation, etc.) $p$-value = Probability we'd see a signal this big if there was just noise Confidence region = All the parameter values we can't reject at low error rates: 1. _Either_ the true parameter is in the confidence region 2. _or_ we are very unlucky 3. _or_ our model is wrong etc., etc. ## The Sampling Distribution Is the Source of All Knowledge Data $X \sim P_{X,\theta_0}$, for some true $\theta_0$ We calculate a statistic $T = \tau(X)$ so it has distribution $P_{T,\theta_0}$ If we knew $P_{T,\theta_0}$, we could calculate - $\Var{T}$ (and so standard error) - $\Expect{T}$ (and so bias) - quantiles (and so confidence intervals or $p$-values), etc. ## The Problems Problem 1: Most of the time, $P_{X,\theta_0}$ is very complicated Problem 2: Most of the time, $\tau$ is a very complicated function Problem 3: We certainly don't know $\theta_0$ Upshot: We don't know $P_{T,\theta_0}$ and can't use it to calculate anything ## The Solutions Classically ($\approx 1900$--$\approx 1975$): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large $n$ Modern ($\approx 1975$--): Use complex models and statistics, but simulate calculating the statistic on the model ## The Bootstrap Principle 1. Find a good estimate $\hat{P}$ for $P_{X,\theta_0}$ 2. Generate a simulation $\tilde{X}$ from $\hat{P}$, set $\tilde{T} = \tau(\tilde{X})$ 3. Use the simulated distribution of the $\tilde{T}$ to approximate $P_{T,\theta_0}$ Refinements: - improving the initial estimate $\hat{P}$ - reducing the number of simulations or speeding them up - transforming $\tau$ so the final approximation is more stable First step: find a good estimate $\hat{P}$ for $P_{X,\theta_0}$ ## Model-based Bootstrap If we are using a model, our best guess at $P_{X,\theta_0}$ is $P_{X,\hat{\theta}}$, with our best estimate $\hat{\theta}$ of the parameters #### The Model-based Bootstrap - Get data $X$, estimate $\hat{\theta}$ from $X$ - Repeat $b$ times: + Simulate $\tilde{X}$ from $P_{X,\hat{\theta}}$ (simulate data of same size/"shape" as real data) + Calculate $\tilde{T} = \tau(\tilde{X}$) (treat simulated data the same as real data) - Use empirical distribution of $\tilde{T}$ as $P_{T,\theta_0}$ ## Example: Is Karakedi overweight? ```{r, fig.retina=NULL, out.width=600, echo=FALSE} knitr::include_graphics("kara-as-snookums.png") ``` ## Example (cont'd.) ```{r} library(MASS); data(cats); summary(cats) (q95.gaussian <- qnorm(0.95,mean=mean(cats$Bwt),sd=sd(cats$Bwt))) ``` ## Example (cont'd.) Simulate from fitted Gaussian; bundle up estimating 95th percentile into a function ```{r} rcats.gaussian <- function() { rnorm(n=nrow(cats), mean=mean(cats$Bwt), sd=sd(cats$Bwt)) } est.q95.gaussian <- function(x) { m <- mean(x) s <- sd(x) return(qnorm(0.95,mean=m,sd=s)) } ``` ## Example (cont'd.) Simulate, plot the sampling distribution from the simulations ```{r} sampling.dist.gaussian <- replicate(1000, est.q95.gaussian(rcats.gaussian())) plot(hist(sampling.dist.gaussian,breaks=50, plot=FALSE), freq=FALSE) lines(density(sampling.dist.gaussian),lwd=2) abline(v=q95.gaussian,lty="dashed",lwd=4) ``` ## Example (cont'd.) Find standard error and a crude confidence interval ```{r} sd(sampling.dist.gaussian) quantile(sampling.dist.gaussian,c(0.025,0.975)) ``` ## Model Checking As always, if the model isn't right, relying on the model is asking for trouble Is the Gaussian a good model for cats' weights? ## Example (re-cont'd.) Compare histogram to fitted Gaussian density and to a smooth density estimate ```{r, fig.align="center", echo=FALSE} plot(hist(cats$Bwt,plot=FALSE),freq=FALSE, main="", xlab="Body weight (kg)") curve(dnorm(x,mean=mean(cats$Bwt),sd=sd(cats$Bwt)), add=TRUE,col="purple") lines(density(cats$Bwt),lty="dashed") ``` ## Resampling Problem: Suppose we don't have a trust-worthy model Resource: We do have data, which tells us a lot about the distribution Solution: **Resampling**, treat the sample like a whole population ## The Resampling ("Nonparameteric") Bootstrap - Get data $X$, containing $n$ samples, find $T=\tau(X)$ - Repeat $b$ times: + Generate $\tilde{X}$ by drawing $n$ samples from $X$ _with replacement_ (resample the data) + Calculate $\tilde{T} = \tau(\tilde{X})$ (treat simulated data the same as real data) - Use empirical distribution of $\tilde{T}$ as $P_{T,\theta}$ ## Example, Take 2 Model-free estimate of the 95th percentile is the 95th percentile of the data ```{r} (q95.np <- quantile(cats$Bwt,0.95)) ``` How precise is that? ## Example, Take 2 Resampling, re-estimating, and finding sampling distribution ```{r} resample <- function(x) { sample(x,size=length(x),replace=TRUE) } est.q95.np <- function(x) { quantile(x,0.95) } ``` ## Example, Take 2 ```{r, fig.align="center"} sampling.dist.np <- replicate(1000, est.q95.np(resample(cats$Bwt))) plot(density(sampling.dist.np), main="", xlab="Body weight (kg)") abline(v=q95.np,lty=2) ``` ## Concrete Example, Take 2 standard error, bias, CI ```{r} sd(sampling.dist.np) mean(sampling.dist.np - q95.np) quantile(sampling.dist.np,c(0.025,0.975)) ``` ## Bootstrapping Regressions A regression is a model for $Y$ conditional on $X$ \[ Y= m(X) + \mathrm{noise} \] Silent about distribution of $X$, so how do we simulate? Options, putting less and less trust in the model: - Hold $x_i$ fixed, set $\tilde{y}_i = \hat{m}(x_i) + \mathrm{noise}$ from model's estimated noise distribution (e.g., Gaussian or $t$) - Hold $x_i$ fixed, set $\tilde{y}_i = \hat{m}(x_i) + \mathrm{noise}$, noise _resampled from the residuals_ - Resample $(x_i, y_i)$ pairs (resample data-points or resample cases) ## Cats' Hearts `cats` has weights for cats' hearts, as well as bodies ![cute cat heart](2017-02-09-cat-heart.jpg) [(Much cuter than any photo of real cats' hearts)](http://yaleheartstudy.org/site/wp-content/uploads/2012/03/cat-heart1.jpg) How does heart weight relate to body weight? (Useful if Kara's vet wants to know how much heart medicine to prescribe) ## Cats' Hearts ```{r, fig.align="center"} plot(Hwt~Bwt, data=cats, xlab="Body weight (kg)", ylab="Heart weight (g)") cats.lm <- lm(Hwt ~ Bwt, data=cats) abline(cats.lm) ``` ## Cats' Hearts Coefficients and "standard" confidence intervals: ```{r} coefficients(cats.lm) confint(cats.lm) ``` ## Cats' Hearts The residuals don't look very Gaussian: ```{r, echo=FALSE, fig.align="center"} par(mfrow=c(1,2)) plot(cats$Bwt,residuals(cats.lm)) plot(density(residuals(cats.lm)),main="") par(mfrow=c(1,1)) ``` ## Cats' Hearts Resample residuals: ```{r} sim.cats.resids <- function() { new.cats <- cats noise <- resample(residuals(cats.lm)) new.cats$Hwt <- fitted(cats.lm) + noise return(new.cats) } ``` Re-estimate on new data: ```{r} coefs.cats.lm <- function(df) { fit <- lm(Hwt~Bwt,data=df) return(coefficients(fit)) } ``` ## Cats' Hearts Re-sample to get CIs: ```{r} cats.lm.samp.dist.resids <- replicate(1000, coefs.cats.lm(sim.cats.resids())) apply(cats.lm.samp.dist.resids,1,quantile,c(0.025,0.975)) ``` ## Cats' Hearts Try resampling whole rows: ```{r} resample.data.frame <- function(df) { return(df[resample(1:nrow(df)),]) } ``` ```{r} cats.lm.samp.dist.cases <- replicate(1000, coefs.cats.lm(resample.data.frame(cats))) apply(cats.lm.samp.dist.cases,1,quantile,c(0.025,0.975)) ``` ## Why Do This? - Resampling residuals works as long as the noise is IID + Noise could be Gaussian... + Or it could be very non-Gaussian - Resampling whole cases works as long as observations are IID + noise needn't be independent of $X$ + needn't be Gaussian + linear model needn't be right ## Sources of Error in Bootstrapping - **Simulation** Using only $b$ bootstrap replicates + Make this small by letting $b\rightarrow\infty$ + Costs computing time + Diminishing returns: error is generally $O(1/\sqrt{b})$ - **Approximation** Using $\hat{P}$ instead of $P_{X,\theta_0}$ + Make this small by careful statistical modeling - **Estimation** Only a finite number of samples + Make this small by being careful about what we simulate (example next slide) Generally: for fixed $n$, nonparametric boostrap shows more uncertainty than parametric bootstraps, but is less at risk to modeling mistakes (yet another bias-variance tradeoff) ## Improving on the Crude Confidence Interval Crude CI use distribution of $\tilde{\theta}$ under $\hat{\theta}$ But really we want the distribution of $\hat{\theta}$ under $\theta$ Mathematical observation: Generally speaking, (distributions of) $\tilde{\theta} - \hat{\theta}$ is closer to $\hat{\theta}-\theta_0$ than $\tilde{\theta}$ is to $\hat{\theta}$ ("centering" or "pivoting") ## The Basic, Pivotal CI quantiles of $\tilde{\theta}$ = $q_{\alpha/2}, q_{1-\alpha/2}$ \[ \begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*} \] Basically: re-center the simulations around the empirical data