9 February 2017

## The Big Picture


1. Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
2. The bootstrap principle: approximate the sampling distribution by simulating from a good model of the data, and treating the simulated data just like the real data
3. Sometimes we simulate from the model we're estimating (model-based or "parametric" bootstrap)
4. Sometimes we simulate by re-sampling the original data (resampling or "nonparametric" bootstrap)
5. Stronger assumptions $$\Rightarrow$$ less uncertainty if we're right

## Statistical Uncertainty

Re-run the experiment (survey, census, …) and get different data

$$\therefore$$ everything we calculate from data (estimates, test statistics, policies, …) will change from trial to trial as well

This variability is (the source of) statistical uncertainty

Quantifying this is a way to be honest about what we actually know

## Measures of Uncertainty

Standard error = standard deviation of an estimator (could equally well use median absolute deviation, etc.)

$$p$$-value = Probability we'd see a signal this big if there was just noise

Confidence region = All the parameter values we can't reject at low error rates:

1. Either the true parameter is in the confidence region
2. or we are very unlucky
3. or our model is wrong

etc., etc.

## The Sampling Distribution Is the Source of All Knowledge

Data $$X \sim P_{X,\theta_0}$$, for some true $$\theta_0$$

We calculate a statistic $$T = \tau(X)$$ so it has distribution $$P_{T,\theta_0}$$

If we knew $$P_{T,\theta_0}$$, we could calculate

• $$\Var{T}$$ (and so standard error)
• $$\Expect{T}$$ (and so bias)
• quantiles (and so confidence intervals or $$p$$-values), etc.

## The Problems

Problem 1: Most of the time, $$P_{X,\theta_0}$$ is very complicated

Problem 2: Most of the time, $$\tau$$ is a very complicated function

Problem 3: We certainly don't know $$\theta_0$$

Upshot: We don't know $$P_{T,\theta_0}$$ and can't use it to calculate anything

## The Solutions

Classically ($$\approx 1900$$–$$\approx 1975$$): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large $$n$$

Modern ($$\approx 1975$$–): Use complex models and statistics, but simulate calculating the statistic on the model

## The Bootstrap Principle

1. Find a good estimate $$\hat{P}$$ for $$P_{X,\theta_0}$$
2. Generate a simulation $$\tilde{X}$$ from $$\hat{P}$$, set $$\tilde{T} = \tau(\tilde{X})$$
3. Use the simulated distribution of the $$\tilde{T}$$ to approximate $$P_{T,\theta_0}$$

Refinements:

• improving the initial estimate $$\hat{P}$$
• reducing the number of simulations or speeding them up
• transforming $$\tau$$ so the final approximation is more stable

First step: find a good estimate $$\hat{P}$$ for $$P_{X,\theta_0}$$

## Model-based Bootstrap

If we are using a model, our best guess at $$P_{X,\theta_0}$$ is $$P_{X,\hat{\theta}}$$, with our best estimate $$\hat{\theta}$$ of the parameters

#### The Model-based Bootstrap

• Get data $$X$$, estimate $$\hat{\theta}$$ from $$X$$
• Repeat $$b$$ times:
• Simulate $$\tilde{X}$$ from $$P_{X,\hat{\theta}}$$ (simulate data of same size/"shape" as real data)
• Calculate $$\tilde{T} = \tau(\tilde{X}$$) (treat simulated data the same as real data)
• Use empirical distribution of $$\tilde{T}$$ as $$P_{T,\theta_0}$$