5 February 2019

# The Big Picture


1. Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
2. The bootstrap principle: approximate the sampling distribution by simulating from a good model of the data, and treating the simulated data just like the real data
3. Sometimes we simulate from the model we’re estimating (model-based or “parametric” bootstrap)
4. Sometimes we simulate by re-sampling the original data (resampling or “nonparametric” bootstrap)
5. Stronger assumptions $$\Rightarrow$$ less uncertainty if we’re right

# Statistical Uncertainty

• Re-run the experiment (survey, census, …) and get different data
• $$\therefore$$ everything we calculate from data (estimates, test statistics, $$p$$-values, policy recommendations, …) will change from run to run
• This variability is (the source of) statistical uncertainty
• Quantifying this = honesty about what we actually know

# Measures of Uncertainty

• Standard error = standard deviation of an estimator
• (could equally well use median absolute deviation, etc.)
• $$p$$-value = Probability we’d see a signal this big if there was just noise
• Confidence region = All the parameter values we can’t reject at low error rates
• Either the true parameter is in the confidence region
• or we are very unlucky
• or our model is wrong
• etc., etc.

# The Sampling Distribution Is the Source of All Knowledge

• Data $$X \sim P_X$$ for some unknown true distribution $$P_X$$
• We calculate a statistic $$T = \tau(X)$$ so it has distribution $$P_{T}$$
• If we knew $$P_{T}$$, we could calculate
• $$\Var{T}$$ (and so standard error)
• $$\Expect{T}$$ (and so bias)
• quantiles (and so confidence intervals or $$p$$-values), etc.

# The Difficulties

• Difficulty 1: Most of the time, $$P_{X}$$ is very complicated
• Difficulty 2: Most of the time, $$\tau$$ is a very complicated function
• $$\therefore$$ We couldn’t solve for $$P_T$$
• Difficulty 3: Actually, we don’t know $$P_X$$
• Upshot: We really don’t know $$P_{T}$$ and can’t use it to calculate anything

# The Solutions

• Classically ($$\approx 1900$$$$\approx 1975$$): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large $$n$$

• Modern ($$\approx 1975$$–): Use complex models and statistics, but simulate calculating the statistic on the model

# The Monte Carlo Principle

• Generate a simulate $$\tilde{X}$$ from $$P_X$$
• Set $$\tilde{T} = \tau(\tilde{X})$$
• Repeat many times
• Use the simulated distribution of the $$\tilde{T}$$ to approximate $$P_{T}$$
• (As a general method, invented by Enrico Fermi in the 1930s, spread through the Manhattan Project)
• Still needs $$P_X$$
• Works in HW 3 because we’re testing a fixed model

# The Bootstrap Principle

1. Find a good estimate $$\hat{P}$$ for $$P_{X}$$
2. Generate a simulation $$\tilde{X}$$ from $$\hat{P}$$, set $$\tilde{T} = \tau(\tilde{X})$$
3. Use the simulated distribution of the $$\tilde{T}$$ to approximate $$P_{T}$$
• “Pull yourself up by your bootstraps”: use $$\hat{P}$$ to get at uncertainty in itself
• Invented by Bradley Efron in the 1970s
• First step: find a good estimate $$\hat{P}$$ for $$P_{X}$$

# Model-based Bootstrap

If we are using a model, our best guess at $$P_{X}$$ is $$P_{X,\hat{\theta}}$$, with our best estimate $$\hat{\theta}$$ of the parameters

#### The Model-based Bootstrap

• Get data $$X$$, estimate $$\hat{\theta}$$ from $$X$$
• Repeat $$b$$ times:
• Simulate $$\tilde{X}$$ from $$P_{X,\hat{\theta}}$$ (simulate data of same size/“shape” as real data)
• Calculate $$\tilde{T} = \tau(\tilde{X}$$) (treat simulated data the same as real data)
• Use empirical distribution of $$\tilde{T}$$ as $$P_{T}$$