9 February 2017

The Big Picture

\[ \newcommand{\Expect}[1]{\mathbf{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)} \newcommand{\Probwrt}[2]{\mathrm{Pr}_{#2}\left( #1 \right)} \]

  1. Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
  2. The bootstrap principle: approximate the sampling distribution by simulating from a good model of the data, and treating the simulated data just like the real data
  3. Sometimes we simulate from the model we're estimating (model-based or "parametric" bootstrap)
  4. Sometimes we simulate by re-sampling the original data (resampling or "nonparametric" bootstrap)
  5. Stronger assumptions \(\Rightarrow\) less uncertainty if we're right

Statistical Uncertainty

Re-run the experiment (survey, census, …) and get different data

\(\therefore\) everything we calculate from data (estimates, test statistics, policies, …) will change from trial to trial as well

This variability is (the source of) statistical uncertainty

Quantifying this is a way to be honest about what we actually know

Measures of Uncertainty

Standard error = standard deviation of an estimator (could equally well use median absolute deviation, etc.)

\(p\)-value = Probability we'd see a signal this big if there was just noise

Confidence region = All the parameter values we can't reject at low error rates:

  1. Either the true parameter is in the confidence region
  2. or we are very unlucky
  3. or our model is wrong

etc., etc.

The Sampling Distribution Is the Source of All Knowledge

Data \(X \sim P_{X,\theta_0}\), for some true \(\theta_0\)

We calculate a statistic \(T = \tau(X)\) so it has distribution \(P_{T,\theta_0}\)

If we knew \(P_{T,\theta_0}\), we could calculate

  • \(\Var{T}\) (and so standard error)
  • \(\Expect{T}\) (and so bias)
  • quantiles (and so confidence intervals or \(p\)-values), etc.

The Problems

Problem 1: Most of the time, \(P_{X,\theta_0}\) is very complicated

Problem 2: Most of the time, \(\tau\) is a very complicated function

Problem 3: We certainly don't know \(\theta_0\)

Upshot: We don't know \(P_{T,\theta_0}\) and can't use it to calculate anything

The Solutions

Classically (\(\approx 1900\)–\(\approx 1975\)): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large \(n\)

Modern (\(\approx 1975\)–): Use complex models and statistics, but simulate calculating the statistic on the model

The Bootstrap Principle

  1. Find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)
  2. Generate a simulation \(\tilde{X}\) from \(\hat{P}\), set \(\tilde{T} = \tau(\tilde{X})\)
  3. Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T,\theta_0}\)

Refinements:

  • improving the initial estimate \(\hat{P}\)
  • reducing the number of simulations or speeding them up
  • transforming \(\tau\) so the final approximation is more stable

First step: find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)

Model-based Bootstrap

If we are using a model, our best guess at \(P_{X,\theta_0}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters

The Model-based Bootstrap

  • Get data \(X\), estimate \(\hat{\theta}\) from \(X\)
  • Repeat \(b\) times:
    • Simulate \(\tilde{X}\) from \(P_{X,\hat{\theta}}\) (simulate data of same size/"shape" as real data)
    • Calculate \(\tilde{T} = \tau(\tilde{X}\)) (treat simulated data the same as real data)
  • Use empirical distribution of \(\tilde{T}\) as \(P_{T,\theta_0}\)

Example: Is Karakedi overweight?