---
title: Significance Testing etc.
author: 36-402, Section A
date: 7 February 2019
output: slidy_presentation
bibliography: locusts.bib
---
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\StdErr}[1]{\mathrm{se}\left[ #1 \right]}
\newcommand{\EstStdErr}[1]{\widehat{\mathrm{se}}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)}
\newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]}
\]
```{r, include=FALSE}
set.seed(2019)
```
## Linear Regression and Hypothesis Testing (Once More with Feeling) {.smaller}
```{r}
stocks <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/19/hw/03/stock_history.csv")
stocks$MAPE <- with(stocks, Price/Earnings_10MA_back)
returns.on.mape.transforms <- lm(Return_10_fwd ~ I(1/MAPE) + MAPE + I(MAPE^2), data=stocks)
summary(returns.on.mape.transforms)
```
## What Is the Model? What Are the Hypotheses?
\[
R_i = \beta_0 + \beta_1\frac{1}{M_i} + \beta_2 M_i + \beta_3 M_i^2 + \epsilon_i, ~ \Expect{\epsilon|M} = 0
\]
`lm` also assumes $\epsilon_i$ IID Gaussian, $~ \mathcal{N}(0, \sigma^2)$
- Sampling distribution of $\hat{\beta}_j$ under _all_ these assumptions:
\[
\hat{\beta}_j \sim \mathcal{N}(\beta_j, \StdErr{\beta_j}^2)
\]
#### What Hypotheses Does `lm` Test?
- For $\beta_0$, $H_{00}$: That model is right, and $\beta_0=0$ exactly vs. That model is right, and $\beta_0 \neq 0$
- For $\beta_1$, $H_{01}$: That model is right, and $\beta_1=0$ exactly vs. That model is right, and $\beta_1 \neq 0$
- For $\beta_2$, $H_{02}$: That model is right, and $\beta_2=0$ exactly vs. That model is right, and $\beta_2 \neq 0$
- For $\beta_3$, $H_{03}$: That model is right, and $\beta_3=0$ exactly vs. That model is right, and $\beta_3 \neq 0$
## How Does R Test These Hypotheses?
- Need a **test statistic** = a function of the data with different distributions under null and alternative
+ Generally, we pick test statistics which tend to be small under the null, and large under the alternative
- R uses the **Wald test**, where the test statistic is
\[
T_j \equiv \frac{\hat{\beta}_j}{\EstStdErr{\hat{\beta}_j}}
\]
+ Works for testing $\beta_j=0$ --- how to modify to test $\beta_j=b \neq 0$?
- We know the sampling distribution of $T_j$ under these modeling assumptions _and_ the null hypothesis $H_{0j}$:
\[
T_j \sim t_{n-4} \approx N(0,1) ~ \mathrm{for\ large}\ n
\]
- We get a certain value of $T_j$ on the data, say $\hat{T}_j$
\[
P_j=\Prob{|T_j| \geq |\hat{T}_j|; H_{0j}}
\]
## What "Statistically Significant" Means
- We can reliably or confidently detect a difference from null in the direction of the alternative
+ We wouldn't see this big a departure from the null, in the direction of the alternative, _if the null were true_
+ unless we were really unlucky
+ or unless the model is wrong
- "Statistically detectable" might have been a better name
## What Tends to Make Things Significant?
- $T_j = \hat{\beta}_j/\EstStdErr{\hat{\beta}_j}$
- As $n\rightarrow \infty$, $\hat{\beta}_j \rightarrow \beta_j$ and $\EstStdErr{\hat{\beta}_j} \rightarrow \StdErr{\hat{\beta}_j}$
- So if $\beta_j$ is large, $T_j$ will tend to be large
- What about $\StdErr{\hat{\beta}_j}$?
+ $\StdErr{\hat{\beta}_j} \propto \sigma$
+ $\StdErr{\hat{\beta}_j} \propto 1/\sqrt{n}$
+ As $\Var{X_j} \uparrow \infty$, $\StdErr{\hat{\beta}_j} \downarrow 0$
+ As $X_j$ becomes correlated with the other $X_k$'s, $\StdErr{\hat{\beta}_j} \uparrow \infty$ (worst case: collinearity)
+ (Really: MSE of a linear regression of $X_j$ on the other $X_k$'s)
## What Tends to Make Things Significant?
> - For fixed $n$, the detectably-different-from-zero coefficients belong to variables with
> + Large true coefficients $\beta_j$
> + Large variances of that variable, $\Var{X_j}$
> + Little correlation with the other variables
## What Tends to Make Things Significant?
```{r}
x <- with(na.omit(stocks), cbind(1/MAPE, MAPE, MAPE^2))
colnames(x) <- c("1/MAPE", "MAPE", "MAPE^2")
var(x)
cor(x)
```
## What Makes Things Significant? (cont'd)
> - As $n$ grows, **every** coefficient which isn't **exactly** zero becomes significant
> + $T_j \rightarrow \pm \infty$, **unless** $\beta_j=0$
> + $P_j \rightarrow 0$ exponentially fast
> + "The $P$-value is a measure of sample size"
> + More [here](http://bactra.org/weblog/1111.html)
> - If $\beta_j=0$, then $P_j \sim \mathrm{Unif}(0,1)$ for **all** $n$
> + Assuming the model is right
> + So about 1/20 of really-0 coefficients will look significant
## Star-Gazing Is a Bad Way to Pick Variables {.smaller}
```{r}
n <- 1e4; p <- 40
predictors <- matrix(rnorm(n*p), nrow=n, ncol=p); response<-rnorm(n) # no relationship at all
summary(lm(response ~ predictors))
```
## What's the Point of Significance Testing?
> - You have a real reason to think $\beta_j=0$ is a "live option"
> + **And** you've included the right covariates
> + **And** you're using the right kind of model
> + Good evidence when test has power to detect if $H_0$ is false
> - Similarly: you have reasons why $\beta_j=b$ might be worth considering
> - Or: you want to know whether $H_0$ is _close enough_ to true that you can't detect the difference
> + Especially: goodness-of-fit tests and diagnostics
> + "All the hypotheses that could be true" = confidence set
## Confidence Sets
- A confidence set (interval, rectangle, ellipse, ...) is a bet:
+ **Either** the true parameter is in the set, $\theta \in C$
+ **or** something really improbable happened (1-confidence level)
+ (**or** the model is wrong)
- Every confidence set can be a test
+ Reject $\theta=\theta_0$ if $\theta_0 \not \in C$
+ Retain if $\theta_0 \in C$
- Every test gives a confidence set
+ $\theta_0 \in C$ if test doesn't reject $\theta_0$
+ Wald test for coefficients $\Rightarrow$ usual confidence intervals
## Simulations
- Nothing changes in the logic, just how we calculate the sampling distribution
- If you want to test a hypothesis, run simulations where that hypothesis is true, and see if they look like the data
+ E.g., we simulate with $\beta_j=0$ but all the other $\beta_k$ set to their optimal values
+ If we didn't know that $\hat{\beta_j} \sim N(\beta_j, \StdErr{\hat{\beta}_j}^2)$ in the linear-Gaussian model, but we did have `rnorm()`, we could do this simulation, and we'd get $P$-values that match `lm`'s calculations
+ "Looks like" comes down to the test statistic
- In the homework, we simulated the hypothesis that $R_i = \frac{1}{M_i} + \epsilon_i$, with $\epsilon_i$ following a generalized $t$-distribution
+ Used the $t$ rather than normal to get a bit more flexibility in the shape
+ Also, $t$-distributed noise is used a lot in financial modeling
+ Many other possibilities for the noise
## Bootstrapping
- In the bootstrap, we don't simulate a fixed model, we craft a simulation so it's close to the data
+ Either by fitting a model to simulate
+ Or by resampling
- This is good at getting things like standard errors and confidence regions
+ Confidence region = our estimate would be in this set with high probability
+ Can always invert a confidence region to test a hypothesis
- Advantage of resampling is that it assumes very little about the true distribution
- Advantage of model-based bootstrap is that it uses the data more efficiently, _if_ the model is good
## Summing Up
- When faced with a hypothesis test, ask:
+ What exactly are we testing, and why?
+ What is the null hypothesis, and what alternative are we testing against?
+ What test statistic do we use to discriminate between null and alternative?
+ How do we find the distribution of the test statistic under the null?
+ Do we have power to detect that alternative with that test statistic?
- Often, what you really want is a confidence set
+ $=$ all the specific hypotheses the data would let you retain
+ Still need to worry about model accuracy to calculate probabilities
- Simulation is a way of calculating probabilities
+ Can simulate under specific hypotheses
+ Or simulate under something close to the real distribution (bootstrap)
## References
- The correct line on hypothesis tests, and when, why and how they are useful: @Mayo-error, @Mayo-Cox-frequentist
- $P$-values tend to zero exponentially fast (unless the null is true): @Bahadur-introduces-Bahadur-efficiency, @Bahadur-limit-theorems, @van-der-Vaart-asymptotic-stats
- Crash course in linear regression modeling, including's actually being tested, diagnostics, model-building: [http://www.stat.cmu.edu/~cshalizi/TALR/]
+ Chapters 7, 12 and 16 are especially relevant to today
- All the linear-Gaussian-model theory you could ever want: @Seber-Lee-linear-reg-2nd
## From discussion: "P-Hacking" {.smaller}
- Manipulating the data and/or analysis until you get the significant results you want
+ Or, sometimes, any significance at all (cf. the [example I tried to describe in class](https://arstechnica.com/science/2017/04/the-peer-reviewed-saga-of-mindless-eating-mindless-research-is-bad-too/) )
- Three main ways it happens:
+ Deciding which data points to include or exclude as outliers
+ Deciding how to define variables --- [Recent example](http://dx.doi.org/10.1038/s41562-018-0506-1) with different measures of "screen time" and "depression"
+ Deciding which model specification to use
- Andy Gelman calls this the "garden of forking paths" (after the @Borges-ficciones story about alternate histories)
- Brutal solution to $p$-hacking by specification search: data-splitting
+ Randomly split the data into two parts
+ Do exploration, try out variations, etc., on one part, and _only_ on one part
+ _After_ you have fixed on a model, estimate it and look at inferential statistics using second part
+ Do not use the second part for anything else
+ Cf. Exercise 3.4 in Chapter 3 (computationally challenging)
## From discussion: What about the F test?
- R reports an F test at the bottom of its output for `lm()`
#### What hypothesis does this test?
- $H_{0F}$: The linear-Gaussian model is right, and $\beta_1=\beta_2=\beta_3=0$ vs. The linear-Gaussian model is right, and not all slopes are exactly 0
- This is **not** testing "Is the linear-Gaussian model right?"
## Bibliography