Linear Regression and Hypothesis Testing (Once More with Feeling)

stocks <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/19/hw/03/stock_history.csv")
stocks$MAPE <- with(stocks, Price/Earnings_10MA_back)
returns.on.mape.transforms <- lm(Return_10_fwd ~ I(1/MAPE) + MAPE + I(MAPE^2), data=stocks)
summary(returns.on.mape.transforms)

## 
## Call:
## lm(formula = Return_10_fwd ~ I(1/MAPE) + MAPE + I(MAPE^2), data = stocks)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.110743 -0.029043  0.002934  0.028354  0.099453 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.550e-02  2.612e-02   0.976    0.329    
## I(1/MAPE)    7.356e-01  1.268e-01   5.801 8.07e-09 ***
## MAPE        -2.194e-04  1.559e-03  -0.141    0.888    
## I(MAPE^2)   -3.578e-05  2.679e-05  -1.336    0.182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0421 on 1480 degrees of freedom
##   (240 observations deleted due to missingness)
## Multiple R-squared:  0.358,  Adjusted R-squared:  0.3567 
## F-statistic: 275.1 on 3 and 1480 DF,  p-value: < 2.2e-16

What Is the Model? What Are the Hypotheses?

\[ R_i = \beta_0 + \beta_1\frac{1}{M_i} + \beta_2 M_i + \beta_3 M_i^2 + \epsilon_i, ~ \Expect{\epsilon|M} = 0 \] lm also assumes \(\epsilon_i\) IID Gaussian, \(~ \mathcal{N}(0, \sigma^2)\)

Sampling distribution of \(\hat{\beta}_j\) under all these assumptions: \[ \hat{\beta}_j \sim \mathcal{N}(\beta_j, \StdErr{\beta_j}^2) \]

What Hypotheses Does `lm` Test?

For \(\beta_0\), \(H_{00}\): That model is right, and \(\beta_0=0\) exactly vs. That model is right, and \(\beta_0 \neq 0\)
For \(\beta_1\), \(H_{01}\): That model is right, and \(\beta_1=0\) exactly vs. That model is right, and \(\beta_1 \neq 0\)
For \(\beta_2\), \(H_{02}\): That model is right, and \(\beta_2=0\) exactly vs. That model is right, and \(\beta_2 \neq 0\)
For \(\beta_3\), \(H_{03}\): That model is right, and \(\beta_3=0\) exactly vs. That model is right, and \(\beta_3 \neq 0\)

How Does R Test These Hypotheses?

Need a test statistic = a function of the data with different distributions under null and alternative
- Generally, we pick test statistics which tend to be small under the null, and large under the alternative
R uses the Wald test, where the test statistic is \[ T_j \equiv \frac{\hat{\beta}_j}{\EstStdErr{\hat{\beta}_j}} \]
- Works for testing \(\beta_j=0\) — how to modify to test \(\beta_j=b \neq 0\)?
We know the sampling distribution of \(T_j\) under these modeling assumptions and the null hypothesis \(H_{0j}\): \[ T_j \sim t_{n-4} \approx N(0,1) ~ \mathrm{for\ large}\ n \]
We get a certain value of \(T_j\) on the data, say \(\hat{T}_j\) \[ P_j=\Prob{|T_j| \geq |\hat{T}_j|; H_{0j}} \]

What “Statistically Significant” Means

We can reliably or confidently detect a difference from null in the direction of the alternative
- We wouldn’t see this big a departure from the null, in the direction of the alternative, if the null were true
- unless we were really unlucky
- or unless the model is wrong
“Statistically detectable” might have been a better name

What Tends to Make Things Significant?

\(T_j = \hat{\beta}_j/\EstStdErr{\hat{\beta}_j}\)
As \(n\rightarrow \infty\), \(\hat{\beta}_j \rightarrow \beta_j\) and \(\EstStdErr{\hat{\beta}_j} \rightarrow \StdErr{\hat{\beta}_j}\)
So if \(\beta_j\) is large, \(T_j\) will tend to be large
What about \(\StdErr{\hat{\beta}_j}\)?
- \(\StdErr{\hat{\beta}_j} \propto \sigma\)
- \(\StdErr{\hat{\beta}_j} \propto 1/\sqrt{n}\)
- As \(\Var{X_j} \uparrow \infty\), \(\StdErr{\hat{\beta}_j} \downarrow 0\)
- As \(X_j\) becomes correlated with the other \(X_k\)’s, \(\StdErr{\hat{\beta}_j} \uparrow \infty\) (worst case: collinearity)
- (Really: MSE of a linear regression of \(X_j\) on the other \(X_k\)’s)

What Tends to Make Things Significant?

For fixed \(n\), the detectably-different-from-zero coefficients belong to variables with
- Large true coefficients \(\beta_j\)
- Large variances of that variable, \(\Var{X_j}\)
- Little correlation with the other variables

What Tends to Make Things Significant?

x <- with(na.omit(stocks), cbind(1/MAPE, MAPE, MAPE^2))
colnames(x) <- c("1/MAPE", "MAPE", "MAPE^2")
var(x)

##               1/MAPE         MAPE       MAPE^2
## 1/MAPE  0.0009283645   -0.1662826    -5.736245
## MAPE   -0.1662826454   42.2134165  1736.549657
## MAPE^2 -5.7362445792 1736.5496566 77562.536040

cor(x)

##            1/MAPE       MAPE     MAPE^2
## 1/MAPE  1.0000000 -0.8399673 -0.6759933
## MAPE   -0.8399673  1.0000000  0.9597010
## MAPE^2 -0.6759933  0.9597010  1.0000000

What Makes Things Significant? (cont’d)

As \(n\) grows, every coefficient which isn’t exactly zero becomes significant
- \(T_j \rightarrow \pm \infty\), unless \(\beta_j=0\)
- \(P_j \rightarrow 0\) exponentially fast
- “The \(P\)-value is a measure of sample size”
- More here
If \(\beta_j=0\), then \(P_j \sim \mathrm{Unif}(0,1)\) for all \(n\)
- Assuming the model is right
- So about 1/20 of really-0 coefficients will look significant

Star-Gazing Is a Bad Way to Pick Variables

n <- 1e4; p <- 40
predictors <- matrix(rnorm(n*p), nrow=n, ncol=p); response<-rnorm(n) # no relationship at all
summary(lm(response ~ predictors))

## 
## Call:
## lm(formula = response ~ predictors)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3460 -0.6627  0.0075  0.6704  3.7687 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -0.0041641  0.0100825  -0.413   0.6796  
## predictors1   0.0121260  0.0100739   1.204   0.2287  
## predictors2   0.0069030  0.0100144   0.689   0.4906  
## predictors3   0.0086452  0.0101618   0.851   0.3949  
## predictors4  -0.0021411  0.0100596  -0.213   0.8315  
## predictors5   0.0007777  0.0099655   0.078   0.9378  
## predictors6   0.0081664  0.0100775   0.810   0.4177  
## predictors7  -0.0039913  0.0100751  -0.396   0.6920  
## predictors8   0.0125959  0.0100846   1.249   0.2117  
## predictors9  -0.0012746  0.0100264  -0.127   0.8988  
## predictors10  0.0207467  0.0100252   2.069   0.0385 *
## predictors11  0.0078052  0.0099835   0.782   0.4343  
## predictors12 -0.0134649  0.0100076  -1.345   0.1785  
## predictors13  0.0054900  0.0101045   0.543   0.5869  
## predictors14  0.0034054  0.0100011   0.341   0.7335  
## predictors15 -0.0043387  0.0101054  -0.429   0.6677  
## predictors16 -0.0134847  0.0101699  -1.326   0.1849  
## predictors17  0.0172360  0.0100663   1.712   0.0869 .
## predictors18  0.0097484  0.0100367   0.971   0.3314  
## predictors19  0.0045021  0.0099840   0.451   0.6520  
## predictors20  0.0073811  0.0100638   0.733   0.4633  
## predictors21  0.0183324  0.0099965   1.834   0.0667 .
## predictors22  0.0101453  0.0101195   1.003   0.3161  
## predictors23 -0.0072479  0.0100816  -0.719   0.4722  
## predictors24  0.0010313  0.0101461   0.102   0.9190  
## predictors25 -0.0064636  0.0100027  -0.646   0.5182  
## predictors26 -0.0123409  0.0101139  -1.220   0.2224  
## predictors27  0.0129432  0.0100371   1.290   0.1972  
## predictors28 -0.0081849  0.0099898  -0.819   0.4126  
## predictors29  0.0040183  0.0101693   0.395   0.6927  
## predictors30  0.0076783  0.0100893   0.761   0.4467  
## predictors31  0.0024647  0.0099502   0.248   0.8044  
## predictors32  0.0058126  0.0099462   0.584   0.5590  
## predictors33  0.0063024  0.0100313   0.628   0.5298  
## predictors34  0.0079884  0.0099747   0.801   0.4232  
## predictors35 -0.0046969  0.0100587  -0.467   0.6405  
## predictors36  0.0212055  0.0100044   2.120   0.0341 *
## predictors37 -0.0138073  0.0099478  -1.388   0.1652  
## predictors38 -0.0028815  0.0100857  -0.286   0.7751  
## predictors39 -0.0201438  0.0101221  -1.990   0.0466 *
## predictors40  0.0173406  0.0101536   1.708   0.0877 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.006 on 9959 degrees of freedom
## Multiple R-squared:  0.0043, Adjusted R-squared:  0.000301 
## F-statistic: 1.075 on 40 and 9959 DF,  p-value: 0.344

What’s the Point of Significance Testing?

You have a real reason to think \(\beta_j=0\) is a “live option”
- And you’ve included the right covariates
- And you’re using the right kind of model
- Good evidence when test has power to detect if \(H_0\) is false
Similarly: you have reasons why \(\beta_j=b\) might be worth considering
Or: you want to know whether \(H_0\) is close enough to true that you can’t detect the difference
- Especially: goodness-of-fit tests and diagnostics
- “All the hypotheses that could be true” = confidence set

Confidence Sets

A confidence set (interval, rectangle, ellipse, …) is a bet:
- Either the true parameter is in the set, \(\theta \in C\)
- or something really improbable happened (1-confidence level)
- (or the model is wrong)
Every confidence set can be a test
- Reject \(\theta=\theta_0\) if \(\theta_0 \not \in C\)
- Retain if \(\theta_0 \in C\)
Every test gives a confidence set
- \(\theta_0 \in C\) if test doesn’t reject \(\theta_0\)
- Wald test for coefficients \(\Rightarrow\) usual confidence intervals

Simulations

Nothing changes in the logic, just how we calculate the sampling distribution
If you want to test a hypothesis, run simulations where that hypothesis is true, and see if they look like the data
- E.g., we simulate with \(\beta_j=0\) but all the other \(\beta_k\) set to their optimal values
- If we didn’t know that \(\hat{\beta_j} \sim N(\beta_j, \StdErr{\hat{\beta}_j}^2)\) in the linear-Gaussian model, but we did have rnorm(), we could do this simulation, and we’d get \(P\)-values that match lm’s calculations
- “Looks like” comes down to the test statistic
In the homework, we simulated the hypothesis that \(R_i = \frac{1}{M_i} + \epsilon_i\), with \(\epsilon_i\) following a generalized \(t\)-distribution
- Used the \(t\) rather than normal to get a bit more flexibility in the shape
- Also, \(t\)-distributed noise is used a lot in financial modeling
- Many other possibilities for the noise

Bootstrapping

In the bootstrap, we don’t simulate a fixed model, we craft a simulation so it’s close to the data
- Either by fitting a model to simulate
- Or by resampling
This is good at getting things like standard errors and confidence regions
- Confidence region = our estimate would be in this set with high probability
- Can always invert a confidence region to test a hypothesis
Advantage of resampling is that it assumes very little about the true distribution
Advantage of model-based bootstrap is that it uses the data more efficiently, if the model is good

Summing Up

When faced with a hypothesis test, ask:
- What exactly are we testing, and why?
- What is the null hypothesis, and what alternative are we testing against?
- What test statistic do we use to discriminate between null and alternative?
- How do we find the distribution of the test statistic under the null?
- Do we have power to detect that alternative with that test statistic?
Often, what you really want is a confidence set
- \(=\) all the specific hypotheses the data would let you retain
- Still need to worry about model accuracy to calculate probabilities
Simulation is a way of calculating probabilities
- Can simulate under specific hypotheses
- Or simulate under something close to the real distribution (bootstrap)

References

The correct line on hypothesis tests, and when, why and how they are useful: Mayo (1996), Mayo and Cox (2006)
\(P\)-values tend to zero exponentially fast (unless the null is true): Bahadur (1967), Bahadur (1971), Vaart (1998)
Crash course in linear regression modeling, including’s actually being tested, diagnostics, model-building: [http://www.stat.cmu.edu/~cshalizi/TALR/]
- Chapters 7, 12 and 16 are especially relevant to today
All the linear-Gaussian-model theory you could ever want: Seber and Lee (2003)

From discussion: “P-Hacking”

Manipulating the data and/or analysis until you get the significant results you want
- Or, sometimes, any significance at all (cf. the example I tried to describe in class )
Three main ways it happens:
- Deciding which data points to include or exclude as outliers
- Deciding how to define variables — Recent example with different measures of “screen time” and “depression”
- Deciding which model specification to use
Andy Gelman calls this the “garden of forking paths” (after the Borges (n.d.) story about alternate histories)
Brutal solution to \(p\)-hacking by specification search: data-splitting
- Randomly split the data into two parts
- Do exploration, try out variations, etc., on one part, and only on one part
- After you have fixed on a model, estimate it and look at inferential statistics using second part
- Do not use the second part for anything else
- Cf. Exercise 3.4 in Chapter 3 (computationally challenging)

From discussion: What about the F test?

R reports an F test at the bottom of its output for lm()

What hypothesis does this test?

\(H_{0F}\): The linear-Gaussian model is right, and \(\beta_1=\beta_2=\beta_3=0\) vs. The linear-Gaussian model is right, and not all slopes are exactly 0
This is not testing “Is the linear-Gaussian model right?”

Bibliography

Bahadur, R. R. 1967. “Rates of Convergence of Estimates and Test Statistics.” Annals of Mathematical Statistics 38:303–24. https://doi.org/10.1214/aoms/1177698949.

———. 1971. Some Limit Theorems in Statistics. Philadelphia: SIAM Press.

Borges, Jorge Luis. n.d. Ficciones. New York: Grove Press.

Mayo, Deborah G. 1996. Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press.

Mayo, Deborah G., and D. R. Cox. 2006. “Frequentist Statistics as a Theory of Inductive Inference.” In Optimality: The Second Erich L. Lehmann Symposium, edited by Javier Rojo, 77–97. Bethesda, Maryland: Institute of Mathematical Statistics. http://arxiv.org/abs/math.ST/0610846.

Seber, George A. F., and Alan J. Lee. 2003. Linear Regression Analysis. Second. New York: Wiley.

Vaart, A. W. van der. 1998. Asymptotic Statistics. Cambridge, England: Cambridge University Press.

Significance Testing etc.

Linear Regression and Hypothesis Testing (Once More with Feeling)

What Is the Model? What Are the Hypotheses?

What Hypotheses Does lm Test?

How Does R Test These Hypotheses?

What “Statistically Significant” Means

What Tends to Make Things Significant?

What Tends to Make Things Significant?

What Tends to Make Things Significant?

What Makes Things Significant? (cont’d)

Star-Gazing Is a Bad Way to Pick Variables

What’s the Point of Significance Testing?

Confidence Sets

Simulations

Bootstrapping

Summing Up

References

From discussion: “P-Hacking”

From discussion: What about the F test?

What hypothesis does this test?

Bibliography

What Hypotheses Does `lm` Test?