# Significance Testing etc.

7 February 2019

$\newcommand{\Expect}{\mathbb{E}\left[ #1 \right]} \newcommand{\StdErr}{\mathrm{se}\left[ #1 \right]} \newcommand{\EstStdErr}{\widehat{\mathrm{se}}\left[ #1 \right]} \newcommand{\Prob}{\mathbb{P}\left( #1 \right)} \newcommand{\Var}{\mathbb{V}\left[ #1 \right]}$

# Linear Regression and Hypothesis Testing (Once More with Feeling)

stocks <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/19/hw/03/stock_history.csv")
stocks\$MAPE <- with(stocks, Price/Earnings_10MA_back)
returns.on.mape.transforms <- lm(Return_10_fwd ~ I(1/MAPE) + MAPE + I(MAPE^2), data=stocks)
summary(returns.on.mape.transforms)
##
## Call:
## lm(formula = Return_10_fwd ~ I(1/MAPE) + MAPE + I(MAPE^2), data = stocks)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.110743 -0.029043  0.002934  0.028354  0.099453
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.550e-02  2.612e-02   0.976    0.329
## I(1/MAPE)    7.356e-01  1.268e-01   5.801 8.07e-09 ***
## MAPE        -2.194e-04  1.559e-03  -0.141    0.888
## I(MAPE^2)   -3.578e-05  2.679e-05  -1.336    0.182
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0421 on 1480 degrees of freedom
##   (240 observations deleted due to missingness)
## Multiple R-squared:  0.358,  Adjusted R-squared:  0.3567
## F-statistic: 275.1 on 3 and 1480 DF,  p-value: < 2.2e-16

# What Is the Model? What Are the Hypotheses?

$R_i = \beta_0 + \beta_1\frac{1}{M_i} + \beta_2 M_i + \beta_3 M_i^2 + \epsilon_i, ~ \Expect{\epsilon|M} = 0$ lm also assumes $$\epsilon_i$$ IID Gaussian, $$~ \mathcal{N}(0, \sigma^2)$$

• Sampling distribution of $$\hat{\beta}_j$$ under all these assumptions: $\hat{\beta}_j \sim \mathcal{N}(\beta_j, \StdErr{\beta_j}^2)$

#### What Hypotheses Does lm Test?

• For $$\beta_0$$, $$H_{00}$$: That model is right, and $$\beta_0=0$$ exactly vs. That model is right, and $$\beta_0 \neq 0$$
• For $$\beta_1$$, $$H_{01}$$: That model is right, and $$\beta_1=0$$ exactly vs. That model is right, and $$\beta_1 \neq 0$$
• For $$\beta_2$$, $$H_{02}$$: That model is right, and $$\beta_2=0$$ exactly vs. That model is right, and $$\beta_2 \neq 0$$
• For $$\beta_3$$, $$H_{03}$$: That model is right, and $$\beta_3=0$$ exactly vs. That model is right, and $$\beta_3 \neq 0$$

# How Does R Test These Hypotheses?

• Need a test statistic = a function of the data with different distributions under null and alternative
• Generally, we pick test statistics which tend to be small under the null, and large under the alternative
• R uses the Wald test, where the test statistic is $T_j \equiv \frac{\hat{\beta}_j}{\EstStdErr{\hat{\beta}_j}}$
• Works for testing $$\beta_j=0$$ — how to modify to test $$\beta_j=b \neq 0$$?
• We know the sampling distribution of $$T_j$$ under these modeling assumptions and the null hypothesis $$H_{0j}$$: $T_j \sim t_{n-4} \approx N(0,1) ~ \mathrm{for\ large}\ n$
• We get a certain value of $$T_j$$ on the data, say $$\hat{T}_j$$ $P_j=\Prob{|T_j| \geq |\hat{T}_j|; H_{0j}}$

# What “Statistically Significant” Means

• We can reliably or confidently detect a difference from null in the direction of the alternative
• We wouldn’t see this big a departure from the null, in the direction of the alternative, if the null were true
• unless we were really unlucky
• or unless the model is wrong
• “Statistically detectable” might have been a better name

# What Tends to Make Things Significant?

• $$T_j = \hat{\beta}_j/\EstStdErr{\hat{\beta}_j}$$
• As $$n\rightarrow \infty$$, $$\hat{\beta}_j \rightarrow \beta_j$$ and $$\EstStdErr{\hat{\beta}_j} \rightarrow \StdErr{\hat{\beta}_j}$$
• So if $$\beta_j$$ is large, $$T_j$$ will tend to be large
• What about $$\StdErr{\hat{\beta}_j}$$?
• $$\StdErr{\hat{\beta}_j} \propto \sigma$$
• $$\StdErr{\hat{\beta}_j} \propto 1/\sqrt{n}$$
• As $$\Var{X_j} \uparrow \infty$$, $$\StdErr{\hat{\beta}_j} \downarrow 0$$
• As $$X_j$$ becomes correlated with the other $$X_k$$’s, $$\StdErr{\hat{\beta}_j} \uparrow \infty$$ (worst case: collinearity)
• (Really: MSE of a linear regression of $$X_j$$ on the other $$X_k$$’s)

# What Tends to Make Things Significant?

• For fixed $$n$$, the detectably-different-from-zero coefficients belong to variables with
• Large true coefficients $$\beta_j$$
• Large variances of that variable, $$\Var{X_j}$$
• Little correlation with the other variables

# What Tends to Make Things Significant?

x <- with(na.omit(stocks), cbind(1/MAPE, MAPE, MAPE^2))
colnames(x) <- c("1/MAPE", "MAPE", "MAPE^2")
var(x)
##               1/MAPE         MAPE       MAPE^2
## 1/MAPE  0.0009283645   -0.1662826    -5.736245
## MAPE   -0.1662826454   42.2134165  1736.549657
## MAPE^2 -5.7362445792 1736.5496566 77562.536040
cor(x)
##            1/MAPE       MAPE     MAPE^2
## 1/MAPE  1.0000000 -0.8399673 -0.6759933
## MAPE   -0.8399673  1.0000000  0.9597010
## MAPE^2 -0.6759933  0.9597010  1.0000000

# What Makes Things Significant? (cont’d)

• As $$n$$ grows, every coefficient which isn’t exactly zero becomes significant
• $$T_j \rightarrow \pm \infty$$, unless $$\beta_j=0$$
• $$P_j \rightarrow 0$$ exponentially fast
• “The $$P$$-value is a measure of sample size”
• More here
• If $$\beta_j=0$$, then $$P_j \sim \mathrm{Unif}(0,1)$$ for all $$n$$
• Assuming the model is right
• So about 1/20 of really-0 coefficients will look significant

# Star-Gazing Is a Bad Way to Pick Variables

n <- 1e4; p <- 40
predictors <- matrix(rnorm(n*p), nrow=n, ncol=p); response<-rnorm(n) # no relationship at all
summary(lm(response ~ predictors))
##
## Call:
## lm(formula = response ~ predictors)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.3460 -0.6627  0.0075  0.6704  3.7687
##
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0041641  0.0100825  -0.413   0.6796
## predictors1   0.0121260  0.0100739   1.204   0.2287
## predictors2   0.0069030  0.0100144   0.689   0.4906
## predictors3   0.0086452  0.0101618   0.851   0.3949
## predictors4  -0.0021411  0.0100596  -0.213   0.8315
## predictors5   0.0007777  0.0099655   0.078   0.9378
## predictors6   0.0081664  0.0100775   0.810   0.4177
## predictors7  -0.0039913  0.0100751  -0.396   0.6920
## predictors8   0.0125959  0.0100846   1.249   0.2117
## predictors9  -0.0012746  0.0100264  -0.127   0.8988
## predictors10  0.0207467  0.0100252   2.069   0.0385 *
## predictors11  0.0078052  0.0099835   0.782   0.4343
## predictors12 -0.0134649  0.0100076  -1.345   0.1785
## predictors13  0.0054900  0.0101045   0.543   0.5869
## predictors14  0.0034054  0.0100011   0.341   0.7335
## predictors15 -0.0043387  0.0101054  -0.429   0.6677
## predictors16 -0.0134847  0.0101699  -1.326   0.1849
## predictors17  0.0172360  0.0100663   1.712   0.0869 .
## predictors18  0.0097484  0.0100367   0.971   0.3314
## predictors19  0.0045021  0.0099840   0.451   0.6520
## predictors20  0.0073811  0.0100638   0.733   0.4633
## predictors21  0.0183324  0.0099965   1.834   0.0667 .
## predictors22  0.0101453  0.0101195   1.003   0.3161
## predictors23 -0.0072479  0.0100816  -0.719   0.4722
## predictors24  0.0010313  0.0101461   0.102   0.9190
## predictors25 -0.0064636  0.0100027  -0.646   0.5182
## predictors26 -0.0123409  0.0101139  -1.220   0.2224
## predictors27  0.0129432  0.0100371   1.290   0.1972
## predictors28 -0.0081849  0.0099898  -0.819   0.4126
## predictors29  0.0040183  0.0101693   0.395   0.6927
## predictors30  0.0076783  0.0100893   0.761   0.4467
## predictors31  0.0024647  0.0099502   0.248   0.8044
## predictors32  0.0058126  0.0099462   0.584   0.5590
## predictors33  0.0063024  0.0100313   0.628   0.5298
## predictors34  0.0079884  0.0099747   0.801   0.4232
## predictors35 -0.0046969  0.0100587  -0.467   0.6405
## predictors36  0.0212055  0.0100044   2.120   0.0341 *
## predictors37 -0.0138073  0.0099478  -1.388   0.1652
## predictors38 -0.0028815  0.0100857  -0.286   0.7751
## predictors39 -0.0201438  0.0101221  -1.990   0.0466 *
## predictors40  0.0173406  0.0101536   1.708   0.0877 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.006 on 9959 degrees of freedom
## Multiple R-squared:  0.0043, Adjusted R-squared:  0.000301
## F-statistic: 1.075 on 40 and 9959 DF,  p-value: 0.344

# What’s the Point of Significance Testing?

• You have a real reason to think $$\beta_j=0$$ is a “live option”
• And you’ve included the right covariates
• And you’re using the right kind of model
• Good evidence when test has power to detect if $$H_0$$ is false
• Similarly: you have reasons why $$\beta_j=b$$ might be worth considering
• Or: you want to know whether $$H_0$$ is close enough to true that you can’t detect the difference
• Especially: goodness-of-fit tests and diagnostics
• “All the hypotheses that could be true” = confidence set

# Confidence Sets

• A confidence set (interval, rectangle, ellipse, …) is a bet:
• Either the true parameter is in the set, $$\theta \in C$$
• or something really improbable happened (1-confidence level)
• (or the model is wrong)
• Every confidence set can be a test
• Reject $$\theta=\theta_0$$ if $$\theta_0 \not \in C$$
• Retain if $$\theta_0 \in C$$
• Every test gives a confidence set
• $$\theta_0 \in C$$ if test doesn’t reject $$\theta_0$$
• Wald test for coefficients $$\Rightarrow$$ usual confidence intervals

# Simulations

• Nothing changes in the logic, just how we calculate the sampling distribution
• If you want to test a hypothesis, run simulations where that hypothesis is true, and see if they look like the data
• E.g., we simulate with $$\beta_j=0$$ but all the other $$\beta_k$$ set to their optimal values
• If we didn’t know that $$\hat{\beta_j} \sim N(\beta_j, \StdErr{\hat{\beta}_j}^2)$$ in the linear-Gaussian model, but we did have rnorm(), we could do this simulation, and we’d get $$P$$-values that match lm’s calculations
• “Looks like” comes down to the test statistic
• In the homework, we simulated the hypothesis that $$R_i = \frac{1}{M_i} + \epsilon_i$$, with $$\epsilon_i$$ following a generalized $$t$$-distribution
• Used the $$t$$ rather than normal to get a bit more flexibility in the shape
• Also, $$t$$-distributed noise is used a lot in financial modeling
• Many other possibilities for the noise

# Bootstrapping

• In the bootstrap, we don’t simulate a fixed model, we craft a simulation so it’s close to the data
• Either by fitting a model to simulate
• Or by resampling
• This is good at getting things like standard errors and confidence regions
• Confidence region = our estimate would be in this set with high probability
• Can always invert a confidence region to test a hypothesis
• Advantage of resampling is that it assumes very little about the true distribution
• Advantage of model-based bootstrap is that it uses the data more efficiently, if the model is good

# Summing Up

• When faced with a hypothesis test, ask:
• What exactly are we testing, and why?
• What is the null hypothesis, and what alternative are we testing against?
• What test statistic do we use to discriminate between null and alternative?
• How do we find the distribution of the test statistic under the null?
• Do we have power to detect that alternative with that test statistic?
• Often, what you really want is a confidence set
• $$=$$ all the specific hypotheses the data would let you retain
• Still need to worry about model accuracy to calculate probabilities
• Simulation is a way of calculating probabilities
• Can simulate under specific hypotheses
• Or simulate under something close to the real distribution (bootstrap)

# From discussion: “P-Hacking”

• Manipulating the data and/or analysis until you get the significant results you want
• Three main ways it happens:
• Deciding which data points to include or exclude as outliers
• Deciding how to define variables — Recent example with different measures of “screen time” and “depression”
• Deciding which model specification to use
• Andy Gelman calls this the “garden of forking paths” (after the Borges (n.d.) story about alternate histories)
• Brutal solution to $$p$$-hacking by specification search: data-splitting
• Randomly split the data into two parts
• Do exploration, try out variations, etc., on one part, and only on one part
• After you have fixed on a model, estimate it and look at inferential statistics using second part
• Do not use the second part for anything else
• Cf. Exercise 3.4 in Chapter 3 (computationally challenging)

# From discussion: What about the F test?

• R reports an F test at the bottom of its output for lm()

#### What hypothesis does this test?

• $$H_{0F}$$: The linear-Gaussian model is right, and $$\beta_1=\beta_2=\beta_3=0$$ vs. The linear-Gaussian model is right, and not all slopes are exactly 0
• This is not testing “Is the linear-Gaussian model right?”

# Bibliography

Bahadur, R. R. 1967. “Rates of Convergence of Estimates and Test Statistics.” Annals of Mathematical Statistics 38:303–24. https://doi.org/10.1214/aoms/1177698949.

———. 1971. Some Limit Theorems in Statistics. Philadelphia: SIAM Press.

Borges, Jorge Luis. n.d. Ficciones. New York: Grove Press.

Mayo, Deborah G. 1996. Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press.

Mayo, Deborah G., and D. R. Cox. 2006. “Frequentist Statistics as a Theory of Inductive Inference.” In Optimality: The Second Erich L. Lehmann Symposium, edited by Javier Rojo, 77–97. Bethesda, Maryland: Institute of Mathematical Statistics. http://arxiv.org/abs/math.ST/0610846.

Seber, George A. F., and Alan J. Lee. 2003. Linear Regression Analysis. Second. New York: Wiley.

Vaart, A. W. van der. 1998. Asymptotic Statistics. Cambridge, England: Cambridge University Press.