36-402, Section A

7 February 2019

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\StdErr}[1]{\mathrm{se}\left[ #1 \right]} \newcommand{\EstStdErr}[1]{\widehat{\mathrm{se}}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]} \]

```
stocks <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/19/hw/03/stock_history.csv")
stocks$MAPE <- with(stocks, Price/Earnings_10MA_back)
returns.on.mape.transforms <- lm(Return_10_fwd ~ I(1/MAPE) + MAPE + I(MAPE^2), data=stocks)
summary(returns.on.mape.transforms)
```

```
##
## Call:
## lm(formula = Return_10_fwd ~ I(1/MAPE) + MAPE + I(MAPE^2), data = stocks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.110743 -0.029043 0.002934 0.028354 0.099453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.550e-02 2.612e-02 0.976 0.329
## I(1/MAPE) 7.356e-01 1.268e-01 5.801 8.07e-09 ***
## MAPE -2.194e-04 1.559e-03 -0.141 0.888
## I(MAPE^2) -3.578e-05 2.679e-05 -1.336 0.182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0421 on 1480 degrees of freedom
## (240 observations deleted due to missingness)
## Multiple R-squared: 0.358, Adjusted R-squared: 0.3567
## F-statistic: 275.1 on 3 and 1480 DF, p-value: < 2.2e-16
```

\[
R_i = \beta_0 + \beta_1\frac{1}{M_i} + \beta_2 M_i + \beta_3 M_i^2 + \epsilon_i, ~ \Expect{\epsilon|M} = 0
\] `lm`

also assumes \(\epsilon_i\) IID Gaussian, \(~ \mathcal{N}(0, \sigma^2)\)

- Sampling distribution of \(\hat{\beta}_j\) under
*all*these assumptions: \[ \hat{\beta}_j \sim \mathcal{N}(\beta_j, \StdErr{\beta_j}^2) \]

`lm`

Test?- For \(\beta_0\), \(H_{00}\): That model is right, and \(\beta_0=0\) exactly vs. That model is right, and \(\beta_0 \neq 0\)
- For \(\beta_1\), \(H_{01}\): That model is right, and \(\beta_1=0\) exactly vs. That model is right, and \(\beta_1 \neq 0\)
- For \(\beta_2\), \(H_{02}\): That model is right, and \(\beta_2=0\) exactly vs. That model is right, and \(\beta_2 \neq 0\)
- For \(\beta_3\), \(H_{03}\): That model is right, and \(\beta_3=0\) exactly vs. That model is right, and \(\beta_3 \neq 0\)

- Need a
**test statistic**= a function of the data with different distributions under null and alternative- Generally, we pick test statistics which tend to be small under the null, and large under the alternative

- R uses the
**Wald test**, where the test statistic is \[ T_j \equiv \frac{\hat{\beta}_j}{\EstStdErr{\hat{\beta}_j}} \]- Works for testing \(\beta_j=0\) — how to modify to test \(\beta_j=b \neq 0\)?

- We know the sampling distribution of \(T_j\) under these modeling assumptions
*and*the null hypothesis \(H_{0j}\): \[ T_j \sim t_{n-4} \approx N(0,1) ~ \mathrm{for\ large}\ n \] - We get a certain value of \(T_j\) on the data, say \(\hat{T}_j\) \[ P_j=\Prob{|T_j| \geq |\hat{T}_j|; H_{0j}} \]

- We can reliably or confidently detect a difference from null in the direction of the alternative
- We wouldn’t see this big a departure from the null, in the direction of the alternative,
*if the null were true* - unless we were really unlucky
- or unless the model is wrong

- We wouldn’t see this big a departure from the null, in the direction of the alternative,
- “Statistically detectable” might have been a better name

- \(T_j = \hat{\beta}_j/\EstStdErr{\hat{\beta}_j}\)
- As \(n\rightarrow \infty\), \(\hat{\beta}_j \rightarrow \beta_j\) and \(\EstStdErr{\hat{\beta}_j} \rightarrow \StdErr{\hat{\beta}_j}\)
- So if \(\beta_j\) is large, \(T_j\) will tend to be large
- What about \(\StdErr{\hat{\beta}_j}\)?
- \(\StdErr{\hat{\beta}_j} \propto \sigma\)
- \(\StdErr{\hat{\beta}_j} \propto 1/\sqrt{n}\)
- As \(\Var{X_j} \uparrow \infty\), \(\StdErr{\hat{\beta}_j} \downarrow 0\)
- As \(X_j\) becomes correlated with the other \(X_k\)’s, \(\StdErr{\hat{\beta}_j} \uparrow \infty\) (worst case: collinearity)
- (Really: MSE of a linear regression of \(X_j\) on the other \(X_k\)’s)

- For fixed \(n\), the detectably-different-from-zero coefficients belong to variables with
- Large true coefficients \(\beta_j\)
- Large variances of that variable, \(\Var{X_j}\)
- Little correlation with the other variables

```
x <- with(na.omit(stocks), cbind(1/MAPE, MAPE, MAPE^2))
colnames(x) <- c("1/MAPE", "MAPE", "MAPE^2")
var(x)
```

```
## 1/MAPE MAPE MAPE^2
## 1/MAPE 0.0009283645 -0.1662826 -5.736245
## MAPE -0.1662826454 42.2134165 1736.549657
## MAPE^2 -5.7362445792 1736.5496566 77562.536040
```

```
## 1/MAPE MAPE MAPE^2
## 1/MAPE 1.0000000 -0.8399673 -0.6759933
## MAPE -0.8399673 1.0000000 0.9597010
## MAPE^2 -0.6759933 0.9597010 1.0000000
```

- As \(n\) grows,
**every**coefficient which isn’t**exactly**zero becomes significant- \(T_j \rightarrow \pm \infty\),
**unless**\(\beta_j=0\) - \(P_j \rightarrow 0\) exponentially fast
- “The \(P\)-value is a measure of sample size”
- More here

- \(T_j \rightarrow \pm \infty\),
- If \(\beta_j=0\), then \(P_j \sim \mathrm{Unif}(0,1)\) for
**all**\(n\)- Assuming the model is right
- So about 1/20 of really-0 coefficients will look significant

```
n <- 1e4; p <- 40
predictors <- matrix(rnorm(n*p), nrow=n, ncol=p); response<-rnorm(n) # no relationship at all
summary(lm(response ~ predictors))
```

```
##
## Call:
## lm(formula = response ~ predictors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3460 -0.6627 0.0075 0.6704 3.7687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0041641 0.0100825 -0.413 0.6796
## predictors1 0.0121260 0.0100739 1.204 0.2287
## predictors2 0.0069030 0.0100144 0.689 0.4906
## predictors3 0.0086452 0.0101618 0.851 0.3949
## predictors4 -0.0021411 0.0100596 -0.213 0.8315
## predictors5 0.0007777 0.0099655 0.078 0.9378
## predictors6 0.0081664 0.0100775 0.810 0.4177
## predictors7 -0.0039913 0.0100751 -0.396 0.6920
## predictors8 0.0125959 0.0100846 1.249 0.2117
## predictors9 -0.0012746 0.0100264 -0.127 0.8988
## predictors10 0.0207467 0.0100252 2.069 0.0385 *
## predictors11 0.0078052 0.0099835 0.782 0.4343
## predictors12 -0.0134649 0.0100076 -1.345 0.1785
## predictors13 0.0054900 0.0101045 0.543 0.5869
## predictors14 0.0034054 0.0100011 0.341 0.7335
## predictors15 -0.0043387 0.0101054 -0.429 0.6677
## predictors16 -0.0134847 0.0101699 -1.326 0.1849
## predictors17 0.0172360 0.0100663 1.712 0.0869 .
## predictors18 0.0097484 0.0100367 0.971 0.3314
## predictors19 0.0045021 0.0099840 0.451 0.6520
## predictors20 0.0073811 0.0100638 0.733 0.4633
## predictors21 0.0183324 0.0099965 1.834 0.0667 .
## predictors22 0.0101453 0.0101195 1.003 0.3161
## predictors23 -0.0072479 0.0100816 -0.719 0.4722
## predictors24 0.0010313 0.0101461 0.102 0.9190
## predictors25 -0.0064636 0.0100027 -0.646 0.5182
## predictors26 -0.0123409 0.0101139 -1.220 0.2224
## predictors27 0.0129432 0.0100371 1.290 0.1972
## predictors28 -0.0081849 0.0099898 -0.819 0.4126
## predictors29 0.0040183 0.0101693 0.395 0.6927
## predictors30 0.0076783 0.0100893 0.761 0.4467
## predictors31 0.0024647 0.0099502 0.248 0.8044
## predictors32 0.0058126 0.0099462 0.584 0.5590
## predictors33 0.0063024 0.0100313 0.628 0.5298
## predictors34 0.0079884 0.0099747 0.801 0.4232
## predictors35 -0.0046969 0.0100587 -0.467 0.6405
## predictors36 0.0212055 0.0100044 2.120 0.0341 *
## predictors37 -0.0138073 0.0099478 -1.388 0.1652
## predictors38 -0.0028815 0.0100857 -0.286 0.7751
## predictors39 -0.0201438 0.0101221 -1.990 0.0466 *
## predictors40 0.0173406 0.0101536 1.708 0.0877 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.006 on 9959 degrees of freedom
## Multiple R-squared: 0.0043, Adjusted R-squared: 0.000301
## F-statistic: 1.075 on 40 and 9959 DF, p-value: 0.344
```

- You have a real reason to think \(\beta_j=0\) is a “live option”
**And**you’ve included the right covariates**And**you’re using the right kind of model- Good evidence when test has power to detect if \(H_0\) is false

- Similarly: you have reasons why \(\beta_j=b\) might be worth considering
- Or: you want to know whether \(H_0\) is
*close enough*to true that you can’t detect the difference- Especially: goodness-of-fit tests and diagnostics
- “All the hypotheses that could be true” = confidence set

- A confidence set (interval, rectangle, ellipse, …) is a bet:
**Either**the true parameter is in the set, \(\theta \in C\)**or**something really improbable happened (1-confidence level)- (
**or**the model is wrong)

- Every confidence set can be a test
- Reject \(\theta=\theta_0\) if \(\theta_0 \not \in C\)
- Retain if \(\theta_0 \in C\)

- Every test gives a confidence set
- \(\theta_0 \in C\) if test doesn’t reject \(\theta_0\)
- Wald test for coefficients \(\Rightarrow\) usual confidence intervals

- Nothing changes in the logic, just how we calculate the sampling distribution
- If you want to test a hypothesis, run simulations where that hypothesis is true, and see if they look like the data
- E.g., we simulate with \(\beta_j=0\) but all the other \(\beta_k\) set to their optimal values
- If we didn’t know that \(\hat{\beta_j} \sim N(\beta_j, \StdErr{\hat{\beta}_j}^2)\) in the linear-Gaussian model, but we did have
`rnorm()`

, we could do this simulation, and we’d get \(P\)-values that match`lm`

’s calculations - “Looks like” comes down to the test statistic

- In the homework, we simulated the hypothesis that \(R_i = \frac{1}{M_i} + \epsilon_i\), with \(\epsilon_i\) following a generalized \(t\)-distribution
- Used the \(t\) rather than normal to get a bit more flexibility in the shape
- Also, \(t\)-distributed noise is used a lot in financial modeling
- Many other possibilities for the noise

- In the bootstrap, we don’t simulate a fixed model, we craft a simulation so it’s close to the data
- Either by fitting a model to simulate
- Or by resampling

- This is good at getting things like standard errors and confidence regions
- Confidence region = our estimate would be in this set with high probability
- Can always invert a confidence region to test a hypothesis

- Advantage of resampling is that it assumes very little about the true distribution
- Advantage of model-based bootstrap is that it uses the data more efficiently,
*if*the model is good

- When faced with a hypothesis test, ask:
- What exactly are we testing, and why?
- What is the null hypothesis, and what alternative are we testing against?
- What test statistic do we use to discriminate between null and alternative?
- How do we find the distribution of the test statistic under the null?
- Do we have power to detect that alternative with that test statistic?

- Often, what you really want is a confidence set
- \(=\) all the specific hypotheses the data would let you retain
- Still need to worry about model accuracy to calculate probabilities

- Simulation is a way of calculating probabilities
- Can simulate under specific hypotheses
- Or simulate under something close to the real distribution (bootstrap)

- The correct line on hypothesis tests, and when, why and how they are useful: Mayo (1996), Mayo and Cox (2006)
- \(P\)-values tend to zero exponentially fast (unless the null is true): Bahadur (1967), Bahadur (1971), Vaart (1998)
- Crash course in linear regression modeling, including’s actually being tested, diagnostics, model-building: [http://www.stat.cmu.edu/~cshalizi/TALR/]
- Chapters 7, 12 and 16 are especially relevant to today

- All the linear-Gaussian-model theory you could ever want: Seber and Lee (2003)

- Manipulating the data and/or analysis until you get the significant results you want
- Or, sometimes, any significance at all (cf. the example I tried to describe in class )

- Three main ways it happens:
- Deciding which data points to include or exclude as outliers
- Deciding how to define variables — Recent example with different measures of “screen time” and “depression”
- Deciding which model specification to use

- Andy Gelman calls this the “garden of forking paths” (after the Borges (n.d.) story about alternate histories)
- Brutal solution to \(p\)-hacking by specification search: data-splitting
- Randomly split the data into two parts
- Do exploration, try out variations, etc., on one part, and
*only*on one part *After*you have fixed on a model, estimate it and look at inferential statistics using second part- Do not use the second part for anything else
- Cf. Exercise 3.4 in Chapter 3 (computationally challenging)

- R reports an F test at the bottom of its output for
`lm()`

- \(H_{0F}\): The linear-Gaussian model is right, and \(\beta_1=\beta_2=\beta_3=0\) vs. The linear-Gaussian model is right, and not all slopes are exactly 0
- This is
**not**testing “Is the linear-Gaussian model right?”

Bahadur, R. R. 1967. “Rates of Convergence of Estimates and Test Statistics.” *Annals of Mathematical Statistics* 38:303–24. https://doi.org/10.1214/aoms/1177698949.

———. 1971. *Some Limit Theorems in Statistics*. Philadelphia: SIAM Press.

Borges, Jorge Luis. n.d. *Ficciones*. New York: Grove Press.

Mayo, Deborah G. 1996. *Error and the Growth of Experimental Knowledge*. Chicago: University of Chicago Press.

Mayo, Deborah G., and D. R. Cox. 2006. “Frequentist Statistics as a Theory of Inductive Inference.” In *Optimality: The Second Erich L. Lehmann Symposium*, edited by Javier Rojo, 77–97. Bethesda, Maryland: Institute of Mathematical Statistics. http://arxiv.org/abs/math.ST/0610846.

Seber, George A. F., and Alan J. Lee. 2003. *Linear Regression Analysis*. Second. New York: Wiley.

Vaart, A. W. van der. 1998. *Asymptotic Statistics*. Cambridge, England: Cambridge University Press.