---
title: Testing a parametric regression model
date: 21 February 2019
output: slidy_presentation
---
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\]
```{r, include=FALSE}
set.seed(2019-02-21)
```
## The general problem
- You have a parametric regression model for your data
+ e.g., linear with such-and-such variables
- You are worried that it might be **misspecified**, that the true $\mu(x)$
isn't in the model
- Now that we know nonparametric regression, we can test this
## First case: linear specification is wrong
```{r}
# Make up sample data
x <- sort(runif(300, 0, 3)) # Sorting simplifies plotting later
# Impose true regression function $\log{(x+1)} +$ Gaussian noise:
yg <- log(x+1) + rnorm(length(x), 0, 0.15)
# Bind into a data frame:
gframe <- data.frame(x=x, y=yg)
```
## What this looks like
Plot it, plus the true regression curve
```{r}
plot(y~x, data=gframe, xlab="x", ylab="y", pch=16, cex=0.5)
curve(log(1+x), col="grey", add=TRUE, lwd=4)
```
## What linear fit do we get?
Fit the linear model
```{r}
glinfit <- lm(y~x, data=gframe)
print(summary(glinfit), signif.stars=FALSE, digits=2)
```
(N.B.: everything is crazily significant, but the linear model is wrong.)
## What the mis-specified linear fit looks like {.smaller}
```{r}
plot(x, yg, xlab="x", ylab="y")
curve(log(1+x), col="grey", add=TRUE, lwd=4)
abline(glinfit, lwd=4)
legend("bottomright", legend=c("Truth", "Linear"),
col=c("grey", "black"), lwd=4)
```
##
MSE of linear model, in-sample: `r signif(mean(residuals(glinfit)^2),3)`.
We'll need to do that a lot, so make it a function:
```{r}
(mse.residuals <- function(model) { mean(residuals(model)^2) })
```
- OFFLINE: Write comments giving the inputs and outputs of the function `ms.residuals`
- OFFLINE: Check that the function works (how?)
##
Fit the non-parametric alternative; use splines as implemented in `gam()`:
```{r, message=FALSE, warning=FALSE}
library(mgcv)
```
```{r}
gnpr <- gam(y~s(x),data=gframe)
```
Add the fitted values from the spline to the plot:
```{r}
plot(x,yg,xlab="x",ylab="y")
curve(log(1+x),col="grey",add=TRUE,lwd=4)
abline(glinfit, lwd=4)
lines(x,fitted(gnpr),col="blue",lwd=4)
legend("bottomright", legend=c("Truth", "Linear", "Spline"),
col=c("grey", "black", "blue"), lwd=4)
```
##
Calculate the difference in MSEs:
```{r}
(D.hat <- mse.residuals(glinfit) - mse.residuals(gnpr))
```
- OFFLINE: Why do we believe this code works?
## How do we test the linear model?
- $H_0$: $\mu(x)$ is a linear function vs. $H_A$: $\mu(x)$ is not linear
- Test statistic: Difference in MSEs $=D$
+ If the linear model is right, it will predict _at least_ as well as a non-parametric model
+ If the linear model is wrong, eventually the non-parametric model will beat it
- Need the sampling distribution of $D$ under the null hypothesis
+ Need to simulate the linear model
+ And need to calculate $D$ over and over
## Simulate the parametric model
Simulate from the parametric model, assuming Gaussian noise
```{r}
sim.lm <- function(linfit, test.x) {
n <- length(test.x)
sim.frame <- data.frame(x=test.x)
sigma <- sqrt(mse.residuals(linfit)) # There are other ways to get sigma
y.sim <- predict(linfit,newdata=sim.frame)
y.sim <- y.sim + rnorm(n,0,sigma)
sim.frame <- data.frame(sim.frame,y=y.sim)
return(sim.frame)
}
```
- OFFLINE: Write comments describing input and output of the `sim.lm` function
- OFFLINE: How can we check that `sim.lm` is working?
- OFFLINE: How can we switch to non-Gaussian noise?
## Calculate the test statistic
Calculate difference in MSEs between parametric and nonparametric models on a
data frame:
```{r}
calc.D <- function(df) {
MSE.p <- mse.residuals(lm(y~x,data=df))
MSE.np <- mse.residuals(gam(y~s(x),data=df))
return(MSE.p - MSE.np)
}
```
- OFFLINE: Comments once again (with feeling this time)
- OFFLINE: Check that this function works properly (how?)
## Put the pieces together
Calculate the MSE difference on one simulation run:
```{r}
calc.D(sim.lm(glinfit,x))
```
Calculate the MSE difference on 200 simulation runs, so we get a sample
from the null hypothesis:
```{r}
null.samples.D <- replicate(200,calc.D(sim.lm(glinfit,x)))
```
How often does the simulation produce gaps bigger than what we really saw?
```{r}
mean(null.samples.D >= D.hat)
```
## Look at the distribution
Plot histogram of the sampling distribution, and the observed value:
```{r}
hist(null.samples.D,n=31,xlim=c(min(null.samples.D),1.2*D.hat),probability=TRUE,
main="Distribution of D under the null model",
xlab="D")
abline(v=D.hat, col="red")
```
## Second case: linear model is properly specified
Deliberately left uncommented so that you can explore
```{r}
y2 <- 0.2+0.5*x + rnorm(length(x),0,0.15)
# Why don't I have to make up x values again?
y2.frame <- data.frame(x=x,y=y2)
plot(x,y2,xlab="x",ylab="y",pch=16)
abline(0.2,0.5,col="grey",lwd=4)
```
##
```{r}
y2.fit <- lm(y~x,data=y2.frame)
null.samples.D.y2 <- replicate(200,calc.D(sim.lm(y2.fit,x)))
(D.hat2 <- calc.D(y2.frame))
hist(null.samples.D.y2,n=31,probability=TRUE, main="Distribution of D when the linear model is right", xlab="D")
abline(v=D.hat2, col="red")
mean(null.samples.D.y2 >= D.hat2)
```
## Extensions/lessons
- Don't need to guess at specific nonlinearities; let the smoother figure it out
- Extends to testing a linear model with multiple predictors easily
+ Choice of whether the alternative then is an additive model vs. a general nonparametric regression
+ Additive: more power to detect additive nonlinearities, can easily miss nonlinear interactions
+ General model: sensitive to more mis-specifications but less power over-all (curse of dimensionality again)
- Nothing special about linear models --- the same idea works for testing any parametric regression model
## Another approach, based on residuals
\[
Y = \mu(X) + \epsilon, ~ \Expect{\epsilon|X} = 0
\]
- $\Rightarrow$ if we've got the right model, $\Expect{Y-\mu(X)|X=x} = 0$ for all $x$
- $\Rightarrow$ smooth the residuals and see if we get (close to) 0
```{r}
par(mfrow=c(1,2))
plot(gam(residuals(glinfit) ~ s(x)),residuals=TRUE,
lwd=4,ylab="Residuals of the linear model", main="Reality nonlinear")
abline(h=0,col="grey")
plot(gam(residuals(y2.fit) ~ s(x)), residuals=TRUE,
lwd=4, ylab="Residuals of the linear model", main="Reality linear")
abline(h=0,col="grey")
```
## Smoothing residuals, cont'd
- We'd need to calculate how far the smoothed-residuals curve is from 0
+ Once common choice is the $L_2$ norm, $\|r\|^2_2 \equiv \int{r^2(x) dx}$, or $\int{r^2(x) f(x) dx}$ to weight by the pdf $f(x)$
+ Could approximate the weighted version by $\frac{1}{n}\sum_{i=1}^{n}{r^2(x_i)}$ (why?)
- Then simulate to get a null distribution for that test statistic
## Backup: Interactions in `gam()`
- Can extend the additive model to include interaction terms:
\[
Y = \alpha + \sum_{j=1}^{p}{f_j(x_j)} + \sum_{j=1}^{p}{\sum_{k=j+1}^{p}{f_{jk}(x_j, x_k)}} + \epsilon
\]
- Estimate $f_{jk}$ by taking partial residuals and doing a 2D smooth against $x_j$ _and_ $x_k$
- How do we do this in R?
+ `s(xj, xk)` fits a 2D ("thin plate") spline, good if `xj` and `xk` are on the same scale
+ `te(xj, xk)` fits a "tensor-product" spline, good if `xj` and `xk` are measured on _different_ scales
- When do you want interactions?
+ When it makes sense (location interacts latitude and longitude)
+ Diagonstic: scatter-plot residuals vs. `xj` and `xk`, look for patterns
- Reading: textbook, section 8.5
## Convergence when we add interactions to a GAM
- Convergence rate for a $k$-dimensional smoother is $O(n^{-4/(4+k)})$
- MSE of true regression function $\mu$ is $\sigma^2$
- For the additive model (w/ no interactions),
\[
MSE = \sigma^2 + (\text{approximation bias of best additive model})^2 + O(pn^{-4/5})
\]
- For the model with _all_ two-way interactions,
\[
MSE = \sigma^2 + (\text{approximation bias of best two-way model})^2 + O(p^2 n^{-4/6})
\]
+ There are ${p \choose 2} = O(p^2)$ possible interactions, each converging at rate $O(n^{-4/6})$, and these dominate the $O(n^{-4/5})$ of the additive terms
- For the model with all $k$-way interactions,
\[
MSE = \sigma^2 + (\text{approximation bias})^2 + O\left({p \choose k} n^{-4/(4+k)}\right)
\]