Make up sample data; it’ll simplify things later if they’re sorted for plotting
x <- sort(runif(300, 0, 3))
Impose true regression function \(\log{(x+1)}\), with Gaussian noise:
yg <- log(x+1) + rnorm(length(x), 0, 0.15)
Bind into a data frame:
gframe <- data.frame(x=x, y=yg)
Plot it, plus the true regression curve
plot(y~x, data=gframe, xlab="x", ylab="y", pch=16, cex=0.5)
curve(log(1+x), col="grey", add=TRUE, lwd=4)
Fit the linear model, add to the plot:
glinfit <- lm(y~x, data=gframe)
print(summary(glinfit), signif.stars=FALSE, digits=2)
##
## Call:
## lm(formula = y ~ x, data = gframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.447 -0.111 -0.002 0.091 0.437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.192 0.019 10 <2e-16
## x 0.431 0.010 41 <2e-16
##
## Residual standard error: 0.15 on 298 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.85
## F-statistic: 1.7e+03 on 1 and 298 DF, p-value: <2e-16
plot(x, yg, xlab="x", ylab="y")
curve(log(1+x), col="grey", add=TRUE, lwd=4)
abline(glinfit, lwd=4)
MSE of linear model, in-sample: 0.0238.
We’ll need to do that a lot, so make it a function:
mse.residuals <- function(model) { mean(residuals(model)^2) }
Fit the non-parametric alternative:
library(mgcv)
We’ll use spline smoothing as provided by the gam
function:
gnpr <- gam(y~s(x),data=gframe)
Add the fitted values from the spline to the plot:
plot(x,yg,xlab="x",ylab="y")
curve(log(1+x),col="grey",add=TRUE,lwd=4)
abline(glinfit, lwd=4)
lines(x,fitted(gnpr),col="blue",lwd=4)