- You
*will*comment your code - Comment every function with purpose, inputs, outputs
- You
*will*use meaningful names - You
*will*indent code blocks - You
*ought*to avoid calling your main data set`data`

- vague
- already the name of a function for loading data sets from packages
- often the name of arguments in e.g. model-fitting functions
- both of those make it hard to trace what’s going on

- Use functions to tie together related commands so they can be repeated
- Arguments: what does the function need to know?
- Default values let you omit arguments that don’t change often
- Default values can always be over-ridden
- No-argument functions can be useful for simulation
- Return value
- Try to be explicit about what you return
- Use
`c()`

or`list()`

to return more than one thing

- When we’re told to
- When we’re repeating the same thing multiple times
- When we’re writing very similar code over and over
- When we’re breaking down a big job into smaller ones

- Divide the big job (“come up with a confidence band”) into smaller parts
- Write a function for each part
- Make sure the outputs of the early steps can be used as inputs to the later steps
- Recurse if need be: the sub-functions spawn sub-sub-functions, etc.
- We can often re-use the sub-functions (e.g.,
`resample()`

)

- Problem: the final piece of code you write gives an incomprehensible error message
- Strategy: make sure all the lower-level pieces are working properly
- Start debugging from the bottom
- If all the small bits work on their own, the bug comes when you put them together
- Use test cases
- “What should this do if it works?”
- “What should this
*not*do if it’s broken?” - When you meet a new bug, write a new test
- “How do I know if this is fixed?”
- Test could be very minimal: “It runs without crashing”

`traceback()`

: Show all the calls leading to the last error- Look for the last line of code you wrote
- This helps localize where things went wrong

- R has lots of ways to avoid iteration
- This is faster (yay) and easier to follow (double yay)
Compare

`numerator <- 0; denominator <- 0 for (i in 1:nrow(mobility)) { if (!is.na(mobility$Mobility[i]) && !is.na(mobility$Population[i])) { numerator <- numerator + mobility$Mobility[i]*mobility$Population[i] denominator <- denominator + mobility$Population[i] } } numerator/denominator`

with

`weighted.mean(x=mobility$Mobility,w=mobility$Population,na.rm=TRUE)`

- Some jobs are just inherently iterative
- Use it when that’s what you need to do
If you’re storing stuff as you go, create the variables which you’re storing into outside the loop

`sim.ar <- function(n,x1,phi=0,sd=1) { x <- vector(length=n) x[1] <- x1 for (t in 2:n) { x[t] <- phi*x[t-1]+rnorm(1,mean=0,sd=sd) } return(x) }`

```
stocks <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/15/hw/03/stock_history.csv")
# Check that it's alright
summary(stocks)
```

```
## Date Price Earnings Earnings_10MA_back
## Min. :1871 Min. : 64.8 Min. : 4.01 Min. : 8.51
## 1st Qu.:1907 1st Qu.: 161.8 1st Qu.: 11.94 1st Qu.:13.89
## Median :1943 Median : 237.5 Median : 18.89 Median :17.48
## Mean :1943 Mean : 442.8 Mean : 27.02 Mean :25.70
## 3rd Qu.:1979 3rd Qu.: 553.6 3rd Qu.: 36.59 3rd Qu.:37.19
## Max. :2015 Max. :2056.5 Max. :103.17 Max. :77.00
## NA's :120
## Return_cumul Return_10_fwd
## Min. : 1 Min. :-0.06
## 1st Qu.: 15 1st Qu.: 0.03
## Median : 96 Median : 0.07
## Mean : 1457 Mean : 0.07
## 3rd Qu.: 1061 3rd Qu.: 0.10
## Max. :12951 Max. : 0.20
## NA's :120
```

`dim(stocks)`

`## [1] 1724 6`

- Add the new variable MAPE, and check that it has the right summary:

```
stocks$MAPE <- stocks$Price/stocks$Earnings_10MA_back
summary(stocks$MAPE)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.79 11.70 15.90 16.60 20.00 44.20 120
```

At this point we give up on the NAs:

`stocks <- na.omit(stocks)`

- Linear fit against MAPE:

```
# Fit the regression
lm.MAPE <- lm(Return_10_fwd ~ MAPE, data=stocks)
# Check the coefficients (how do you get standard errors?)
coefficients(lm.MAPE)
```

```
## (Intercept) MAPE
## 0.138348 -0.004589
```

We will put off calculating the \(R^2\) and the CV MSE until we have more models.

- Linear fit against 1/MAPE:

```
lm.invMAPE <- lm(Return_10_fwd ~ I(1/MAPE),data=stocks)
coefficients(lm.invMAPE)
```

```
## (Intercept) I(1/MAPE)
## -0.007659 0.995904
```

- Cross-validated MSEs for these two models is easier if we first load the code for doing CV of linear models from lecture 3. (See the source for how to discard the plots that code makes as side-effects.)

`source("http://www.stat.cmu.edu/~cshalizi/uADA/15/lectures/03-in-class.R")`

The `cv.lm`

function defined there needs formulas, but we can get them from the estimated models; the default is 5-fold CV, which is what we want here.

`(cv.MSEs <- cv.lm(stocks,formulae=c(formula(lm.MAPE),formula(lm.invMAPE))))`

`## [1] 0.001870 0.001837`

- Plot the data, the two estimated models, and, last, the “simple-minded” model which is that returns should be exactly equal to 1/MAPE.

```
# 4a: scatterplot
plot(Return_10_fwd ~ MAPE, data=stocks, xlab="MAPE",
ylab="Future returns",pch=16,cex=0.2)
# 4b: Add the fitted line for regressing on MAPE
abline(lm.MAPE,col="red")
# 4c: fitted curve for regressing on 1/MAPE
# Why doesn't the next line work?
##abline(lm.invMAPE,col="blue")
# Partial credit for plotting: use isolated points
points(x=stocks$MAPE,y=predict(lm.invMAPE),col="blue",pch=16,cex=0.5)
# Full credit: re-order the data in order of increasing MAPE:
stocks <- stocks[order(stocks$MAPE),]
# We can always recover the time order later if we want it!
lines(stocks$MAPE,predict(lm.invMAPE,newdata=stocks),col="blue",lwd=2)
# 4d: the simple-minded model
curve(1/x,col="darkgrey",lwd=2,lty="solid",add=TRUE)
```

- Since there are no adjustable parameters to the simple-minded model at all, we can just look at its in-sample error as an estimate of its generalization error:

`(simple.MSE <- with(stocks,mean((Return_10_fwd - 1/MAPE)^2)))`

`## [1] 0.001896`

- Make a table with the CV and the \(R^2\):

```
library(knitr) # for the kable() table-making function
cv.MSEs <- c(cv.MSEs, simple.MSE)
r.sqs <- c(summary(lm.MAPE)$adj.r.sq, summary(lm.invMAPE)$adj.r.sq, cor(1/stocks$MAPE,stocks$Return_10_fwd)^2)
scores <- rbind(cv.MSEs, r.sqs)
rownames(scores) <- c("CV MSE", "Adjusted R^2")
colnames(scores) <- c("~MAPE", "~1/MAPE", "Simple-minded model")
kable(scores)
```

~MAPE | ~1/MAPE | Simple-minded model | |
---|---|---|---|

CV MSE | 0.0019 | 0.0018 | 0.0019 |

Adjusted R^2 | 0.3221 | 0.3338 | 0.3342 |

- Turn to the confidence intervals.

- The conventional ones:

`kable(confint(lm.invMAPE,level=.9))`

5 % | 95 % | |
---|---|---|

(Intercept) | -0.0124 | -0.0029 |

I(1/MAPE) | 0.9358 | 1.0560 |

- Resampling of residuals: For the in-class demo, re-written from scratch.

```
# Confidence intervals for parametric model coefficients from residual resampling
# Inputs: confidence level, number of bootstrap replicates, estimated model
# Output: array of upper and lower confidence limits for each coefficient
# Presumes: resample.residuals() and estimate.coefs() exist and do the right thing
residual.resample.CIs <- function(level=0.95,B,mdl=lm.invMAPE) {
# Generate bootstrap samples
# Calculate coefficients from each bootstrap sample
# Reduce the pile of bootstrapped coefficients to CIs
# Need 3 pieces of code: simulate, estimate, reduce
# Do all the simulation and re-estimation in one line
tboots <- replicate(B, estimate.coefs(resample.residuals(mdl)))
# Each column of tboots is a different replicate, each row a different coefficient
# Now reduce to an array of confidence limits
# First, get the error probability for easier math
alpha <- 1-level
# Each coefficient gets its own row so calculate quantiles separately for each row
boot.quantiles <- apply(tboots, 1, quantile, probs=c(alpha/2,1-alpha/2))
# What, come to think of it, is our point estimate of the coefficient vector?
main.coefs <- coefficients(mdl)
# Form the upper and lower confidence limits by taking an appropriate spread around
# the point estimates
upper.lim <- main.coefs + (main.coefs - boot.quantiles[1,])
lower.lim <- main.coefs + (main.coefs - boot.quantiles[2,])
# Bind them into an array
CIs <- cbind(lower.lim, upper.lim)
# Name the columns by the appropriate probabilities
colnames(CIs) <- paste(100*c(alpha/2,1-alpha/2), "%")
return(CIs)
}
```

We’ll need the utility function `resample`

:

```
# Resample a vector with replacement
# Inputs: a vector
# Output: a sample taken from the input vector of the same size, with replacement
resample <- function(x) {
return(sample(x,size=length(x),replace=TRUE))
}
```

With this, we can write the actual function for resampling residuals from an estimated model. With a view to later problems, we’ll make it work for multiple models. With a bit more work, we could have it work with many different data frames as well, but for this problem set, it’s enough to get it to work with the `stocks`

data.

```
# Resample the residuals from a given model
# Inputs: the estimated model
# Outputs: a new data frame with simulated values for the dependent variable
# Presumes: data frame is called stocks, dependent variable is Return_10_fwd
# Presumes: no NA issues making fitted(mdl) or residuals(mdl) shorter than the data frame
# Presumes: resample() function exists and works on vectors
resample.residuals <- function(mdl) {
# ATTN: Better to avoid hard-coding stocks and the choice of response variable
stocks$Return_10_fwd <- predict(mdl,newdata=stocks) + resample(residuals(mdl))
return(stocks)
}
```

Now, before we go and spend a lot of time running anything else, let’s make sure that this works. We’ll go through a series of checks.

- The output is indeed a data frame, of the right size and shape:

`is.data.frame(resample.residuals(lm.invMAPE))`

`## [1] TRUE`

`all(dim(stocks)==dim(resample.residuals(lm.invMAPE)))`

`## [1] TRUE`

`all(colnames(stocks)==colnames(resample.residuals(lm.invMAPE)))`

`## [1] TRUE`

- Resampling residuals leaves the independent variable alone:

`all(stocks$MAPE == resample.residuals(lm.invMAPE)$MAPE)`

`## [1] TRUE`

- This is true no matter what the model:

`all(stocks$MAPE == resample.residuals(lm.MAPE)$MAPE)`

`## [1] TRUE`

- Resampling does change the dependent variable:

`!all(stocks$Return_10_fwd == resample.residuals(lm.invMAPE)$Return_10_fwd)`

`## [1] TRUE`

- Resampling gives different values for the dependent variable each time:

`!all(resample.residuals(lm.invMAPE)$Return_10_fwd == resample.residuals(lm.invMAPE)$Return_10_fwd)`

`## [1] TRUE`

- The over-all distribution of the dependent variable shouldn’t change too much though:

`summary(stocks$Return_10_fwd)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0593 0.0297 0.0644 0.0647 0.1020 0.2000
```

`summary(resample.residuals(lm.invMAPE)$Return_10_fwd)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0661 0.0322 0.0643 0.0670 0.0993 0.2950
```

`summary(resample.residuals(lm.invMAPE)$Return_10_fwd)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0838 0.0272 0.0656 0.0653 0.0995 0.2960
```

The estimation function is fairly straightforward:

```
estimate.coefs <- function(data,specification=formula(lm.invMAPE)) {
coefficients(lm(specification,data))
}
```

Some checks on the estimation function:

- It should match the coefficient we estimated on the complete data

`all.equal(estimate.coefs(stocks), coefficients(lm.invMAPE))`

`## [1] TRUE`

- It should not match if we change the data!

`!all(estimate.coefs(stocks) == estimate.coefs(resample.residuals(lm.invMAPE)))`

`## [1] TRUE`

- The estimates should change every time we resample:

`!all(estimate.coefs(resample.residuals(lm.invMAPE)) == estimate.coefs(resample.residuals(lm.invMAPE)))`

`## [1] TRUE`

We can’t be too precise about how close these re-estimates should be to the original (that’s what the confidence intervals are going to be for…), but we should worry if they’re incredibly different:

`estimate.coefs(resample.residuals(lm.invMAPE))`

```
## (Intercept) I(1/MAPE)
## -0.007787 0.998331
```

Notice that by splitting up the job into a separate estimator and resampler, we can design and debug them separately; if we get stuck on one, we can turn to the other. If we have to change how things work internal in either, it doesn’t affect the other.

Having convinced ourselves that the residual resampler is working, and that the coefficient estimator is working, we can put them together for a small run:

`kable(residual.resample.CIs(level=0.9,B=20))`

5 % | 95 % | |
---|---|---|

(Intercept) | -0.0139 | -0.0023 |

I(1/MAPE) | 0.9125 | 1.0729 |

Once we think it works at a small number of replicates, run it for longer to get more accurate confidence limits:

`kable(residual.resample.CIs(level=0.9,B=1e4))`

5 % | 95 % | |
---|---|---|

(Intercept) | -0.0124 | -0.0030 |

I(1/MAPE) | 0.9366 | 1.0561 |

- It’s acceptable, though not ideal, to write this over again from the top:

```
# Confidence intervals for parametric model coefficients from resampling rows or cases
# Inputs: confidence level, number of bootstrap replicates, estimated model
# Output: array of upper and lower confidence limits for each coefficient
# Presumes: resample.rows() and estimate.coefs() exist and do the right thing
row.resample.CIs <- function(level=0.95,B,mdl=lm.invMAPE,data=stocks) {
# Do all the simulation and re-estimation in one line
tboots <- replicate(B, estimate.coefs(resample.rows(data)))
# Each column of tboots is a different replicate, each row a different coefficient
# Now reduce to an array of confidence limits
# First, get the error probability for easier math
alpha <- 1-level
# Each coefficient gets its own row so calculate quantiles separately for each row
boot.quantiles <- apply(tboots, 1, quantile, probs=c(alpha/2,1-alpha/2))
# What, come to think of it, is our point estimate of the coefficient vector?
main.coefs <- coefficients(mdl)
# Form the upper and lower confidence limits by taking an appropriate spread around
# the point estimates
upper.lim <- main.coefs + (main.coefs - boot.quantiles[1,])
lower.lim <- main.coefs + (main.coefs - boot.quantiles[2,])
# Bind them into an array
CIs <- cbind(lower.lim, upper.lim)
# Name the columns by the appropriate probabilities
colnames(CIs) <- paste(100*c(alpha/2,1-alpha/2), "%")
return(CIs)
}
```

We don’t need to write a new estimator, but we do need to write the resampler:

`resample.rows <- function(data) { return(data[resample(1:nrow(data)),]) }`

Again, we should check this. - The output should be a data frame with the same size and shape as `stocks`

:

`is.data.frame(resample.rows(stocks))`

`## [1] TRUE`

`all(dim(stocks)==dim(resample.rows(stocks)))`

`## [1] TRUE`

`all(colnames(stocks)==colnames(resample.rows(stocks)))`

`## [1] TRUE`

- The independent variable
*shouldn’t*match the original,*should*change from run to run, and*should*have about the same distribution:

`!all(stocks$MAPE == resample.rows(stocks)$MAPE)`

`## [1] TRUE`

`!all(resample.rows(stocks)$MAPE == resample.rows(stocks)$MAPE)`

`## [1] TRUE`

`summary(stocks$MAPE)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.79 11.50 15.50 16.00 18.90 44.20
```

`summary(resample.rows(stocks)$MAPE)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.02 11.40 15.40 15.90 18.70 44.20
```

- The dependent variable should also be fluctuating all the time:

`!all(stocks$Return_10_fwd == resample.rows(stocks)$Return_10_fwd)`

`## [1] TRUE`

`!all(resample.rows(stocks)$Return_10_fwd == resample.rows(stocks)$Return_10_fwd)`

`## [1] TRUE`

`summary(stocks$Return_10_fwd)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0593 0.0297 0.0644 0.0647 0.1020 0.2000
```

`summary(resample.rows(stocks)$Return_10_fwd)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0518 0.0329 0.0644 0.0648 0.1010 0.2000
```

Now we can run

`kable(row.resample.CIs(level=0.9,B=1e4))`

5 % | 95 % | |
---|---|---|

(Intercept) | -0.0117 | -0.0035 |

I(1/MAPE) | 0.9449 | 1.0441 |

However, it’s clearly not very clever to have repeated a huge chunk of code in the two CI functions. One way of simplifying this is to peel off the repeated calculation as a new function, and then use it in both:

```
# Reduce an array of bootstrap re-estimates to confidence intervals
# Inputs: array of bootstrap estimates, confidence level, vector of point estimates
# Output: two-column array giving lower and upper confidence limits for each quantity
# Presumes: the array of bootstrapped values has one column per replicate
# Presumes: the number of rows in the arrays matches the length of the vector of point estimates
reduce.boots.to.CIs <- function(the.boots,level=0.95, point.ests) {
# Double check that the dimensions of the.boots and point.ests match
stopifnot(nrow(the.boots) == length(point.ests))
# First, get the error probability for easier math
alpha <- 1-level
# Each coefficient gets its own row so calculate quantiles separately for each row
boot.quantiles <- apply(the.boots, 1, quantile, probs=c(alpha/2,1-alpha/2))
# What, come to think of it, is our point estimate of the coefficient vector?
# Form the upper and lower confidence limits by taking an appropriate spread around
# the point estimates
upper.lim <- point.ests + (point.ests - boot.quantiles[1,])
lower.lim <- point.ests + (point.ests - boot.quantiles[2,])
# Bind them into an array
CIs <- cbind(lower.lim, upper.lim)
# Name the columns by the appropriate probabilities
colnames(CIs) <- paste(100*c(alpha/2,1-alpha/2), "%")
return(CIs)
}
```

Here is a little test case, where we can work out by hand what the answers should be:

`reduce.boots.to.CIs(the.boots=rbind(1:20,2*(20:1)),level=0.9,point.ests=c(10,20))`

```
## 5 % 95 %
## [1,] 0.95 18.05
## [2,] 1.90 36.10
```

(Double-check that these are the right values.)

Now we can greatly simplify the definitions of both our functions:

```
residual.resample.CIs <- function(level=0.95,B,mdl=lm.invMAPE) {
tboots <- replicate(B, estimate.coefs(resample.residuals(mdl)))
main.coefs <- coefficients(mdl)
return(reduce.boots.to.CIs(the.boots=tboots,level=level,point.ests=main.coefs))
}
row.resample.CIs <- function(level=0.95,B,mdl=lm.invMAPE,data=stocks) {
tboots <- replicate(B, estimate.coefs(resample.rows(data)))
main.coefs <- coefficients(mdl)
return(reduce.boots.to.CIs(the.boots=tboots,level=level,point.ests=main.coefs))
}
```

Notice that these are almost identical, except for the resampling function.

```
resample.CIs <- function(level=0.95,B,resampler,main.ests,estimator) {
tboots <- replicate(B, estimator(resampler()))
return(reduce.boots.to.CIs(the.boots=tboots,level=level,point.ests=main.ests))
}
```

Notice that here both `resampler`

and `estimator`

are to be functions. The next code block illustrates:

```
# the residual-resampling CIs
resample.CIs(level=0.9,B=200,resampler=function() { resample.residuals(lm.invMAPE) },
main.ests=coefficients(lm.invMAPE),
estimator=function(data) { estimate.coefs(data, lm.invMAPE) })
```

```
## 5 % 95 %
## (Intercept) -0.01193 -0.003244
## I(1/MAPE) 0.93835 1.053856
```

```
# The case-resampling CIs
resample.CIs(level=0.9,B=200,resampler=function() { resample.rows(stocks) },
main.ests=coefficients(lm.invMAPE),
estimator=function(data) { estimate.coefs(data, lm.invMAPE) })
```

```
## 5 % 95 %
## (Intercept) -0.01211 -0.003022
## I(1/MAPE) 0.94677 1.043825
```

- We’re old friends with estimating kernel regressions by now:

`library(np)`

```
## Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-2)
## [vignette("np_faq",package="np") provides answers to frequently asked questions]
```

`np.MAPE <- npreg(Return_10_fwd ~ MAPE, data=stocks, tol=0.01, ftol=0.01)`

`np.MAPE$bws$bw # bandwidth`

`## [1] 0.5805`

`np.MAPE$bws$fval # cross-validated MSE`

`## [1] 0.001693`

- We can re-use the function for resampling residuals, because it’s been written to work generically, with any regression model that works with
`predict()`

and`residuals()`

. Detailed testing to make sure of this is omitted, but look at a plot of the output:

```
plot(Return_10_fwd ~ MAPE, data=resample.residuals(np.MAPE), pch=16, cex=0.3, col="blue")
points(stocks$MAPE, stocks$Return_10_fwd, pch=16, cex=0.3)
legend("topright",legend=c("Data","Simulated from kernel regression"),col=c("black","blue"),pch=16)
```

We also need some code to re-estimate the kernel regression on new data. We’ll do it in two stages, because `npreg`

can be tempramental when used on its own inside a function.

```
# Estimate a kernel regression of returns on MAPE from a data frame
# Inputs: the data frame
# Outputs: the estimate npregression object
# Presumes: Data frame has columns with the right names (and values)
# ATTN: It'd be nicer to not hard-code the formula
estimate.npr <- function(data) {
bws <- npregbw(Return_10_fwd ~ MAPE, data=data, tol=0.01, ftol=0.01)
fit <- npreg(bws)
return(fit)
}
```

Since we don’t really care about the whole of the kernel regression, just its predictions on selected points, we’ll write a function to calculate that:

```
evaluation.points <- data.frame(MAPE=seq(from=min(stocks$MAPE),to=max(stocks$MAPE),length.out=100))
# Evaluate an estimated model on a grid
# Inputs: the model, data frame containing the grid
# Output: vector of predicted values
# Presumes: the data frame has columns which match the predictor variables of the model
evaluate.mdl <- function(mdl, grid=evaluation.points) { return(predict(mdl, newdata=grid)) }
```

Check that `evaluate.mdl`

is doing what it should:

`all(predict(np.MAPE, newdata=evaluation.points) == evaluate.mdl(np.MAPE))`

`## [1] TRUE`

Now we can actually re-cycle the code we’ve written; start with a small number of replicates.

```
little.CIs <- resample.CIs(level=0.9,B=20,resampler=function() { resample.residuals(np.MAPE) },
main.ests=evaluate.mdl(np.MAPE),
estimator=function(data) { evaluate.mdl(estimate.npr(data)) })
```

Let’s look at this to see if it makes sense: it should have 100 rows and 2 columns; the first column, being the lower limits, should be below our point estimate, `evaluate.mdl(np.MAPE)`

, which in turn should be below the second column.

`dim(little.CIs)`

`## [1] 100 2`

`all(little.CIs[,1] < evaluate.mdl(np.MAPE))`

`## [1] TRUE`

`all(little.CIs[,2] > evaluate.mdl(np.MAPE))`

`## [1] TRUE`

Re-assured that this is working sensibly, we can now give it the time needed to actually generate the confidence limits.