---
title: "Lab 7w: The Apply Family"
author: "Statistical Computing, 36-350"
date: "Wednesday October 12, 2016"
---
Name:
Andrew ID:
Collaborated with:
This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit **your own** lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.
There are Homework 6 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 6 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit **your own** homework as a knit HTML file on Blackboard, by 6pm on **Sunday October 16**. This document contains 16 of the 45 total points for Homework 6.
Don't use `apply()`, just yet!
===
- As pointed out in the last slide of the mini-lecture "The Apply Family", the `apply()` function can often be overused. You can perform a lot of tasks without it, just by using logical indexing, vectorization, and a few specialized functions like `rowSums()` and `colSums()`. In this section of the lab, don't use the `apply()` function to answer any of the questions (and don't use `for()` loops either), just use logical indexing, vectorization (or matricization), and any specialized functions that run directly on vectors or matrices like `sum()`, `rowSums()`, `colSums()`, `max.col()`, etc. For your convenience, we preview the `state.x77` matrix below, which, recall is a matrix of 50 states x 8 numeric variables. (You don't have to do anything yet.)
```{r}
head(state.x77)
```
- How many states have at least 150 days of frost per year? Which ones are they? How many states are at most 10,000 square miles? How many states have illiteracy percentages between 1.5 and 2 percent? How many states have murder rates of at least 12 people per 100,000? Which ones are they? (Hint: you should only need one line of code to answer each question.)
- For each of the 8 numeric variables in `state.x77`, what is the average? For each of the variables, how many states have values above the average? For each of the variables, how many states have values above twice the average? (Hint: you're allowed to use the transpose operator, `t()`, here.)
- For each of the 8 numeric variables in `state.x77`, what is the geometric mean? (Hint: recall the the geometric mean of a vector of length n is the product of its entries, all raise to the power of 1/n. You can actually do this with `colMeans()`, after you transform the data in a clever way; think about the relationship between sums of logs and logs of products.)
**Hw6 Q4 (3 points).** For each variable, what is the state that achieves the largest value? What is the state that achieves the smallest value? (Hint: you're allowed to use the transpose operator, `t()`, here.)
**Hw6 Q5 (3 points).** For each variable, what is the standard deviation, and the mean absolute deviation? (Hint: recall that the mean absolute deviation of a vector is the average of the absolute differences between its entries and the average. You're allowed to use the function `scale()`, here, which---with the input `center` set to TRUE and the input `scale` set to FALSE---returns a matrix where each column has been centered, i.e., has had its mean subtracted out of it.)
OK, now let's use `apply()`!
===
- We're curious to see how the `Frost` variable in `state.x77` correlates with the others. Write a function called `cor.v1.v2()` that takes two inputs: `v1`, a numeric vector; and `v2`, another numeric vector, whose default value is `state.x77[,"Frost"]`. Its output should be the correlation of `v1` and `v2`. (Hint: recall `cor()`.) Check that `cor.v1.v2(v1=state.x77[,"Life Exp"])` gives you 0.262068, and `cor.v1.v2(v1=state.x77[,"Frost"])` gives you 1.
- Using `apply()` and the function `cor.v1.v2()` that you defined in the last question, calculate the correlation between each one of the 8 variables in the `state.x77` matrix and the `Frost` variable. Display these correlations. Do any of them seem reasonably large, in absolute value (aside from the correlation of `Frost` with itself, which is trivially 1)? Does this surprise you?
- Plot the illiteracy percentage versus days of frost, and the murder rate versus days of frost, each with appropriately labeled axes and appropriate title. Do there appear to be relationships between the variables? Is it obvious that these relationships are linear?
- The usual measure of correlation that you already know, called Pearson's correlation, is a measure of linear association. A more flexible measure of nonlinear association is called Spearman's correlation. Spearman's correlation between two vectors x and y is defined by first converting x and y, individually, to a vector of their ranks, and then computing the usual Pearson correlation between the ranks of x and ranks of y. (E.g., if x has entries -5, 4, and 0 then it gets convered to a vector of ranks 1, 3, and 2, because -5 is the 1st smallest largest entry, 4 is the 3rd smallest entry, and 0 is the 2nd smallest entry.) This can actually be computed with `cor()` by specifying `method="spearman"` (the default is `method="pearson", and another possibility is `method="kendall"). A demo is given below.
```{r}
plot(xvec <- seq(-3,3,length=101), xvec^3)
text(-1, 15, paste("Pearson's correlation:",
round(cor(xvec, xvec^3, method="pearson"),3))) # Default
text(-1, 20, paste("Spearman's correlation:",
round(cor(xvec, xvec^3, method="spearman"),3)))
```
Modify your function `cor.v1.v2()` so that it takes a third argument, `method`, whose default value is "pearson", but that can also be "spearman" (or "kendall"), and signifies what type of correlation we should be computing with `cor()`. Check that `cor.v1.v2(v1=state.x77[,"Life Exp"], method="spearman")` gives you 0.298391, and `cor.v1.v2(v1=state.x77[,"Frost"], method="spearman")` still gives you 1.
- Using `apply()` and the latest version of `cor.v1.v2()`, with `method="spearman"`, to compute the Spearman correlation between each one of the 8 variables in the `state.x77` matrix and the `Frost` variable. Display these correlations. Again, do any of them seem reasonably large, in absolute value (aside from that of `Frost` with itself, which is trivially 1)? Does this surprise you?
**Hw6 Q6 (10 points).** The Spearman correlations between `Illiteracy` and `Frost`, as well as `Murder` and `Frost`, both look reasonably large in absolute value (so do the Pearson correlations, but let's suppose the Spearman correlations are more interesting for now.) However, it's hard to judge them without assigning some notion of variability to our computed Spearman correlations. The **jackknife** is a super neat tool for doing just this. Here is a general description of how the jackknife works.
- Given a set of $n$ data points, compute an estimate $\hat\theta$ for some quantity of interest.
- For each data point $i=1,\ldots,n$, remove $i$ from the data set, and recompute an estimate $\hat\theta_{(-i)}$ using the remaining $n-1$ data points.
- The jackknife standard error of $\hat\theta$, denoted $\mathrm{se}^{\mathrm{jack}}(\hat\theta)$, is
\[
\mathrm{se}^{\mathrm{jack}}(\hat\theta) =
\sqrt{\frac{(n-1)^2}{n}} \mathrm{sd}\{\hat\theta_{(-1)},\ldots,\hat\theta_{(-n)}\}
\]
where $\mathrm{sd} stands for sample standard deviation.
Write a function called `cor.v1.v2.jack()`, which takes the same inputs as your latest version of `cor.v1.v2()`, but instead of computing a correlation between `v1` and `v2` (of the specified type, in `method`), the function `cor.v1.v2.jack()` should compute a jackknife standard error of this correlation. (Hint: you'll want to use a `for()` loop, in which you leave out one point at a time; in the body of this `for()` loop, you can in fact just call `cor.v1.v2()`. Then collect the correlations, and compute the jackknife standard error per the above formula.) Check that `cor.v1.v2.jack(v1=state.x77[,"Life Exp"], method="spearman")` gives you 0.1540541, and `cor.v1.v2.jack(v1=state.x77[,"Frost"], method="spearman")` still gives you 0. Explain why it makes sense to get 0, for this last jackknife standard error.
Finally, using `apply()` and `cor.v1.v2.jack()`, compute the jackknife standard errors of the Spearman correlations between each one of the 8 variables in the `state.x77` matrix and the `Frost` variable. Create a matrix of dimension 3 x 8, where the first row is the Spearman correlation between each variable and `Frost`, the second row is the Spearman correlation plus twice the jackknife standard error of this Spearman correlation, and the third row is the Spearman correlation minus twice the jackknife standard error of this Spearman correlation. Assign row names "Spear cor", "Spear cor + 2se", and "Spear cor - 2se". Display this matrix. You can think of each Spearman correlation plus/minus 2 jackknife standard errors as defining a (rough) 95% confidence interval---which variables have confidence intervals that do not contain zero?