--- title: "Lab 7w: The Apply Family" author: "Statistical Computing, 36-350" date: "Wednesday October 12, 2016" --- Name: Andrew ID: Collaborated with: This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit **your own** lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab. There are Homework 6 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 6 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit **your own** homework as a knit HTML file on Blackboard, by 6pm on **Sunday October 16**. This document contains 16 of the 45 total points for Homework 6. Don't use apply(), just yet! === - As pointed out in the last slide of the mini-lecture "The Apply Family", the apply() function can often be overused. You can perform a lot of tasks without it, just by using logical indexing, vectorization, and a few specialized functions like rowSums() and colSums(). In this section of the lab, don't use the apply() function to answer any of the questions (and don't use for() loops either), just use logical indexing, vectorization (or matricization), and any specialized functions that run directly on vectors or matrices like sum(), rowSums(), colSums(), max.col(), etc. For your convenience, we preview the state.x77 matrix below, which, recall is a matrix of 50 states x 8 numeric variables. (You don't have to do anything yet.) {r} head(state.x77)  - How many states have at least 150 days of frost per year? Which ones are they? How many states are at most 10,000 square miles? How many states have illiteracy percentages between 1.5 and 2 percent? How many states have murder rates of at least 12 people per 100,000? Which ones are they? (Hint: you should only need one line of code to answer each question.) - For each of the 8 numeric variables in state.x77, what is the average? For each of the variables, how many states have values above the average? For each of the variables, how many states have values above twice the average? (Hint: you're allowed to use the transpose operator, t(), here.) - For each of the 8 numeric variables in state.x77, what is the geometric mean? (Hint: recall the the geometric mean of a vector of length n is the product of its entries, all raise to the power of 1/n. You can actually do this with colMeans(), after you transform the data in a clever way; think about the relationship between sums of logs and logs of products.) **Hw6 Q4 (3 points).** For each variable, what is the state that achieves the largest value? What is the state that achieves the smallest value? (Hint: you're allowed to use the transpose operator, t(), here.) **Hw6 Q5 (3 points).** For each variable, what is the standard deviation, and the mean absolute deviation? (Hint: recall that the mean absolute deviation of a vector is the average of the absolute differences between its entries and the average. You're allowed to use the function scale(), here, which---with the input center set to TRUE and the input scale set to FALSE---returns a matrix where each column has been centered, i.e., has had its mean subtracted out of it.) OK, now let's use apply()! === - We're curious to see how the Frost variable in state.x77 correlates with the others. Write a function called cor.v1.v2() that takes two inputs: v1, a numeric vector; and v2, another numeric vector, whose default value is state.x77[,"Frost"]. Its output should be the correlation of v1 and v2. (Hint: recall cor().) Check that cor.v1.v2(v1=state.x77[,"Life Exp"]) gives you 0.262068, and cor.v1.v2(v1=state.x77[,"Frost"]) gives you 1. - Using apply() and the function cor.v1.v2() that you defined in the last question, calculate the correlation between each one of the 8 variables in the state.x77 matrix and the Frost variable. Display these correlations. Do any of them seem reasonably large, in absolute value (aside from the correlation of Frost with itself, which is trivially 1)? Does this surprise you? - Plot the illiteracy percentage versus days of frost, and the murder rate versus days of frost, each with appropriately labeled axes and appropriate title. Do there appear to be relationships between the variables? Is it obvious that these relationships are linear? - The usual measure of correlation that you already know, called Pearson's correlation, is a measure of linear association. A more flexible measure of nonlinear association is called Spearman's correlation. Spearman's correlation between two vectors x and y is defined by first converting x and y, individually, to a vector of their ranks, and then computing the usual Pearson correlation between the ranks of x and ranks of y. (E.g., if x has entries -5, 4, and 0 then it gets convered to a vector of ranks 1, 3, and 2, because -5 is the 1st smallest largest entry, 4 is the 3rd smallest entry, and 0 is the 2nd smallest entry.) This can actually be computed with cor() by specifying method="spearman" (the default is method="pearson", and another possibility is method="kendall"). A demo is given below. {r} plot(xvec <- seq(-3,3,length=101), xvec^3) text(-1, 15, paste("Pearson's correlation:", round(cor(xvec, xvec^3, method="pearson"),3))) # Default text(-1, 20, paste("Spearman's correlation:", round(cor(xvec, xvec^3, method="spearman"),3)))  Modify your function cor.v1.v2() so that it takes a third argument, method, whose default value is "pearson", but that can also be "spearman" (or "kendall"), and signifies what type of correlation we should be computing with cor(). Check that cor.v1.v2(v1=state.x77[,"Life Exp"], method="spearman") gives you 0.298391, and cor.v1.v2(v1=state.x77[,"Frost"], method="spearman") still gives you 1. - Using apply() and the latest version of cor.v1.v2(), with method="spearman", to compute the Spearman correlation between each one of the 8 variables in the state.x77 matrix and the Frost variable. Display these correlations. Again, do any of them seem reasonably large, in absolute value (aside from that of Frost with itself, which is trivially 1)? Does this surprise you? **Hw6 Q6 (10 points).** The Spearman correlations between Illiteracy and Frost, as well as Murder and Frost, both look reasonably large in absolute value (so do the Pearson correlations, but let's suppose the Spearman correlations are more interesting for now.) However, it's hard to judge them without assigning some notion of variability to our computed Spearman correlations. The **jackknife** is a super neat tool for doing just this. Here is a general description of how the jackknife works. - Given a set of $n$ data points, compute an estimate $\hat\theta$ for some quantity of interest. - For each data point $i=1,\ldots,n$, remove $i$ from the data set, and recompute an estimate $\hat\theta_{(-i)}$ using the remaining $n-1$ data points. - The jackknife standard error of $\hat\theta$, denoted $\mathrm{se}^{\mathrm{jack}}(\hat\theta)$, is $\mathrm{se}^{\mathrm{jack}}(\hat\theta) = \sqrt{\frac{(n-1)^2}{n}} \mathrm{sd}\{\hat\theta_{(-1)},\ldots,\hat\theta_{(-n)}\}$ where \$\mathrm{sd} stands for sample standard deviation. Write a function called cor.v1.v2.jack(), which takes the same inputs as your latest version of cor.v1.v2(), but instead of computing a correlation between v1 and v2 (of the specified type, in method), the function cor.v1.v2.jack() should compute a jackknife standard error of this correlation. (Hint: you'll want to use a for() loop, in which you leave out one point at a time; in the body of this for() loop, you can in fact just call cor.v1.v2(). Then collect the correlations, and compute the jackknife standard error per the above formula.) Check that cor.v1.v2.jack(v1=state.x77[,"Life Exp"], method="spearman") gives you 0.1540541, and cor.v1.v2.jack(v1=state.x77[,"Frost"], method="spearman") still gives you 0. Explain why it makes sense to get 0, for this last jackknife standard error. Finally, using apply() and cor.v1.v2.jack(), compute the jackknife standard errors of the Spearman correlations between each one of the 8 variables in the state.x77 matrix and the Frost variable. Create a matrix of dimension 3 x 8, where the first row is the Spearman correlation between each variable and Frost, the second row is the Spearman correlation plus twice the jackknife standard error of this Spearman correlation, and the third row is the Spearman correlation minus twice the jackknife standard error of this Spearman correlation. Assign row names "Spear cor", "Spear cor + 2se", and "Spear cor - 2se". Display this matrix. You can think of each Spearman correlation plus/minus 2 jackknife standard errors as defining a (rough) 95% confidence interval---which variables have confidence intervals that do not contain zero?