Name:
Andrew ID:
Collaborated with:

On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Sunday 10pm, this week.

## For reproducibility --- don't change this!
set.seed(01232018)

# Some R basics

• 1a. Let’s start easy by working through some R basics, just to brush up on them. Define a variable x.vec to contain the integers 1 through 100. Check that it has length 100. Report the data type being stored in x.vec. Add up the numbers in x.vec, by calling a built-in R function. How many arithmetic operations did this take? Challenge: show how Gauss would have done this same calculation as a 7 year old, using just 3 arithmetic operations.

• 1b. Convert x.vec into a matrix with 20 rows and 5 columns, and store this as x.mat. Here x.mat should be filled out in the default order (column major order). Check the dimensions of x.mat, and the data type as well. Compute the sums of each of the 5 columns of x.mat, by calling a built-in R function. Check (using a comparison operator) that the sum of column sums of x.mat equals the sum of x.vec.

• 1c. Extract and display rows 1, 5, and 17 of x.mat, with a single line of code. Answer the following questions, each with a single line of code: how many elements in row 2 of x.mat are larger than 40? How many elements in column 3 are in between 45 and 50? How many elements in column 5 are odd? Hint: take advantage of the sum() function applied to Boolean vectors.

• 1d. Using Boolean indexing, modify x.vec so that every even number in this vector is incremented by 10, and every odd number is left alone. This should require just a single line of code. Print out the result to the console. Challenge: show that ifelse() can be used to do the same thing, again using just a single line of code.

• 1e. Consider the list x.list created below. Complete the following tasks, each with a single line of code: extract all but the second element of x.list—seeking here a list as the final answer. Extract the first and third elements of x.list, then extract the second element of the resulting list—seeking here a vector as the final answer. Extract the second element of x.list as a vector, and then extract the first 10 elements of this vector—seeking here a vector as the final answer. Note: pay close attention to what is asked and use either single brackets [ ] or double brackets [[ ]] as appropriate.

x.list = list(rnorm(6), letters, sample(c(TRUE,FALSE),size=4,replace=TRUE))

# Prostate cancer data set

OK, moving along to more interesting things! We’re going to look again, as in lab, at the prostate cancer data set: 9 variables measured on 97 men who have prostate cancer (from the book The Elements of Statistical Learning):

1. lpsa: log PSA score
2. lcavol: log cancer volume
3. lweight: log prostate weight
4. age: age of patient
5. lbph: log of the amount of benign prostatic hyperplasia
6. svi: seminal vesicle invasion
7. lcp: log of capsular penetration
8. gleason: Gleason score
9. pgg45: percent of Gleason scores 4 or 5

To load this prostate cancer data set into your R session, and store it as a matrix pros.dat:

pros.dat =
as.matrix(read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat"))

# Computing standard deviations using iteration

• 2a. Using on-the-fly Boolean indexing, extract the rows of pros.dat that correspond to patients with SVI, and the rows that correspond to patients without it. Call the resulting matrices pros.dat.svi and pros.dat.no.svi, respectively. Display the dimensions of these matrices. Compute the column means of pros.dat.svi and pros.dat.no.svi, stored into vectors called pros.dat.svi.avg and pros.dat.no.svi.avg, respectively. For each matrix, this should require just a single call to a built-in R function. Display these column means.

• 2b. Take a look at the starter code below. The first line defines an empty vector pros.dat.svi.sd of length ncol(pros.dat) (of length 9). The second line defines an index variable i and sets it equal to 1. Write a third line of code to compute the standard deviation of the ith column of pros.dat.svi, using a built-in R function, and store this value in the ith element of pros.dat.svi.sd.

pros.dat.svi.sd = vector(length=ncol(pros.dat))
i = 1
• 2c. Repeat the calculation as in the previous question, but for patients without SVI. That is, produce three lines of code: the first should define an empty vector pros.dat.no.svi.sd of length ncol(pros.dat) (of length 9), the second should define an index variable i and set it equal to 1, and the third should fill the ith element of pros.dat.no.svi.sd with the standard deviation of the ith column of pros.dat.no.svi.

• 2d. Write a for() loop to compute the standard deviations of the columns of pros.dat.svi and pros.dat.no.svi, and store the results in the vectors pros.dat.svi.sd and pros.dat.no.svi.sd, respectively, that were created above. Note: you should have a single for() loop here, not two for loops. And if it helps, consider breaking this task down into two steps: as the first step, write a for() loop that iterates an index variable i over the integers between 1 and the number of columns of pros.dat (don’t just manually write 9 here, pull out the number of columns programmatically), with an empty body. As the second step, paste relevant pieces of your solution code from Q2b and Q2c into the body of the for() loop. Print out the resulting vectors pros.dat.svi.sd and pros.dat.no.svi.sd to the console. Comment, just briefly (informally), by visually inspecting these standard deviations and the means you computed in Q2a: which variables exhibit large differences in means between the SVI and non-SVI patients, relative to their standard deviations?

• 2e. The code below computes the standard deviations of the columns of pros.dat.svi and pros.dat.no.svi, and stores them in pros.dat.svi.sd.master and pros.dat.no.svi.sd.master, respectively, using apply(). (We’ll learn apply() and related functions a bit later in the course.) Remove eval=FALSE as an option to the Rmd code chunk, and check using all.equal() that the standard deviations you computed in the previous question equal these “master” copies. Note: use check.names=FALSE as a third argument to all.equal(), which instructs it to ignore the names of its first two arguments. (If all.equal() doesn’t succeed in both cases, then you must have done something wrong in computing the standard deviations, so go back and fix them!)

pros.dat.svi.sd.master = apply(pros.dat.svi, 2, sd)
pros.dat.no.svi.sd.master = apply(pros.dat.no.svi, 2, sd)

# Computing t-tests using vectorization

• 3a. Recall that the two-sample (unpaired) t-statistic between data sets $$X=(X_1,\ldots,X_n)$$ and $$Y=(Y_1,\ldots,Y_m)$$ is: $T = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s_X^2}{n} + \frac{s_Y^2}{m}}},$ where $$\bar{X}=\sum_{i=1}^n X_i/n$$ is the sample mean of $$X$$, $$s_X^2 = \sum_{i=1}^n (X_i-\bar{X})^2/(n-1)$$ is the sample variance of $$X$$, and similarly for $$\bar{Y}$$ and $$s_Y^2$$. We will compute these t-statistics for all 9 variables in our data set, where $$X$$ will play the role of one of the variables for SVI patients, and $$Y$$ will play the role of this variable for non-SVI patients. Start by computing a vector of the denominators of the t-statistics, called pros.dat.denom, according to the formula above. Take advantage of vectorization; this calculation should require just a single line of code. Make sure not to include any hard constants (e.g., don’t just manually write 21 here for $$n$$); as always, programmatically define all the relevant quantities. Copy over the definition of magic.denom from Q3d on Lab 2, these were the “magic denominators” that we used in this lab. Check using all.equal() that your computed denominators match these ones. Then compute a vector of t-statistics for the 9 variables in our data set, called pros.dat.t.stat, according to the formula above, and using pros.dat.denom. Again, take advantage of vectorization; this calculation should require just a single line of code. Print out the t-statistics to the console.

• 3b. Given data $$X$$ and $$Y$$ and the t-statistic $$T$$ as defined the last question, the degrees of freedom associated with $$T$$ is: $\nu = \frac{(\frac{s_X^2}{n}+\frac{s_Y^2}{m})^2}{\frac{(\frac{s_X^2}{n})^2}{n-1} + \frac{(\frac{s_Y^2}{m})^2}{m-1}}.$ Compute the degrees of freedom associated with each of our 9 t-statistics (from our 9 variables), storing the result in a vector called pros.dat.df. This might look like a complicated calculation, but really, it’s not too bad: it only involves arithmetic operators, and taking advantage of vectorization, the calculation should only require a single line of code. Hint: to simplify this line of code, it will help to first set short variable names for variables/quantities you will be using, as in sx = pros.dat.svi.sd, n = nrow(pros.dat.svi), and so on. Print out these degrees of freedom values to the console.

• 3c. The function pt() evaluates the distribution function of the t-distribution. E.g.,

pt(x, df=v, lower.tail=FALSE)

returns the probability that a t-distributed random variable, with v degrees of freedom, exceeds the value x. Importantly, pt() is vectorized: if x is a vector, and so is v, then the above returns, in vector format: the probability that a t-distributed variate with v[1] degrees of freedom exceeds x[1], the probability that a t-distributed variate with v[2] degrees of freedom exceeds x[2], and so on.

Call pt() as in the above line, but replace x by the absolute values of the t-statistics you computed for the 9 variables in our data set, and v by the degrees of freedom values associated with these t-statistics. Multiply the output by 2, and store it as a vector pros.dat.p.val. These are called p-values for the t-tests of mean difference between SVI and non-SVI patients, over the 9 variables in our data set. Print out the p-values to the console. Identify the variables for which the p-value is smaller than 0.05 (hence deemed to have a significant difference between SVI and non-SVI patients). Identify the variable with the smallest p-value (the most significant difference between SVI and non-SVI patients).

# Computing t-tests using iteration

• 4. The function t.test() computes a two-sample (unpaired) t-test between two data sets. E.g.,

t.test.obj = t.test(x=rnorm(10), y=rnorm(10))
names(t.test.obj)
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"
## [6] "null.value"  "alternative" "method"      "data.name"

computes a t-test between data sets x=rnorm(10) and y=rnorm(10) (here, just for the sake of example, these are just two sets of 10 randomly generated standard normals), and stores the output in a list called t.test.obj. The names of the list are then displayed. Note: the element named p.value contains the p-value.

Define an empty vector of length ncol(pros.dat) (of length 9) called pros.dat.p.val.master. Then write a for() loop to populate its entries with the p-values from calling t.test() to test the mean difference between SVI and non-SVI patients, over each of the 9 variables in our data set. Important: the t.test() function will throw an error when it tries to consider the mean difference in the SVI variable itself, across the two groups of SVI patients and non-SVI patients; this will occur at some value of i (i.e., the value for which pros.dat[,i] is the SVI column). To avoid this error, use an if() statement to check if the current variable being considered is SVI, and in this case, just set the p-value equal to NaN (rather than calling t.test()). Check using all.equal() that the p-values stored in pros.dat.p.val.master match the ones you computed in Q3c. Note: use check.names=FALSE as a third argument to all.equal(), which instructs it to ignore the names of its first two arguments.