---
title: "Lab 6m: Iteration Basics"
author: "Statistical Computing, 36-350"
date: "Monday October 2, 2016"
---
Name:
Andrew ID:
Collaborated with:
This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit **your own** lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.
There are Homework 5 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 5 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit **your own** homework as a knit HTML file on Blackboard, by 6pm on **Sunday October 9**. This document contains 16 of the 45 total points for Homework 5.
Practice with `for()`
===
- Write a `for()` loop to add, elementwise, the two vectors `x`, `y` defined below. Check that the result is exactly the same as what you get with the usual (vectorized) addition operation in R.
```{r}
x = c(1.2, -1.5, 2, 2.7, 0)
y = c(0.8, -0.5, 0, -0.7, 2)
```
- Write a `for()` loop to sum up the logs of all elements of the vector `z` below that are positive. Show that you can perform the same task without a for loop, just with careful vector subsetting and one call to `sum()`. Check that the results are the same, either way you do it.
```{r}
z = c(-5.3, 1.4, 0.1, 2.9, -0.5, 10.8, -0.7)
```
- Write a `for()` loop to multiply the two matrices `a`, `b` defined below. Note: here we're asking for a proper matrix multiplication. Hence, because `a` is 3 x 2 and `b` is 2 x 4, the resulting matrix, call it `c`, should be 3 x 4. (Hint: you'll need a nested `for()` loop. If you need to remind yourself about matrix multiplication, sketch it out on a piece of paper!) Check that `c` is equal to the result of using multiplying `a`, `b` using the usual matrix multiplication operator.
```{r}
a = matrix(c(-0.15, -0.33, 0.72, -0.64, -0.01, -0.14), 3, 2)
b = matrix(c(0.66, -0.77, 0.92, -0.70, -0.71, 0.85, 0.01, -0.69), 2, 4)
```
- Write a function `mat.mult()` to perform manual matrix multiplication, via a nested `for()` loop, as you implemented in the last question. Its inputs should be: `a` and `b`, two matrices (whose dimensions we will assume are compatible). Its output should be: the matrix equal to the product of `a` and `b`. Demonstrate that your function works by calling it on two matrices and showing that the output equals the result given by the usual matrix multiplication operator. (These can be anything, but make sure they have different dimensions than the ones defined in the previous question---that way, this is a "real" test for your function.)
**Hw5 Q1 (3 points).** Below we generate two matrices `a.big`, `b.big` of dimensions 500 x 200 and 200 x 300, respectively, filled with standard normals. Time how long it takes your function `mat.mult()` to multiply these two. Also time how long it takes with the usual matrix multiplication operator. For the timings, you can use `proc.time()`, as demonstrated below, with some dummy code. (The third element of the returned vector of timings, "elapsed", is what you can report.) Is there a big difference? Also, to verify that your function `mat.mult()` is still doing the right thing, report the maximum absolute difference between its output and the result of standard matrix multiplication here.
```{r}
a.big = matrix(rnorm(500*200), 500, 200)
b.big = matrix(rnorm(200*300), 200, 300)
t0 = proc.time()
x = sqrt(log(1+exp(rnorm(100)*1:100)))
Sys.sleep(1.5) # Really, any code can go here
proc.time() - t0
```
Practice with `while()`
===
- The code below uses a `for()` loop to populate the entries of a numeric vector `log.vec` with the values log(1) through log(10). Modify it so that it uses a `while()` loop, rather than a `for()` loop, to accomplish the same purpose.
```{r}
n = 10
log.vec = vector(length=n, mode="numeric")
for (i in 1:n) {
log.vec[i] = log(i)
}
log.vec
```
- The function below `let.to.num()` takes a lower case letter and converts it to a number according to its order in the alphabet. Also, it counts spaces as 0. (For the purpose of this question, it is not vectorized---i.e., it only expects a single character.) Write a `while()` loop to complete the following task. Iterate over the characters in the string `str` below, convert each to a number using `let.to.num()`, and keep their running sum. Stop as soon as this sum exceeds 300, and display the resulting substring. To be clear, this substring should contain all the characters in `str` up until the last point where their sum is less or equal to (does not exceed) 300. (Hint: recall `substr()`.)
```{r}
let.to.num = function(char) { match(char[1], letters, no=0) }
let.to.num("c"); let.to.num("")
str = "When are you going to stop iterating over me this is getting tiring"
```
**Hw5 Q2 (7 points).** The Babylonians were apparently pretty smart. They devised the following algorithm to compute $\sqrt{x}$, using only the basic arithmetic operations (addition, subtraction, multiplication, division). We take a first guess $r$, let's say $r=x/2$ for concreteness. Either $r^2 > x$, $r^2 < x$, or $r^2 = x$. Now:
- If $r^2=x$, we can stop
- If $r^2 > x$, then $r > \sqrt{x}$, and $x/r < \sqrt{x}$
- If $r^2 < x$, then $r < \sqrt{x}$, and $x/r > \sqrt{x}$
Therefore, in the latter two cases, we can replace $r$ with average of $r$ and $x/r$, and repeat. Write a function `baby.root()` that takes as inputs `x`, the numeric variable whose square root we want to compute; and `tol`, a numeric variable whose default value is 1e-6. Your function should implement the Babylonian method of root finding described above, and stop when $|r^2 - x|$ is less than `tol` (which stands for "tolerance"). Your function should use a `while()` loop. Finally, it should output a list with two elements: `x.sqrt`, the value of $r$ in the above description at convergence (once the loop has terminated); and `n.iter` the number of iterations taken in the `while()` loop before convergence. Run your function when `x` is equal (separately) to 2, 4, 10, 99, and `tol` is kept at the default value. Compare the results to the actual square roots as computed by `sqrt()`, by computing the absolute difference between the approximated and actual square root in each case. Also display the display the numbers of iterations needed in each case.
Compare documents function
===
- In the "Function Design" mini-lecture, we sketched a function `compare.docs()` that computed a document-term matrix from a vector of strings (each being a URL containing text on the web), and then optionally printed out summaries. The code is given below, for your convenience. In Lab 5f, we completed the first step, and wrote the `get.dt.mat()` function. This is again given below, for your convenience. Note: you don't have to do anything yet, this is just a setup for the questions to come.
```{r}
compare.docs = function(str.urls, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.numbers=FALSE, print.summary=TRUE) {
# Compute the document-term matrix
dt.mat = get.dt.mat(str.urls, split, tolower, keep.numbers)
# Print a summary, if we're asked to
if (print.summary) print.dt.mat(dt.mat)
# Compute correlations
cor.mat = cor(t(dt.mat))
# Return a list with document-term matrix and correlations
return(list(dt.mat=dt.mat, cor.mat=cor.mat))
}
get.dt.mat = function(str.urls, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.numbers=FALSE) {
# First, compute all the individual word tables
wordtabs = get.wordtabs(str.urls, split, tolower, keep.numbers)
# Then, build the document-term matrix from these, and return it
return(dt.mat.from.wordtabs(wordtabs))
}
get.wordtabs = function(str.urls, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.numbers=FALSE) {
wordtabs = list()
for (i in 1:length(str.urls)) {
wordtabs[[i]] = get.wordtab(str.urls[i], split, tolower, keep.numbers)
k = max(max(gregexpr("/", str.urls[i])[[1]]), 1)
names(wordtabs)[i] = substr(str.urls[i], k+1, nchar(str.urls[i]))
}
return(wordtabs)
}
get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.numbers=FALSE) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split=split)[[1]]
words = words[words != ""]
# Convert to lower case, if we're asked to
if (tolower) words = tolower(words)
# Get rid of words with numbers, if we're asked to
if (!keep.numbers)
words = grep("[0-9]", words, inv=TRUE, val=TRUE)
table(words)
}
dt.mat.from.wordtabs = function(wordtabs) {
# First get all the unique words
all.words = c()
for (i in 1:length(wordtabs)) {
all.words = c(all.words, names(wordtabs[[i]]))
}
all.words.unique = sort(unique(all.words))
# Then build the document-term matrix
dt.mat = matrix(0, length(wordtabs), length(all.words.unique))
colnames(dt.mat) = all.words.unique
rownames(dt.mat) = names(wordtabs)
for (i in 1:length(wordtabs)) {
dt.mat[i, names(wordtabs[[i]])] = wordtabs[[i]]
}
return(dt.mat)
}
```
- Below we run `get.dt.mat()` to produce a document-term matrix `dt.mat` from the speeches given by Trump, Clinton, Pence, Kaine at the 2016 Republican and Democratic National Conventions. On just the first row of `dt.mat`, Trump's document, show how to use appropriate indexing and vectorization to compute the following quantities: the total number of words used by Trump, the total number of characters used by Trump, the most common word used by the Trump, and the longest word used by Trump. To reiterate, you should not use `for()` or `while()` loops here; but instead stick to appropriate indexing and vectorization. You might find it useful to use the variables `words` and `wlens`, computed for your convenience below. (Hint: the last quantity, the longest word used by Trump, is the trickiest to compute. But consider modifying `wlens` so that we "zero out", i.e., assign a zero value, to all words that were not used by Trump, and then find the maximum value of this modified word lengths vector.)
```{r}
dt.mat = get.dt.mat(c(
"http://www.stat.cmu.edu/~ryantibs/statcomp/data/trump.txt",
"http://www.stat.cmu.edu/~ryantibs/statcomp/data/clinton.txt",
"http://www.stat.cmu.edu/~ryantibs/statcomp/data/pence.txt",
"http://www.stat.cmu.edu/~ryantibs/statcomp/data/kaine.txt"))
words = colnames(dt.mat)
wlens = nchar(words)
```
**Hw5 Q3 (6 points).** Write a function `print.dt.mat()` that takes a document-term matrix `dt.mat` as an input, and returns nothing, but as a side effect, prints numeric summaries of the documents. Specifically, for each row of `dt.mat`, this function should print a summary on a new line in the console, reporting for that particular row: the total number of words, the total number of characters, the most common word, and the longest word. You should use a `for()` loop over the rows of `dt.mat`, and for the body of this loop, use appropriate indexing and vectorization as you did in the last quantity to compute each of these quantities. To reiterate, you should not have a nested `for()` loop; i.e., the four quantities here should be computable without looping, just as in the last question. (Hint: to print to the console, use `cat()` and `paste()`. Also take a look at the examples in the mini-lecture "Iteration Basics".)
The format of each line printed summary should be exactly as follows:
"\ -- total words: \, total chars: \, most common: \, longest: \"
where \, \, \, \ are placeholders for quantities that you will compute. Also, \ is a placeholder for the name of the current row of the document-term matrix. So, for example, when `dt.mat` is as defined in the last question, the printed out summary for its first row, Trump's speech, should be exactly as follows:
"trump.txt -- total words: 4431, total chars: 20299, most common: the, longest: representative"
Run your function `print.dt.mat()` on `dt.mat`, the document-term matrix from the Trump, Clinton, Pence, Kaine speeches. Then, run the function `compare.docs()` (finally finished!) on the appropriate vector of strings, containing URLs to the speeches from Trump, Clinton, Pence, Kaine, with the rest of the inputs set to their default values. Show all 4 rows and the first 10 columns of the resulting document-term matrix. Also display the correlation matrix.