Name:

Andrew ID:

Collaborated with:

On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit **your own** homework as an knitted HTML file on Canvas, by Sunday 10pm, this week.

```
## For reproducibility --- don't change this!
set.seed(01302018)
```

**1a.**Using proper indexing, modify the vectors defined below, as described in the comments. Each time, the solution should require just one line of code, and you should print out the new value of each vector, to show the result.

`(x = runif(10, -1, 1))`

```
## [1] -0.5703595 -0.4517724 0.4088408 -0.1856029 -0.2185913 -0.2291011
## [7] -0.3781738 0.7277174 -0.6395923 -0.9342617
```

```
# Increment the negative entries by 0.1
(y = sample(c(TRUE,FALSE), 8, replace=TRUE))
```

`## [1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE`

```
# Replace the FALSE entries by NA
(z = c("Hey", "you", "there", "what's", "going", "on"))
```

`## [1] "Hey" "you" "there" "what's" "going" "on"`

`# Paste an exclamation mark "!" at the end of the entries with at most 3 characters`

**1b.**The**geometric mean**of positive numbers \(x_1,\ldots,x_n>0\) is defined as \[ (x_1 \cdot x_2 \cdots x_n)^{1/n}. \] There are two strategies for computing the geometric mean of a vector of positive numbers in R. The first uses the function`prod()`

, and the binary operator`^`

. The second uses the functions`exp()`

,`mean()`

, and`log()`

. Using the vector`x`

defined below, implement both strategies, calling the results`m1`

and`m2`

, respectively. Each computation should require one line of code. Check that the results`m1`

and`m2`

match using`all.equal()`

.

```
n = 10
x = runif(n) # Generate 10 numbers uniformly between 0 and 1
```

**1c.**Rerun the code you wrote for the last question, but in the first line set`n=10000`

, so that`x`

is a vector of 10,000 numbers, distributed uniformly at random between 0 and 1. Do`m1`

and`m2`

match? What is the value of`m1`

now?**Challenge**: can you explain what is happening to`m1`

here, and why we would therefore prefer the strategy used to compute`m2`

?**1d.**Another reason to prefer the second strategy for computing the geometric mean—in which we use the`exp()`

,`mean()`

, and`log()`

functions—is that it can be readily extended to computing the geometric means of rows/columns of matrices in R. Demonstrate this by computing the geometric means of each column of the matrix`x`

defined below with just one line of code. (Do*not*use a`for()`

loop here; restrict yourself to just three function calls, still.)

`x = matrix(runif(40), 10, 4)`

**1e.**Nested`for()`

loops work just like the usual (unnested) ones you’ve already been considering; nesting just means using a`for()`

loop within the body of another`for()`

loop. E.g., consider`x = matrix(0, 5, 5) for (i in 1:5) { for (j in 1:5) { x[i,j] = i + j^2 } } x`

`## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 5 10 17 26 ## [2,] 3 6 11 18 27 ## [3,] 4 7 12 19 28 ## [4,] 5 8 13 20 29 ## [5,] 6 9 14 21 30`

which populates the entries of the matrix

`x`

by first filling out all of its first row, then all of its second row, and so on. (To see this, look at the index variables in the`for()`

loops, and step through their progression: first we set`i=1`

in the outer`for()`

loop, then we set`j=1`

,`j=2`

, and so on in the inner`for()`

loop, until`j=5`

; then we move on to`i=2`

, …)Write a nested

`for()`

loop to multiply the two matrices`a`

and`b`

defined below, storing the result in the matrix`c`

. You will have to remember how matrix multiplication works! And you must only use arithmetic operations in your solution. Hint: your solution should have a nesting of three`for()`

loops (the example above had a nesting of two`for()`

loops). Check using`all.equal()`

that your result`c`

matches`a %*% b`

, which is R’s built-in way of multiplying`a`

and`b`

.

```
a = matrix(rnorm(15), 5, 3)
b = matrix(rnorm(12), 3, 4)
c = matrix(0, 5, 4)
```

On to the more fun stuff! As in lab, we’re going to look at William Shakespeare’s complete works, taken from Project Gutenberg. The Shakespeare data file is up on our course website, and to load it into your R session, as a string vector called `shakespeare.lines`

:

```
shakespeare.lines =
readLines("http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt")
```

**2a.**Some lines in`shakespeare.lines`

are empty, i.e., they are just equal to “”. How many such lines are there? Remove all empty lines from`shakespeare.lines`

. Also, trim all “extra” white space characters in the lines of`shakespeare.lines`

using the`trimws()`

function. Note: if you are unsure about what`trimws()`

does, try it out on some simple strings/some simple vectors of strings.**2b.**Visit http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt in your web browser and just skim through this text file. Near the top you’ll see a table of contents. Note that “THE SONNETS” is the first play, and “VENUS AND ADONIS” is the last. Using`which()`

, find the indices of the lines in`shakespeare.lines`

that equal “THE SONNETS”, report the index of the*first*such occurence, and store it as`toc.start`

. Similarly, find the indices of the lines in`shakespeare.lines`

that equal “VENUS AND ADONIS”, report the index of the*first*such occurence, and store it as`toc.end`

.**2c.**Define`n = toc.end - toc.start + 1`

, and create an empty string vector of length`n`

called`titles`

. Using a`for()`

loop, populate`titles`

with the titles of Shakespeare’s plays as ordered in the table of contents list, with the first being “THE SONNETS”, and the last being “VENUS AND ADONIS”. Print out the resulting`titles`

vector to the console. Hint: if you define the counter variable`i`

in your`for()`

loop to run between 1 and`n`

, then you will have to index`shakespeare.lines`

carefully to extract the correct titles. Think about the following. When`i=1`

, you want to extract the title of the first play in`shakespeare.lines`

, which is located at index`toc.start`

. When`i=2`

, you want to extract the title of the second play, which is located at index`toc.start + 1`

. And so on.**2d.**Use a`for()`

loop to find out, for each play, the index of the line in`shakespeare.lines`

at which this play begins. It turns out that the*second*occurence of “THE SONNETS” in`shakespeare.lines`

is where this play actually begins (this first ocurrence is in the table of contents), and so on, for each play title. Use your`for()`

loop to fill out an integer vector called`titles.start`

, containing the indices at which each of Shakespeare’s plays begins in`shakespeare.lines`

. Print the resulting vector`titles.start`

to the console.**2e.**Define`titles.end`

to be an integer vector of the same length as`titles.start`

, whose first element is the second element in`titles.start`

minus 1, whose second element is the third element in`titles.start`

minus 1, and so on. What this means: we are considering the line before the second play begins to be the last line of the first play, and so on. Define the last element in`titles.end`

to be the length of`shakespeare.lines`

. You can solve this question either with a`for()`

loop, or with proper indexing and vectorization.**Challenge**: it’s not really correct to set the last element in`titles.end`

to be length of`shakespeare.lines`

, because there is a footer at the end of the Shakespeare data file. By looking at the data file visually in your web browser, come up with a way to programmatically determine the index of the last line of the last play, and implement it.**2f.**In Q2d, you should have seen that the starting index of Shakespeare’s 38th play “THE TWO NOBLE KINSMEN” was computed to be`NA`

, in the vector`titles.start`

. Why? If you run`which(shakespeare.lines == "THE TWO NOBLE KINSMEN")`

in your console, you will see that there is only one occurence of “THE TWO NOBLE KINSMEN” in`shakespeare.lines`

, and this occurs in the table of contents. So there was no second occurence, hence the resulting`NA`

value.But now take a look at line 118,463 in

`shakespeare.lines`

: you will see that it is “THE TWO NOBLE KINSMEN:”, so this is really where the second play starts, but because of colon “:” at the end of the string, this doesn’t exactly match the title “THE TWO NOBLE KINSMEN”, as we were looking for. The advantage of using the`grep()`

function, versus checking for exact equality of strings, is that`grep()`

allows us to match substrings. Specifically,`grep()`

returns the indices of the strings in a vector for which a substring match occurs, e.g.,`grep(pattern="cat", x=c("cat", "canned goods", "batman", "catastrophe", "tomcat"))`

`## [1] 1 4 5`

so we can see that in this example,

`grep()`

was able to find substring matches to “cat” in the first, fourth, and fifth strings in the argument`x`

. Redefine`titles.start`

by repeating the logic in your solution to Q2d, but replacing the`which()`

command in the body of your`for()`

loop with an appropriate call to`grep()`

. Also, redefine`titles.end`

by repeating the logic in your solution to Q2e. Print out the new vectors`titles.start`

and`titles.end`

to the console—they should be free of`NA`

values.

**3a.**Let’s look at two of Shakespeare’s most famous tragedies. Programmatically find the index at which “THE TRAGEDY OF HAMLET, PRINCE OF DENMARK” occurs in the`titles`

vector. Use this to find the indices at which this play starts and ends, in the`titles.start`

and`titles.end`

vectors, respectively. Call the lines of text corresponding to this play`shakespeare.lines.hamlet`

. How many such lines are there? Do the same, but now for the play “THE TRAGEDY OF ROMEO AND JULIET”, and call the lines of text corresponding to this play`shakespeare.lines.romeo`

. How many such lines are there?**3b.**Repeat the analysis, outlined in Q4 of Lab 3, on`shakespeare.lines.hamlet`

. That is:- collapse
`shakespeare.lines.hamlet`

into one big string, separated by spaces; - convert this string into all lower case characters;
- divide this string into words, by splitting on spaces or on punctuation marks, using
`split="[[:space:]]|[[:punct:]]"`

in the call to`strsplit()`

; - remove all empty words (equal to the empty string “”), and report how many words remain;
- report the 5 longest words;
- compute a word table, and report the 25 most common words and their counts;
- finally, produce a plot of the word counts verus rank.

- collapse
**3c.**Repeat the same task as in Q3b, but on`shakespeare.lines.romeo`

. Comment on any similarities/differences you see in the answers.**Challenge.**Using a`for()`

loop and the`titles.start`

,`titles.end`

vectors constructed above, answer the following questions. What is Shakespeare’s longest play (in terms of the number of words)? What is Shakespeare’s shortest play? In which play did Shakespeare use his longest word (in terms of the number of characters)? Are there any plays in which “the” is not the most common word?

**Challenge.**The ubiquity of Zipf’s law in text data seems kind of amazing. Go read up on Zipf’s law and tell us what you find.