Name:
Andrew ID:
Collaborated with:

On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Sunday 10pm, this week.

Huber loss function

Recall, as covered in lab, the Huber loss function (or just Huber function, for short), with cutoff $$a$$, which is defined as: $\psi_a(x) = \begin{cases} x^2 & \text{if |x| \leq a} \\ 2a|x| - a^2 & \text{if |x| > a} \end{cases}$ This function is quadratic on the interval $$[-a,a]$$, and linear outside of this interval. It transitions from quadratic to linear “smoothly”, and looks like this (when $$a=1$$):

Plotting practice, side effects

• 1a. The code for the huber() function that you should have arrived at in this week’s lab is copied below. Using huber(), reproduce the plot of the Huber function that you see above. The axes and title should be just the same, so should the Huber curve (in black), so should be the red dotted lines at the values -1 and 1, and so should the text “Linear”, “Quadratic”, “Linear”.
huber = function(x, a=1) {
ifelse(abs(x) <= a, x^2, 2*a*abs(x)-a^2)
}
• 1b. Modify the huber() function so that, as a side effect, it prints the string “Invented by the great Swiss statistician Peter Huber!” to the console. Hint: use cat(). Call your function on an input of your choosing, to demonstrate this side effect.

• 1c. Further modify your huber() function so that, as another side effect, it produces a plot of Switzerland’s national flag. Hint: look up this flag up on Google; it’s pretty simple; and you should be able to recreate it with a few calls to rect(). Call your function on an input of your choosing, to demonstrate its side effects.

Exploring function environments

• 2a. A modified version of the Huber function is given below. You can see that we’ve defined the variable x.squared in the body of the function to be the square of the input argument x. In a separate line of code (outside of the function definition), define the variable x.squared to be equal to 999. Then call huber(x=3), and display the value of x.squared. What is its value? Is this affected by the function call huber(x=3)? It shouldn’t be! Reiterate this point with several more lines of code, in which you repeatedly define x.squared to be something different (even something nonnumeric, like a string), and then call huber(x=3), and demonstrate afterwards that the value of x.squared hasn’t changed.
huber = function(x, a=1) {
x.squared = x^2
ifelse(abs(x) <= a, x.squared, 2*a*abs(x)-a^2)
}
• 2b. Similar to the last question, define the variable a to be equal to -59.6, then call huber(x=3, a=2), and show that the value of a after this function call is unchanged. And repeat a few times with different assignments for the variable a, to reiterate this point.

• 2c. The previous two questions showed you that a function’s body has its own environment in which locally defined variables, like those defined in the body itself, or those defined through inputs to the function, take priority over those defined outside of the function. However, when a variable referred to the body of a function is not defined in the local environment, the default is to look for it in the global environment (outside of the function).

Below is a “sloppy” implementation of the Huber function called huber.sloppy(), in which the cutoff a is not passed as an argument to the function. In a separate line of code (outside of the function definition), define a to be equal to 1.5 and then call huber.sloppy(x=3). What is the output? Explain. Repeat this a few times, by defining a and then calling huber.sloppy(x=3), to show that the value of a does indeed affect the function’s ouptut as expected. Challenge: try setting a equal to a string and calling huber.sloppy(x=3); can you explain what is happening?

huber.sloppy = function(x) {
ifelse(abs(x) <= a, x^2, 2*a*abs(x)-a^2)
}
• 2d. At last, a difference between = and <-, explained! Many of you have asked about this. The equal sign = and assignment operator <- are often used interchangeably in R, and some people will often say that a choice between the two is mostly a matter of stylistic taste. This is not the full story. Indeed, = and <- behave very differently when used to set input arguments in a function call. As we showed above, setting, say, a=5 as the input to huber() has no effect on the global assignment for a. However, replacing a=5 with a<-5 in the call to huber() is entirely different in terms of its effect on a. Demonstrate this, and explain what you are seeing in terms of global assignment.

• 2e. The story now gets even more subtle. It turns out that the assignment operator <- allows us to define new global variables even when we are specifying inputs to a function. Pick a variable name that has not been defined yet in your workspace, say b (or something else, if this has already been used in your R Markdown document). Call huber(x=3, b<-20), then display the value of b—this variable should now exist in the global enviroment, and it should be equal to 20! Alo, can you explain the output of huber(x=3, b<-20)?

• Challenge. The property of the assignment operator <- demonstrated in the last question, although tricky, can also be pretty useful. Leverage this property to plot the function $$y=0.05x^2 - \sin(x)\cos(x) + 0.1\exp(1+\log(x))$$ over 50 x values between 0 and 2, using only one line of code and one call to the function seq().

• 2f. Give an example to show that the property of the assignment operator <- demonstrated in the last two questions does not hold in the body of a function. That is, give an example in which <- is used in the body of a function to define a variable, but this doesn’t translate into global assignment.

Shakespeare’s complete works

Once more, as in lab (and lab/hw from Week 3), we’re going to look at that the complete works of William Shakespeare from Project Gutenberg. We’ve put this text file up at http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt.

Functions for word tables

• 3a. Compute word tables for each of Shakespeare’s plays. You should be able to do this by putting together relevant parts of the solution code from Q2 and Q3 in this week’s lab. Your result should be a list called shakespeare.wordtab.by.play of length 44, witheach component giving a word table for one of Shakespeare’s plays. Display the first 5 entries of each word table.

• 3b. Suppose we have many text documents, and we have computed word tables for each (just like we did for Shakeapeare’s plays). A document-term matrix is essentially a matrix formed by stacking the word tables along its rows. But how do we combine word tables into a matrix when they count the number of appearances of possibly different words, and hence are potentially of different lengths? The answer: we gather all unique words across all documents, into a “master” list of words, and then we expand each word table so that it has one entry per word in the master list, with 0s for words that never appeared in its corresponding document. This ensures that the word tables are all of the same length, and the document-term matrix is formed by stacking them row-wise.

Consider the code below which sketches out the implementation of a function called get.dtmat.from.wordtabs(). The only argument is wordtab.list, which is a list of word tables. There are two main steps: 1. The first step is to get all the unique words across all the word tables. The result should be stored in a string vector called master.words. The code below simply sets this to c(); replace this by your own implementation. You can see that the next line sorts master.words into alphabetical order. 2. The second step is to populate the document-term matrix. The code below defines an matrix dt.mat of the appropriate dimensions, of all 0s, and iterates over its rows one by one. Put your implementation to populate a row of dt.mat into the body of the for() loop. Hint: consider the ith row dt.mat[i,]; this is already all 0s; and so we only need to modify its entries for the words that appeared in the ith word table. This should only require one line of code; take advantage of the column names of dt.mat and use named indexing!

Once you have finished your implementation, apply get.dtmat.from.wordtabs() to shakespeare.wordtab.by.play and save the result as shakespeare.dt.mat. Its dimensions should be 44 x 25801. Display its first 10 rows and 5 columns.

get.dtmat.from.wordtabs = function(wordtab.list) {
# First get all the unique words
master.words = c() # Compute the master list here
master.words = sort(master.words)

# Then build the document-term matrix
dt.mat = matrix(0, nrow=length(wordtab.list), ncol=length(master.words))
rownames(dt.mat) = names(wordtab.list)
colnames(dt.mat) = master.words
for (i in 1:nrow(dt.mat)) {
# Populate the ith row of dt.mat here
}

return(dt.mat)
}
• 3c. Compute correlations between every pair of rows in shakespeare.dt.mat. Use the cor() function, but beware: this computes correlations between each pair of columns of its argument. Which pair of plays achieves the highest correlation, and hence are the most similar in terms of the word tables? Note: here we obviously want to exclude from consideration the correlations of each play with itself, which will be 1. Is this a surprising result?

• Challenge. Do some exploratory analysis of the correlations you computed in the last question, and describe what you are seeing. For example, you might consider hand-labeling each play as either a comedy or tragedy, and then looking at the correlations within and between these groups. You might also consider performing some kind of hierarchical clustering based on the correlations.

• Challenge. Use TF-IDF weighting on shakespeare.dt.mat. This stands for term frequency-inverse document frequency weighting; you can read about this in the course notes for Stat Computing from Fall 2016, or on the web. Compute the top 2 principle component scores of the newly-weighted document-term matrix, and produce a scatter plot of the 44 plays with respect to these 2 dimensions. Set the title and axes labels appropriately. Color the points that correspond to comedies in blue and tragedies in red. Do you see any separation between the point clusters for comedies and tragedies?