Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted PDF file on Gradescope, by Friday 9pm, this week.
This week’s agenda: creating and updating functions; understanding argument and return structures; revisiting Shakespeare’s plays; code refactoring.
The Huber loss function (or just Huber function, for short) is defined as: \[ \psi(x) = \begin{cases} x^2 & \text{if $|x| \leq 1$} \\ 2|x| - 1 & \text{if $|x| > 1$} \end{cases} \] This function is quadratic on the interval [-1,1], and linear outside of this interval. It transitions from quadratic to linear “smoothly”, and looks like this. It is often used in place of the usual squared error loss for robust estimation. For example, the sample average, \(\bar{X}\)—which given a sample \(X_1,\ldots,X_n\) minimizes the squared error loss \(\sum_{i=1}^n (X_i-m)^2\) over all choices of \(m\)—can be inaccurate as an estimate of \(\mathbb{E}(X)\) if the distribution of \(X\) is heavy-tailed. In such cases, minimizing Huber loss can give a better estimate.
huber() that takes as an input a number \(x\), and returns the Huber value \(\psi(x)\), as defined above. Hint: the body of a function is just a block of R code, e.g., in this code you can use if() and else() statements. Check that huber(1) returns 1, and huber(4) returns 7.# YOUR CODE GOES HEREhuber() function so that it takes two arguments: \(x\), a number at which to evaluate the loss, and \(a\) a number representing the cutoff value. It should now return \(\psi_a(x)\), as defined above. Check that huber(3, 2) returns 8, and huber(3, 4) returns 9.# YOUR CODE GOES HEREhuber() function so that the default value of the cutoff \(a\) is 1. Check that huber(3) returns 5.# YOUR CODE GOES HEREhuber(a=1, x=3) returns 5. Check that huber(1, 3) returns 1. Explain why these are different.# YOUR CODE GOES HEREhuber() function, so that the first input can actually be a vector of numbers, and what is returned is a vector whose elements give the Huber evaluated at each of these numbers. Hint: you might try using ifelse(), if you haven’t already, to vectorize nicely. Check that huber(x=1:6, a=3) returns the vector of numbers (1, 4, 9, 15, 21, 27).# YOUR CODE GOES HEREhuber.vals and x.vals, respectively. However, the cutoff \(a\) was, let’s say, lost. Using huber.vals, x.vals, and the definition of the Huber function, you should be able to figure out the cutoff value \(a\), at least roughly. Estimate \(a\) and explain how you got there. Hint: one way to estimate \(a\) is to do so visually, using plotting tools; there are other ways too.# YOUR CODE GOES HEREhuber() function you wrote above. The axes and title should be just the same, so should the Huber curve (in black), so should be the red dotted lines at the values -1 and 1, and so should the text “Linear”, “Quadratic”, “Linear”.# YOUR CODE GOES HEREhuber() function so that, as a side effect, it prints the string “Invented by the great Swiss statistician Peter Huber!” to the console. Hint: use cat(). Call your function on an input of your choosing, to demonstrate this side effect.# YOUR CODE GOES HEREhuber() function so that, as another side effect, it produces a plot of Switzerland’s national flag. Hint: look up this flag up on Google; it’s pretty simple; and you should be able to recreate it with a few calls to rect(). Call your function on an input of your choosing, to demonstrate its side effects.# YOUR CODE GOES HEREx.squared in the body of the function to be the square of the input argument x. In a separate line of code (outside of the function definition), define the variable x.squared to be equal to 999. Then call huber(x=3), and display the value of x.squared. What is its value? Is this affected by the function call huber(x=3)? It shouldn’t be! Reiterate this point with several more lines of code, in which you repeatedly define x.squared to be something different (even something nonnumeric, like a string), and then call huber(x=3), and demonstrate afterwards that the value of x.squared hasn’t changed.huber = function(x, a=1) {
  x.squared = x^2
  ifelse(abs(x) <= a, x.squared, 2*a*abs(x)-a^2)
}# YOUR CODE GOES HEREa to be equal to -59.6, then call huber(x=3, a=2), and show that the value of a after this function call is unchanged. And repeat a few times with different assignments for the variable a, to reiterate this point.# YOUR CODE GOES HERE3c. The previous two questions showed you that a function’s body has its own environment in which locally defined variables, like those defined in the body itself, or those defined through inputs to the function, take priority over those defined outside of the function. However, when a variable referred to the body of a function is not defined in the local environment, the default is to look for it in the global environment (outside of the function).
Below is a “sloppy” implementation of the Huber function called huber.sloppy(), in which the cutoff a is not passed as an argument to the function. In a separate line of code (outside of the function definition), define a to be equal to 1.5 and then call huber.sloppy(x=3). What is the output? Explain. Repeat this a few times, by defining a and then calling huber.sloppy(x=3), to show that the value of a does indeed affect the function’s ouptut as expected. Challenge: try setting a equal to a string and calling huber.sloppy(x=3); can you explain what is happening?
huber.sloppy = function(x) {
  ifelse(abs(x) <= a, x^2, 2*a*abs(x)-a^2)
}# YOUR CODE GOES HERE= and <-, explained! Some of you have been asking about this. The equal sign = and assignment operator <- are often used interchangeably in R, and some people will often say that a choice between the two is mostly a matter of stylistic taste. This is not the full story. Indeed, = and <- behave very differently when used to set input arguments in a function call. As we showed above, setting, say, a=5 as the input to huber() has no effect on the global assignment for a. However, replacing a=5 with a<-5 in the call to huber() is entirely different in terms of its effect on a. Demonstrate this, and explain what you are seeing in terms of global assignment.# YOUR CODE GOES HERE<- allows us to define new global variables even when we are specifying inputs to a function. Pick a variable name that has not been defined yet in your workspace, say b (or something else, if this has already been used in your R Markdown document). Call huber(x=3, b<-20), then display the value of b—this variable should now exist in the global enviroment, and it should be equal to 20! Also, can you explain the output of huber(x=3, b<-20)?# YOUR CODE GOES HERE<- demonstrated in the last question, although tricky, can also be pretty useful. Leverage this property to plot the function \(y=0.05x^2 - \sin(x)\cos(x) + 0.1\exp(1+\log(x))\) over 50 x values between 0 and 2, using only one line of code and one call to the function seq().# YOUR CODE GOES HERE<- demonstrated in the last two questions does not hold in the body of a function. That is, give an example in which <- is used in the body of a function to define a variable, but this doesn’t translate into global assignment.# YOUR CODE GOES HERERecall, as we saw in Week 4, that the complete works of William Shakespeare are available freely from Project Gutenberg. We’ve put this text file up at http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt.
get.wordtab.from.url() from lecture. Modify this function so that the string vectors lines and words are both included as named components in the returned list. For good practice, update the documentation in comments to reflect your changes. Then call this function on the URL for the Shakespeare’s complete works: http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt (with the rest of the arguments at their default values) and save the result as shakespeare.wordobj. Using head(), display the first several elements of (definitely not all of!) the lines, words, and wordtab components of shakespeare.wordobj, just to check that the output makes sense to you.# get.wordtab.from.url: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page 
# - split: string, specifying what to split on. Default is the regex pattern
#   "[[:space:]]|[[:punct:]]"
# - tolower: Boolean, TRUE if words should be converted to lower case before
#   the word table is computed. Default is TRUE
# - keep.nums: Boolean, TRUE if words containing numbers should be kept in the
#   word table. Default is FALSE
# Output: list, containing lines, words, word table, and some basic summaries
get.wordtab.from.url = function(str.url, split="[[:space:]]|[[:punct:]]",
                                tolower=TRUE, keep.nums=FALSE) {
  lines = readLines(str.url)
  text = paste(lines, collapse=" ")
  words = strsplit(text, split=split)[[1]]
  words = words[words != ""]
    
  # Convert to lower case, if we're asked to
  if (tolower) words = tolower(words)
    
  # Get rid of words with numbers, if we're asked to
  if (!keep.nums) 
    words = grep("[0-9]", words, inv=TRUE, val=TRUE)
  
  # Compute the word table
  wordtab = table(words)
  
  return(list(wordtab=wordtab,
              number.unique.words=length(wordtab),
              number.total.words=sum(wordtab),
              longest.word=words[which.max(nchar(words))]))
}shakespeare.lines = shakespeare.wordobj$lines, and then rerun your solution code (or the rerun the official solution code, if you’d like) for Q5 of Lab 6 on the lines of text stored in shakespeare.lines. (Note: you don’t actually need to rerun the code for Q5d or Q5e, since the code for Q5f will accomplish the same task only without encountering NAs). You should end up with two vectors titles.start and titles.end, containing the start and end positions of each of Shakespeare’s plays in shakespeare.lines. Print out titles.start and titles.end to the console.# YOUR CODE GOES HEREshakespeare.lines.by.play of length equal to the number of Shakespeare’s plays (a number you should have already computed in the solution to the last question). Using a for() loop, and relying on titles.start and titles.end, extract the appropriate subvector of shakespeare.lines for each of Shakespeare’s plays, and store it as a component of shakespeare.lines.by.play. That is, shakespeare.lines.by.play[[1]] should contain the lines for Shakespeare’s first play, shakespeare.lines.by.play[[2]] should contain the lines for Shakespeare’s second play, and so on. Name the components of shakespeare.lines.by.play according to the titles of the plays.# YOUR CODE GOES HEREhead(), print the first 4 lines of each of Shakespeare’s plays to the console (sorry graders …). This should only require one line of code.# YOUR CODE GOES HEREget.wordtab.from.lines() to have the same argument structure as get.wordtab.from.url(), which recall you last updated in Q2a, except that the first argument of get.wordtab.from.lines() should be lines, a string vector passed by the user that contains lines of text to be processed. The body of get.wordtab.from.lines() should be the same as get.wordtab.from.url(), except that lines is passed and does not need to be computed using readlines(). The output of get.wordtab.from.lines() should be the same as get.wordtab.from.url(), except that lines does not need to be returned as a component. For good practice, incude documentation for your get.wordtab.from.lines() function in comments.# YOUR CODE GOES HEREfor() loop or one of the apply functions (your choice here), run the get.wordtab.from.lines() function on each of the components of shakespeare.lines.by.play, (with the rest of the arguments at their default values). Save the result in a list called shakespeare.wordobj.by.play. That is, shakespeare.wordobj.by.play[[1]] should contain the result of calling this function on the lines for the first play, shakespeare.wordobj.by.play[[2]] should contain the result of calling this function on the lines for the second play, and so on.# YOUR CODE GOES HEREshakespeare.total.words.by.play and shakespeare.unique.words.by.play, that contain the number of total words and number of unique words, respectively, for each of Shakespeare’s plays. Each vector should only require one line of code to compute. Hint: recall `[[`() is actually a function that allows you to do extract a named component of a list; e.g., try `[[`(shakespeare.wordobj, "number.total.words"), and you’ll see this is the same as shakespeare.wordobj[["number.total.words"]]; you should take advantage of this functionality in your apply call. What are the 5 longest plays, in terms of total word count? The 5 shortest plays?# YOUR CODE GOES HERE# YOUR CODE GOES HERE6. Look back at get.wordtab.from.lines() and get.wordtab.from.url(). Note that they overlap heavily, i.e., their bodies contain a lot of the same code. Redefine get.wordtab.from.url() so that it just calls get.wordtab.from.lines() in its body. Your new get.wordtab.from.url() function should have the same inputs as before, and produce the same output as before. So externally, nothing will have changed; we are just changing the internal structure of get.wordtab.from.url() to clean up our code base (specifically, to avoid code duplication in our case). This is an example of code refactoring.
Call your new get.wordtab.from.url() function on the URL for Shakespeare’s complete works, saving the result as shakespeare.wordobj2. Compare some of the components of shakespeare.wordobj2 to those of shakespeare.wordobj (which was computed using the old function definition) to check that your new implementation works as it should.
# YOUR CODE GOES HEREall.equal() whether shakespeare.wordobj and shakespeare.wordobj2 are the same. Likely, this will not return TRUE. (If it does, then you’ve already solved this challenge question!) Modify your get.wordtab.from.url() function from the last question, so that it still calls get.wordtab.from.lines() to do the hard work, but produces an output exactly the same as the original shakespeare.wordobj object. Demonstrate your success by calling all.equal() once again.# YOUR CODE GOES HERE