Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.

This week’s agenda: creating and updating functions; understanding argument and return structures; revisiting Shakespeare’s plays; code refactoring.

Huber loss function

The Huber loss function (or just Huber function, for short) is defined as: \[ \psi(x) = \begin{cases} x^2 & \text{if $|x| \leq 1$} \\ 2|x| - 1 & \text{if $|x| > 1$} \end{cases} \] This function is quadratic on the interval [-1,1], and linear outside of this interval. It transitions from quadratic to linear “smoothly”, and looks like this:

It is often used in place of the usual squared error loss for robust estimation. The sample average, \(\bar{X}\)—which given a sample \(X_1,\ldots,X_n\) minimizes the squared error loss \(\sum_{i=1}^n (X_i-m)^2\) over all choices of \(m\)—can be inaccurate as an estimate of \(\mathbb{E}(X)\) if the distribution of \(X\) is heavy-tailed. In such cases, minimizing Huber loss can give a better estimate. (Interested in hearing more? Come ask one of us, or ask your 401 or 402 Professor!)

Some simple function tasks

x.vals = seq(0, 5, length=21)
huber.vals = c(0.0000, 0.0625, 0.2500, 0.5625, 1.0000, 1.5625, 2.2500,
               3.0625, 4.0000, 5.0625, 6.2500, 7.5625, 9.0000, 10.5000,
               12.0000, 13.5000, 15.0000, 16.5000, 18.0000, 19.5000, 

Shakespeare’s complete works

Recall, as in lab/hw from Week 3, that the complete works of William Shakespeare are available freely from Project Gutenberg. We’ve put this text file up at

Getting lines of text play-by-play

# get.wordtab.from.url: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page 
# - split: string, specifying what to split on. Default is the regex pattern
#   "[[:space:]]|[[:punct:]]"
# - tolower: Boolean, TRUE if words should be converted to lower case before
#   the word table is computed. Default is TRUE
# - keep.nums: Boolean, TRUE if words containing numbers should be kept in the
#   word table. Default is FALSE
# Output: list, containing word table, and some basic numeric summaries

get.wordtab.from.url = function(str.url, split="[[:space:]]|[[:punct:]]",
                                tolower=TRUE, keep.nums=FALSE) {
  lines = readLines(str.url)
  text = paste(lines, collapse=" ")
  words = strsplit(text, split=split)[[1]]
  words = words[words != ""]
  # Convert to lower case, if we're asked to
  if (tolower) words = tolower(words)
  # Get rid of words with numbers, if we're asked to
  if (!keep.nums) 
    words = grep("[0-9]", words, inv=TRUE, val=TRUE)
  # Compute the word table
  wordtab = table(words)

Getting word tables play-by-play

Refactoring the word table functions