Functions: Example with Pareto distribution

Distribution of wealth has very, very heavy tails
A good (but not perfect) model is the Pareto distribution, pdf \[ f(x;\alpha,x_0) = \begin{cases} \frac{\alpha-1}{x_0}\left(\frac{x}{x_0}\right)^{-\alpha} & x \ge x_0\\ 0 & x < x_0 \end{cases} \]
We often want the quantiles (where does the 99% begin? the 99.9%? the 80%?)
- The mathematical quantile function is \[ Q(p; \alpha,x_0) = x_0(1-p)^{-\frac{1}{\alpha-1}} \]
Could do stuff like

# alpha = 2.33, x_0 = 6e8, p = 0.5
6e8 * (1-0.5)^(-1/(2.33-1))

## [1] 1010391288

# alpha = 2.33, x_0 = 6e8, p = 0.4
6e8 * (1-0.4)^(-1/(2.33-1))

## [1] 880957225

# alpha = 2.5, x_0 = 1e6, p = 0.92
1e6 * (1-0.92)^(-1/(2.5-1))

## [1] 5386087

Functions: Pareto example

Bundle up the repetitive code into one function:

# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: the pth quantile
qpareto.1 <- function(p, exponent, threshold) {
  q <- threshold*((1-p)^(-1/(exponent-1)))
  return(q)
}

Now we can say:

qpareto.1(p=0.5, exponent=2.33, threshold=6e8)

## [1] 1010391288

qpareto.1(p=0.4, exponent=2.33, threshold=6e8)

## [1] 880957225

qpareto.1(p=0.92, exponent=2.5, threshold=1e6)

## [1] 5386087

Parts of a function

Arguments: what does the function need to know?
- Default values let you omit arguments that don’t change often
- Default values can always be over-ridden
- No-argument functions are possible
Return value: what does the function give back to R?
- Try to be explicit about what you return
- Use c() or list() to return more than one thing (We will see some examples of this)
The body: do these things to the arguments to produce the return value

Parts of a function: Pareto quantiles cont’d

# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
  # flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.2 <- function(p, exponent, threshold, lower.tail=TRUE) {
  if(lower.tail==FALSE) {
    p <- 1-p
  }
  q <- threshold*((1-p)^(-1/(exponent-1)))
  return(q)
}

qpareto.2(p=0.92, exponent=2.5, threshold=1e6)

## [1] 5386087

qpareto.2(p=0.92, exponent=2.5, threshold=1e6, lower.tail=FALSE)

## [1] 1057162

When should we write functions?

When we’re repeating the same thing multiple times
- Especially when: we’re repeating almost the same thing
When we’re writing very similar code over and over
When we’re breaking down a big job into smaller ones
When we’re told to

Chaining functions together

Divide the big job (“come up with a confidence band”) into smaller parts
Write a function for each part
Make sure the outputs of the early steps can be used as inputs to the later steps
Repeat (recurse) if need be: the sub-functions spawn sub-sub-functions, etc.
We can often re-use the sub-functions

Calling functions from other functions

Functions can call other functions, including other functions we’ve written
Could have avoided replicating the quantile code from qpareto.1 in qpareto.2:

# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
  # flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.3 <- function(p, exponent, threshold, lower.tail=TRUE) {
  if(lower.tail==FALSE) {
    p <- 1-p
  }
  # Call qpareto.1() to do the actual calculation
  q <- qpareto.1(p, exponent, threshold)
  return(q)
}

Functions calling other functions (cont’d)

You will often be working with several intertwined functions at the same time
The called functions have to be defined when they are used
- Example: the lecture code for cv.lm() calls a function, response.name(), to figure out what variable in a lm formula is the response
- response.name() is defined just after cv.lm(), but before the code +runs_ cv.lm()
If you change (fix) the definition of a function, the new, revised one will be used thereafter (no snapshots)

Debugging

Eventually, you will break something.
Problem: the final piece of code you write gives an incomprehensible error message
Idea: Carefully characterize the bug, then narrow down where it could be until the the problem is obvious.
Strategy: make sure all the lower-level pieces are working properly
- Start debugging from the bottom
- If all the small bits work on their own, the bug comes when you put them together

Debugging: Test cases

Use test cases
- “What should this do if it works?”
- “What should this not do if it’s broken?”
When you meet a new bug, write a new test
- “How do I know if this is fixed?”
- Test could be very minimal: “It runs without crashing”

Debugging: Some tools

traceback(): Show all the calls leading to the last error
- Look for the last line of code you wrote
- This helps localize where things went wrong
print(): Make R send output to the console
- Most useful a bit before where things crash/go crazy
- Remember to get rid of it when the bug is fixed!
browser(): A convenient way to mess around inside a function
- When called within a function, gives you an R shell there
- Avoids problems with local variables disappearing
- A common trick: Executes [Dangerous Expression] and calls browser() if there is an error.
```
tryCatch({[Dangerous Expression]},warning=function(w){browser()},
       error=function(e){browser()})
```

Debugging: Pareto example continued

I want to generate a Pareto random variable
- If \(F(x)\) is the CDF for a distribution and \(U\) is a uniform random variable, then \(F^{-1}(U)\) has the \(F\) distribution.
New version of the quantile function that sanity-checks the arguments:

# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
  # flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.4 <- function(p, exponent, threshold, lower.tail=TRUE) {
  stopifnot(p >= 0, p <= 1, exponent > 1, threshold > 0)
  q <- qpareto.3(p,exponent,threshold,lower.tail)
  return(q)
}

Try it with a bad input

qpareto.4(p=0.92,exponent=2.5,threshold=-1,lower.tail=FALSE)

## Error: threshold > 0 is not TRUE

Debugging: Pareto example continued

Now define a random-Pareto function. This version is (intentionally) buggy!

# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto <- function(n,exponent,threshold) {
  x <- vector(length=n)
  for (i in 1:n) {
    x[i] <- qpareto.4(p=rnorm(1),exponent=exponent,threshold=threshold)
  }
  return(x)
}

Debugging: Pareto example continued

Our buggy code gives us errors!

rpareto(10)

## Error in qpareto.4(p = rnorm(1), exponent = exponent, threshold = threshold): argument "exponent" is missing, with no default

rpareto(n=10,exponent=2.5,threshold=1)

## Error: p >= 0 is not TRUE

Debugging: The in-class exercise

EXERCISE: Where is the bug? Why doesn’t rpareto(n=10,exponent=2.5,threshold=1) work?

qpareto.1 <- function(p, exponent, threshold) {
  q <- threshold*((1-p)^(-1/(exponent-1)))
  return(q)
}

qpareto.3 <- function(p, exponent, threshold, lower.tail=TRUE) {
  if(lower.tail==FALSE) {
    p <- 1-p
  }
  q <- qpareto.1(p, exponent, threshold)
  return(q)
}

qpareto.4 <- function(p, exponent, threshold, lower.tail=TRUE) {
  stopifnot(p >= 0, p <= 1, exponent > 1, threshold > 0)
  q <- qpareto.3(p,exponent,threshold,lower.tail)
  return(q)
}

rpareto <- function(n,exponent,threshold) {
  x <- vector(length=n)
  for (i in 1:n) {
    x[i] <- qpareto.4(p=rnorm(1),exponent=exponent,threshold=threshold)
  }
  return(x)
}

Exercise: Solution

We’ve checked the qpareto functions
The error message is that p >= 0 is not true
The error message comes from qpareto.4()
So a negative value of p has to be going into qpareto.4() when it’s called by rpareto()
And in fact that code says p=rnorm(1), and rnorm() draws from the standard normal, which is negative half the time
That should be p=runif(1) (uniform distribution on \([0,1]\)), not p=rnorm(1)
- As mentioned, this is the bug I actually made when I first wrote rpareto in 2004

Avoid iteration

Iterating in R is often quite slow
- Not (always) true in other languages
R has lots of ways to avoid iteration
This is often faster! And easier to follow!

Avoid iteration: Pareto example cont’d

A very iterative way of making many random Pareto draws:

# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.1 <- function(n,exponent,threshold) {
  x <- c()
  for (i in 1:n) {
    x <- c(x,qpareto.4(p=runif(1),exponent=exponent,threshold=threshold))
  }
  return(x)
}

system.time(rpareto.1(n=1e4,exponent=2.5,threshold=1))

##    user  system elapsed 
##   0.323   0.018   0.343

Avoid iteration: Pareto example cont’d

A not-so-iterative way:

# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.2 <- function(n,exponent,threshold) {
  x = replicate(n, qpareto.4(p=runif(1),exponent=exponent,threshold=threshold))
  return(x)
}


system.time(rpareto.2(n=1e4,exponent=2.5,threshold=1))

##    user  system elapsed 
##   0.122   0.001   0.123

Avoiding iteration: Vectorizing

Lots of R functions are happy with vector arguments, and vectorize along them
Our qpareto() functions can handle vector p arguments just fine:

qpareto.4(c(.1,.2,.3), 2.5, 1)

## [1] 1.072766 1.160397 1.268434

So use that:

# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.3 <- function(n,exponent,threshold) {
  x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
  return(x)
}

system.time(rpareto.3(10000,exponent=2.5,threshold=1))

##    user  system elapsed 
##   0.000   0.000   0.001

Avoiding iteration: `ifelse`

use ifelse to do conditional things to entire vectors

# Calculate Huber's loss function
# Input: vector of numbers x
# Return: x^2 for |x|<1, 2|x|-1 otherwise
huber <- function(x) {
  return(ifelse(abs(x) <= 1, x^2, 2*abs(x)-1))
}
x = seq(from=-3,to=3,length.out=100)
plot(x,huber(x), type="l")

Avoiding iteration: matrix and vector operations

# Get some long vectors
n = 1e6
a = rnorm(n)
b = rnorm(n)
# Find the inner product with a for() loop
system.time({
  total = 0
  for(i in 1:length(a)){total = total + a[i]*b[i]}
})

##    user  system elapsed 
##   0.058   0.000   0.058

# Find the inner product with matrix operations
system.time({
a %*% b
})

##    user  system elapsed 
##   0.005   0.000   0.005

Avoiding iteration: `apply` and friends

Use apply to apply a function to rows or columns of a matrix
Use sapply to apply a function to every element of a vector

#Make a 100x100 matrix of uniform [0,1] variables
X = matrix(runif(10000),nrow=100,ncol=100)
#Find the maximum of each column
column.maxs = apply(X, MARGIN = 2, FUN = max)
#Plot a histogram.  Do you know the distribution?
hist(column.maxs,20)

The tidyverse package defines a lot of new functions to push this idea to logical extremes

Do not be afraid of iteration

Some jobs are just inherently iterative
Use it when that’s what you need to do
If you’re storing stuff as you go, create the variables which you’re storing into outside the loop

sim.ar <- function(n,x1,phi=0,sd=1) {
  x <- vector(length=n)
  x[1] <- x1
  for (t in 2:n) {
    x[t] <- phi*x[t-1]+rnorm(1,mean=0,sd=sd)
  }
  return(x)
}

Returning vectors and lists

A function can return one thing. Often you want to return more than one value:
- For several values of the same type, you can return a vector
- For several different sorts of values, you can return a list. You can even assign names!
Returning a vector: Look at rpareto():

rpareto.3 <- function(n,exponent,threshold) {
  x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
  return(x) #If n>1, this is a vector!
}

Returning a list: look at the end of cv_bws_npreg() from lecture 4:

  return(list(best.bw=best.bw,CV_MSEs=CV_MSEs,
              fold_MSEs=fold_MSEs))

Design advice

Design from the top down
- Start from what you want to do (not how)
- Break that task into a few (2–6) smaller tasks
- Keep breaking tasks into smaller tasks until you can’t go further
Code from the bottom up
- Start with the smallest, simplest tasks
- Write a function for each one, and test it
- Then write functions that call those functions (and test them)
- Keep going until you have top-level functions for the whole task

What you need to remember

Use meaningful names and Comment your code
Write functions to bundle related commands together
Split big jobs into a bunch of related functions
Practice debugging
Avoid iteration, in favor of acting on whole objects

Aside: Style: Things not to do with your main data frame

Don’t use attach()
- Makes it really hard to track down errors
- Can shorten your code using with() instead
Try not to call your main data set data
- vague
- the name of a function for loading data sets from packages
- often the name of arguments in model-fitting functions
- both of those make it hard to trace what’s going on

# Not what you want to read:
data <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/19/hw/02/uval.csv")
data.lm <- lm(growth~log(gdp)+underval, data=data)

Aside: No-argument functions

No-argument functions can be useful for simulation

# Function to return the biggest correlation of 5 length-10 Gaussian vectors
# Takes no arguments, so it can be used in a simulation
random.correlation.max = function(){
  #Draw a 10x5 matrix of Gaussians
  X = matrix(rnorm(50),nrow=10,ncol=5)
  #Compute the correlation matrix
  cor.matrix = cor(X)
  #Find the biggest entry, making sure to avoid the diagonal (which is all 1)
  return(max(cor.matrix[upper.tri(cor.matrix)]))
}

Draw 100 samples and plot them

hist(replicate(100,random.correlation.max()),20,col='grey', main="Max. Correlation",xlab='correlation')

Aside: Functions refer to the current definition of other functions

A very simple function that adds one to its argument

add.one = function(x){
  x+1
}

A function that calls add.one() to do its addition

add.stuff = function(x){
  return(add.one(x))
}

add.stuff(3)

## [1] 4

A terrible job rewriting add.one():

add.one = function(x){
  x+2
}

Now what?

add.stuff(3)

## [1] 5

Minimal Advice on Programming in R

Why do you need to do programming?

Agenda for today

Style: Why?

Style: Commandments

Functions

Functions: Example with Pareto distribution

Functions: Pareto example

Parts of a function

Parts of a function: Pareto quantiles cont’d

When should we write functions?

Chaining functions together

Calling functions from other functions

Functions calling other functions (cont’d)

Debugging

Debugging: Test cases

Debugging: Some tools

Debugging: Pareto example continued

Debugging: Pareto example continued

Debugging: Pareto example continued

Debugging: The in-class exercise

Exercise: Solution

Avoid iteration

Avoid iteration: Pareto example cont’d

Avoid iteration: Pareto example cont’d

Avoiding iteration: Vectorizing

Avoiding iteration: `ifelse`

Avoiding iteration: matrix and vector operations

Avoiding iteration: `apply` and friends

Do not be afraid of iteration

Returning vectors and lists

Design advice

What you need to remember

Aside: Style: Things not to do with your main data frame

Aside: No-argument functions

Aside: Functions refer to the current definition of other functions

Minimal Advice on Programming in R

Why do you need to do programming?

Agenda for today

Style: Why?

Style: Commandments

Functions

Functions: Example with Pareto distribution

Functions: Pareto example

Parts of a function

Parts of a function: Pareto quantiles cont’d

When should we write functions?

Chaining functions together

Calling functions from other functions

Functions calling other functions (cont’d)

Debugging

Debugging: Test cases

Debugging: Some tools

Debugging: Pareto example continued

Debugging: Pareto example continued

Debugging: Pareto example continued

Debugging: The in-class exercise

Exercise: Solution

Avoid iteration

Avoid iteration: Pareto example cont’d

Avoid iteration: Pareto example cont’d

Avoiding iteration: Vectorizing

Avoiding iteration: ifelse

Avoiding iteration: matrix and vector operations

Avoiding iteration: apply and friends

Do not be afraid of iteration

Returning vectors and lists

Design advice

What you need to remember

Aside: Style: Things not to do with your main data frame

Aside: No-argument functions

Aside: Functions refer to the current definition of other functions

Avoiding iteration: `ifelse`

Avoiding iteration: `apply` and friends