---
title: "Minimal Advice on Programming in R"
author: 36-402, Section A
date: 29 January 2019 (Lecture 5)
output: slidy_presentation
---
```{r}
library(knitr)
opts_chunk$set(size="small", cache=TRUE,
autodep=TRUE, warnings=TRUE, error=TRUE)
```
## Why do _you_ need to do programming?
> - Programs for data analysis are tools or machines
> + Understand your tools!
> - If you know how to make new tools, you always have the right tools
> + Don't distort your problem to match old tools
> - Programming is making those tools
> - Programming is also making your ideas clear
## Agenda for today
1. Style
2. Functions
3. Debugging
4. Avoiding iteration
5. Design advice
## Style: Why?
> - Code is about telling the computer what to do
> - Code is _also_ about communicating with other people
> + Including communicating with _yourself_, later
## Style: Commandments
- Use meaningful names for variables, data-sets, functions, etc.
+ Avoid magic numbers
- Indent your code blocks
- **Comment your code**
+ Comment everything that seems clever or tricky
+ Comment every trick you took from somewhere else
+ Comment every function with purpose, inputs, outputs
## Functions
- Use functions to tie together related commands so they can be repeated
+ Less work (in the long run)
+ Fewer errors (get it right _once_, make improvements _once_)
+ Clearer: allows for abstraction
+ Lets you build complicated procedures out of modular, re-usable parts
## Functions: Example with Pareto distribution
- Distribution of wealth has very, very heavy tails
- A good (but not perfect) model is the **Pareto** distribution, pdf
\[
f(x;\alpha,x_0) = \begin{cases}
\frac{\alpha-1}{x_0}\left(\frac{x}{x_0}\right)^{-\alpha} & x \ge x_0\\
0 & x < x_0
\end{cases}
\]
- We often want the quantiles (where does the 99% begin? the 99.9%? the 80%?)
+ The **mathematical** quantile function is
\[
Q(p; \alpha,x_0) = x_0(1-p)^{-\frac{1}{\alpha-1}}
\]
- Could do stuff like
```{r}
# alpha = 2.33, x_0 = 6e8, p = 0.5
6e8 * (1-0.5)^(-1/(2.33-1))
# alpha = 2.33, x_0 = 6e8, p = 0.4
6e8 * (1-0.4)^(-1/(2.33-1))
# alpha = 2.5, x_0 = 1e6, p = 0.92
1e6 * (1-0.92)^(-1/(2.5-1))
```
## Functions: Pareto example
- Bundle up the repetitive code into one function:
```{r}
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: the pth quantile
qpareto.1 <- function(p, exponent, threshold) {
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
```
- Now we can say:
```{r}
qpareto.1(p=0.5, exponent=2.33, threshold=6e8)
qpareto.1(p=0.4, exponent=2.33, threshold=6e8)
qpareto.1(p=0.92, exponent=2.5, threshold=1e6)
```
## Parts of a function
- Arguments: what does the function need to know?
+ Default values let you omit arguments that don't change often
+ Default values can always be over-ridden
+ [No-argument functions are possible](#noargs)
- Return value: what does the function give back to R?
+ Try to be explicit about what you return
+ Use `c()` or `list()` to return more than one thing (We will see some examples of this)
- The body: do these things to the arguments to produce the return value
## Parts of a function: Pareto quantiles cont'd
```{r}
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.2 <- function(p, exponent, threshold, lower.tail=TRUE) {
if(lower.tail==FALSE) {
p <- 1-p
}
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
```
```{r}
qpareto.2(p=0.92, exponent=2.5, threshold=1e6)
qpareto.2(p=0.92, exponent=2.5, threshold=1e6, lower.tail=FALSE)
```
## When should we write functions?
> - When we're repeating the same thing multiple times
> + Especially when: we're repeating **almost** the same thing
> - When we're writing very similar code over and over
> - When we're breaking down a big job into smaller ones
> - When we're told to
## Chaining functions together
- Divide the big job ("come up with a confidence band") into smaller parts
- Write a function for each part
- Make sure the outputs of the early steps can be used as inputs to the later steps
- Repeat (recurse) if need be: the sub-functions spawn sub-sub-functions, etc.
- We can often re-use the sub-functions
## Calling functions from other functions
- Functions can call other functions, including other functions we've written
- Could have avoided replicating the quantile code from `qpareto.1` in `qpareto.2`:
```{r}
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.3 <- function(p, exponent, threshold, lower.tail=TRUE) {
if(lower.tail==FALSE) {
p <- 1-p
}
# Call qpareto.1() to do the actual calculation
q <- qpareto.1(p, exponent, threshold)
return(q)
}
```
## Functions calling other functions (cont'd)
- You will often be working with several intertwined functions at the same time
- The called functions have to be defined when they are _used_
+ Example: the lecture code for `cv.lm()` calls a function, `response.name()`,
to figure out what variable in a `lm` formula is the response
+ `response.name()` is defined just after `cv.lm()`, but before the code +runs_ `cv.lm()`
- If you change (fix) the definition of a function, the new, revised one will
be used thereafter (no snapshots)
## Debugging
> - Eventually, you will break something.
> - **Problem**: the final piece of code you write gives an incomprehensible error message
> - **Idea**: Carefully characterize the bug, then narrow down where it could be until the the problem is obvious.
> - Strategy: make sure all the lower-level pieces are working properly
> + Start debugging from the bottom
> + If all the small bits work on their own, the bug comes when you put them together
## Debugging: Test cases
> - Use test cases
> + "What should this do if it works?"
> + "What should this _not_ do if it's broken?"
> - When you meet a new bug, write a new test
> + "How do I know if this is fixed?"
> + Test could be very minimal: "It runs without crashing"
## Debugging: Some tools
- `traceback()`: Show all the calls leading to the last error
+ Look for the last line of code you wrote
+ This helps localize where things went wrong
- `print()`: Make R send output to the console
+ Most useful a bit before where things crash/go crazy
+ Remember to get rid of it when the bug is fixed!
- `browser()`: A convenient way to mess around inside a function
+ When called within a function, gives you an R shell there
+ Avoids problems with local variables disappearing
+ A common trick: Executes `[Dangerous Expression]` and calls `browser()` if there is an error.
```
tryCatch({[Dangerous Expression]},warning=function(w){browser()},
error=function(e){browser()})
```
## Debugging: Pareto example continued
- I want to generate a Pareto random variable
+ If $F(x)$ is the CDF for a distribution and $U$ is a uniform random variable, then $F^{-1}(U)$ has the $F$ distribution.
- New version of the quantile function that sanity-checks the arguments:
```{r}
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.4 <- function(p, exponent, threshold, lower.tail=TRUE) {
stopifnot(p >= 0, p <= 1, exponent > 1, threshold > 0)
q <- qpareto.3(p,exponent,threshold,lower.tail)
return(q)
}
```
- Try it with a bad input
```{r}
qpareto.4(p=0.92,exponent=2.5,threshold=-1,lower.tail=FALSE)
```
## Debugging: Pareto example continued
- Now define a random-Pareto function. This version is (intentionally) buggy!
```{r}
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto <- function(n,exponent,threshold) {
x <- vector(length=n)
for (i in 1:n) {
x[i] <- qpareto.4(p=rnorm(1),exponent=exponent,threshold=threshold)
}
return(x)
}
```
## Debugging: Pareto example continued
- Our buggy code gives us errors!
```{r}
rpareto(10)
```
```{r}
rpareto(n=10,exponent=2.5,threshold=1)
```
## Debugging: The in-class exercise {.smaller}
EXERCISE: Where is the bug? Why doesn't `rpareto(n=10,exponent=2.5,threshold=1)` work?
```{r, eval=FALSE}
qpareto.1 <- function(p, exponent, threshold) {
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
qpareto.3 <- function(p, exponent, threshold, lower.tail=TRUE) {
if(lower.tail==FALSE) {
p <- 1-p
}
q <- qpareto.1(p, exponent, threshold)
return(q)
}
qpareto.4 <- function(p, exponent, threshold, lower.tail=TRUE) {
stopifnot(p >= 0, p <= 1, exponent > 1, threshold > 0)
q <- qpareto.3(p,exponent,threshold,lower.tail)
return(q)
}
rpareto <- function(n,exponent,threshold) {
x <- vector(length=n)
for (i in 1:n) {
x[i] <- qpareto.4(p=rnorm(1),exponent=exponent,threshold=threshold)
}
return(x)
}
```
## Exercise: Solution
- We've checked the `qpareto` functions
- The error message is that `p >= 0` is not true
- The error message comes from `qpareto.4()`
- So a negative value of `p` has to be going into `qpareto.4()` when it's called
by `rpareto()`
- And in fact that code says `p=rnorm(1)`, and `rnorm()` draws from the standard normal, which is negative half the time
- That should be `p=runif(1)` (uniform distribution on $[0,1]$), not `p=rnorm(1)`
+ As mentioned, this is the bug I actually made when I first wrote `rpareto` in 2004
## Avoid iteration
> - Iterating in R is often quite slow
> + Not (always) true in other languages
> + Basically: every time you assign in R, you make a copy
> - R has lots of ways to avoid iteration
> - This is often faster! And easier to follow!
## Avoid iteration: Pareto example cont'd
- A very iterative way of making many random Pareto draws:
```{r}
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.1 <- function(n,exponent,threshold) {
x <- c()
for (i in 1:n) {
x <- c(x,qpareto.4(p=runif(1),exponent=exponent,threshold=threshold))
}
return(x)
}
```
```{r}
system.time(rpareto.1(n=1e4,exponent=2.5,threshold=1))
```
## Avoid iteration: Pareto example cont'd
- A not-so-iterative way:
```{r}
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.2 <- function(n,exponent,threshold) {
x = replicate(n, qpareto.4(p=runif(1),exponent=exponent,threshold=threshold))
return(x)
}
system.time(rpareto.2(n=1e4,exponent=2.5,threshold=1))
```
## Avoiding iteration: Vectorizing
- Lots of R functions are happy with _vector_ arguments, and **vectorize** along them
- Our `qpareto()` functions can handle vector `p` arguments just fine:
```{r}
qpareto.4(c(.1,.2,.3), 2.5, 1)
```
- So use that:
```{r}
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.3 <- function(n,exponent,threshold) {
x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
return(x)
}
```
```{r}
system.time(rpareto.3(10000,exponent=2.5,threshold=1))
```
## Avoiding iteration: `ifelse`
- use `ifelse` to do conditional things to entire vectors
```{r, fig.width=4, fig.height=4}
# Calculate Huber's loss function
# Input: vector of numbers x
# Return: x^2 for |x|<1, 2|x|-1 otherwise
huber <- function(x) {
return(ifelse(abs(x) <= 1, x^2, 2*abs(x)-1))
}
x = seq(from=-3,to=3,length.out=100)
plot(x,huber(x), type="l")
```
## Avoiding iteration: matrix and vector operations
```{r}
# Get some long vectors
n = 1e6
a = rnorm(n)
b = rnorm(n)
# Find the inner product with a for() loop
system.time({
total = 0
for(i in 1:length(a)){total = total + a[i]*b[i]}
})
# Find the inner product with matrix operations
system.time({
a %*% b
})
```
## Avoiding iteration: `apply` and friends
- Use `apply` to apply a function to rows or columns of a matrix
- Use `sapply` to apply a function to every element of a vector
```{r}
#Make a 100x100 matrix of uniform [0,1] variables
X = matrix(runif(10000),nrow=100,ncol=100)
#Find the maximum of each column
column.maxs = apply(X, MARGIN = 2, FUN = max)
#Plot a histogram. Do you know the distribution?
hist(column.maxs,20)
```
- The `tidyverse` package defines a _lot_ of new functions to push this idea to logical extremes
## Do not be afraid of iteration
- Some jobs are just inherently iterative
- Use it when that's what you need to do
- If you're storing stuff as you go, create the variables which you're storing into
outside the loop
```{r, eval=FALSE}
sim.ar <- function(n,x1,phi=0,sd=1) {
x <- vector(length=n)
x[1] <- x1
for (t in 2:n) {
x[t] <- phi*x[t-1]+rnorm(1,mean=0,sd=sd)
}
return(x)
}
```
## Returning vectors and lists
- A function can return one *thing*. Often you want to return more than one value:
+ For several values of the same type, you can return a vector
+ For several different sorts of values, you can return a list. You can even assign names!
- Returning a vector: Look at `rpareto()`:
```{r, eval=FALSE}
rpareto.3 <- function(n,exponent,threshold) {
x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
return(x) #If n>1, this is a vector!
}
```
- Returning a list: look at the end of `cv_bws_npreg()` from lecture 4:
```{r, eval=FALSE}
return(list(best.bw=best.bw,CV_MSEs=CV_MSEs,
fold_MSEs=fold_MSEs))
```
## Design advice
- Design from the top down
+ Start from _what_ you want to do (not how)
+ Break that task into a few (2--6) smaller tasks
+ Keep breaking tasks into smaller tasks until you can't go further
- Code from the bottom up
+ Start with the smallest, simplest tasks
+ Write a function for each one, and test it
+ Then write functions that call _those_ functions (and test them)
+ Keep going until you have top-level functions for the whole task
## What you need to remember
1. **Use meaningful names** and **Comment your code**
2. Write functions to bundle related commands together
3. Split big jobs into a bunch of related functions
4. Practice debugging
5. Avoid iteration, in favor of acting on whole objects
## Aside: Style: Things not to do with your main data frame
- Don't use `attach()`
+ Makes it really hard to track down errors
+ Can shorten your code using `with()` instead
- Try not to call your main data set `data`
- vague
- the name of a function for loading data sets from packages
- often the name of arguments in model-fitting functions
- both of those make it hard to trace what's going on
```{r,eval=FALSE}
# Not what you want to read:
data <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/19/hw/02/uval.csv")
data.lm <- lm(growth~log(gdp)+underval, data=data)
```
## Aside: No-argument functions {#noargs}
- No-argument functions can be useful for simulation
```{r, fig.height=3,fig.width=4}
# Function to return the biggest correlation of 5 length-10 Gaussian vectors
# Takes no arguments, so it can be used in a simulation
random.correlation.max = function(){
#Draw a 10x5 matrix of Gaussians
X = matrix(rnorm(50),nrow=10,ncol=5)
#Compute the correlation matrix
cor.matrix = cor(X)
#Find the biggest entry, making sure to avoid the diagonal (which is all 1)
return(max(cor.matrix[upper.tri(cor.matrix)]))
}
```
- Draw 100 samples and plot them
```{r}
hist(replicate(100,random.correlation.max()),20,col='grey', main="Max. Correlation",xlab='correlation')
```
## Aside: Functions refer to the _current_ definition of other functions
- A very simple function that adds one to its argument
```{r}
add.one = function(x){
x+1
}
```
- A function that calls `add.one()` to do its addition
```{r}
add.stuff = function(x){
return(add.one(x))
}
```
```{r}
add.stuff(3)
```
- A terrible job rewriting `add.one()`:
```{r}
add.one = function(x){
x+2
}
```
- Now what?
```{r}
add.stuff(3)
```