36-402, Section A

29 January 2019 (Lecture 5)

- Programs for data analysis are tools or machines
- Understand your tools!

- If you know how to make new tools, you always have the right tools
- Don’t distort your problem to match old tools

- Programming is making those tools
- Programming is also making your ideas clear

- Style
- Functions
- Debugging
- Avoiding iteration
- Design advice

- Code is about telling the computer what to do
- Code is
*also*about communicating with other people- Including communicating with
*yourself*, later

- Including communicating with

- Use meaningful names for variables, data-sets, functions, etc.
- Avoid magic numbers

- Indent your code blocks
**Comment your code**- Comment everything that seems clever or tricky
- Comment every trick you took from somewhere else
- Comment every function with purpose, inputs, outputs

- Use functions to tie together related commands so they can be repeated
- Less work (in the long run)
- Fewer errors (get it right
*once*, make improvements*once*) - Clearer: allows for abstraction
- Lets you build complicated procedures out of modular, re-usable parts

- Distribution of wealth has very, very heavy tails
- A good (but not perfect) model is the
**Pareto**distribution, pdf \[ f(x;\alpha,x_0) = \begin{cases} \frac{\alpha-1}{x_0}\left(\frac{x}{x_0}\right)^{-\alpha} & x \ge x_0\\ 0 & x < x_0 \end{cases} \] - We often want the quantiles (where does the 99% begin? the 99.9%? the 80%?)
- The
**mathematical**quantile function is \[ Q(p; \alpha,x_0) = x_0(1-p)^{-\frac{1}{\alpha-1}} \]

- The
- Could do stuff like

`## [1] 1010391288`

`## [1] 880957225`

`## [1] 5386087`

- Bundle up the repetitive code into one function:

```
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: the pth quantile
qpareto.1 <- function(p, exponent, threshold) {
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
```

- Now we can say:

`## [1] 1010391288`

`## [1] 880957225`

`## [1] 5386087`

- Arguments: what does the function need to know?
- Default values let you omit arguments that don’t change often
- Default values can always be over-ridden
- No-argument functions are possible

- Return value: what does the function give back to R?
- Try to be explicit about what you return
- Use
`c()`

or`list()`

to return more than one thing (We will see some examples of this)

- The body: do these things to the arguments to produce the return value

```
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.2 <- function(p, exponent, threshold, lower.tail=TRUE) {
if(lower.tail==FALSE) {
p <- 1-p
}
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
```

`## [1] 5386087`

`## [1] 1057162`

- When we’re repeating the same thing multiple times
- Especially when: we’re repeating
**almost**the same thing

- Especially when: we’re repeating
- When we’re writing very similar code over and over
- When we’re breaking down a big job into smaller ones
- When we’re told to

- Divide the big job (“come up with a confidence band”) into smaller parts
- Write a function for each part
- Make sure the outputs of the early steps can be used as inputs to the later steps
- Repeat (recurse) if need be: the sub-functions spawn sub-sub-functions, etc.
- We can often re-use the sub-functions

- Functions can call other functions, including other functions we’ve written
- Could have avoided replicating the quantile code from
`qpareto.1`

in`qpareto.2`

:

```
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.3 <- function(p, exponent, threshold, lower.tail=TRUE) {
if(lower.tail==FALSE) {
p <- 1-p
}
# Call qpareto.1() to do the actual calculation
q <- qpareto.1(p, exponent, threshold)
return(q)
}
```

- You will often be working with several intertwined functions at the same time
- The called functions have to be defined when they are
*used*- Example: the lecture code for
`cv.lm()`

calls a function,`response.name()`

, to figure out what variable in a`lm`

formula is the response `response.name()`

is defined just after`cv.lm()`

, but before the code +runs_`cv.lm()`

- Example: the lecture code for
- If you change (fix) the definition of a function, the new, revised one will be used thereafter (no snapshots)

- Eventually, you will break something.
**Problem**: the final piece of code you write gives an incomprehensible error message**Idea**: Carefully characterize the bug, then narrow down where it could be until the the problem is obvious.- Strategy: make sure all the lower-level pieces are working properly
- Start debugging from the bottom
- If all the small bits work on their own, the bug comes when you put them together

- Use test cases
- “What should this do if it works?”
- “What should this
*not*do if it’s broken?”

- When you meet a new bug, write a new test
- “How do I know if this is fixed?”
- Test could be very minimal: “It runs without crashing”

`traceback()`

: Show all the calls leading to the last error- Look for the last line of code you wrote
- This helps localize where things went wrong

`print()`

: Make R send output to the console- Most useful a bit before where things crash/go crazy
- Remember to get rid of it when the bug is fixed!

`browser()`

: A convenient way to mess around inside a function- When called within a function, gives you an R shell there
- Avoids problems with local variables disappearing
- A common trick: Executes
`[Dangerous Expression]`

and calls`browser()`

if there is an error.

`tryCatch({[Dangerous Expression]},warning=function(w){browser()}, error=function(e){browser()})`

- I want to generate a Pareto random variable
- If \(F(x)\) is the CDF for a distribution and \(U\) is a uniform random variable, then \(F^{-1}(U)\) has the \(F\) distribution.

- New version of the quantile function that sanity-checks the arguments:

```
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.4 <- function(p, exponent, threshold, lower.tail=TRUE) {
stopifnot(p >= 0, p <= 1, exponent > 1, threshold > 0)
q <- qpareto.3(p,exponent,threshold,lower.tail)
return(q)
}
```

- Try it with a bad input

`## Error: threshold > 0 is not TRUE`

- Now define a random-Pareto function. This version is (intentionally) buggy!

```
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto <- function(n,exponent,threshold) {
x <- vector(length=n)
for (i in 1:n) {
x[i] <- qpareto.4(p=rnorm(1),exponent=exponent,threshold=threshold)
}
return(x)
}
```

- Our buggy code gives us errors!

`## Error in qpareto.4(p = rnorm(1), exponent = exponent, threshold = threshold): argument "exponent" is missing, with no default`

`## Error: p >= 0 is not TRUE`

EXERCISE: Where is the bug? Why doesn’t `rpareto(n=10,exponent=2.5,threshold=1)`

work?

```
qpareto.1 <- function(p, exponent, threshold) {
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
qpareto.3 <- function(p, exponent, threshold, lower.tail=TRUE) {
if(lower.tail==FALSE) {
p <- 1-p
}
q <- qpareto.1(p, exponent, threshold)
return(q)
}
qpareto.4 <- function(p, exponent, threshold, lower.tail=TRUE) {
stopifnot(p >= 0, p <= 1, exponent > 1, threshold > 0)
q <- qpareto.3(p,exponent,threshold,lower.tail)
return(q)
}
rpareto <- function(n,exponent,threshold) {
x <- vector(length=n)
for (i in 1:n) {
x[i] <- qpareto.4(p=rnorm(1),exponent=exponent,threshold=threshold)
}
return(x)
}
```

- We’ve checked the
`qpareto`

functions - The error message is that
`p >= 0`

is not true - The error message comes from
`qpareto.4()`

- So a negative value of
`p`

has to be going into`qpareto.4()`

when it’s called by`rpareto()`

- And in fact that code says
`p=rnorm(1)`

, and`rnorm()`

draws from the standard normal, which is negative half the time - That should be
`p=runif(1)`

(uniform distribution on \([0,1]\)), not`p=rnorm(1)`

- As mentioned, this is the bug I actually made when I first wrote
`rpareto`

in 2004

- As mentioned, this is the bug I actually made when I first wrote

- Iterating in R is often quite slow
- Not (always) true in other languages

- R has lots of ways to avoid iteration
- This is often faster! And easier to follow!

- A very iterative way of making many random Pareto draws:

```
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.1 <- function(n,exponent,threshold) {
x <- c()
for (i in 1:n) {
x <- c(x,qpareto.4(p=runif(1),exponent=exponent,threshold=threshold))
}
return(x)
}
```

```
## user system elapsed
## 0.323 0.018 0.343
```

- A not-so-iterative way:

```
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.2 <- function(n,exponent,threshold) {
x = replicate(n, qpareto.4(p=runif(1),exponent=exponent,threshold=threshold))
return(x)
}
system.time(rpareto.2(n=1e4,exponent=2.5,threshold=1))
```

```
## user system elapsed
## 0.122 0.001 0.123
```

Lots of R functions are happy with

*vector*arguments, and**vectorize**along themOur

`qpareto()`

functions can handle vector`p`

arguments just fine:

`## [1] 1.072766 1.160397 1.268434`

- So use that:

```
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.3 <- function(n,exponent,threshold) {
x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
return(x)
}
```

```
## user system elapsed
## 0.000 0.000 0.001
```

`ifelse`

- use
`ifelse`

to do conditional things to entire vectors

```
# Get some long vectors
n = 1e6
a = rnorm(n)
b = rnorm(n)
# Find the inner product with a for() loop
system.time({
total = 0
for(i in 1:length(a)){total = total + a[i]*b[i]}
})
```

```
## user system elapsed
## 0.058 0.000 0.058
```

```
## user system elapsed
## 0.005 0.000 0.005
```

`apply`

and friends- Use
`apply`

to apply a function to rows or columns of a matrix - Use
`sapply`

to apply a function to every element of a vector

```
#Make a 100x100 matrix of uniform [0,1] variables
X = matrix(runif(10000),nrow=100,ncol=100)
#Find the maximum of each column
column.maxs = apply(X, MARGIN = 2, FUN = max)
#Plot a histogram. Do you know the distribution?
hist(column.maxs,20)
```

- The
`tidyverse`

package defines a*lot*of new functions to push this idea to logical extremes

- Some jobs are just inherently iterative
- Use it when that’s what you need to do
- If you’re storing stuff as you go, create the variables which you’re storing into outside the loop

- A function can return one
*thing*. Often you want to return more than one value:- For several values of the same type, you can return a vector
- For several different sorts of values, you can return a list. You can even assign names!

- Returning a vector: Look at
`rpareto()`

:

```
rpareto.3 <- function(n,exponent,threshold) {
x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
return(x) #If n>1, this is a vector!
}
```

- Returning a list: look at the end of
`cv_bws_npreg()`

from lecture 4:

- Design from the top down
- Start from
*what*you want to do (not how) - Break that task into a few (2–6) smaller tasks
- Keep breaking tasks into smaller tasks until you can’t go further

- Start from
- Code from the bottom up
- Start with the smallest, simplest tasks
- Write a function for each one, and test it
- Then write functions that call
*those*functions (and test them) - Keep going until you have top-level functions for the whole task

**Use meaningful names**and**Comment your code**- Write functions to bundle related commands together
- Split big jobs into a bunch of related functions
- Practice debugging
- Avoid iteration, in favor of acting on whole objects

- Don’t use
`attach()`

- Makes it really hard to track down errors
- Can shorten your code using
`with()`

instead

- Try not to call your main data set
`data`

- vague
- the name of a function for loading data sets from packages
- often the name of arguments in model-fitting functions
- both of those make it hard to trace what’s going on

- No-argument functions can be useful for simulation

```
# Function to return the biggest correlation of 5 length-10 Gaussian vectors
# Takes no arguments, so it can be used in a simulation
random.correlation.max = function(){
#Draw a 10x5 matrix of Gaussians
X = matrix(rnorm(50),nrow=10,ncol=5)
#Compute the correlation matrix
cor.matrix = cor(X)
#Find the biggest entry, making sure to avoid the diagonal (which is all 1)
return(max(cor.matrix[upper.tri(cor.matrix)]))
}
```

- Draw 100 samples and plot them

- A very simple function that adds one to its argument

- A function that calls
`add.one()`

to do its addition

`## [1] 4`

- A terrible job rewriting
`add.one()`

:

- Now what?

`## [1] 5`