Writing and Using Functions
========================================================
author: 36-350
date: 15 September 2014
transition: none
font-family: 'Garamond'
Agenda
===
- Defining functions: Tying related commands into bundles
- Interfaces: Controlling what the function can see and do
- Example: Parameter estimation code
Why Functions?
===
Data structures tie related values into one object
Functions tie related commands into one object
In both cases: easier to understand, easier to work with, easier to build into larger things
For example
===
```{r}
# "Robust" loss function, for outlier-resistant regression
# Inputs: vector of numbers (x)
# Outputs: vector with x^2 for small entries, 2|x|-1 for large ones
psi.1 <- function(x) {
psi <- ifelse(x^2 > 1, 2*abs(x)-1, x^2)
return(psi)
}
```
Our functions get used just like the built-in ones:
```{r}
z <- c(-0.5,-5,0.9,9)
psi.1(z)
```
===
Go back to the declaration and look at the parts:
```{r}
# "Robust" loss function, for outlier-resistant regression
# Inputs: vector of numbers (x)
# Outputs: vector with x^2 for small entries, |x| for large ones
psi.1 <- function(x) {
psi <- ifelse(x^2 > 1, 2*abs(x)-1, x^2)
return(psi)
}
```
**Interfaces**: the **inputs** or **arguments**; the **outputs** or **return value**
Calls other functions `ifelse()`, `abs()`, operators
`^` and `>`
could also call other functions we've written
`return()` says what the output is
alternately, return the last evaluation; I like explicit returns better
**Comments**: Not required by R, but a Very Good Idea
One-line description of purpose; listing of arguments; listing of outputs
What should be a function?
===
- Things you're going to re-run, especially if it will be re-run with changes
- Chunks of code you keep highlighting and hitting return on
- Chunks of code which are small parts of bigger analyses
- Chunks which are very similar to other chunks
will say more about design later
Named and default arguments
===
```{r}
# "Robust" loss function, for outlier-resistant regression
# Inputs: vector of numbers (x), scale for crossover (c)
# Outputs: vector with x^2 for small entries, 2c|x|-c^2 for large ones
psi.2 <- function(x,c=1) {
psi <- ifelse(x^2 > c^2, 2*c*abs(x)-c^2, x^2)
return(psi)
}
```
```{r}
identical(psi.1(z), psi.2(z,c=1))
```
===
Default values get used if names are missing:
```{r}
identical(psi.2(z,c=1), psi.2(z))
```
Named arguments can go in any order when explicitly tagged:
```{r}
identical(psi.2(x=z,c=2), psi.2(c=2,x=z))
```
Checking Arguments
===
_Problem_: Odd behavior when arguments aren't as we expect
```{r}
psi.2(x=z,c=c(1,1,1,10))
psi.2(x=z,c=-1)
```
===
_Solution_: Put little sanity checks into the code
```{r}
# "Robust" loss function, for outlier-resistant regression
# Inputs: vector of numbers (x), scale for crossover (c)
# Outputs: vector with x^2 for small entries, 2c|x|-c^2 for large ones
psi.3 <- function(x,c=1) {
# Scale should be a single positive number
stopifnot(length(c) == 1,c>0)
psi <- ifelse(x^2 > c^2, 2*c*abs(x)-c^2, x^2)
return(psi)
}
```
Arguments to `stopifnot()` are a series of expressions which should all
be TRUE; execution halts, with error message, at _first_ FALSE
(try it!)
What the function can see and do
===
- Each function has its own environment
- Names here over-ride names in the global environment
- Internal environment starts with the named arguments
- Assignments inside the function only change the internal environment
There _are_ ways around this, but they are difficult and best avoided; see Chambers, ch. 5, if you must
- Names undefined in the function are looked for in the environment the function gets called from
_not_ the environment of definition
Internal environment examples
===
```{r}
x <- 7
y <- c("A","C","G","T","U")
adder <- function(y) { x<- x+y; return(x) }
adder(1)
x
y
```
Internal environment examples cont'd.
===
```{r}
circle.area <- function(r) { return(pi*r^2) }
circle.area(c(1,2,3))
truepi <- pi
pi <- 3 # Valid in 1800s Indiana, or drowned R'lyeh
circle.area(c(1,2,3))
pi <- truepi # Restore sanity
circle.area(c(1,2,3))
```
Respect the Interfaces!
===
Interfaces mark out a controlled inner environment for our code
Interact with the rest of the system only at the interface
Advice: arguments explicitly give the function all the information
Reduces risk of confusion and error
Exception: true universals like $\pi$
Likewise, output should only be through the return value
More about breaking up tasks and about environments later
Further reading: Herbert Simon, _The Sciences of the Artificial_
Example: Fitting a Model
===
Fact: bigger cities tend to produce more economically per capita
A proposed statistical model (Geoffrey West et al.):
\[
Y = y_0 N^{a} + \mathrm{noise}
\]
where $Y$ is the per-capita "gross metropolitan product" of a city, $N$ is
its population, and $y_0$ and $a$ are parameters
Evidence
===
```{r}
gmp <- read.table("gmp.dat")
gmp$pop <- gmp$gmp/gmp$pcgmp
plot(pcgmp~pop, data=gmp, log="x", xlab="Population",
ylab="Per-Capita Economic Output ($/person-year)",
main="US Metropolitan Areas, 2006")
curve(6611*x^(1/8),add=TRUE,col="blue")
```
===
Want to fit the model
\[
Y = y_0 N^{a} + \mathrm{noise}
\]
Take $y_0=6611$ for today
Approximate the derivative of error w.r.t $a$ and move against it
\[
\begin{eqnarray*}
MSE(a) & \equiv & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - y_0 N_i^a)^2}\\
MSE^{\prime}(a) & \approx & \frac{MSE(a+h) - MSE(a)}{h}\\
a_{t+1} - a_{t} & \propto & -MSE^{\prime}(a)
\end{eqnarray*}
\]
===
An actual first attempt at code:
```{r}
maximum.iterations <- 100
deriv.step <- 1/1000
step.scale <- 1e-12
stopping.deriv <- 1/100
iteration <- 0
deriv <- Inf
a <- 0.15
while ((iteration < maximum.iterations) && (deriv > stopping.deriv)) {
iteration <- iteration + 1
mse.1 <- mean((gmp$pcgmp - 6611*gmp$pop^a)^2)
mse.2 <- mean((gmp$pcgmp - 6611*gmp$pop^(a+deriv.step))^2)
deriv <- (mse.2 - mse.1)/deriv.step
a <- a - step.scale*deriv
}
list(a=a,iterations=iteration,converged=(iteration < maximum.iterations))
```
What's wrong with this?
===
- Not _encapsulated_: Re-run by cutting and pasting code --- but how
much of it? Also, hard to make part of something larger
- _Inflexible_: To change initial guess at $a$, have to edit, cut,
paste, and re-run
- _Error-prone_: To change the data set, have to edit, cut, paste,
re-run, and hope that all the edits are consistent
- _Hard to fix_: should stop when _absolute value_ of derivative is
small, but this stops when large and negative. Imagine having five copies of
this and needing to fix same bug on each.
Will turn this into a function and then improve it
===
First attempt, with logic fix:
```{r}
estimate.scaling.exponent.1 <- function(a) {
maximum.iterations <- 100
deriv.step <- 1/1000
step.scale <- 1e-12
stopping.deriv <- 1/100
iteration <- 0
deriv <- Inf
while ((iteration < maximum.iterations) && (abs(deriv) > stopping.deriv)) {
iteration <- iteration + 1
mse.1 <- mean((gmp$pcgmp - 6611*gmp$pop^a)^2)
mse.2 <- mean((gmp$pcgmp - 6611*gmp$pop^(a+deriv.step))^2)
deriv <- (mse.2 - mse.1)/deriv.step
a <- a - step.scale*deriv
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
```
===
_Problem_: All those magic numbers!
_Solution_: Make them defaults
```{r}
estimate.scaling.exponent.2 <- function(a, y0=6611,
maximum.iterations=100, deriv.step = .001,
step.scale = 1e-12, stopping.deriv = .01) {
iteration <- 0
deriv <- Inf
while ((iteration < maximum.iterations) && (abs(deriv) > stopping.deriv)) {
iteration <- iteration + 1
mse.1 <- mean((gmp$pcgmp - y0*gmp$pop^a)^2)
mse.2 <- mean((gmp$pcgmp - y0*gmp$pop^(a+deriv.step))^2)
deriv <- (mse.2 - mse.1)/deriv.step
a <- a - step.scale*deriv
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
```
===
_Problem:_ Why type out the same calculation of the MSE twice?
_Solution:_ Declare a function
```{r}
estimate.scaling.exponent.3 <- function(a, y0=6611,
maximum.iterations=100, deriv.step = .001,
step.scale = 1e-12, stopping.deriv = .01) {
iteration <- 0
deriv <- Inf
mse <- function(a) { mean((gmp$pcgmp - y0*gmp$pop^a)^2) }
while ((iteration < maximum.iterations) && (abs(deriv) > stopping.deriv)) {
iteration <- iteration + 1
deriv <- (mse(a+deriv.step) - mse(a))/deriv.step
a <- a - step.scale*deriv
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
```
`mse()` declared inside the function, so it can see `y0`, but it's not added to the global environment
===
_Problem:_ Locked in to using specific columns of `gmp`; shouldn't have to re-write just to compare two data sets
_Solution:_ More arguments, with defaults
```{r}
estimate.scaling.exponent.4 <- function(a, y0=6611,
response=gmp$pcgmp, predictor = gmp$pop,
maximum.iterations=100, deriv.step = .001,
step.scale = 1e-12, stopping.deriv = .01) {
iteration <- 0
deriv <- Inf
mse <- function(a) { mean((response - y0*predictor^a)^2) }
while ((iteration < maximum.iterations) && (abs(deriv) > stopping.deriv)) {
iteration <- iteration + 1
deriv <- (mse(a+deriv.step) - mse(a))/deriv.step
a <- a - step.scale*deriv
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
```
===
Respecting the interfaces: We could turn the `while()` loop into
a `for()` loop, and nothing outside the function would care
```{r}
estimate.scaling.exponent.5 <- function(a, y0=6611,
response=gmp$pcgmp, predictor = gmp$pop,
maximum.iterations=100, deriv.step = .001,
step.scale = 1e-12, stopping.deriv = .01) {
mse <- function(a) { mean((response - y0*predictor^a)^2) }
for (iteration in 1:maximum.iterations) {
deriv <- (mse(a+deriv.step) - mse(a))/deriv.step
a <- a - step.scale*deriv
if (abs(deriv) <= stopping.deriv) { break() }
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
```
What have we done?
===
The final code is shorter, clearer, more flexible, and more re-usable
_Exercise:_ Run the code with the default values to get
an estimate of $a$; plot the curve along with the data points
_Exercise:_ Randomly remove one data point --- how much does the estimate change?
_Exercise:_ Run the code from multiple starting points --- how different
are the estimates of $a$?
Summary
===
- **Functions** bundle related commands together into objects: easier to
re-run, easier to re-use, easier to combine, easier to modify, less risk of error, easier to think about
- **Interfaces** control what the function can see (arguments,
environment) and change (its internals, its return value)
- **Calling** functions we define works just like calling built-in
functions: named arguments, defaults
Next time: working with many functions