# Last week: Functions

• Function: formal encapsulation of a block of code; generally makes your code easier to understand, to work with, and to modify
• Functions are absolutely critical for writing (good) code for medium or large projects
• A function’s structure consists of three main parts: inputs, body, and output
• R allows the function designer to specify default values for any of the inputs
• R doesn’t allow the designer to return multiple outputs, but can return a list
• Side effects are things that happen as a result of a function call, but that aren’t returned as an output
• Top-down design means breaking a big task into small parts, implementing each of these parts, and then putting them together

# Part I

Simulation basics

# Why simulate?

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

• Often, simulations can be easier than hand calculations
• Often, simulations can be made more realistic than hand calculations

# Sampling from a given vector

To sample from a given vector, use sample()

sample(x=letters, size=10) # Without replacement, the default
##  [1] "n" "j" "x" "u" "w" "f" "r" "p" "b" "h"
sample(x=c(0,1), size=10, replace=TRUE) # With replacement
##  [1] 1 1 1 1 1 0 0 1 1 0
sample(x=10) # Arguments set as x=1:10, size=10, replace=FALSE
##  [1]  4 10  1  2  9  8  5  3  6  7

# Random number generation

To sample from a normal distribution, we have the utility functions:

• rnorm(): generate normal random variables
• pnorm(): normal distribution function, $$\Phi(x)=P(Z \leq x)$$
• dnorm(): normal density function, $$\phi(x)=\Phi'(x)$$
• qnorm(): normal quantile function, $$q(y)=\Phi^{-1}(y)$$, i.e., $$\Phi(q(y))=y$$

Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.

# Random number examples

Standard normal random variables (mean 0 and variance 1)

n = 100
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z)  # Check: sample mean is approximately 0
## [1] 0.07550971
var(z)   # Check: sample variance is approximately 1
## [1] 1.0815

# Estimated distribution function

To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function (CDF)—use ecdf()

x = seq(-3, 3, length=100)
ecdf.fun = ecdf(z) # Create the ECDF
class(ecdf.fun) # It's a function!
## [1] "ecdf"     "stepfun"  "function"
ecdf.fun(0)
## [1] 0.54
# We can plot it
plot(x, ecdf.fun(x), lwd=2, col="red", type="l", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))

# Interlude: Kolmogorov-Smirnov test

One of the most celebrated tests in statistics is due to Kolmogorov in 1933. The Kolmogorov-Smirnoff (KS) statistic is: $\sqrt{\frac{n}{2}} \sup_{x} |F_n(x)-G_n(x)|$ Here $$F_n$$ is the ECDF of $$X_1,\ldots,X_n \sim F$$, and $$G_n$$ is the ECDF of $$Y_1,\ldots,Y_n \sim G$$. Under the null hypothesis $$F=G$$ (two distributions are the same), as $$n \to \infty$$, the KS statistic approaches the supremum of a Brownian bridge: $\sup_{t \in [0,1]} |B(t)|$

Here $$B$$ is a Gaussian process with $$B(0)=B(1)=0$$, mean $$\mathbb{E}(B(t))=0$$ for all $$t$$, and covariance function $$\mathrm{Cov}(B(s), B(t)) = s(1-t)$$

n = 500
t = 1:n/n
Sig = t %o% (1-t)
Sig = pmin(Sig, t(Sig))
eig = eigen(Sig)
Sig.half = eig$vec %*% diag(sqrt(eig$val)) %*% t(eig$vec) B = Sig.half %*% rnorm(n) plot(t, B, type="l") Two remarkable facts about the KS test: 1. It is distribution-free, meaning that the null distribution doesn’t depend on $$F,G$$! 2. We can actually compute the null distribution and use this test, e.g., via ks.test(): ks.test(rnorm(n), rt(n, df=1)) # Normal versus t1 ## ## Two-sample Kolmogorov-Smirnov test ## ## data: rnorm(n) and rt(n, df = 1) ## D = 0.142, p-value = 8.365e-05 ## alternative hypothesis: two-sided ks.test(rnorm(n), rt(n, df=10)) # Normal versus t10 ## ## Two-sample Kolmogorov-Smirnov test ## ## data: rnorm(n) and rt(n, df = 10) ## D = 0.06, p-value = 0.3291 ## alternative hypothesis: two-sided # Estimated density function To compute histogram—a basic estimator of the density based on binning—use hist() hist.obj = hist(z, breaks=30, plot=FALSE) class(hist.obj) # It's a list ## [1] "histogram" hist.obj$breaks # These are the break points that were used
##  [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4
## [15] -0.2  0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4
## [29]  2.6  2.8
hist.obj\$density # These are the estimated probabilities
##  [1] 0.05 0.00 0.00 0.00 0.00 0.00 0.05 0.25 0.05 0.35 0.30 0.20 0.40 0.50
## [15] 0.55 0.35 0.10 0.45 0.15 0.35 0.10 0.05 0.30 0.20 0.05 0.05 0.10 0.00
## [29] 0.05

# We can plot it
plot(hist.obj, col="pink", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2,
col=c("pink","black"))

# Part II

Pseudorandomness and seeds

# Same function call, different results

Not surprisingly, we get different draws each time we call rnorm()

mean(rnorm(n))
## [1] 0.01911518
mean(rnorm(n))
## [1] -0.07090431
mean(rnorm(n))
## [1] -0.01171319
mean(rnorm(n))
## [1] -0.03441602

# Is it really random?

Random numbers generated in R (in any language) are not “truly” random; they are what we call pseudorandom

• These are numbers generated by computer algorithms that very closely mimick “truly” random numbers
• The study of such algorithms is an interesting research area in its own right!
• The default algorithm in R (and in nearly all software languages) is called the “Mersenne Twister”
• Type ?Random in your R console to read more about this (and to read how to change the algorithm used for pseudorandom number generation, which you should never really have to do, by the way)

# Setting the random seed

All pseudorandom number generators depend on what is called a seed value

• This puts the random number generator in a well-defined “state”, so that the numbers it generates, from then on, will be reproducible
• The seed is just an integer, and can be set with set.seed()
• The reason we set it: so that when someone else runs our simulation code, they can see the same—albeit, still random—results that we do

# Seed examples

# Getting the same 5 random normals over and over
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414
set.seed(0); rnorm(5)
## [1]  1.2629543 -0.3262334  1.3297993  1.2724293  0.4146414

# Different seeds, different numbers
set.seed(1); rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
set.seed(2); rnorm(5)
## [1] -0.89691455  0.18484918  1.58784533 -1.13037567 -0.08025176
set.seed(3); rnorm(5)
## [1] -0.9619334 -0.2925257  0.2587882 -1.1521319  0.1957828

# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1]  1.2629543 -0.3262334  1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995

# Part III

Iteration and simulation

# Drug effect model

• Let’s start with a motivating example: suppose we had a model for the way a drug affected certain patients
• All patients will undergo chemotherapy. We believe those who aren’t given the drug experience a reduction in tumor size of percentage $X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1)$
• And those who were given the drug experience a reduction in tumor size of percentage $X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2)$ (Here $$\mathrm{Exp}$$ denotes the exponential distribution, and $$\mathrm{Unif}$$ the uniform distribution)

# What would you do?

What would you do if you had such a model, and your scientist collaborators asked you: how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?

• Answer used to be: get out your pen and paper, make some approximations
• Answer is now: simulate from the model, no approximations required!

# So, let’s simulate!

# Simulate, supposing 60 subjects in each group
set.seed(0)
n = 60
mu.drug = 2
mu.nodrug = runif(n, min=0, max=1)
x.drug = 100*rexp(n, rate=1/mu.drug)
x.nodrug = 100*rexp(n, rate=1/mu.nodrug)

# Find the range of all the measurements together, and define breaks
x.range = range(c(x.nodrug,x.drug))
breaks = seq(min(x.range),max(x.range),length=20)

# Produce hist of the non drug measurements, then drug measurements on top
hist(x.nodrug, breaks=breaks, probability=TRUE, xlim=x.range,
col="lightgray", xlab="Percentage reduction in tumor size",
main="Comparison of tumor reduction")

# Plot a histogram of the drug measurements, on top
legend("topright", legend=c("No drug","Drug"), lty=1, lwd=3, col=1:2)