- Function: formal encapsulation of a block of code; generally makes your code easier to understand, to work with, and to modify
- Functions are absolutely critical for writing (good) code for medium or large projects
- A function’s structure consists of three main parts: inputs, body, and output
- R allows the function designer to specify default values for any of the inputs
- R doesn’t allow the designer to return multiple outputs, but can return a list
- Side effects are things that happen as a result of a function call, but that aren’t returned as an output
- Top-down design means breaking a big task into small parts, implementing each of these parts, and then putting them together

*Simulation basics*

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

- Often, simulations can be easier than hand calculations
- Often, simulations can be made more realistic than hand calculations

To sample from a given vector, use `sample()`

`sample(x=letters, size=10) # Without replacement, the default`

`## [1] "n" "j" "x" "u" "w" "f" "r" "p" "b" "h"`

`sample(x=c(0,1), size=10, replace=TRUE) # With replacement`

`## [1] 1 1 1 1 1 0 0 1 1 0`

`sample(x=10) # Arguments set as x=1:10, size=10, replace=FALSE`

`## [1] 4 10 1 2 9 8 5 3 6 7`

To sample from a normal distribution, we have the utility functions:

`rnorm()`

: generate normal random variables`pnorm()`

: normal distribution function, \(\Phi(x)=P(Z \leq x)\)`dnorm()`

: normal density function, \(\phi(x)=\Phi'(x)\)`qnorm()`

: normal quantile function, \(q(y)=\Phi^{-1}(y)\), i.e., \(\Phi(q(y))=y\)

Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.

Standard normal random variables (mean 0 and variance 1)

```
n = 100
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z) # Check: sample mean is approximately 0
```

`## [1] 0.07550971`

`var(z) # Check: sample variance is approximately 1`

`## [1] 1.0815`

To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function (CDF)—use `ecdf()`

```
x = seq(-3, 3, length=100)
ecdf.fun = ecdf(z) # Create the ECDF
class(ecdf.fun) # It's a function!
```

`## [1] "ecdf" "stepfun" "function"`

`ecdf.fun(0)`

`## [1] 0.54`

```
# We can plot it
plot(x, ecdf.fun(x), lwd=2, col="red", type="l", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))
```

One of the most celebrated tests in statistics is due to Kolmogorov in 1933. The *Kolmogorov-Smirnoff (KS) statistic* is: \[
\sqrt{\frac{n}{2}} \sup_{x} |F_n(x)-G_n(x)|
\] Here \(F_n\) is the ECDF of \(X_1,\ldots,X_n \sim F\), and \(G_n\) is the ECDF of \(Y_1,\ldots,Y_n \sim G\). Under the null hypothesis \(F=G\) (two distributions are the same), as \(n \to \infty\), the KS statistic approaches the supremum of a Brownian bridge: \[
\sup_{t \in [0,1]} |B(t)|
\]

Here \(B\) is a Gaussian process with \(B(0)=B(1)=0\), mean \(\mathbb{E}(B(t))=0\) for all \(t\), and covariance function \(\mathrm{Cov}(B(s), B(t)) = s(1-t)\)

```
n = 500
t = 1:n/n
Sig = t %o% (1-t)
Sig = pmin(Sig, t(Sig))
eig = eigen(Sig)
Sig.half = eig$vec %*% diag(sqrt(eig$val)) %*% t(eig$vec)
B = Sig.half %*% rnorm(n)
plot(t, B, type="l")
```

Two remarkable facts about the KS test:

It is

*distribution-free*, meaning that the null distribution doesn’t depend on \(F,G\)!We can actually compute the null distribution and use this test, e.g., via

`ks.test()`

:

`ks.test(rnorm(n), rt(n, df=1)) # Normal versus t1`

```
##
## Two-sample Kolmogorov-Smirnov test
##
## data: rnorm(n) and rt(n, df = 1)
## D = 0.142, p-value = 8.365e-05
## alternative hypothesis: two-sided
```

`ks.test(rnorm(n), rt(n, df=10)) # Normal versus t10`

```
##
## Two-sample Kolmogorov-Smirnov test
##
## data: rnorm(n) and rt(n, df = 10)
## D = 0.06, p-value = 0.3291
## alternative hypothesis: two-sided
```

To compute histogram—a basic estimator of the density based on binning—use `hist()`

```
hist.obj = hist(z, breaks=30, plot=FALSE)
class(hist.obj) # It's a list
```

`## [1] "histogram"`

`hist.obj$breaks # These are the break points that were used`

```
## [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4
## [15] -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
## [29] 2.6 2.8
```

`hist.obj$density # These are the estimated probabilities`

```
## [1] 0.05 0.00 0.00 0.00 0.00 0.00 0.05 0.25 0.05 0.35 0.30 0.20 0.40 0.50
## [15] 0.55 0.35 0.10 0.45 0.15 0.35 0.10 0.05 0.30 0.20 0.05 0.05 0.10 0.00
## [29] 0.05
```

```
# We can plot it
plot(hist.obj, col="pink", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2,
col=c("pink","black"))
```

*Pseudorandomness and seeds*

Not surprisingly, we get different draws each time we call `rnorm()`

`mean(rnorm(n))`

`## [1] 0.01911518`

`mean(rnorm(n))`

`## [1] -0.07090431`

`mean(rnorm(n))`

`## [1] -0.01171319`

`mean(rnorm(n))`

`## [1] -0.03441602`

Random numbers generated in R (in any language) are not “truly” random; they are what we call **pseudorandom**

- These are numbers generated by computer algorithms that very closely mimick “truly” random numbers
- The study of such algorithms is an interesting research area in its own right!
- The default algorithm in R (and in nearly all software languages) is called the “Mersenne Twister”
- Type
`?Random`

in your R console to read more about this (and to read how to change the algorithm used for pseudorandom number generation, which you should never really have to do, by the way)

All pseudorandom number generators depend on what is called a **seed** value

- This puts the random number generator in a well-defined “state”, so that the numbers it generates, from then on, will be reproducible
- The seed is just an integer, and can be set with
`set.seed()`

- The reason we set it: so that when someone else runs our simulation code, they can see the same—albeit, still random—results that we do

```
# Getting the same 5 random normals over and over
set.seed(0); rnorm(5)
```

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

`set.seed(0); rnorm(5)`

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

`set.seed(0); rnorm(5)`

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

```
# Different seeds, different numbers
set.seed(1); rnorm(5)
```

`## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078`

`set.seed(2); rnorm(5)`

`## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176`

`set.seed(3); rnorm(5)`

`## [1] -0.9619334 -0.2925257 0.2587882 -1.1521319 0.1957828`

```
# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
```

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

`set.seed(0); rnorm(3); rnorm(2); rnorm(1)`

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

`set.seed(0); rnorm(3); rnorm(2); rnorm(1)`

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

*Iteration and simulation*

- Let’s start with a motivating example: suppose we had a model for the way a drug affected certain patients
- All patients will undergo chemotherapy. We believe those who aren’t given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1) \]
- And those who were given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2) \] (Here \(\mathrm{Exp}\) denotes the exponential distribution, and \(\mathrm{Unif}\) the uniform distribution)

What would you do if you had such a model, and your scientist collaborators asked you: *how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?*

- Answer used to be: get out your pen and paper, make some approximations
- Answer is now: simulate from the model, no approximations required!

```
# Simulate, supposing 60 subjects in each group
set.seed(0)
n = 60
mu.drug = 2
mu.nodrug = runif(n, min=0, max=1)
x.drug = 100*rexp(n, rate=1/mu.drug)
x.nodrug = 100*rexp(n, rate=1/mu.nodrug)
```

```
# Find the range of all the measurements together, and define breaks
x.range = range(c(x.nodrug,x.drug))
breaks = seq(min(x.range),max(x.range),length=20)
# Produce hist of the non drug measurements, then drug measurements on top
hist(x.nodrug, breaks=breaks, probability=TRUE, xlim=x.range,
col="lightgray", xlab="Percentage reduction in tumor size",
main="Comparison of tumor reduction")
# Plot a histogram of the drug measurements, on top
hist(x.drug, breaks=breaks, probability=TRUE, col=rgb(1,0,0,0.2), add=TRUE)
# Draw estimated densities on top, for each dist
lines(density(x.nodrug), lwd=3, col=1)
lines(density(x.drug), lwd=3, col=2)
legend("topright", legend=c("No drug","Drug"), lty=1, lwd=3, col=1:2)
```