Statistical Computing, 36-350

Monday October 8, 2018

- Function: formal encapsulation of a block of code; generally makes your code easier to understand, to work with, and to modify
- Functions are absolutely critical for writing (good) code for medium or large projects
- A functionâ€™s structure consists of three main parts: inputs, body, and output
- R allows the function designer to specify default values for any of the inputs
- R doesnâ€™t allow the designer to return multiple outputs, but can return a list
- Side effects are things that happen as a result of a function call, but that arenâ€™t returned as an output
- Top-down design means breaking a big task into small parts, implementing each of these parts, and then putting them together

*Simulation basics*

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

- Often, simulations can be easier than hand calculations
- Often, simulations can be made more realistic than hand calculations

To sample from a given vector, use `sample()`

`sample(x=letters, size=10) # Without replacement, the default`

`## [1] "k" "r" "n" "f" "d" "s" "w" "c" "x" "b"`

`sample(x=c(0,1), size=10, replace=TRUE) # With replacement`

`## [1] 0 0 1 1 0 0 0 1 1 1`

`sample(x=10) # Arguments set as x=1:10, size=10, replace=FALSE`

`## [1] 9 6 7 1 2 10 3 5 8 4`

To sample from a normal distribution, we have the utility functions:

`rnorm()`

: generate normal random variables`pnorm()`

: normal distribution function, \(\Phi(x)=P(Z \leq x)\)`dnorm()`

: normal density function, \(\phi(x)= \Phi'(x)\)`qnorm()`

: normal quantile function, \(q(y)=\Phi^{-1}(y)\), i.e., \(\Phi(q(y))=y\)

Replace â€śnormâ€ť with the name of another distribution, all the same functions apply. E.g., â€śtâ€ť, â€śexpâ€ť, â€śgammaâ€ť, â€śchisqâ€ť, â€śbinomâ€ť, â€śpoisâ€ť, etc.

Standard normal random variables (mean 0 and variance 1)

```
n = 100
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z) # Check: sample mean is approximately 0
```

`## [1] 0.06769794`

`var(z) # Check: sample variance is approximately 1`

`## [1] 0.9970664`

To compute empirical cumulative distribution function (ECDF)â€”the standard estimator of the cumulative distribution function (CDF)â€”use `ecdf()`

```
x = seq(-3,3,length=100)
ecdf.fun = ecdf(z) # Create the ECDF
class(ecdf.fun) # It's a function!
```

`## [1] "ecdf" "stepfun" "function"`

`ecdf.fun(0)`

`## [1] 0.43`

```
# We can plot it
plot(x, ecdf.fun(x), lwd=2, col="red", type="l", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))
```

One of the most celebrated tests in statistics is due to Kolmogorov in 1933. The *Kolmogorov-Smirnoff (KS) statistic* is: \[
\sqrt{\frac{n}{2}} \sup_{x} |F_n(x)-G_n(x)|
\] Here \(F_n\) is the ECDF of \(X_1,\ldots,X_n \sim F\), and \(G_n\) is the ECDF of \(Y_1,\ldots,Y_n \sim G\). Under \(F=G\) (two distributions are the same), as \(n \to \infty\), the KS statistic approaches the supremum of a Brownian bridge: \[
\sup_{t \in [0,1]} |B(t)|
\]

Here \(B\) is a Gaussian process with \(B(0)=B(1)=0\), mean \(\mathbb{E}(B(t))=0\) for all \(t\), and covariance function \(\mathrm{Cov}(B(s), B(t)) = s(1-t)\)

```
n = 500
t = 1:n/n
Sig = t %o% (1-t)
Sig = pmin(Sig, t(Sig))
eig = eigen(Sig)
Sig.half = eig$vec %*% diag(sqrt(eig$val)) %*% t(eig$vec)
B = Sig.half %*% rnorm(n)
plot(t, B, type="l")
```

Two remarkable facts about the KS test:

It is

*distribution-free*, meaning that the null distribution doesnâ€™t depend on \(F,G\)!We can actually compute the null distribution and use this test, e.g., via

`ks.test()`

:

`ks.test(rnorm(n), rt(n, df=1)) # Normal versus t1`

```
##
## Two-sample Kolmogorov-Smirnov test
##
## data: rnorm(n) and rt(n, df = 1)
## D = 0.14, p-value = 0.0001109
## alternative hypothesis: two-sided
```

`ks.test(rnorm(n), rt(n, df=10)) # Normal versus t10`

```
##
## Two-sample Kolmogorov-Smirnov test
##
## data: rnorm(n) and rt(n, df = 10)
## D = 0.058, p-value = 0.3696
## alternative hypothesis: two-sided
```

To compute histogramâ€”a basic estimator of the density based on binningâ€”use `hist()`

```
hist.obj = hist(z, breaks=30, plot=FALSE)
class(hist.obj) # It's a list
```

`## [1] "histogram"`

`hist.obj$breaks # These are the break points that were used`

```
## [1] -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2
## [15] 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
## [29] 3.2
```

`hist.obj$density # These are the estimated probabilities`

```
## [1] 0.05 0.05 0.05 0.05 0.20 0.25 0.25 0.05 0.30 0.25 0.30 0.35 0.40 0.60
## [15] 0.25 0.50 0.35 0.25 0.10 0.20 0.00 0.05 0.10 0.00 0.00 0.00 0.00 0.05
```

```
# We can plot it
plot(hist.obj, col="pink", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2,
col=c("pink","black"))
```

*Pseudorandomness and seeds*

Not surprisingly, we get different draws each time we call `rnorm()`

`mean(rnorm(n))`

`## [1] -0.08407018`

`mean(rnorm(n))`

`## [1] 0.06926718`

`mean(rnorm(n))`

`## [1] 0.02865116`

`mean(rnorm(n))`

`## [1] 0.03032017`

Random numbers generated in R (in any language) are not â€śtrulyâ€ť random; they are what we call **pseudorandom**

- These are numbers generated by computer algorithms that very closely mimick â€śtrulyâ€ť random numbers
- The study of such algorithms is an interesting research area in its own right!
- The default algorithm in R (and in nearly all software languages) is called the â€śMersenne Twisterâ€ť
- Type
`?Random`

in your R console to read more about this (and to read how to change the algorithm used for pseudorandom number generation, which you should never really have to do, by the way)

All pseudorandom number generators depend on what is called a **seed** value

- This puts the random number generator in a well-defined â€śstateâ€ť, so that the numbers it generates, from then on, will be reproducible
- The seed is just an integer, and can be set with
`set.seed()`

- The reason we set it: so that when someone else runs our simulation code, they can see the sameâ€”albeit, still randomâ€”results that we do

```
# Getting the same 5 random normals over and over
set.seed(0); rnorm(5)
```

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

`set.seed(0); rnorm(5)`

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

`set.seed(0); rnorm(5)`

`## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414`

```
# Different seeds, different numbers
set.seed(1); rnorm(5)
```

`## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078`

`set.seed(2); rnorm(5)`

`## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176`

`set.seed(3); rnorm(5)`

`## [1] -0.9619334 -0.2925257 0.2587882 -1.1521319 0.1957828`

```
# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
```

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

`set.seed(0); rnorm(3); rnorm(2); rnorm(1)`

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

`set.seed(0); rnorm(3); rnorm(2); rnorm(1)`

`## [1] 1.2629543 -0.3262334 1.3297993`

`## [1] 1.2724293 0.4146414`

`## [1] -1.53995`

*Iteration and simulation*

- Letâ€™s start with a motivating example: suppose we had a model for the way a drug affected certain patients
- All patients will undergo chemotherapy. We believe those who arenâ€™t given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1) \]
- And those who were given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2) \] (Here \(\mathrm{Exp}\) denotes the exponential distribution, and \(\mathrm{Unif}\) the uniform distribution)

What would you do if you had such a model, and your scientist collaborators asked you: *how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?*

- Answer used to be: get out your pen and paper, make some approximations
- Answer is now: simulate from the model, no approximations required!

```
# Simulate, supposing 60 subjects in each group
set.seed(0)
n = 60
mu.drug = 2
mu.nodrug = runif(n, min=0, max=1)
x.drug = 100*rexp(n, rate=1/mu.drug)
x.nodrug = 100*rexp(n, rate=1/mu.nodrug)
```

```
# Find the range of all the measurements together, and define breaks
x.range = range(c(x.nodrug,x.drug))
breaks = seq(min(x.range),max(x.range),length=20)
# Produce hist of the non drug measurements, then drug measurements on top
hist(x.nodrug, breaks=breaks, probability=TRUE, xlim=x.range,
col="lightgray", xlab="Percentage reduction in tumor size",
main="Comparison of tumor reduction")
# Plot a histogram of the drug measurements, on top
hist(x.drug, breaks=breaks, probability=TRUE, col=rgb(1,0,0,0.2), add=TRUE)
# Draw estimated densities on top, for each dist
lines(density(x.nodrug), lwd=3, col=1)
lines(density(x.drug), lwd=3, col=2)
legend("topright", legend=c("No drug","Drug"), lty=1, lwd=3, col=1:2)
```

- One single simulation is not always trustworthy (depends on the situation, of course)
- In general, simulations should be repeated and aggregate results reportedâ€”requires iteration!
- To make random number draws reproducible, we must set the seed with
`set.seed()`

- More than this, to make simulation results reproducible, we need to follow
**good programming practices** - Gold standard: any time you show a simulation result (a figure, a table, etc.), you have code that can be run (by anyone) to produce exactly the same result

- Writing a
**function**to complete a single run of your simulation is often very helpful - This allows the simulation itself to be intricate (e.g., intricate steps, several simulation parameters), but makes running the simulation simple
- Then you can use
**iteration**to run your simulation over and over again - Good design practice: write another function for this last part (running your simulation many times)

Consider the code below for a generic simulation. Think about how you would frame this for the drug effect example, which youâ€™ll revisit in lab

```
# Function to do one simulation run
one.sim = function(param1, param2=value2, param3=value3) {
# Possibly intricate simulation code goes here
}
# Function to do repeated simulation runs
rep.sim = function(nreps, param1, param2=value2, param3=value3, seed=NULL) {
# Set the seed, if we need to
if(!is.null(seed)) set.seed(seed)
# Run the simulation over and over
sim.objs = vector(length=nreps, mode="list")
for (r in 1:nreps) {
sim.objs[r] = one.sim(param1, param2, param3)
}
# Aggregate the results somehow, and then return something
}
```

Sometimes simulations take a long time to run, and we want to save intermediate or final output, for quick reference later

There two different ways of saving things from R (there are more than two, but here are two useful ones):

`saveRDS()`

: allows us to save single R objects (like a vector, matrix, list, etc.), in (say) .rds format. E.g.,`saveRDS(my.mat, file="my.matrix.rds")`

`save()`

: allows us to save any number of R objects in (say) .rdata format. E.g.,`save(mat.x, mat.y, list.z, file="my.objects.rdata")`

Note: there is a big difference between how these two treat variable names

Corresponding to the two different ways of saving, we have two ways of loading things into R:

`readRDS()`

: allows us to load an object that has been saved by`savedRDS()`

, and**assign a new variable name**. E.g.,`my.new.mat = readRDS("my.matrix.rds")`

`load()`

: allows us to load all objects that have been saved through`save()`

,**according to their original variables names**. E.g.,`load("my.objects.rdata")`

- Running simulations is an integral part of being a statistician in the 21st century
- R provides us with a utility functions for simulations from a wide variety of distributions
- To make your simulation results reproducible, you must set the seed, using
`set.seed()`

- There is a natural connection between iteration, functions, and simulations
- Saving and loading results can be done in two formats: rds and rdata formats