- Function: formal encapsulation of a block of code; generally makes your code easier to understand, to work with, and to modify
- Functions are absolutely critical for writing (good) code for medium or large projects
- A functionâ€™s structure consists of three main parts: inputs, body, and output
- R allows the function designer to specify default values for any of the inputs
- R doesnâ€™t allow the designer to return multiple outputs, but can return a list
- Side effects are things that happen as a result of a function call, but that arenâ€™t returned as an output
- Top-down design means breaking a big task into small parts, implementing each of these parts, and then putting them together

*Simulation basics*

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

- Often, simulations can be easier than hand calculations
- Often, simulations can be made more realistic than hand calculations

To sample from a given vector, use `sample()`

`sample(x=letters, size=10) # Without replacement, the default`

`## [1] "q" "e" "l" "p" "h" "x" "y" "f" "d" "m"`

`sample(x=c(0,1), size=10, replace=TRUE) # With replacement`

`## [1] 0 1 1 1 0 0 1 0 0 0`

`sample(x=10) # Arguments set as x=1:10, size=10, replace=FALSE`

`## [1] 4 10 8 7 5 2 1 9 6 3`

To sample from a normal distribution, we have the utility functions:

`rnorm()`

: generate normal random variables`pnorm()`

: normal distribution function, \(\Phi(x)=P(Z \leq x)\)`dnorm()`

: normal density function, \(\phi(x)= \Phi'(x)\)`qnorm()`

: normal quantile function, \(q(y)=\Phi^{-1}(y)\), i.e., \(\Phi(q(y))=y\)

Replace â€śnormâ€ť with the name of another distribution, all the same functions apply. E.g., â€śtâ€ť, â€śexpâ€ť, â€śgammaâ€ť, â€śchisqâ€ť, â€śbinomâ€ť, â€śpoisâ€ť, etc.

Standard normal random variables (mean 0 and variance 1)

```
n = 100
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z) # Check: sample mean is approximately 0
```

`## [1] 0.01706529`

`var(z) # Check: sample variance is approximately 1`

`## [1] 1.094709`

To compute empirical cumulative distribution function (ECDF)â€”the standard estimator of the cumulative distribution function (CDF)â€”use `ecdf()`

```
x = seq(-3,3,length=100)
ecdf.fun = ecdf(z) # Create the ECDF
class(ecdf.fun) # It's a function!
```

`## [1] "ecdf" "stepfun" "function"`

`ecdf.fun(0)`

`## [1] 0.41`

```
# We can plot it
plot(x, ecdf.fun(x), lwd=2, col="red", type="l", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))
```

One of the most celebrated tests in statistics is due to Kolmogorov in 1933. The *Kolmogorov-Smirnoff (KS) statistic* is: \[
\sqrt{\frac{n}{2}} \sup_{x} |F_n(x)-G_n(x)|
\] Here \(F_n\) is the ECDF of \(X_1,\ldots,X_n \sim F\), and \(G_n\) is the ECDF of \(Y_1,\ldots,Y_n \sim G\). Under \(F=G\) (two distributions are the same), as \(n \to \infty\), the KS statistic approaches the supremum of a Brownian bridge: \[
\sup_{t \in [0,1]} |B(t)|
\]

Here \(B\) is a Gaussian process with \(B(0)=B(1)=0\), mean \(\mathbb{E}(B(t))=0\) for all \(t\), and covariance function \(\mathrm{Cov}(B(s), B(t)) = s(1-t)\)

```
n = 500
t = 1:n/n
Sig = t %o% (1-t)
Sig = pmin(Sig, t(Sig))
eig = eigen(Sig)
Sig.half = eig$vec %*% diag(sqrt(eig$val)) %*% t(eig$vec)
B = Sig.half %*% rnorm(n)
plot(t, B, type="l")
```

Two remarkable facts about the KS test:

It is

*distribution-free*, meaning that the null distribution doesnâ€™t depend on \(F,G\)!We can actually compute the null distribution and use this test, e.g., via

`ks.test()`

:

`ks.test(rnorm(n), rt(n, df=1)) # Normal versus t1`

```
##
## Two-sample Kolmogorov-Smirnov test
##
## data: rnorm(n) and rt(n, df = 1)
## D = 0.134, p-value = 0.0002523
## alternative hypothesis: two-sided
```

`ks.test(rnorm(n), rt(n, df=10)) # Normal versus t10`

```
##
## Two-sample Kolmogorov-Smirnov test
##
## data: rnorm(n) and rt(n, df = 10)
## D = 0.056, p-value = 0.4131
## alternative hypothesis: two-sided
```

To compute histogramâ€”a basic estimator of the density based on binningâ€”use `hist()`

```
hist.obj = hist(z, breaks=30, plot=FALSE)
class(hist.obj) # It's a list
```

`## [1] "histogram"`

`hist.obj$breaks # These are the break points that were used`

```
## [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4
## [15] -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
## [29] 2.6 2.8 3.0
```

`hist.obj$density # These are the estimated probabilities`

```
## [1] 0.05 0.00 0.00 0.00 0.15 0.15 0.15 0.05 0.15 0.20 0.15 0.10 0.30 0.25
## [15] 0.35 0.70 0.45 0.50 0.30 0.35 0.15 0.10 0.10 0.10 0.05 0.05 0.05 0.00
## [29] 0.00 0.05
```

```
# We can plot it
plot(hist.obj, col="pink", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2,
col=c("pink","black"))
```