# Last week: Functions

• Function: formal encapsulation of a block of code; generally makes your code easier to understand, to work with, and to modify
• Functions are absolutely critical for writing (good) code for medium or large projects
• A function’s structure consists of three main parts: inputs, body, and output
• R allows the function designer to specify default values for any of the inputs
• R doesn’t allow the designer to return multiple outputs, but can return a list
• Side effects are things that happen as a result of a function call, but that aren’t returned as an output
• Top-down design means breaking a big task into small parts, implementing each of these parts, and then putting them together

# Part I

Simulation basics

# Why simulate?

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

• Often, simulations can be easier than hand calculations
• Often, simulations can be made more realistic than hand calculations

# Sampling from a given vector

To sample from a given vector, use sample()

sample(x=letters, size=10) # Without replacement, the default
##  [1] "q" "e" "l" "p" "h" "x" "y" "f" "d" "m"
sample(x=c(0,1), size=10, replace=TRUE) # With replacement
##  [1] 0 1 1 1 0 0 1 0 0 0
sample(x=10) # Arguments set as x=1:10, size=10, replace=FALSE
##  [1]  4 10  8  7  5  2  1  9  6  3

# Random number generation

To sample from a normal distribution, we have the utility functions:

• rnorm(): generate normal random variables
• pnorm(): normal distribution function, $$\Phi(x)=P(Z \leq x)$$
• dnorm(): normal density function, $$\phi(x)= \Phi'(x)$$
• qnorm(): normal quantile function, $$q(y)=\Phi^{-1}(y)$$, i.e., $$\Phi(q(y))=y$$

Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.

# Random number examples

Standard normal random variables (mean 0 and variance 1)

n = 100
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z)  # Check: sample mean is approximately 0
## [1] 0.01706529
var(z)   # Check: sample variance is approximately 1
## [1] 1.094709

# Estimated distribution function

To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function (CDF)—use ecdf()

x = seq(-3,3,length=100)
ecdf.fun = ecdf(z) # Create the ECDF
class(ecdf.fun) # It's a function!
## [1] "ecdf"     "stepfun"  "function"
ecdf.fun(0)
## [1] 0.41
# We can plot it
plot(x, ecdf.fun(x), lwd=2, col="red", type="l", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))

# Interlude: Kolmogorov-Smirnov test

One of the most celebrated tests in statistics is due to Kolmogorov in 1933. The Kolmogorov-Smirnoff (KS) statistic is: $\sqrt{\frac{n}{2}} \sup_{x} |F_n(x)-G_n(x)|$ Here $$F_n$$ is the ECDF of $$X_1,\ldots,X_n \sim F$$, and $$G_n$$ is the ECDF of $$Y_1,\ldots,Y_n \sim G$$. Under $$F=G$$ (two distributions are the same), as $$n \to \infty$$, the KS statistic approaches the supremum of a Brownian bridge: $\sup_{t \in [0,1]} |B(t)|$

Here $$B$$ is a Gaussian process with $$B(0)=B(1)=0$$, mean $$\mathbb{E}(B(t))=0$$ for all $$t$$, and covariance function $$\mathrm{Cov}(B(s), B(t)) = s(1-t)$$

n = 500
t = 1:n/n
Sig = t %o% (1-t)
Sig = pmin(Sig, t(Sig))
eig = eigen(Sig)
Sig.half = eig$vec %*% diag(sqrt(eig$val)) %*% t(eig$vec) B = Sig.half %*% rnorm(n) plot(t, B, type="l") Two remarkable facts about the KS test: 1. It is distribution-free, meaning that the null distribution doesn’t depend on $$F,G$$! 2. We can actually compute the null distribution and use this test, e.g., via ks.test(): ks.test(rnorm(n), rt(n, df=1)) # Normal versus t1 ## ## Two-sample Kolmogorov-Smirnov test ## ## data: rnorm(n) and rt(n, df = 1) ## D = 0.134, p-value = 0.0002523 ## alternative hypothesis: two-sided ks.test(rnorm(n), rt(n, df=10)) # Normal versus t10 ## ## Two-sample Kolmogorov-Smirnov test ## ## data: rnorm(n) and rt(n, df = 10) ## D = 0.056, p-value = 0.4131 ## alternative hypothesis: two-sided # Estimated density function To compute histogram—a basic estimator of the density based on binning—use hist() hist.obj = hist(z, breaks=30, plot=FALSE) class(hist.obj) # It's a list ## [1] "histogram" hist.obj$breaks # These are the break points that were used
##  [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4
## [15] -0.2  0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4
## [29]  2.6  2.8  3.0
hist.obj\$density # These are the estimated probabilities
##  [1] 0.05 0.00 0.00 0.00 0.15 0.15 0.15 0.05 0.15 0.20 0.15 0.10 0.30 0.25
## [15] 0.35 0.70 0.45 0.50 0.30 0.35 0.15 0.10 0.10 0.10 0.05 0.05 0.05 0.00
## [29] 0.00 0.05

# We can plot it
plot(hist.obj, col="pink", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2,
col=c("pink","black"))