Last week: Functions

Part I

Simulation basics

Why simulate?

R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:

Sampling from a given vector

To sample from a given vector, use sample()

sample(x=letters, size=10) # Without replacement, the default
##  [1] "q" "e" "l" "p" "h" "x" "y" "f" "d" "m"
sample(x=c(0,1), size=10, replace=TRUE) # With replacement
##  [1] 0 1 1 1 0 0 1 0 0 0
sample(x=10) # Arguments set as x=1:10, size=10, replace=FALSE
##  [1]  4 10  8  7  5  2  1  9  6  3

Random number generation

To sample from a normal distribution, we have the utility functions:

Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.

Random number examples

Standard normal random variables (mean 0 and variance 1)

n = 100
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z)  # Check: sample mean is approximately 0
## [1] 0.01706529
var(z)   # Check: sample variance is approximately 1
## [1] 1.094709

Estimated distribution function

To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function (CDF)—use ecdf()

x = seq(-3,3,length=100)
ecdf.fun = ecdf(z) # Create the ECDF
class(ecdf.fun) # It's a function!
## [1] "ecdf"     "stepfun"  "function"
ecdf.fun(0)
## [1] 0.41
# We can plot it 
plot(x, ecdf.fun(x), lwd=2, col="red", type="l", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2, 
       col=c("red","black"))

Interlude: Kolmogorov-Smirnov test

One of the most celebrated tests in statistics is due to Kolmogorov in 1933. The Kolmogorov-Smirnoff (KS) statistic is: \[ \sqrt{\frac{n}{2}} \sup_{x} |F_n(x)-G_n(x)| \] Here \(F_n\) is the ECDF of \(X_1,\ldots,X_n \sim F\), and \(G_n\) is the ECDF of \(Y_1,\ldots,Y_n \sim G\). Under \(F=G\) (two distributions are the same), as \(n \to \infty\), the KS statistic approaches the supremum of a Brownian bridge: \[ \sup_{t \in [0,1]} |B(t)| \]


Here \(B\) is a Gaussian process with \(B(0)=B(1)=0\), mean \(\mathbb{E}(B(t))=0\) for all \(t\), and covariance function \(\mathrm{Cov}(B(s), B(t)) = s(1-t)\)

n = 500
t = 1:n/n
Sig = t %o% (1-t)
Sig = pmin(Sig, t(Sig))
eig = eigen(Sig)
Sig.half = eig$vec %*% diag(sqrt(eig$val)) %*% t(eig$vec)
B = Sig.half %*% rnorm(n)
plot(t, B, type="l")


Two remarkable facts about the KS test:

  1. It is distribution-free, meaning that the null distribution doesn’t depend on \(F,G\)!

  2. We can actually compute the null distribution and use this test, e.g., via ks.test():

ks.test(rnorm(n), rt(n, df=1)) # Normal versus t1
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  rnorm(n) and rt(n, df = 1)
## D = 0.134, p-value = 0.0002523
## alternative hypothesis: two-sided
ks.test(rnorm(n), rt(n, df=10)) # Normal versus t10
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  rnorm(n) and rt(n, df = 10)
## D = 0.056, p-value = 0.4131
## alternative hypothesis: two-sided

Estimated density function

To compute histogram—a basic estimator of the density based on binning—use hist()

hist.obj = hist(z, breaks=30, plot=FALSE) 
class(hist.obj) # It's a list
## [1] "histogram"
hist.obj$breaks # These are the break points that were used
##  [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4
## [15] -0.2  0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4
## [29]  2.6  2.8  3.0
hist.obj$density # These are the estimated probabilities
##  [1] 0.05 0.00 0.00 0.00 0.15 0.15 0.15 0.05 0.15 0.20 0.15 0.10 0.30 0.25
## [15] 0.35 0.70 0.45 0.50 0.30 0.35 0.15 0.10 0.10 0.10 0.05 0.05 0.05 0.00
## [29] 0.00 0.05

# We can plot it
plot(hist.obj, col="pink", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2, 
       col=c("pink","black"))