36-350
29 October 2014
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\ER}{f} \newcommand{\TrueR}{f_0} \newcommand{\ERM}{\hat{\theta}_n} \newcommand{\EH}{\widehat{\mathbf{H}}_n} \newcommand{\tr}[1]{\mathrm{tr}\left( #1 \right)} \]
Optional reading: Bottou and Bosquet, “The Tradeoffs of Large Scale Learning”
Typical statistical objective function, mean-squared error: \[ f(\theta) = \frac{1}{n}\sum_{i=1}^{n}{{\left( y_i - m(x_i,\theta)\right)}^2} \]
Getting a value of \( f \) is \( O(n) \), \( \nabla f \) is \( O(np) \), \( \mathbf{H} \) is \( O(np^2) \)
Not bad when \( n=100 \) or even \( n=10^4 \), but if \( n={10}^9 \) or \( n={10}^{12} \) we don't even know which way to move
Pick one data point \( I \) at random (uniform on \( 1:n \))
Loss there, \( {\left( y_I - m(x_I,\theta)\right)}^2 \), is random, but
\[ \Expect{{\left( y_I - m(x_I,\theta)\right)}^2} = f(\theta) \]
\( \therefore \) Don't optimize with all the data, optimize with random samples
Draw lots of one-point samples, let their noise cancel out:
Shrinking step-size by \( 1/t \) ensures noise in each gradient dies down
(Variants: put points in some random order, only check progress after going over each point once, adjust \( 1/t \) rate, average a couple of random data points (“mini-batch”), etc.)
stoch.grad.descent <- function(f,theta,df,max.iter=1e6,rate=1e-6) {
for (t in 1:max.iter) {
g <- stoch.grad(f,theta,df)
theta <- theta - (rate/t)*g
}
return(x)
}
stoch.grad <- function(f,theta,df) {
stopifnot(require(numDeriv))
i <- sample(1:nrow(df),size=1)
noisy.f <- function(theta) { return(f(theta, data=df[i,])) }
stoch.grad <- grad(noisy.f,theta)
return(stoch.grad)
}
a.k.a. 2nd order stochastic gradient descent
+ all the Newton-ish tricks to avoid having to recompute the Hessian
Pros:
Cons:
Often low computational cost to get within statistical error of the optimum
We're minimizing \( f \) and aiming at \( \hat{\theta} \)
\( f \) is a function of the data, which are full of useless details
We hope there's some true \( f_0 \), with minimum \( \theta_0 \)
but we know \( f \neq f_0 \)
Past some point, getting a better \( \hat{\theta} \) isn't helping is find \( \theta_0 \)
(why push optimization to \( \pm {10}^{-6} \) if \( f \) only matches \( f_0 \) to \( \pm 1 \)?)
f0 <- function(b) { 0.1^2+ (1/3)*(b-1)^2 }
f <- Vectorize(FUN=function(b,df) { mean((df$y - b*df$x)^2) }, vectorize.args="b")
simulate_df <- function(n) {
x <- runif(n)
y<-x+rnorm(n,0,0.1)
return(data.frame(x=x,y=y))
}
curve(f0(b=x), from=0,to=2,)
replicate(100,curve(f(b=x,df=simulate_df(30)),
add=TRUE,col="grey",lwd=0.1))