Stochastic Optimization

36-350
29 October 2014

Agenda

  • Why there's such a thing as too much data
  • How to make noise our friend
  • How to make slop our friend

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\ER}{f} \newcommand{\TrueR}{f_0} \newcommand{\ERM}{\hat{\theta}_n} \newcommand{\EH}{\widehat{\mathbf{H}}_n} \newcommand{\tr}[1]{\mathrm{tr}\left( #1 \right)} \]

Optional reading: Bottou and Bosquet, “The Tradeoffs of Large Scale Learning”

Problems with Big Data

  • Typical statistical objective function, mean-squared error: \[ f(\theta) = \frac{1}{n}\sum_{i=1}^{n}{{\left( y_i - m(x_i,\theta)\right)}^2} \]

  • Getting a value of \( f \) is \( O(n) \), \( \nabla f \) is \( O(np) \), \( \mathbf{H} \) is \( O(np^2) \)

    • worse still if \( m \) slows down with \( n \)
  • Not bad when \( n=100 \) or even \( n=10^4 \), but if \( n={10}^9 \) or \( n={10}^{12} \) we don't even know which way to move

Problems with Big Data

sarcastic gradient descent

Sampling, an Alternative to Sarcastic Gradient Descent

  • Pick one data point \( I \) at random (uniform on \( 1:n \))

  • Loss there, \( {\left( y_I - m(x_I,\theta)\right)}^2 \), is random, but

\[ \Expect{{\left( y_I - m(x_I,\theta)\right)}^2} = f(\theta) \]

  • Generally, if \( f(\theta) = n^{-1}\sum_{i=1}^{n}{f_i(\theta)} \) and \( f_i \) are well-behaved, \[ \begin{eqnarray*} \Expect{f_I(\theta)} & = & f(\theta)\\ \Expect{\nabla f_I(\theta)} & = & \nabla f(\theta)\\ \Expect{\nabla^2 f_I(\theta)} & = & \mathbf{H}(\theta) \end{eqnarray*} \]

\( \therefore \) Don't optimize with all the data, optimize with random samples

Stochastic Gradient Descent

Draw lots of one-point samples, let their noise cancel out:

  1. Start with initial guess \( \theta \), learning rate \( \eta \)
  2. While ((not too tired) and (making adequate progress))
    • At \( t^{\mathrm{th}} \) iteration, pick random \( I \) uniformly
    • Set \( \theta\leftarrow \theta - t^{-1}\eta\nabla f_I(\theta) \)
  3. Return final \( \theta \)

Shrinking step-size by \( 1/t \) ensures noise in each gradient dies down

(Variants: put points in some random order, only check progress after going over each point once, adjust \( 1/t \) rate, average a couple of random data points (“mini-batch”), etc.)

Stochastic Gradient Descent

stoch.grad.descent <- function(f,theta,df,max.iter=1e6,rate=1e-6) {
  for (t in 1:max.iter) {
    g <- stoch.grad(f,theta,df)
    theta <- theta - (rate/t)*g
  }
  return(x)
}

stoch.grad <- function(f,theta,df) {
  stopifnot(require(numDeriv))
  i <- sample(1:nrow(df),size=1)
  noisy.f <- function(theta) { return(f(theta, data=df[i,])) }
  stoch.grad <- grad(noisy.f,theta)
  return(stoch.grad)
}

Stochastic Newton's Method

a.k.a. 2nd order stochastic gradient descent

  1. Start with initial guess \( \theta \)
  2. While ((not too tired) and (making adequate progress))
    • At \( t^{\mathrm{th}} \) iteration, pick uniformly-random \( I \)
    • \( \theta \leftarrow \theta - t^{-1}\mathbf{H}_{I}^{-1}(\theta) \nabla f_{I}(\theta) \)
  3. Return final \( \theta \)

+ all the Newton-ish tricks to avoid having to recompute the Hessian

Stochastic Gradient Methods

  • Pros:

    • Each iteration is fast (and constant in \( n \))
    • Never need to hold all data in memory
    • Does converge eventually
  • Cons:

    • Noise does reduce precision — more iterations to get within \( \epsilon \) of optimum than non-stochastic GD or Newton

Often low computational cost to get within statistical error of the optimum

How Precise?

  • We're minimizing \( f \) and aiming at \( \hat{\theta} \)

  • \( f \) is a function of the data, which are full of useless details

    • e.g., \( f(\theta) = n^{-1}\sum_{i=1}^{n}{(y_i-m(x_i;\theta))^2} \)
  • We hope there's some true \( f_0 \), with minimum \( \theta_0 \)

    • e.g., \( f_0(\theta) = \Expect{(Y-m(X;\theta))^2} = \int{(y-m(x;\theta))^2 p(x,y) dx dy} \)
  • but we know \( f \neq f_0 \)

    • because we've only got a finite sample, which includes noise
  • Past some point, getting a better \( \hat{\theta} \) isn't helping is find \( \theta_0 \)

(why push optimization to \( \pm {10}^{-6} \) if \( f \) only matches \( f_0 \) to \( \pm 1 \)?)

How Precise?

  • An example
    • true model: \( Y=X + \) Gaussian noise with mean 0, s.d. 0.1, \( X \) uniform between 0 and 1
    • Try regression through the origin (\( Y=bX \), \( b \) unknown)
  • We'll plot the population objective function, and a bunch of sample objective functions from different \( n=30 \) draws
f0 <- function(b) { 0.1^2+ (1/3)*(b-1)^2 }
f <- Vectorize(FUN=function(b,df) { mean((df$y - b*df$x)^2) }, vectorize.args="b")
simulate_df <- function(n) {
  x <- runif(n)
  y<-x+rnorm(n,0,0.1)
  return(data.frame(x=x,y=y))
}

How Precise?

curve(f0(b=x), from=0,to=2,)
replicate(100,curve(f(b=x,df=simulate_df(30)),
  add=TRUE,col="grey",lwd=0.1))