36-350

29 October 2014

- Why there's such a thing as too much data
- How to make noise our friend
- How to make slop our friend

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\ER}{f} \newcommand{\TrueR}{f_0} \newcommand{\ERM}{\hat{\theta}_n} \newcommand{\EH}{\widehat{\mathbf{H}}_n} \newcommand{\tr}[1]{\mathrm{tr}\left( #1 \right)} \]

Optional reading: Bottou and Bosquet, “The Tradeoffs of Large Scale Learning”

Typical statistical objective function, mean-squared error: \[ f(\theta) = \frac{1}{n}\sum_{i=1}^{n}{{\left( y_i - m(x_i,\theta)\right)}^2} \]

Getting a value of \( f \) is \( O(n) \), \( \nabla f \) is \( O(np) \), \( \mathbf{H} \) is \( O(np^2) \)

- worse still if \( m \) slows down with \( n \)

Not bad when \( n=100 \) or even \( n=10^4 \), but if \( n={10}^9 \) or \( n={10}^{12} \) we don't even know which way to move

Pick

*one*data point \( I \) at random (uniform on \( 1:n \))Loss there, \( {\left( y_I - m(x_I,\theta)\right)}^2 \), is random, but

\[ \Expect{{\left( y_I - m(x_I,\theta)\right)}^2} = f(\theta) \]

- Generally, if \( f(\theta) = n^{-1}\sum_{i=1}^{n}{f_i(\theta)} \) and \( f_i \) are well-behaved, \[ \begin{eqnarray*} \Expect{f_I(\theta)} & = & f(\theta)\\ \Expect{\nabla f_I(\theta)} & = & \nabla f(\theta)\\ \Expect{\nabla^2 f_I(\theta)} & = & \mathbf{H}(\theta) \end{eqnarray*} \]

\( \therefore \) Don't optimize with all the data, optimize with random samples

Draw lots of one-point samples, let their noise cancel out:

- Start with initial guess \( \theta \), learning rate \( \eta \)
- While ((not too tired) and (making adequate progress))

- At \( t^{\mathrm{th}} \) iteration, pick random \( I \) uniformly
- Set \( \theta\leftarrow \theta - t^{-1}\eta\nabla f_I(\theta) \)

- At \( t^{\mathrm{th}} \) iteration, pick random \( I \) uniformly
- Return final \( \theta \)

Shrinking step-size by \( 1/t \) ensures noise in each gradient dies down

(Variants: put points in some random order, only check progress after going over each point once, adjust \( 1/t \) rate, average a couple of random data points (“mini-batch”), etc.)

```
stoch.grad.descent <- function(f,theta,df,max.iter=1e6,rate=1e-6) {
for (t in 1:max.iter) {
g <- stoch.grad(f,theta,df)
theta <- theta - (rate/t)*g
}
return(x)
}
stoch.grad <- function(f,theta,df) {
stopifnot(require(numDeriv))
i <- sample(1:nrow(df),size=1)
noisy.f <- function(theta) { return(f(theta, data=df[i,])) }
stoch.grad <- grad(noisy.f,theta)
return(stoch.grad)
}
```

a.k.a. 2nd order stochastic gradient descent

- Start with initial guess \( \theta \)
- While ((not too tired) and (making adequate progress))

- At \( t^{\mathrm{th}} \) iteration, pick uniformly-random \( I \)
- \( \theta \leftarrow \theta - t^{-1}\mathbf{H}_{I}^{-1}(\theta) \nabla f_{I}(\theta) \)

- At \( t^{\mathrm{th}} \) iteration, pick uniformly-random \( I \)
- Return final \( \theta \)

+ all the Newton-ish tricks to avoid having to recompute the Hessian

Pros:

- Each iteration is fast (and constant in \( n \))
- Never need to hold all data in memory
- Does converge eventually

Cons:

- Noise
*does*reduce precision — more iterations to get within \( \epsilon \) of optimum than non-stochastic GD or Newton

- Noise

Often low computational cost to get within *statistical* error of
the optimum

We're minimizing \( f \) and aiming at \( \hat{\theta} \)

\( f \) is a function of the data, which are full of useless details

- e.g., \( f(\theta) = n^{-1}\sum_{i=1}^{n}{(y_i-m(x_i;\theta))^2} \)

We hope there's some true \( f_0 \), with minimum \( \theta_0 \)

- e.g., \( f_0(\theta) = \Expect{(Y-m(X;\theta))^2} = \int{(y-m(x;\theta))^2 p(x,y) dx dy} \)

but we know \( f \neq f_0 \)

- because we've only got a finite sample, which includes noise

Past some point, getting a better \( \hat{\theta} \) isn't helping is find \( \theta_0 \)

(why push optimization to \( \pm {10}^{-6} \) if \( f \) only matches \( f_0 \) to \( \pm 1 \)?)

- An example
- true model: \( Y=X + \) Gaussian noise with mean 0, s.d. 0.1, \( X \) uniform between 0 and 1
- Try regression through the origin (\( Y=bX \), \( b \) unknown)

- We'll plot the population objective function, and a bunch of sample objective functions from different \( n=30 \) draws

```
f0 <- function(b) { 0.1^2+ (1/3)*(b-1)^2 }
f <- Vectorize(FUN=function(b,df) { mean((df$y - b*df$x)^2) }, vectorize.args="b")
simulate_df <- function(n) {
x <- runif(n)
y<-x+rnorm(n,0,0.1)
return(data.frame(x=x,y=y))
}
```

```
curve(f0(b=x), from=0,to=2,)
replicate(100,curve(f(b=x,df=simulate_df(30)),
add=TRUE,col="grey",lwd=0.1))
```