Agenda

Reading: Recipes 13.1 and 13.2 in The R Cookbook

Optional reading: 1.1, 2.1 and 2.2 in Red Plenty

Examples of Optimization Problems

Optimization Problems

Given an objective function \(f: \mathcal{D} \mapsto R\), find

\[ \theta^* = \operatorname*{argmin}_{\theta}{f(\theta)} \]

Basics: maximizing \(f\) is minimizing \(-f\):

\[ \operatorname*{argmax}_{\theta}{f(\theta)}= \operatorname*{argmin}_{\theta}{-f(\theta)} \]

If \(h\) is strictly increasing (e.g., \(\log\)), then

\[ \operatorname*{argmin}_{\theta}{f(\theta)} = \operatorname*{argmin}_{\theta}{h(f(\theta))} \]

Considerations

You remember calculus, right?

Suppose \(x\) is one dimensional and \(f\) is smooth. If \(x^*\) is an interior minimum / maximum / extremum point

\[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]

If \(x^*\) a minimum, \[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]

You remember calculus, right?

This all carries over to multiple dimensions:

At an interior extremum, \[ \nabla f(\theta^*) = 0 \]

At an interior minimum, \[ \nabla^2 f(\theta^*) \geq 0 \] meaning for any vector \(v\), \[ v^T \nabla^2 f(\theta^*) v \geq 0 \] \(\nabla^2 f =\) the Hessian, \(\mathbf{H}\)

\(\theta\) might just be a local minimum

Gradients and Changes to f

\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \]

\[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \]

Locally, the function looks linear; to minimize a linear function, move down the slope

Multivariate version: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \]

\(\nabla f(\theta_0)\) points in the direction of fastest ascent at \(\theta_0\)

Gradient Descent

  1. Start with initial guess for \(\theta\), step-size \(\eta\)
  2. While ((not too tired) and (making adequate progress))
    • Find gradient \(\nabla f(\theta)\)
    • Set \(\theta \leftarrow \theta - \eta \nabla f(\theta)\)
  3. Return final \(\theta\) as approximate \(\theta^*\)

Variations: adaptively adjust \(\eta\) to make sure of improvement or search along the gradient direction for minimum

Pros and Cons of Gradient Descent

Pro:

Cons:

How much work do we need?

Scaling

Big-\(O\) notation:

\[ h(x) = O(g(x)) \]

means

\[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \]

for some \(c \neq 0\)

e.g., \(x^2 - 5000x + 123456778 = O(x^2)\)

e.g., \(e^{x}/(1+e^{x}) = O(1)\)

Useful to look at over-all scaling, hiding details

Also done when the limit is \(x\rightarrow 0\)

How Much Work is Gradient Descent?

Pro:

Con:

Taylor Series

What if we do a quadratic approximation to \(f\)?

\[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Special cases of general idea of Taylor approximation

Simplifies if \(x_0\) is a minimum since then \(f^{\prime}(x_0) = 0\):

\[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Near a minimum, smooth functions look like parabolas

Carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]

Minimizing a Quadratic

If we know \[ f(x) = ax^2 + bx + c \]

we minimize exactly:

\[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \]

If \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \]

then \[ x^* = x_0 - a^{-1}b \]

Newton’s Method

Taylor-expand for the value at the minimum \(\theta^*\) \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \]

Take gradient, set to zero, solve for \(\theta^*\): \[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \]

Works exactly if \(f\) is quadratic
and \(\mathbf{H}^{-1}\) exists, etc.

If \(f\) isn’t quadratic, keep pretending it is until we get close to \(\theta^*\), when it will be nearly true

Newton’s Method: The Algorithm

  1. Start with guess for \(\theta\)
  2. While ((not too tired) and (making adequate progress))
    • Find gradient \(\nabla f(\theta)\) and Hessian \(\mathbf{H}(\theta)\)
    • Set \(\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)\)
  3. Return final \(\theta\) as approximation to \(\theta^*\)

Like gradient descent, but with inverse Hessian giving the step-size

“This is about how far you can go with that gradient”

Advantages and Disadvantages of Newton’s Method

Pros:

Advantages and Disadvantages of Newton’s Method

Cons:

Getting Around the Hessian

Want to use the Hessian to improve convergence

Don’t want to have to keep computing the Hessian at each step

Approaches:

Other Methods

Curve-Fitting by Optimizing

We have data \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\)

We also have possible curves, \(r(x;\theta)\)

e.g., \(r(x) = x \cdot \theta\)

e.g., \(r(x) = \theta_1 x^{\theta_2}\)

e.g., \(r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}\) for fixed “basis” functions \(b_j\)

Curve-Fitting by Optimizing

Least-squares curve fitting:

\[ \hat{\theta} = \operatorname*{argmin}_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]

“Robust” curve fitting: \[ \hat{\theta} = \operatorname*{argmin}_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]

Optimization in R: optim()

optim(par, fn, gr, method, control, hessian)

Return contains the location ($par) and the value ($val) of the optimum, diagnostics, possibly $hessian

Optimization in R: optim()

gmp <- read.table("gmp.dat")
gmp$pop <- gmp$gmp/gmp$pcgmp
library(numDeriv)
mse <- function(theta) { mean((gmp$pcgmp - theta[1]*gmp$pop^theta[2])^2) }
grad.mse <- function(theta) { grad(func=mse,x=theta) }
theta0=c(5000,0.15)
fit1 <- optim(theta0,mse,grad.mse,method="BFGS",hessian=TRUE)

fit1: Newton-ish BFGS method

fit1[1:3]
## $par
## [1] 6493.2564    0.1277
## 
## $value
## [1] 61853983
## 
## $counts
## function gradient 
##       63       11

fit1: Newton-ish BFGS method

fit1[4:6]
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##           [,1]      [,2]
## [1,]      52.5 4.422e+06
## [2,] 4422070.4 3.757e+11

nls

optim is a general-purpose optimizer

So is nlm — try them both if one doesn’t work

nls is for nonlinear least squares

nls

nls(formula, data, start, control, [[many other options]])

Returns an nls object, with fitted values, prediction methods, etc.

The default optimization is a version of Newton’s method

fit2: Fitting the Same Model with nls()

fit2 <- nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1))
summary(fit2)
## 
## Formula: pcgmp ~ y0 * pop^a
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## y0 6.49e+03   8.57e+02    7.58  2.9e-13 ***
## a  1.28e-01   1.01e-02   12.61  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7890 on 364 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 1.75e-07

fit2: Fitting the Same Model with nls()

plot(pcgmp~pop,data=gmp)
pop.order <- order(gmp$pop)
lines(gmp$pop[pop.order],fitted(fit2)[pop.order])
curve(fit1$par[1]*x^fit1$par[2],add=TRUE,lty="dashed",col="blue")

plot of chunk unnamed-chunk-5

Summary

  1. Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation
    • Gradient descent: less complex iterations, more guarantees, less adaptive
    • Newton: more complex iterations, but few of them for good functions, more adaptive, less robust
  2. Start with pre-built code like optim or nls, implement your own as needed

Nelder-Mead, a.k.a. the Simplex Method

Try to cage \(\theta^*\) with a simplex of \(p+1\) points

Order the trial points, \(f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})\)

\(\theta_{p+1}\) is the worst guess — try to improve it

Center of the not-worst = \(\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}\)

Nelder-Mead, a.k.a. the Simplex Method

Try to improve the worst guess \(\theta_{p+1}\)

  1. Reflection: Try \(\theta_0 - (\theta_{p+1}-\theta_0)\), across the center from \(\theta_{p+1}\)
    • if it’s better than \(\theta_p\) but not than \(\theta_1\), replace the old \(\theta_{p+1}\) with it
    • Expansion: if the reflected point is the new best, try \(\theta_0 - 2(\theta_{p+1}-\theta_0)\); replace the old \(\theta_{p+1}\) with the better of the reflected and the expanded point
  2. Contraction: If the reflected point is worse that \(\theta_p\), try \(\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}\); if the contracted value is better, replace \(\theta_{p+1}\) with it
  3. Reduction: If all else fails, \(\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}\)
  4. Go back to (1) until we stop improving or run out of time

Making Sense of Nedler-Mead

The Moves:

Making Sense of Nedler-Mead

Pros:

Con: - Can need many more iterations than gradient methods

Coordinate Descent

Gradient descent, Newton’s method, simplex, etc., adjust all coordinates of \(\theta\) at once — gets harder as the number of dimensions \(p\) grows

Coordinate descent: never do more than 1D optimization

Coordinate Descent

Cons:

Pros: