Reading: Recipes 13.1 and 13.2 in The R Cookbook

Optional reading: 1.1, 2.1 and 2.2 in Red Plenty

Examples of Optimization Problems

Optimization Problems

Given an objective function \(f: \mathcal{D} \mapsto R\), find

\[ \theta^* = \operatorname*{argmin}_{\theta}{f(\theta)} \]

Basics: maximizing \(f\) is minimizing \(-f\):

\[ \operatorname*{argmax}_{\theta}{f(\theta)}= \operatorname*{argmin}_{\theta}{-f(\theta)} \]

If \(h\) is strictly increasing (e.g., \(\log\)), then

\[ \operatorname*{argmin}_{\theta}{f(\theta)} = \operatorname*{argmin}_{\theta}{h(f(\theta))} \]


You remember calculus, right?

Suppose \(x\) is one dimensional and \(f\) is smooth. If \(x^*\) is an interior minimum / maximum / extremum point

\[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]

If \(x^*\) a minimum, \[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]

You remember calculus, right?

This all carries over to multiple dimensions:

At an interior extremum, \[ \nabla f(\theta^*) = 0 \]

At an interior minimum, \[ \nabla^2 f(\theta^*) \geq 0 \] meaning for any vector \(v\), \[ v^T \nabla^2 f(\theta^*) v \geq 0 \] \(\nabla^2 f =\) the Hessian, \(\mathbf{H}\)

\(\theta\) might just be a local minimum

Gradients and Changes to f

\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \]

\[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \]

Locally, the function looks linear; to minimize a linear function, move down the slope

Multivariate version: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \]

\(\nabla f(\theta_0)\) points in the direction of fastest ascent at \(\theta_0\)

Gradient Descent

  1. Start with initial guess for \(\theta\), step-size \(\eta\)
  2. While ((not too tired) and (making adequate progress))
    • Find gradient \(\nabla f(\theta)\)
    • Set \(\theta \leftarrow \theta - \eta \nabla f(\theta)\)
  3. Return final \(\theta\) as approximate \(\theta^*\)

Variations: adaptively adjust \(\eta\) to make sure of improvement or search along the gradient direction for minimum

Pros and Cons of Gradient Descent



How much work do we need?


Big-\(O\) notation:

\[ h(x) = O(g(x)) \]


\[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \]

for some \(c \neq 0\)

e.g., \(x^2 - 5000x + 123456778 = O(x^2)\)

e.g., \(e^{x}/(1+e^{x}) = O(1)\)

Useful to look at over-all scaling, hiding details

Also done when the limit is \(x\rightarrow 0\)

How Much Work is Gradient Descent?



Taylor Series

What if we do a quadratic approximation to \(f\)?

\[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Special cases of general idea of Taylor approximation

Simplifies if \(x_0\) is a minimum since then \(f^{\prime}(x_0) = 0\):

\[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Near a minimum, smooth functions look like parabolas

Carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]

Minimizing a Quadratic

If we know \[ f(x) = ax^2 + bx + c \]

we minimize exactly:

\[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \]

If \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \]

then \[ x^* = x_0 - a^{-1}b \]

Newton’s Method

Taylor-expand for the value at the minimum \(\theta^*\) \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \]

Take gradient, set to zero, solve for \(\theta^*\): \[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \]

Works exactly if \(f\) is quadratic
and \(\mathbf{H}^{-1}\) exists, etc.

If \(f\) isn’t quadratic, keep pretending it is until we get close to \(\theta^*\), when it will be nearly true

Newton’s Method: The Algorithm

  1. Start with guess for \(\theta\)
  2. While ((not too tired) and (making adequate progress))
    • Find gradient \(\nabla f(\theta)\) and Hessian \(\mathbf{H}(\theta)\)
    • Set \(\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)\)
  3. Return final \(\theta\) as approximation to \(\theta^*\)

Like gradient descent, but with inverse Hessian giving the step-size

“This is about how far you can go with that gradient”

Advantages and Disadvantages of Newton’s Method


Advantages and Disadvantages of Newton’s Method


Getting Around the Hessian

Want to use the Hessian to improve convergence

Don’t want to have to keep computing the Hessian at each step


Other Methods

Curve-Fitting by Optimizing

We have data \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\)

We also have possible curves, \(r(x;\theta)\)

e.g., \(r(x) = x \cdot \theta\)

e.g., \(r(x) = \theta_1 x^{\theta_2}\)

e.g., \(r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}\) for fixed “basis” functions \(b_j\)

Curve-Fitting by Optimizing

Least-squares curve fitting:

\[ \hat{\theta} = \operatorname*{argmin}_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]

“Robust” curve fitting: \[ \hat{\theta} = \operatorname*{argmin}_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]

Optimization in R: optim()

optim(par, fn, gr, method, control, hessian)

Return contains the location ($par) and the value ($val) of the optimum, diagnostics, possibly $hessian

Optimization in R: optim()

gmp <- read.table("gmp.dat")
gmp$pop <- gmp$gmp/gmp$pcgmp
mse <- function(theta) { mean((gmp$pcgmp - theta[1]*gmp$pop^theta[2])^2) }
grad.mse <- function(theta) { grad(func=mse,x=theta) }
fit1 <- optim(theta0,mse,grad.mse,method="BFGS",hessian=TRUE)

fit1: Newton-ish BFGS method

## $par
## [1] 6493.2564    0.1277
## $value
## [1] 61853983
## $counts
## function gradient 
##       63       11

fit1: Newton-ish BFGS method

## $convergence
## [1] 0
## $message
## $hessian
##           [,1]      [,2]
## [1,]      52.5 4.422e+06
## [2,] 4422070.4 3.757e+11


optim is a general-purpose optimizer

So is nlm — try them both if one doesn’t work

nls is for nonlinear least squares


nls(formula, data, start, control, [[many other options]])

Returns an nls object, with fitted values, prediction methods, etc.

The default optimization is a version of Newton’s method

fit2: Fitting the Same Model with nls()

fit2 <- nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1))
## Formula: pcgmp ~ y0 * pop^a
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## y0 6.49e+03   8.57e+02    7.58  2.9e-13 ***
## a  1.28e-01   1.01e-02   12.61  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7890 on 364 degrees of freedom
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 1.75e-07

fit2: Fitting the Same Model with nls()

pop.order <- order(gmp$pop)

plot of chunk unnamed-chunk-5


  1. Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation
    • Gradient descent: less complex iterations, more guarantees, less adaptive
    • Newton: more complex iterations, but few of them for good functions, more adaptive, less robust
  2. Start with pre-built code like optim or nls, implement your own as needed

Nelder-Mead, a.k.a. the Simplex Method

Try to cage \(\theta^*\) with a simplex of \(p+1\) points

Order the trial points, \(f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})\)

\(\theta_{p+1}\) is the worst guess — try to improve it

Center of the not-worst = \(\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}\)

Nelder-Mead, a.k.a. the Simplex Method

Try to improve the worst guess \(\theta_{p+1}\)

  1. Reflection: Try \(\theta_0 - (\theta_{p+1}-\theta_0)\), across the center from \(\theta_{p+1}\)
    • if it’s better than \(\theta_p\) but not than \(\theta_1\), replace the old \(\theta_{p+1}\) with it
    • Expansion: if the reflected point is the new best, try \(\theta_0 - 2(\theta_{p+1}-\theta_0)\); replace the old \(\theta_{p+1}\) with the better of the reflected and the expanded point
  2. Contraction: If the reflected point is worse that \(\theta_p\), try \(\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}\); if the contracted value is better, replace \(\theta_{p+1}\) with it
  3. Reduction: If all else fails, \(\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}\)
  4. Go back to (1) until we stop improving or run out of time

Making Sense of Nedler-Mead

The Moves:

Making Sense of Nedler-Mead


Con: - Can need many more iterations than gradient methods

Coordinate Descent

Gradient descent, Newton’s method, simplex, etc., adjust all coordinates of \(\theta\) at once — gets harder as the number of dimensions \(p\) grows

Coordinate descent: never do more than 1D optimization

Coordinate Descent