Agenda

Basics of optimization
Gradient descent
Newton’s method
Curve-fitting
R: optim, nls

Reading: Recipes 13.1 and 13.2 in The R Cookbook

Optional reading: 1.1, 2.1 and 2.2 in Red Plenty

Examples of Optimization Problems

Minimize mean-squared error of regression surface (Gauss, c. 1800)
Maximize likelihood of distribution (Fisher, c. 1918)
Maximize output of plywood from given supplies and factories (Kantorovich, 1939)
Maximize output of tanks from given supplies and factories; minimize number of bombing runs to destroy factory (c. 1939–1945)
Maximize return of portfolio for given volatility (Markowitz, 1950s)
Minimize cost of airline flight schedule (Kantorovich…)
Maximize reproductive fitness of an organism (Maynard Smith)

Optimization Problems

Given an objective function $f: \mathcal{D} \mapsto R$, find

\[ \theta^* = \operatorname*{argmin}_{\theta}{f(\theta)} \]

Basics: maximizing $f$ is minimizing $-f$:

\[ \operatorname*{argmax}_{\theta}{f(\theta)}= \operatorname*{argmin}_{\theta}{-f(\theta)} \]

If $h$ is strictly increasing (e.g., $\log$), then

\[ \operatorname*{argmin}_{\theta}{f(\theta)} = \operatorname*{argmin}_{\theta}{h(f(\theta))} \]

Considerations

Approximation: How close can we get to $\theta^*$, and/or $f(\theta^*)$?
Time complexity: How many computer steps does that take?
Varies with precision of approximation, niceness of $f$, size of $\mathcal{D}$, size of data, method…
Most optimization algorithms use successive approximation, so distinguish number of iterations from cost of each iteration

You remember calculus, right?

Suppose $x$ is one dimensional and $f$ is smooth. If $x^*$ is an interior minimum / maximum / extremum point

\[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]

If $x^*$ a minimum, \[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]

You remember calculus, right?

This all carries over to multiple dimensions:

At an interior extremum, \[ \nabla f(\theta^*) = 0 \]

At an interior minimum, \[ \nabla^2 f(\theta^*) \geq 0 \] meaning for any vector $v$, \[ v^T \nabla^2 f(\theta^*) v \geq 0 \] $\nabla^2 f =$ the Hessian, $\mathbf{H}$

$\theta$ might just be a local minimum

Gradients and Changes to f

\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \]

\[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \]

Locally, the function looks linear; to minimize a linear function, move down the slope

Multivariate version: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \]

$\nabla f(\theta_0)$ points in the direction of fastest ascent at $\theta_0$

Gradient Descent

Start with initial guess for $\theta$, step-size $\eta$
While ((not too tired) and (making adequate progress))
- Find gradient $\nabla f(\theta)$
- Set $\theta \leftarrow \theta - \eta \nabla f(\theta)$
Return final $\theta$ as approximate $\theta^*$

Variations: adaptively adjust $\eta$ to make sure of improvement or search along the gradient direction for minimum

Pros and Cons of Gradient Descent

Pro:

Moves in direction of greatest immediate improvement
If $\eta$ is small enough, gets to a local minimum eventually, and then stops

Cons:

“small enough” $\eta$ can be really, really small
Slowness or zig-zagging if components of $\nabla f$ are very different sizes

How much work do we need?

Scaling

Big-$O$ notation:

\[ h(x) = O(g(x)) \]

means

\[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \]

for some $c \neq 0$

e.g., $x^2 - 5000x + 123456778 = O(x^2)$

e.g., $e^{x}/(1+e^{x}) = O(1)$

Useful to look at over-all scaling, hiding details

Also done when the limit is $x\rightarrow 0$

How Much Work is Gradient Descent?

Pro:

For nice $f$, $f(\theta) \leq f(\theta^*)+\epsilon$ in $O(\epsilon^{-2})$ iterations
- For very nice $f$, only $O(\log{\epsilon^{-1}})$ iterations
To get $\nabla f(\theta)$, take $p$ derivatives, $\therefore$ each iteration costs $O(p)$

Con:

Taking derivatives can slow down as data grows — each iteration might really be $O(np)$

Taylor Series

What if we do a quadratic approximation to $f$?

\[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Special cases of general idea of Taylor approximation

Simplifies if $x_0$ is a minimum since then $f^{\prime}(x_0) = 0$:

\[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Near a minimum, smooth functions look like parabolas

Carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]

Minimizing a Quadratic

If we know \[ f(x) = ax^2 + bx + c \]

we minimize exactly:

\[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \]

If \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \]

then \[ x^* = x_0 - a^{-1}b \]

Newton’s Method

Taylor-expand for the value at the minimum $\theta^*$ \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \]

Take gradient, set to zero, solve for $\theta^*$: \[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \]

Works exactly if $f$ is quadratic
and $\mathbf{H}^{-1}$ exists, etc.

If $f$ isn’t quadratic, keep pretending it is until we get close to $\theta^*$, when it will be nearly true

Newton’s Method: The Algorithm

Start with guess for $\theta$
While ((not too tired) and (making adequate progress))
- Find gradient $\nabla f(\theta)$ and Hessian $\mathbf{H}(\theta)$
- Set $\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)$
Return final $\theta$ as approximation to $\theta^*$

Like gradient descent, but with inverse Hessian giving the step-size

“This is about how far you can go with that gradient”

Advantages and Disadvantages of Newton’s Method

Pros:

Step-sizes chosen adaptively through 2nd derivatives, much harder to get zig-zagging, over-shooting, etc.
Also guaranteed to need $O(\epsilon^{-2})$ steps to get within $\epsilon$ of optimum
Only $O(\log\log{\epsilon^{-1}})$ for very nice functions
Typically many fewer iterations than gradient descent

Advantages and Disadvantages of Newton’s Method

Cons:

Hopeless if $\mathbf{H}$ doesn’t exist or isn’t invertible
Need to take $O(p^2)$ second derivatives plus $p$ first derivatives
Need to solve $\mathbf{H} \theta_{\mathrm{new}} = \mathbf{H} \theta_{\mathrm{old}} - \nabla f(\theta_{\mathrm{old}})$ for $\theta_{\mathrm{new}}$
- inverting $\mathbf{H}$ is $O(p^3)$, but cleverness gives $O(p^2)$ for solving for $\theta_{\mathrm{new}}$

Getting Around the Hessian

Want to use the Hessian to improve convergence

Don’t want to have to keep computing the Hessian at each step

Approaches:

Use knowledge of the system to get some approximation to the Hessian, use that instead of taking derivatives (“Fisher scoring”)
Use only diagonal entries ($p$ unmixed 2nd derivatives)
Use $\mathbf{H}(\theta)$ at initial guess, hope $\mathbf{H}$ changes very slowly with $\theta$
Re-compute $\mathbf{H}(\theta)$ every $k$ steps, $k > 1$
Fast, approximate updates to the Hessian at each step (BFGS)

Other Methods

Lots!
See bonus slides at end for for “Nedler-Mead”, a.k.a. “the simplex method”, which doesn’t need any derivatives
See bonus slides for the meta-method “coordinate descent”

Curve-Fitting by Optimizing

We have data $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$

We also have possible curves, $r(x;\theta)$

e.g., $r(x) = x \cdot \theta$

e.g., $r(x) = \theta_1 x^{\theta_2}$

e.g., $r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}$ for fixed “basis” functions $b_j$

Curve-Fitting by Optimizing

Least-squares curve fitting:

\[ \hat{\theta} = \operatorname*{argmin}_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]

“Robust” curve fitting: \[ \hat{\theta} = \operatorname*{argmin}_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]

Optimization in R: optim()

optim(par, fn, gr, method, control, hessian)

fn: function to be minimized; mandatory
par: initial parameter guess; mandatory
gr: gradient function; only needed for some methods
method: defaults to a gradient-free method (``Nedler-Mead’’), could be BFGS (Newton-ish)
control: optional list of control settings
- (maximum iterations, scaling, tolerance for convergence, etc.)
hessian: should the final Hessian be returned? default FALSE

Return contains the location ($par) and the value ($val) of the optimum, diagnostics, possibly $hessian

Optimization in R: optim()

gmp <- read.table("gmp.dat")
gmp$pop <- gmp$gmp/gmp$pcgmp
library(numDeriv)
mse <- function(theta) { mean((gmp$pcgmp - theta[1]*gmp$pop^theta[2])^2) }
grad.mse <- function(theta) { grad(func=mse,x=theta) }
theta0=c(5000,0.15)
fit1 <- optim(theta0,mse,grad.mse,method="BFGS",hessian=TRUE)

fit1: Newton-ish BFGS method

fit1[1:3]

## $par
## [1] 6493.2564    0.1277
## 
## $value
## [1] 61853983
## 
## $counts
## function gradient 
##       63       11

fit1: Newton-ish BFGS method

fit1[4:6]

## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##           [,1]      [,2]
## [1,]      52.5 4.422e+06
## [2,] 4422070.4 3.757e+11

nls

optim is a general-purpose optimizer

So is nlm — try them both if one doesn’t work

nls is for nonlinear least squares

nls

nls(formula, data, start, control, [[many other options]])

formula: Mathematical expression with response variable, predictor variable(s), and unknown parameter(s)
data: Data frame with variable names matching formula
start: Guess at parameters (optional)
control: Like with optim (optional)

Returns an nls object, with fitted values, prediction methods, etc.

The default optimization is a version of Newton’s method

fit2: Fitting the Same Model with nls()

fit2 <- nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1))
summary(fit2)

## 
## Formula: pcgmp ~ y0 * pop^a
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## y0 6.49e+03   8.57e+02    7.58  2.9e-13 ***
## a  1.28e-01   1.01e-02   12.61  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7890 on 364 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 1.75e-07

fit2: Fitting the Same Model with nls()

plot(pcgmp~pop,data=gmp)
pop.order <- order(gmp$pop)
lines(gmp$pop[pop.order],fitted(fit2)[pop.order])
curve(fit1$par[1]*x^fit1$par[2],add=TRUE,lty="dashed",col="blue")

plot of chunk unnamed-chunk-5

Summary

Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation
- Gradient descent: less complex iterations, more guarantees, less adaptive
- Newton: more complex iterations, but few of them for good functions, more adaptive, less robust
Start with pre-built code like optim or nls, implement your own as needed

Nelder-Mead, a.k.a. the Simplex Method

Try to cage $\theta^*$ with a simplex of $p+1$ points

Order the trial points, $f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})$

$\theta_{p+1}$ is the worst guess — try to improve it

Center of the not-worst = $\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}$

Nelder-Mead, a.k.a. the Simplex Method

Try to improve the worst guess $\theta_{p+1}$

Reflection: Try $\theta_0 - (\theta_{p+1}-\theta_0)$, across the center from $\theta_{p+1}$
- if it’s better than $\theta_p$ but not than $\theta_1$, replace the old $\theta_{p+1}$ with it
- Expansion: if the reflected point is the new best, try $\theta_0 - 2(\theta_{p+1}-\theta_0)$; replace the old $\theta_{p+1}$ with the better of the reflected and the expanded point
Contraction: If the reflected point is worse that $\theta_p$, try $\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}$; if the contracted value is better, replace $\theta_{p+1}$ with it
Reduction: If all else fails, $\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}$
Go back to (1) until we stop improving or run out of time

Making Sense of Nedler-Mead

The Moves:

Reflection: try the opposite of the worst point
Expansion: if that really helps, try it some more
Contraction: see if we overshot when trying the opposite
Reduction: if all else fails, try making each point more like the best point

Making Sense of Nedler-Mead

Pros:

Each iteration $\leq 4$ values of $f$, plus sorting
(and sorting is at most $O(p\log{p})$, usually much better)
No derivatives used, can even work for dis-continuous $f$

Con: - Can need many more iterations than gradient methods

Coordinate Descent

Gradient descent, Newton’s method, simplex, etc., adjust all coordinates of $\theta$ at once — gets harder as the number of dimensions $p$ grows

Coordinate descent: never do more than 1D optimization

Start with initial guess $\theta$
While ((not too tired) and (making adequate progress))
- For $i \in (1:p)$
  - do 1D optimization over $i^{\mathrm{th}}$ coordinate of $\theta$, holding the others fixed
  - Update $i^{\mathrm{th}}$ coordinate to this optimal value
Return final value of $\theta$

Coordinate Descent

Cons:

Needs a good 1D optimizer
Can bog down for very tricky functions, especially with lots of interactions among variables

Pros:

Can be extremely fast and simple

Lecture 17: Numerical Optimization

36-350

22 October 2014

Agenda

Examples of Optimization Problems

Optimization Problems

Considerations

You remember calculus, right?

You remember calculus, right?

Gradients and Changes to f

Gradient Descent

Pros and Cons of Gradient Descent

Scaling

How Much Work is Gradient Descent?

Taylor Series

Minimizing a Quadratic

Newton’s Method

Newton’s Method: The Algorithm

Advantages and Disadvantages of Newton’s Method

Advantages and Disadvantages of Newton’s Method

Getting Around the Hessian

Other Methods

Curve-Fitting by Optimizing

Curve-Fitting by Optimizing

Optimization in R: optim()

Optimization in R: optim()

fit1: Newton-ish BFGS method

fit1: Newton-ish BFGS method

nls

nls

fit2: Fitting the Same Model with nls()

fit2: Fitting the Same Model with nls()

Summary

Nelder-Mead, a.k.a. the Simplex Method

Nelder-Mead, a.k.a. the Simplex Method

Making Sense of Nedler-Mead

Making Sense of Nedler-Mead

Coordinate Descent

Coordinate Descent