optim
, nls
Reading: Recipes 13.1 and 13.2 in The R Cookbook
Optional reading: 1.1, 2.1 and 2.2 in Red Plenty
Given an objective function \(f: \mathcal{D} \mapsto R\), find
\[ \theta^* = \operatorname*{argmin}_{\theta}{f(\theta)} \]
Basics: maximizing \(f\) is minimizing \(-f\):
\[ \operatorname*{argmax}_{\theta}{f(\theta)}= \operatorname*{argmin}_{\theta}{-f(\theta)} \]
If \(h\) is strictly increasing (e.g., \(\log\)), then
\[ \operatorname*{argmin}_{\theta}{f(\theta)} = \operatorname*{argmin}_{\theta}{h(f(\theta))} \]
Approximation: How close can we get to \(\theta^*\), and/or \(f(\theta^*)\)?
Time complexity: How many computer steps does that take?
Varies with precision of approximation, niceness of \(f\), size of \(\mathcal{D}\), size of data, method…
Most optimization algorithms use successive approximation, so distinguish number of iterations from cost of each iteration
Suppose \(x\) is one dimensional and \(f\) is smooth. If \(x^*\) is an interior minimum / maximum / extremum point
\[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]
If \(x^*\) a minimum, \[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]
This all carries over to multiple dimensions:
At an interior extremum, \[ \nabla f(\theta^*) = 0 \]
At an interior minimum, \[ \nabla^2 f(\theta^*) \geq 0 \] meaning for any vector \(v\), \[ v^T \nabla^2 f(\theta^*) v \geq 0 \] \(\nabla^2 f =\) the Hessian, \(\mathbf{H}\)
\(\theta\) might just be a local minimum
\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \]
\[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \]
Locally, the function looks linear; to minimize a linear function, move down the slope
Multivariate version: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \]
\(\nabla f(\theta_0)\) points in the direction of fastest ascent at \(\theta_0\)
Variations: adaptively adjust \(\eta\) to make sure of improvement or search along the gradient direction for minimum
Pro:
Cons:
How much work do we need?
Big-\(O\) notation:
\[ h(x) = O(g(x)) \]
means
\[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \]
for some \(c \neq 0\)
e.g., \(x^2 - 5000x + 123456778 = O(x^2)\)
e.g., \(e^{x}/(1+e^{x}) = O(1)\)
Useful to look at over-all scaling, hiding details
Also done when the limit is \(x\rightarrow 0\)
Pro:
Con:
What if we do a quadratic approximation to \(f\)?
\[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]
Special cases of general idea of Taylor approximation
Simplifies if \(x_0\) is a minimum since then \(f^{\prime}(x_0) = 0\):
\[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]
Near a minimum, smooth functions look like parabolas
Carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]
If we know \[ f(x) = ax^2 + bx + c \]
we minimize exactly:
\[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \]
If \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \]
then \[ x^* = x_0 - a^{-1}b \]
Taylor-expand for the value at the minimum \(\theta^*\) \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \]
Take gradient, set to zero, solve for \(\theta^*\): \[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \]
Works exactly if \(f\) is quadratic
and \(\mathbf{H}^{-1}\) exists, etc.
If \(f\) isn’t quadratic, keep pretending it is until we get close to \(\theta^*\), when it will be nearly true
Like gradient descent, but with inverse Hessian giving the step-size
“This is about how far you can go with that gradient”
Pros:
Cons:
Want to use the Hessian to improve convergence
Don’t want to have to keep computing the Hessian at each step
Approaches:
We have data \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\)
We also have possible curves, \(r(x;\theta)\)
e.g., \(r(x) = x \cdot \theta\)
e.g., \(r(x) = \theta_1 x^{\theta_2}\)
e.g., \(r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}\) for fixed “basis” functions \(b_j\)
Least-squares curve fitting:
\[ \hat{\theta} = \operatorname*{argmin}_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]
“Robust” curve fitting: \[ \hat{\theta} = \operatorname*{argmin}_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]
optim(par, fn, gr, method, control, hessian)
fn
: function to be minimized; mandatorypar
: initial parameter guess; mandatorygr
: gradient function; only needed for some methodsmethod
: defaults to a gradient-free method (``Nedler-Mead’’), could be BFGS (Newton-ish)control
: optional list of control settings
hessian
: should the final Hessian be returned? default FALSEReturn contains the location ($par
) and the value ($val
) of the optimum, diagnostics, possibly $hessian
gmp <- read.table("gmp.dat")
gmp$pop <- gmp$gmp/gmp$pcgmp
library(numDeriv)
mse <- function(theta) { mean((gmp$pcgmp - theta[1]*gmp$pop^theta[2])^2) }
grad.mse <- function(theta) { grad(func=mse,x=theta) }
theta0=c(5000,0.15)
fit1 <- optim(theta0,mse,grad.mse,method="BFGS",hessian=TRUE)
fit1[1:3]
## $par
## [1] 6493.2564 0.1277
##
## $value
## [1] 61853983
##
## $counts
## function gradient
## 63 11
fit1[4:6]
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2]
## [1,] 52.5 4.422e+06
## [2,] 4422070.4 3.757e+11
optim
is a general-purpose optimizer
So is nlm
— try them both if one doesn’t work
nls
is for nonlinear least squares
nls(formula, data, start, control, [[many other options]])
formula
: Mathematical expression with response variable, predictor variable(s), and unknown parameter(s)data
: Data frame with variable names matching formula
start
: Guess at parameters (optional)control
: Like with optim
(optional)Returns an nls
object, with fitted values, prediction methods, etc.
The default optimization is a version of Newton’s method
fit2 <- nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1))
summary(fit2)
##
## Formula: pcgmp ~ y0 * pop^a
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## y0 6.49e+03 8.57e+02 7.58 2.9e-13 ***
## a 1.28e-01 1.01e-02 12.61 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7890 on 364 degrees of freedom
##
## Number of iterations to convergence: 5
## Achieved convergence tolerance: 1.75e-07
plot(pcgmp~pop,data=gmp)
pop.order <- order(gmp$pop)
lines(gmp$pop[pop.order],fitted(fit2)[pop.order])
curve(fit1$par[1]*x^fit1$par[2],add=TRUE,lty="dashed",col="blue")
optim
or nls
, implement your own as neededTry to cage \(\theta^*\) with a simplex of \(p+1\) points
Order the trial points, \(f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})\)
\(\theta_{p+1}\) is the worst guess — try to improve it
Center of the not-worst = \(\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}\)
Try to improve the worst guess \(\theta_{p+1}\)
The Moves:
Pros:
Con: - Can need many more iterations than gradient methods
Gradient descent, Newton’s method, simplex, etc., adjust all coordinates of \(\theta\) at once — gets harder as the number of dimensions \(p\) grows
Coordinate descent: never do more than 1D optimization
Cons:
Pros: