Lecture 17: Numerical Optimization ==== author: 36-350 date: 22 October 2014 transition: None font-family: Garamond Agenda === - Basics of optimization - Gradient descent - Newton's method - Curve-fitting - R: optim, nls Reading: Recipes 13.1 and 13.2 in _The R Cookbook_ Optional reading: 1.1, 2.1 and 2.2 in _Red Plenty_ Examples of Optimization Problems === - Minimize mean-squared error of regression surface (Gauss, c. 1800) - Maximize likelihood of distribution (Fisher, c. 1918) - Maximize output of plywood from given supplies and factories (Kantorovich, 1939) - Maximize output of tanks from given supplies and factories; minimize number of bombing runs to destroy factory (c. 1939--1945) - Maximize return of portfolio for given volatility (Markowitz, 1950s) - Minimize cost of airline flight schedule (Kantorovich...) - Maximize reproductive fitness of an organism (Maynard Smith) Optimization Problems === Given an **objective function** $f: \mathcal{D} \mapsto R$, find $\DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \theta^* = \argmin_{\theta}{f(\theta)}$ Basics: maximizing $f$ is minimizing $-f$: $\argmax_{\theta}{f(\theta)}= \argmin_{\theta}{-f(\theta)}$ If $h$ is strictly increasing (e.g., $\log$), then $\argmin_{\theta}{f(\theta)} = \argmin_{\theta}{h(f(\theta))}$ Considerations === - Approximation: How close can we get to $\theta^*$, and/or $f(\theta^*)$? - Time complexity: How many computer steps does that take? Varies with precision of approximation, niceness of $f$, size of $\mathcal{D}$, size of data, method... - Most optimization algorithms use **successive approximation**, so distinguish number of iterations from cost of each iteration You remember calculus, right? === Suppose $x$ is one dimensional and $f$ is smooth. If $x^*$ is an **interior** minimum / maximum / extremum point ${\left. \frac{df}{dx} \right|}_{x=x^*} = 0$ If $x^*$ a minimum, ${\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0$ You remember calculus, right? === This all carries over to multiple dimensions: At an **interior extremum**, $\nabla f(\theta^*) = 0$ At an **interior minimum**, $\nabla^2 f(\theta^*) \geq 0$ meaning for any vector $v$, $v^T \nabla^2 f(\theta^*) v \geq 0$ $\nabla^2 f =$ the **Hessian**, $\mathbf{H}$ $\theta$ might just be a **local** minimum Gradients and Changes to f === $f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}}$ $f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0)$ Locally, the function looks linear; to minimize a linear function, move down the slope Multivariate version: $f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0)$ $\nabla f(\theta_0)$ points in the direction of fastest ascent at $\theta_0$ Gradient Descent === 1. Start with initial guess for $\theta$, step-size $\eta$ 2. While ((not too tired) and (making adequate progress)) - Find gradient $\nabla f(\theta)$ - Set $\theta \leftarrow \theta - \eta \nabla f(\theta)$ 3. Return final $\theta$ as approximate $\theta^*$ Variations: adaptively adjust $\eta$ to make sure of improvement or search along the gradient direction for minimum Pros and Cons of Gradient Descent === Pro: - Moves in direction of greatest immediate improvement - If $\eta$ is small enough, gets to a local minimum eventually, and then stops Cons: - "small enough" $\eta$ can be really, really small - Slowness or zig-zagging if components of $\nabla f$ are very different sizes How much work do we need? Scaling === Big-$O$ notation: $h(x) = O(g(x))$ means $\lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c$ for some $c \neq 0$ e.g., $x^2 - 5000x + 123456778 = O(x^2)$ e.g., $e^{x}/(1+e^{x}) = O(1)$ Useful to look at over-all scaling, hiding details Also done when the limit is $x\rightarrow 0$ How Much Work is Gradient Descent? === Pro: - For nice $f$, $f(\theta) \leq f(\theta^*)+\epsilon$ in $O(\epsilon^{-2})$ iterations + For _very_ nice $f$, only $O(\log{\epsilon^{-1}})$ iterations - To get $\nabla f(\theta)$, take $p$ derivatives, $\therefore$ each iteration costs $O(p)$ Con: - Taking derivatives can slow down as data grows --- each iteration might really be $O(np)$ Taylor Series === What if we do a quadratic approximation to $f$? $f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0)$ Special cases of general idea of Taylor approximation Simplifies if $x_0$ is a minimum since then $f^{\prime}(x_0) = 0$: $f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0)$ Near a minimum, smooth functions look like parabolas Carries over to the multivariate case: $f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0)$ Minimizing a Quadratic === If we know $f(x) = ax^2 + bx + c$ we minimize exactly: $\begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*}$ If $f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c$ then $x^* = x_0 - a^{-1}b$ Newton's Method === Taylor-expand for the value *at the minimum* $\theta^*$ $f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta)$ Take gradient, set to zero, solve for $\theta^*$: $\begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*}$ Works *exactly* if $f$ is quadratic and $\mathbf{H}^{-1}$ exists, etc. If $f$ isn't quadratic, keep pretending it is until we get close to $\theta^*$, when it will be nearly true Newton's Method: The Algorithm === 1. Start with guess for $\theta$ 2. While ((not too tired) and (making adequate progress)) - Find gradient $\nabla f(\theta)$ and Hessian $\mathbf{H}(\theta)$ - Set $\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)$ 3. Return final $\theta$ as approximation to $\theta^*$ Like gradient descent, but with inverse Hessian giving the step-size "This is about how far you can go with that gradient" Advantages and Disadvantages of Newton's Method === Pros: - Step-sizes chosen adaptively through 2nd derivatives, much harder to get zig-zagging, over-shooting, etc. - Also guaranteed to need $O(\epsilon^{-2})$ steps to get within $\epsilon$ of optimum - Only $O(\log\log{\epsilon^{-1}})$ for very nice functions - Typically many fewer iterations than gradient descent Advantages and Disadvantages of Newton's Method === Cons: - Hopeless if $\mathbf{H}$ doesn't exist or isn't invertible - Need to take $O(p^2)$ second derivatives *plus* $p$ first derivatives - Need to solve $\mathbf{H} \theta_{\mathrm{new}} = \mathbf{H} \theta_{\mathrm{old}} - \nabla f(\theta_{\mathrm{old}})$ for $\theta_{\mathrm{new}}$ + inverting $\mathbf{H}$ is $O(p^3)$, but cleverness gives $O(p^2)$ for solving for $\theta_{\mathrm{new}}$ Getting Around the Hessian === Want to use the Hessian to improve convergence Don't want to have to keep computing the Hessian at each step Approaches: - Use knowledge of the system to get some approximation to the Hessian, use that instead of taking derivatives ("Fisher scoring") - Use only diagonal entries ($p$ unmixed 2nd derivatives) - Use $\mathbf{H}(\theta)$ at initial guess, hope $\mathbf{H}$ changes *very* slowly with $\theta$ - Re-compute $\mathbf{H}(\theta)$ every $k$ steps, $k > 1$ - Fast, approximate updates to the Hessian at each step (BFGS) Other Methods === - Lots! - See bonus slides at end for for "Nedler-Mead", a.k.a. "the simplex method", which doesn't need any derivatives - See bonus slides for the meta-method "coordinate descent" Curve-Fitting by Optimizing === We have data $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$ We also have possible curves, $r(x;\theta)$ e.g., $r(x) = x \cdot \theta$ e.g., $r(x) = \theta_1 x^{\theta_2}$ e.g., $r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}$ for fixed "basis" functions $b_j$ Curve-Fitting by Optimizing === Least-squares curve fitting: $\hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}}$ "Robust" curve fitting: $\hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}}$ Optimization in R: optim() ===  optim(par, fn, gr, method, control, hessian)  - fn: function to be minimized; mandatory - par: initial parameter guess; mandatory - gr: gradient function; only needed for some methods - method: defaults to a gradient-free method (Nedler-Mead''), could be BFGS (Newton-ish) - control: optional list of control settings + (maximum iterations, scaling, tolerance for convergence, etc.) - hessian: should the final Hessian be returned? default FALSE Return contains the location ($par) and the value ($val) of the optimum, diagnostics, possibly $hessian Optimization in R: optim() === {r} gmp <- read.table("gmp.dat") gmp$pop <- gmp$gmp/gmp$pcgmp library(numDeriv) mse <- function(theta) { mean((gmp$pcgmp - theta*gmp$pop^theta)^2) } grad.mse <- function(theta) { grad(func=mse,x=theta) } theta0=c(5000,0.15) fit1 <- optim(theta0,mse,grad.mse,method="BFGS",hessian=TRUE)  fit1: Newton-ish BFGS method === {r} fit1[1:3]  fit1: Newton-ish BFGS method === {r} fit1[4:6]  nls === optim is a general-purpose optimizer So is nlm --- try them both if one doesn't work nls is for nonlinear least squares nls ===  nls(formula, data, start, control, [[many other options]])  - formula: Mathematical expression with response variable, predictor variable(s), and unknown parameter(s) - data: Data frame with variable names matching formula - start: Guess at parameters (optional) - control: Like with optim (optional) Returns an nls object, with fitted values, prediction methods, etc. The default optimization is a version of Newton's method fit2: Fitting the Same Model with nls() === {r} fit2 <- nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1)) summary(fit2)  fit2: Fitting the Same Model with nls() === {r} plot(pcgmp~pop,data=gmp) pop.order <- order(gmp$pop) lines(gmp$pop[pop.order],fitted(fit2)[pop.order]) curve(fit1$par*x^fit1$par,add=TRUE,lty="dashed",col="blue")  Summary === 1. Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation + Gradient descent: less complex iterations, more guarantees, less adaptive + Newton: more complex iterations, but few of them for good functions, more adaptive, less robust 2. Start with pre-built code like optim or nls, implement your own as needed Nelder-Mead, a.k.a. the Simplex Method === Try to cage $\theta^*$ with a **simplex** of $p+1$ points Order the trial points, $f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})$ $\theta_{p+1}$ is the worst guess --- try to improve it Center of the not-worst = $\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}$ Nelder-Mead, a.k.a. the Simplex Method === Try to improve the worst guess $\theta_{p+1}$ 1. **Reflection**: Try $\theta_0 - (\theta_{p+1}-\theta_0)$, across the center from $\theta_{p+1}$ + if it's better than $\theta_p$ but not than $\theta_1$, replace the old $\theta_{p+1}$ with it + **Expansion**: if the reflected point is the new best, try $\theta_0 - 2(\theta_{p+1}-\theta_0)$; replace the old $\theta_{p+1}$ with the better of the reflected and the expanded point 2. **Contraction**: If the reflected point is worse that $\theta_p$, try $\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}$; if the contracted value is better, replace $\theta_{p+1}$ with it 3. **Reduction**: If all else fails, $\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}$ 4. Go back to (1) until we stop improving or run out of time Making Sense of Nedler-Mead === The Moves: - Reflection: try the opposite of the worst point - Expansion: if that really helps, try it some more - Contraction: see if we overshot when trying the opposite - Reduction: if all else fails, try making each point more like the best point Making Sense of Nedler-Mead === Pros: - Each iteration $\leq 4$ values of $f$, plus sorting (and sorting is at most $O(p\log{p})$, usually much better) - No derivatives used, can even work for dis-continuous $f$ Con: - Can need _many_ more iterations than gradient methods Coordinate Descent === Gradient descent, Newton's method, simplex, etc., adjust all coordinates of $\theta$ at once --- gets harder as the number of dimensions $p$ grows **Coordinate descent**: never do more than 1D optimization - Start with initial guess $\theta$ - While ((not too tired) and (making adequate progress)) + For $i \in (1:p)$ * do 1D optimization over $i^{\mathrm{th}}$ coordinate of $\theta$, holding the others fixed * Update $i^{\mathrm{th}}$ coordinate to this optimal value - Return final value of $\theta$ Coordinate Descent === Cons: - Needs a good 1D optimizer - Can bog down for very tricky functions, especially with lots of interactions among variables Pros: - Can be extremely fast and simple