Lecture 6, Optimization Algorithms

36-462/662, Spring 2022

3 February 2022

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\optimum}{\optimand^*} \newcommand{\ObjFunc}{{M}} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \]

Previously

Today

What about more than one dimension?

No slope in any direction: the first-order condition

No slope in any direction

First-order condition or first-order conditions?

The function increases in every direction: the second-order condition

Positive-definite matrices

The first- and second- order conditions for minima

For \(\optimum\) to be a local minimum,

Near a minimum, nice functions look quadratic

Minimizing risk vs. minimizing empirical risk

Finding the minimum: optimization algorithms

How do we build an optimization algorithm?

Optimizing by equation-solving

Pros and cons of the solve-the-equations approach

Go back to the calculus

Constant-step-size gradient descent

while ((not too tired) and (making adequate progress)) {
   Find \(\nabla \ObjFunc(\optimand^{(t)})\)
   Set \(\optimand^{(t+1)} \leftarrow \optimand^{(t)} - a \nabla \ObjFunc(\optimand^{(t)})\)
}
return (final \(\optimand\))

Constant-step-size gradient descent

Gradient descent is basic, but powerful

Beyond gradient descent: Newton’s method

Pros of Newton’s method

Cons of Newton’s method

Gradient methods with big data

\[ \EmpRisk(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \theta))} \]

A way out: sampling is an unbiased estimate

Stochastic gradient descent

  1. Start with initial guess \(\optimand^{(0)}\), adjustment rate \(a\)
  2. While (not too tired) and (making adequate progress))
    1. At \(t^{\mathrm{th}}\) iteration, pick random \(I\) uniformly on \(1:n\)
    2. Set \(\optimand^{(t+1)} \leftarrow \optimand^{(t)} - \frac{a}{t}\nabla \EmpRisk_{I}(\optimand^{(t)})\)
  3. Return final \(\optimand\)

Stochastic gradient descent (2)

Pros and cons of stochastic gradient methods

More optimization algorithms

Estimation error vs. optimization error

Estimation error vs. optimization error (2)

\[ \text{risk} = \text{minimal risk} + \text{approximation error} + \text{estimation error} + \text{optimization error} \]

Don’t bother optimizing more precisely than the noise in the data will support (Bottou and Bosquet 2012)

What do we do in R?

optim(par, fn, gr, method, ...)

No, really, what do we do in R?

my.fn <- function(t) {
    - exp(-0.25*sqrt(t[1]^2+t[2]^2))*cos(sqrt(t[1]^2+t[2]^2))
}

No, really, what do we do in R?

my.fit <- optim(par=c(1,1), fn=my.fn, method="BFGS") # Starting here is dumb!
str(my.fit)
## List of 5
##  $ par        : num [1:2] 4.96e-10 4.96e-10
##  $ value      : num -1
##  $ counts     : Named int [1:2] 41 13
##   ..- attr(*, "names")= chr [1:2] "function" "gradient"
##  $ convergence: int 0
##  $ message    : NULL
my.fit$par    # Location of the minimum
## [1] 4.956205e-10 4.956205e-10
my.fit$value  # Value at the minimum
## [1] -1

What if optim() isn’t enough?

Summing up

Backup: More about second-order conditions

Backup: Representation vs. Reality

Backup: Why are there so many different optimization algorithms?

References

Albert, Arthur E., and Jr. Gardener Leland A. 1967. Stochastic Approximation and Nonlinear Regression. Cambridge, Massachusetts: MIT Press.

Bottou, Léon, and Olivier Bosquet. 2012. “The Tradeoffs of Large Scale Learning.” In Optimization for Machine Learning, edited by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, 351–68. Cambridge, Massachusetts: MIT Press. http://leon.bottou.org/publications/pdf/mloptbook-2011.pdf.

Culberson, Joseph C. 1998. “On the Futility of Blind Search: An Algorithmic View of ‘No Free Lunch’.” Evolutionary Computation 6:109–27. http://www.cs.ualberta.ca/~joe/Abstracts/TR96-18.html.

Nevel’son, M. B., and R. Z. Has’minskiĭ. n.d. Stochastic Approximation and Recursive Estimation. Providence, Rhode Island: American Mathematical Society.

Robbins, Herbert, and Sutton Monro. 1951. “A Stochastic Approximation Method.” Annals of Mathematical Statistics 22:400–407. https://doi.org/10.1214/aoms/1177729586.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323:533–36. https://doi.org/10.1038/323533a0.

Traub, J. F., and A. G. Werschulz. 1998. Complexity and Information. Lezioni Lincee. Cambridge, England: Cambridge University Press.

Wolpert, David H., and William G. Macready. 1997. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1:67–82. https://doi.org/10.1109/4235.585893.


  1. After L. O. Hesse, 1811–1874