\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\optimum}{\optimand^*} \newcommand{\ObjFunc}{{M}} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \]

Previously

What we really want is the risk-minimizing strategy, \[ \OptimalModel \equiv \argmin_{s \in \ModelClass}{\Expect{\Loss(Y,s(X))}} = \argmin_{s \in \ModelClass}{\Risk(s)} \]
We often settle instead for the empirical risk minimizer \[ \hat{s}_n \equiv \argmin_{s \in \ModelClass}{\frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i))}} = \argmin_{s \in \ModelClass}{\EmpRisk(s)} \]
Finding the minimizers means doing optimization
- Objective function \(\ObjFunc\), variable being optimized \(\optimand\)
- Local vs. global optima
- Location of optimum \(\optimum\), value of optimum \(\ObjFunc(\optimum)\) (argmin vs. min)
- First-order condition: the function is flat at the minimum, \(\frac{d\ObjFunc}{d\optimand}(\optimum) = 0\)
- Second-order condition: the function curves upwards at the minimum, \(\frac{d^2\ObjFunc}{d\optimand^2}(\optimum) > 0\)

Today

What if \(\optimand\) has more than one dimension?
- First order condition: “the gradient vanishes”
- Second order condition: “the Hessian is positive-definite”
What about actual algorithms for computing \(\optimum\)?
- Solving the first-order condition equations
- Using the first derivatives
- Using the first and second derivatives
- Adapting to big data
How hard should we try to optimize anyway?
“What do I type in R?”

What about more than one dimension?

Usually \(\optimand\) is a vector of \(\OptDim > 1\) dimensions
We usually can’t optimize each coordinate separately
What should happen at an interior minimum \(\optimum\)?
\(\ObjFunc\) should have no slope at \(\optimum\) in every direction
\(\ObjFunc\) should increase as we move away from \(\optimum\) in every direction

No slope in any direction: the first-order condition

Pick any direction \(\vec{v}\), a vector of length 1, say \((v_1, v_2, \ldots v_\OptDim)\)
The slope of \(\ObjFunc\) in that direction, at \(\optimand\), is (chain rule) \[ \sum_{i=1}^{p}{v_i \frac{\partial \ObjFunc}{\partial \optimand_i}(\optimand)} = \vec{v} \cdot \nabla \ObjFunc(\optimand) \]
Here \(\nabla \ObjFunc(\optimand)\) is the gradient of \(\ObjFunc\) at \(\optimand\), the vector of partial derivatives \[ \nabla \ObjFunc(\optimand) \equiv \left[\begin{array}{ccc} \frac{\partial \ObjFunc}{\partial \optimand_1}(\optimand) & \ldots & \frac{\partial \ObjFunc}{\partial \optimand_\OptDim}(\optimand) \end{array} \right] \]

No slope in any direction at \(\optimum\) means: \(\vec{v} \cdot \nabla \ObjFunc(\optimum) = 0\) for all \(\vec{v} \neq 0\)
And that means: \(\nabla \ObjFunc(\optimum) = 0\)
The first-order condition is: “the gradient vanishes at the optimum”

No slope in any direction

First-order condition or first-order conditions?

We have one vector equation \(\nabla \ObjFunc(\optimum) = 0\)
This is the same as a system of \(\OptDim\) equations for the partial derivatives: \[\begin{eqnarray*} \frac{\partial \ObjFunc}{\partial \optimand_1}(\optimum) & = & 0\\ & \vdots & \\ \frac{\partial \ObjFunc}{\partial \optimand_\OptDim}(\optimum) & = & 0 \end{eqnarray*}\]
We also have \(\OptDim\) unknowns, \(\optimum = \left[ \begin{array}{ccc} \optimum_1 & \ldots & \optimum_\OptDim \end{array}\right]\)
\(\OptDim\) equations for \(\OptDim\) unknowns \(\Rightarrow\) typically a solution
- Typically a unique solution if all the equations are linear in \(\optimum\)
- Often not unique because nonlinear in \(\optimum\)
- But still, there are solutions!

The function increases in every direction: the second-order condition

Second-order Taylor series for vectors: \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + (\optimand - \optimum) \cdot \nabla \ObjFunc(\optimum) + \frac{1}{2}(\optimand - \optimum) \cdot \left(\nabla\nabla \ObjFunc (\optimum) \right) (\optimand - \optimum) \]
Here \(\nabla\nabla\ObjFunc(\optimum)\) is the matrix of second partial derivatives, \(\frac{\partial^2 \ObjFunc}{\partial \optimand_i \partial \optimand_j}\), a.k.a. the Hessian¹, or \(\Hessian\)
First-order condition says the gradient term is zero at \(\optimum\), so \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + \frac{1}{2}(\optimand - \optimum) \cdot \left(\nabla\nabla \ObjFunc (\optimum) \right) (\optimand - \optimum) \]
- “Typically, functions look quadratic near their minima”
\(\optimum\) is a minimum means: \[ (\optimand - \optimum) \cdot \left(\nabla\nabla \ObjFunc(\optimum)\right) (\optimand - \optimum) > 0 \]

Positive-definite matrices

A square matrix \(\mathbf{h}\) is positive-definite when, for any non-zero vector \(\vec{v}\), \[ \vec{v} \cdot \mathbf{h} \vec{v} > 0 \]
- If we only have \(\vec{v} \cdot \mathbf{h} \vec{v} \geq 0\) then \(\mathbf{h}\) is only non-negative-definite (or positive semi-definite)
Not the same as \(\mathbf{h}\) only having positive entries!
- E.g., \(\mathbf{p} = \left[\begin{array}{cc} 1 & -0.5\\ -0.5 & 1\end{array}\right]\) is positive-definite
- E.g., \(\mathbf{n} = \left[\begin{array}{cc} 0.5 & 1\\ 1 & 0.5\end{array}\right]\) is not positive-definite
We write this as \(\mathbf{h} \succ 0\)
- Non-negative-definite is \(\mathbf{h} \succeq 0\)
For symmetric matrices: \(\mathbf{h}\) is positive definite \(\Leftrightarrow\) all eigenvalues of \(\mathbf{h}\) are \(>0\)
- The Hessian matrix \(\nabla\nabla\ObjFunc\) is always symmetric (why?)
- We’ll do a refresher on eigenvalues in a few weeks before we really need them

The first- and second- order conditions for minima

For \(\optimum\) to be a local minimum,

First-order condition: “The gradient must vanish”, \(\nabla \ObjFunc(\optimum) = 0\)
- Necessary, except at a boundary
Second-order condition: “The Hessian should be positive-definite”, \(\nabla \nabla \ObjFunc(\optimum) \succ 0\)
- Sufficient; minima where it’s violated are weird and a-typical
- Necessary: \(\nabla \nabla \ObjFunc(\optimum) \succeq 0\), “the Hessian must be non-negative-definite”

Near a minimum, nice functions look quadratic

Taylor approximation again: if \(\optimum\) is a local minimum, so \(\nabla \ObjFunc(\optimum) = 0\), then \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + \frac{1}{2}(\optimand-\optimum) \cdot \left(\nabla\nabla \ObjFunc(\optimum)\right) (\optimand - \optimum) \]
Consequence: if we come close to the location of minimum, so \(\|\optimand - \optimum\| = \delta \ll 1\), then \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + O(\delta^2) \]
If we can get \(\delta\)-close to the location of the optimum, we get \(O(\delta^2)\)-close to the value of the optimum (and \(\delta^2 \ll \delta \ll 1\))
To get within \(\epsilon\) of the value of the optimum, we need to only get within \(O(\sqrt{\epsilon})\) of the location of the optimum (and \(\sqrt{\epsilon} \gg \epsilon\) if \(\epsilon \ll 1\))

Minimizing risk vs. minimizing empirical risk

We want to minimize risk, \[ \optimum = \argmin_{\optimand \in \OptDomain}{\Risk(\optimand)} = \argmin_{\optimand \in \OptDomain}{\Expect{\Loss(Y, s(X))}} \]
We can minimize empirical risk, \[ \widehat{\optimand} = \argmin_{\optimand \in \OptDomain}{\EmpRisk(\optimand)} = \argmin_{\optimand \in \OptDomain}{\frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i))}} \]
We’re going to see later that \[ \|\widehat{\optimand} - \optimum\| = O(\sqrt{1/n}) \]
- Basically: because of the law of large numbers
- Assuming \(\optimand\) has finite dimension \(\OptDim\) not changing with \(n\)
Consequence: \[ \Risk(\widehat{\optimand}) \approx \Risk(\optimum) + O(\OptDim/n) \]
- Factor of \(\OptDim\) comes from the Hessian (basically)
\(\Rightarrow\) Minimizing the empirical risk comes closer and closer to minimizing the true risk

Finding the minimum: optimization algorithms

An optimization algorithm starts from \(\ObjFunc\) and \(\OptDomain\), and (usually) a starting guess \(\optimand^{(0)}\), and finds an approximation to \(\argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand)}\), say \(\outputoptimand\)
We care about approximating the value, not the location: the algorithm gets \(\epsilon\)-close when \[ \ObjFunc(\outputoptimand) \leq \epsilon + \min_{\optimand \in \OptDomain}{\ObjFunc(\optimand)} \]
Usually, the longer we let the algorithm run, the better the approximation
- How many steps does the algorithm need to get \(\epsilon\)-close to the optium?
  - \(O(1/\epsilon)\) or \(O(\epsilon^{-d})\) steps is polynomial (tolerable, depending on \(d\))
  - \(O(\log{1/\epsilon})\) is logarithmic (very nice)
  - \(\myexp{O(1/\epsilon)}\) is exponential (bad)

How do we build an optimization algorithm?

Remember our first and second order conditions: \[\begin{eqnarray} \nabla \ObjFunc(\optimum) & = & 0\\ \nabla\nabla \ObjFunc(\optimum) & \succ & 0 \end{eqnarray}\]
Two big approaches at this point:
1. Solve the equations
2. Keep moving until the gradient \(\nabla \ObjFunc\) goes to 0

Optimizing by equation-solving

Use the first-order condition to get a system of equations \[ \nabla \ObjFunc(\optimum) = 0 \]
- One equation per coordinate of \(\optimand\) (as we saw)
- When \(\ObjFunc\) is empirical risk \(\EmpRisk\), sometimes called the estimating equations or even normal equations
Solve the system of equations for \(\optimum\)
If there’s more than one solution, check the second-order conditions
We did this for ordinary least squares, weighted least squares…

Pros and cons of the solve-the-equations approach

Con: You need to set up the system of equations, and often finding \(\nabla \ObjFunc\) would itself be a pain
- Pro: Numerical differentiation is a thing, however
?: You need to solve a system of equations: good if there are good solvers for that type of system of equations, not so good otherwise
- Pro: 200+ years of work have given us very good solvers for linear systems
  - Pro: For linear systems, even very old-fashioned methods that go back to Gauss around 1800 get \(\epsilon\) approximations with \(O(\log{1/\epsilon})\) iterations
- Con: General-purpose nonlinear equation-solving is still much harder
  - ?: sometimes works by using Taylor expansion to linearize
  - Con: sometimes works by turning the solve-the-equations into “minimize the difference between the left and the right hand side of the equation”

Go back to the calculus

Start with a guess \(\optimand^{(0)}\)
Find \(\nabla \ObjFunc(\optimand^{(0)})\)
Move in the opposite direction: \[ \optimand^{(1)} = \optimand^{(0)} - a_0 \nabla \ObjFunc(\optimand^{(0)}) \]
Repeat: \[ \optimand^{(t+1)} = \optimand^{(t)} - a_t \nabla \ObjFunc(\optimand^{(t)}) \]
First-order condition means: a local optimum will be a fixed point!
Issue: how big are the step sizes \(a_t\)?
- (Sometimes called the learning rate, confusingly enough)

Constant-step-size gradient descent

Inputs: objective function \(\ObjFunc\), step size \(a\), initial guess \(\optimand^{(0)}\)

while ((not too tired) and (making adequate progress)) {
Find \(\nabla \ObjFunc(\optimand^{(t)})\)
Set \(\optimand^{(t+1)} \leftarrow \optimand^{(t)} - a \nabla \ObjFunc(\optimand^{(t)})\)
}
return (final \(\optimand\))

“not too tired”: Set a maximum number of iterations
“making adequate progress”:
- \(\ObjFunc\) isn’t changing by too little to bother with
- \(\optimand\) isn’t changing by too little to bother with
- \(\nabla \ObjFunc\) isn’t too close to zero

Constant-step-size gradient descent

Pick an \(a > 0\) that’s small and use it at each step
Each iteration of gradient descent takes \(O(\OptDim)\) operations
- Find \(\OptDim\) derivatives, multiply by \(a\), add to \(\optimand^{(t-1)}\)
If \(\ObjFunc\) is nice, \(\optimand^{(t)}\) is an \(\epsilon\)-approximation of the optimum after \(t=O(\epsilon^{-2})\) iterations
- i.e. at that point \(\ObjFunc(\optimand^{(t)}) \leq \epsilon+\min{\ObjFunc(\optimand)}\)
- “Nice” here means: convex and second-differentiable
If \(\ObjFunc\) is very nice, \(\optimand^{(t)}\) is an \(\epsilon\)-approximation after only \(t=O(\log{1/\epsilon})\) iterations
- “nice” plus strictly convex

Gradient descent is basic, but powerful

Gradient descent works well when there’s a single global minimum, no flat parts to the function, and the step size is small enough to not over-shoot or zig-zag
It’s actually been re-invented a number of times under different names
- e.g., “back-propagation” (Rumelhart, Hinton, and Williams 1986)
It’s the work-horse for large-scale industrial applications in modern machine learning
- especially as stochastic gradient descent
It’s still a bit mysterious why it works so well for those applications, which actually have lots of local minima!

Beyond gradient descent: Newton’s method

Needing to pick the step-size \(a_t\) is annoying
We’d like to take big steps, but \(\nabla \ObjFunc\) is a local quantity and might be mis-leading far away
\(\Rightarrow\) We’d like to take bigger steps when the gradient doesn’t change much
This is Newton’s method: \[ \optimand^{(t+1)} = \optimand^{(t)} - \left(\Hessian(\optimand^{(t)}) \right)^{-1} \nabla \ObjFunc(\optimand^{(t)}) \]
- One route to this: pretend \(\ObjFunc\) is quadratic, as justified by a Taylor expansion around the true minimum
This is like gradient descent, but using the inverse Hessian to give the step size
- And possibly a bit of rotation away from the gradient

Pros of Newton’s method

Adaptively-chosen step size makes it harder to zig-zag, over-shoot, etc.
Generally needs many fewer iterations than gradient descent
- Need \(O(\epsilon^{-2})\) steps to get an \(\epsilon\) approximation to the minimum for nice functions
- For very nice functions, only need \(O(\log{(\log{(1/\epsilon)})})\) iterations

Cons of Newton’s method

Hopeless if the Hessian doesn’t exist or isn’t invertible
Need to take \(O(\OptDim^2)\) second derivatives and \(p\) first derivatives, total \(O(\OptDim^2)\)
Need to find \(\optimand^{(t+1)}\)
- Seems straightforward, it’s \(\optimand^{(t+1)} = \optimand^{(t)} - \left(\Hessian(\optimand^{(t)}) \right)^{-1} \nabla \ObjFunc(\optimand^{(t)})\)
- But inverting a \([p\times p]\) matrix takes \(O(\OptDim^3)\) operations in general, so this would be an \(O(\OptDim^3)\) step
Alternative: solve \(\Hessian \optimand^{(t+1)} = \Hessian \optimand^{(t)} - \nabla \ObjFunc(\optimand^{(t)})\) for \(\optimand^{(t+1)}\) for the unknown \(\optimand^{(t+1)}\)
- (Take the basic update equation for Newton’s method and multiply both sides by \(\Hessian\) from the left)
- Solving a system of \(p\) linear equations for a particular RHS can be done faster than inverting a matrix (which’d give the solution for any RHS)
- Lots of variants to use approximate Hessians rather than the full deal (BFGS, built in to R’s optim(), is one of these)
So each iteration is \(O(\OptDim^2)\), much slower than gradient descent’s \(O(\OptDim)\)
- \(O(\OptDim^2)\) to get Hessian and gradient plus \(O(\OptDim^2)\) to solve for update \(=O(\OptDim^2)\)

Gradient methods with big data

\[ \EmpRisk(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \theta))} \]

Getting a value of \(\EmpRisk\) at a particular \(\theta\) is \(O(n)\), getting \(\nabla \EmpRisk\) is \(O(np)\), getting \(\Hessian\) is \(O(n\OptDim^2)\)
- And that’s assuming calculating \(s(x_i; \theta)\) doesn’t slow down with \(n\)
Maybe OK when \(n=100\) or \(n=10^4\), but with \(n=10^9\) or \(n=10^{12}\), we really don’t know which way to move

A way out: sampling is an unbiased estimate

Pick one data point \(I\) at random, uniform on \(1:n\)
\(\Loss(y_I, s(x_I; \theta))\) is random, but \[ \Expect{\Loss(y_I, s(x_I; \theta))} = \EmpRisk(\theta) \]
Re-brand \(\Loss(y_I, s(x_I; \theta))\) as \(\EmpRisk_{I}(\theta)\) \[\begin{eqnarray} \Expect{\EmpRisk_{I}(\theta)} & = & \EmpRisk(\theta)\\ \Expect{\nabla \EmpRisk_{I}(\theta)} & = & \nabla \EmpRisk(\theta)\\ \Expect{\nabla \nabla \EmpRisk_{I}(\theta)} & = & \Hessian(\theta) \end{eqnarray}\]
\(\Rightarrow\) Don’t optimize with all the data, optimize with random samples

Stochastic gradient descent

Draw lots of random one-point samples and let their noise cancel out:

Start with initial guess \(\optimand^{(0)}\), adjustment rate \(a\)
While (not too tired) and (making adequate progress))
1. At \(t^{\mathrm{th}}\) iteration, pick random \(I\) uniformly on \(1:n\)
2. Set \(\optimand^{(t+1)} \leftarrow \optimand^{(t)} - \frac{a}{t}\nabla \EmpRisk_{I}(\optimand^{(t)})\)
Return final \(\optimand\)

Shrinking step-sizes by \(1/t\) ensures noise in each gradient dies down

Stochastic gradient descent (2)

Tons of variants:
- Put the data points \(1:n\) in a random order and then cycle through them
- Don’t check the “making adequate progress” condition too often
- Adjust the \(1/t\) step-size to some other function
- Stochastic Newton’s method: Use the sample to also calculate the Hessian and take a Newton’s method step
- Mini-batch: Sample a few of random data points at once
- Mini-batch stochastic Newton’s method, etc.
All of these are kinds of stochastic approximation algorithms, which go back to the 1950s (Robbins and Monro 1951; Albert and Gardener 1967; Nevel’son and Has’minskiĭ, n.d.)

Pros and cons of stochastic gradient methods

Pro: Each iteration is (or at least constant in \(n\))
Pro: Never need to hold all the data in memory at once
Pro: Does converge eventually (at least if the non-stochastic method would)
Cons: sampling noise increases optimization error
- That is: more iterations to come within the same \(\epsilon\) of the optimum as non-stochastic GD or Newton
Over-all pro: often low computational cost to make the optimzation error small compared to the estimation error

More optimization algorithms

Ones which play more tricks with derivatives than just gradient descent and Newton (“conjugate gradient”, etc., etc.)
Ones which avoid derivatives (“simplex” or “Nelder-Mead”)
Ones which avoid derivatives and try random changes (“simulated annealing”)
Ones which use natural-selection-with-random-variation to evolve a whole population of approximate optima (“genetic algorithms”)

Estimation error vs. optimization error

Remember our approximation error vs. estimation error decomposition: \[\begin{eqnarray} \Risk(\hat{s}) & = & \Risk(\OptimalStrategy) + (\Risk(\OptimalModel) - \Risk(\OptimalStrategy)) + (\Risk(\hat{s}) - \Risk(\OptimalModel))\\ & = & (\text{true minimum risk}) + (\text{approximation error from limited strategy set})\\ & & + (\text{estimation error from not knowing the best-in-class set}) \end{eqnarray}\]
Now we don’t even have \(\hat{s} = \argmin{\EmpRisk(s)}\), we have \(\hat{s}_{\mathrm{out}}\), the output of some algorithm \[\begin{eqnarray} \Risk(\hat{s}_{\mathrm{out}}) & = & \Risk(\OptimalStrategy) + (\Risk(\OptimalModel) - \Risk(\OptimalStrategy)) + (\Risk(\hat{s}) - \Risk(\OptimalModel)) + (\Risk(\hat{s}_{\mathrm{out}}) - \Risk(\hat{s}))\\ &= & (\text{optimal risk}) + (\text{approximation error}) + (\text{estimation error}) + (\text{optimization error}) \end{eqnarray}\]
Optimization error \(\approx\) what I’ve been calling \(\epsilon\)
- only \(\approx\) because of \(\Risk\) vs. \(\EmpRisk\) issue

Estimation error vs. optimization error (2)

\[ \text{risk} = \text{minimal risk} + \text{approximation error} + \text{estimation error} + \text{optimization error} \]

Minimal risk and approximation error don’t change with \(n\) or with how we optimize
Estimation error shrinks with \(n\): for large \(n\), typically \(O(\OptDim/n)\)
- Possibly more slowly converging in \(n\) for some families, or if \(\OptDim\) grows with \(n\)
Optimization error shrinks as we do more computational work
There’s no point to making the optimization error much smaller than the estimation error
- More exactly: lots of work for little real benefit
So: don’t try to make the optimization error much smaller than \(O(\OptDim/n)\)

Don’t bother optimizing more precisely than the noise in the data will support (Bottou and Bosquet 2012)

What do we do in R?

The basic function for optimization in R is optim()

optim(par, fn, gr, method, ...)

par \(=\) Initial guess at the “parameters” \(=\) a vector, our \(\optimand^{(0)}\)
fn \(=\) Function to be minimized, our \(\ObjFunc(\optimand)\)
- Should take a single vector as input and return a single numeric value
- R lets functions be arguments to other functions without any fuss
gr \(=\) Function to calculate the gradient, our \(\nabla \ObjFunc(\optimand)\)
- Should take a vector and return a vector of the same length
- Optional, not used by all methods, if missing R will try numerical differentiation
  - Numerical differentiation can be very slow so your doing some math to work this out can be very useful
method \(=\) Which optimization algorithm?
- Default is Nelder-Mead a.k.a. simplex method, doesn’t use derivatives, can be good for dis-continuous functions but inefficient for smooth ones
- BFGS is a Newton-type method, but with clever tricks to not spend quite so much time computing and inverting Hessians
...: lots of extra settings, including things like the “tolerance” (how small an improvement in fn / \(\ObjFunc\) to bother with)

No, really, what do we do in R?

my.fn <- function(t) {
    - exp(-0.25*sqrt(t[1]^2+t[2]^2))*cos(sqrt(t[1]^2+t[2]^2))
}

No, really, what do we do in R?

my.fit <- optim(par=c(1,1), fn=my.fn, method="BFGS") # Starting here is dumb!
str(my.fit)

## List of 5
##  $ par        : num [1:2] 4.96e-10 4.96e-10
##  $ value      : num -1
##  $ counts     : Named int [1:2] 41 13
##   ..- attr(*, "names")= chr [1:2] "function" "gradient"
##  $ convergence: int 0
##  $ message    : NULL

my.fit$par    # Location of the minimum

## [1] 4.956205e-10 4.956205e-10

my.fit$value  # Value at the minimum

## [1] -1

What if `optim()` isn’t enough?

We’ll look briefly at doing constrained optimization in R next time
[https://cran.r-project.org/web/views/Optimization.html] is your friend

Summing up

When optimizing multiple variables at once, we still want all the derivatives to be zero (first-order condition, \(\nabla \ObjFunc(\optimum) = 0\)), and the second derivative to be positive in every direction (second order condition, \(\nabla \nabla\ObjFunc(\optimum) \succ 0\))
With real data and real computers, finding the empirical-risk-minimizer means using an algorithm to solve an optimization problem
These algorithms almost never give the exact optimum but just an approximation
Usually, the longer an algorithm is allowed to work, the closer it can get to the true optimum
This adds optimization error on to estimation error
For many statistical learning problems, gradient descent and Newton’s method work really well
- With sampling to make them more computationally efficient for big data
Don’t bother reducing the optimization error much beyond the estimation error
No one algorithm is best for all problems

Backup: More about second-order conditions

I’ve been writing \(\nabla \nabla \ObjFunc \succ 0\), which is a sufficient condition for a local minimum (if the first-order condition also holds)
\(\nabla \nabla \ObjFunc \succeq 0\) is a necessary condition for a local minimum
- There can’t be any directions in which the function curves down
Again, think of \(\theta^4\) in one dimension
Again, the typical local minimum of a smooth function has a positive-definite Hessian, cases where it’s only non-negative-definite are fragile
- See backup slides to lecture 5 for more, including an illustration

Backup: Representation vs. Reality

Optimization algorithms don’t really start with \(\ObjFunc\), \(\OptDomain\) and \(\optimand^{(0)}\)
They start with digital representations of all these things
Different representations can be easier or harder to work with
Digital representations of continuous things always have limited detail, which can lead to extra error
The late Joseph Traub, of our CS department, developed an interesting theory about how much detail the representations had to have, to achieve a certain accuracy
- See Traub and Werschulz (1998) if that sounds interesting

Backup: Why are there so many different optimization algorithms?

“Come up with a new algorithm” is a way to make a mark…
No one algorithm works well on every problem
- Sometimes obvious: don’t use Newton’s method if \(\OptDomain\) is discrete
Fundamental limit: no algorithm is universally better than others on every problem, shown by the no free lunch theorem of Wolpert and Macready (1997)
- In fact: For every problem where your favorite algorithm does better than mine, I can design a new problem where my algorithm leads yours by just as much (Culberson 1998)
We need to know something about the problem to select a good optimizer

References

Albert, Arthur E., and Jr. Gardener Leland A. 1967. Stochastic Approximation and Nonlinear Regression. Cambridge, Massachusetts: MIT Press.

Bottou, Léon, and Olivier Bosquet. 2012. “The Tradeoffs of Large Scale Learning.” In Optimization for Machine Learning, edited by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, 351–68. Cambridge, Massachusetts: MIT Press. http://leon.bottou.org/publications/pdf/mloptbook-2011.pdf.

Culberson, Joseph C. 1998. “On the Futility of Blind Search: An Algorithmic View of ‘No Free Lunch’.” Evolutionary Computation 6:109–27. http://www.cs.ualberta.ca/~joe/Abstracts/TR96-18.html.

Nevel’son, M. B., and R. Z. Has’minskiĭ. n.d. Stochastic Approximation and Recursive Estimation. Providence, Rhode Island: American Mathematical Society.

Robbins, Herbert, and Sutton Monro. 1951. “A Stochastic Approximation Method.” Annals of Mathematical Statistics 22:400–407. https://doi.org/10.1214/aoms/1177729586.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323:533–36. https://doi.org/10.1038/323533a0.

Traub, J. F., and A. G. Werschulz. 1998. Complexity and Information. Lezioni Lincee. Cambridge, England: Cambridge University Press.

Wolpert, David H., and William G. Macready. 1997. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1:67–82. https://doi.org/10.1109/4235.585893.

Lecture 6, Optimization Algorithms

Previously

Today

What about more than one dimension?

No slope in any direction: the first-order condition

No slope in any direction

First-order condition or first-order conditions?

The function increases in every direction: the second-order condition

Positive-definite matrices

The first- and second- order conditions for minima

Near a minimum, nice functions look quadratic

Minimizing risk vs. minimizing empirical risk

Finding the minimum: optimization algorithms

How do we build an optimization algorithm?

Optimizing by equation-solving

Pros and cons of the solve-the-equations approach

Go back to the calculus

Constant-step-size gradient descent

Constant-step-size gradient descent

Gradient descent is basic, but powerful

Beyond gradient descent: Newton’s method

Pros of Newton’s method

Cons of Newton’s method

Gradient methods with big data

A way out: sampling is an unbiased estimate

Stochastic gradient descent

Stochastic gradient descent (2)

Pros and cons of stochastic gradient methods

More optimization algorithms

Estimation error vs. optimization error

Estimation error vs. optimization error (2)

What do we do in R?

No, really, what do we do in R?

No, really, what do we do in R?

What if optim() isn’t enough?

Summing up

Backup: More about second-order conditions

Backup: Representation vs. Reality

Backup: Why are there so many different optimization algorithms?

References

What if `optim()` isn’t enough?