\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \newcommand{\Indicator}[1]{\mathbb{I}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{{M}} \newcommand{\optimum}{\optimand^*} \newcommand{\Hessian}{\mathbf{h}} \]

Previously

Risk of a strategy $s$ is $\Risk = \Expect{\Loss(Y, s(X))}$, expected loss on new data
Empirical risk of strategy $s$ is $\EmpRisk(s) = n^{-1}\sum_{i=1}^{n}{\Loss(Y_i, s(X_i))}$, average loss on old data
We want to find the empirical risk minimizer $\hat{s}$, \[ \hat{s} \equiv \argmin_{s \in \ModelClass}{\EmpRisk(s)} \]
We’re now going to start opening up the black box of $\argmin$

Optimization: some jargon

The function we’re trying to optimize is the objective function, let’s say $\ObjFunc$ today
The argument to $\ObjFunc$ is (say) $\optimand$
- Some people call this the optimand
The possible values of $\optimand$ is $\OptDomain$, the domain or feasible set, whose dimension is (say) $\OptDim$
Optimization can be minimization or maximization, as we like; we’ll stick with minimizing

Local vs. global minima

$\optimand$ is a global minimum when $\altoptimand \neq \optimand$ $\Rightarrow$ $\ObjFunc(\altoptimand) \geq \ObjFunc(\optimand)$
- Not necessarily unique!
$\optimand$ is a local minimum when $\ObjFunc(\optimand) \leq \ObjFunc(\altoptimand)$ whenever $\altoptimand$ is close enough to $\optimand$
- Every global minimum is also a local minimum
- If there’s only one local minimum anywhere, it’s the global minimum
Lots of local minima tend to make it harder to find the global minimum

Local vs. global minima

“The” minimum: value vs. location

If $\optimum$ is a global minimum, then $\ObjFunc(\optimum)$ is the value of the minimum or minimal value, in symbols \[ \min_{\optimand \in \OptDomain}{\ObjFunc(\optimand)} \]
But $\optimum$ itself is the location of the global minimum, in symbols \[ \argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand)} \]
Example: the minimal value of $(x-1)^2$ is 0, but the location of the minimum is $x=1$
Transformations: If $L$ is an increasing function, then $L(\ObjFunc(\optimand))$ has the same location for its minimum, but a different value
- Example: log-likelihood vs. likelihood
Both value and location can change with $\OptDomain$
- important later, when we look at constraints

Finding the optimum: calculus basics

Assume for now that $\optimand$ is a continuous variable, and $\ObjFunc$ is a nice, continuous function
- We’ll talk about not-so-nice situations later
In fact, assume for now that $\optimand$ is just a single real number
Some things you probably remember from calculus about minima
- Isn’t $\frac{d \ObjFunc}{ d\optimand} = 0$?
- Isn’t $\frac{d^2 \ObjFunc }{ d\optimand^2} > 0$?
Yes, pretty much

The first order condition

At an interior, minimum $\optimum$, $\frac{d \ObjFunc }{ d\optimand}(\optimum) = 0$
- If $\ObjFunc$ had a slope, we could keep decreasing $\ObjFunc$ by moving past $\optimum$ in one direction or the other

The first order condition

The tangent line to $\ObjFunc$ is flat at the minimum $\optimum$

The first order condition and boundary optima

At an interior, minimum $\optimum$, $\frac{d \ObjFunc }{ d\optimand}(\optimum) = 0$
- If $\ObjFunc$ had a slope, we could keep decreasing $\ObjFunc$ by moving past $\optimum$ in one direction or the other
This reasoning fails at the boundaries of $\OptDomain$
- Easy example: $\OptDomain = [0,1]$, $\ObjFunc(\optimand) = 1-\optimand$
- Boundary optima can have zero slope though

The first order condition and boundary optima

The minimum on this domain is at the right-hand boundary, and the tangent line is not flat

The first order condition and boundary optima

At an interior, minimum $\optimum$, $\frac{d \ObjFunc }{ d\optimand}(\optimum) = 0$
- If $\ObjFunc$ had a slope, we could keep decreasing $\ObjFunc$ by moving past $\optimum$ in one direction or the other
This reasoning fails at the boundaries of $\OptDomain$
- Easy example: $\OptDomain = [0,1]$, $\ObjFunc(\optimand) = 1-\optimand$
But, except at boundaries, we need $\frac{d\ObjFunc }{ d\optimand}(\optimum) = 0$
This is called the first-order condition for a minimum

The second order condition

Maxima as well as minima also have zero derivatives, so do inflection points
A sufficient condition for a point with $d\ObjFunc / d\optimand = 0$ to be a minimum: $d^2 \ObjFunc / d\optimand^2 > 0$
- This is called the second order condition
- Sufficient, but not necessary: $\optimand^4$ has a minimum at $\optimand = 0$, even though $d^2\ObjFunc/d\optimand^2 = 12\optimand^2 = 0$ there
- Minima which don’t meet the second-order condition tend to be weird and fragile, like this
Generally, we can find local minima in one dimension by using the first- and second- order conditions together:
- Find all the solutions to $\frac{d\ObjFunc }{ d\optimand}(\optimum) = 0$
- Keep those with $\frac{d^2 \ObjFunc }{ d\optimand^2}(\optimum) > 0$

A bit more insight into the second-order condition

Remember the definition of a derivative: \[ \frac{df}{dx}(x_0) \equiv \lim_{x \rightarrow x_0}{f(x) - f(x_0)}{x-x_0} \]
Turn this around: for $x \approx x_0$, \[ f(x) \approx f(x_0) + (x-x_0)\frac{df}{dx}(x_0) \]
This is a first-order Taylor approximation
Second-order Taylor approximation: for $x \approx x_0$, \[ f(x) \approx f(x_0) + (x-x_0)\frac{df}{dx}(x_0) + \frac{1}{2}(x-x_0)^2 \frac{d^2f}{dx^2}(x_0) \]
First-order condition says: $\frac{d\ObjFunc}{d\optimand}(\optimum) = 0$
So, near $\optimum$, \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + \frac{1}{2}(\optimand - \optimum)^2 \frac{d^2 \ObjFunc}{d\optimand^2}(\optimum) \]
“Generic minima look, locally, like parabolas”

Generic minima look, locally, like parabolas

$\ObjFunc(\optimand)$ (solid) vs. $\ObjFunc(\optimum) + \frac{1}{2}(\optimand - \optimum)^2 \frac{d^2 \ObjFunc}{d\optimand^2}(\optimum)$ (dashed) around the local minimum $\optimum$

Break for in-class exercise (15 min.)

Groups of $\leq 4$; on your own is OK
- Turn in by 6 pm today on Gradescope
- In a group, pick one person as the scribe, make sure the scribe knows everyone’s Andrew ID

Suppose \[ \ObjFunc(\optimand) = -q\log{\optimand} - (1-q)\log{(1-\optimand)} \] with $0 < q< 1$, $\OptDomain = [0,1]$

Write out the first-order condition for $\optimum$ (but don’t solve for it yet)
Solve for $\optimum$ in terms of $q$
Write out the second-order condition — how do we know that $\optimum$ is really the optimum?
Sketch the value of the optimum, $\ObjFunc(\optimum)$, as $q$ goes from $0$ to $1$
- Hint: $0\log{0} = 0$

What about more than one dimension?

Usually $\optimand$ is a vector of $\OptDim > 1$ dimensions
We can’t, usually, do a separate optimization on each dimension
What should happen at an interior minimum $\optimum$?
$\ObjFunc$ should have no slope at $\optimum$ in every direction
- Otherwise, we could lower the value of the function by moving
$\ObjFunc$ should increase as we move away from $\optimum$ in every direction

No slope in any direction: the first-order condition

Pick your favorite direction $\vec{v}$, a vector of length 1, say $(v_1, v_2, \ldots v_\OptDim)$
The slope of $\ObjFunc$ in that direction, at $\optimand$, is (chain rule) \[ \sum_{i=1}^{p}{v_i \frac{\partial \ObjFunc}{\partial \optimand_i}(\optimand)} = \vec{v} \cdot \nabla \ObjFunc(\optimand) \]
Here $\nabla \ObjFunc(\optimand)$ is the gradient of $\ObjFunc$ at $\optimand$, the vector of partial derivatives \[ \nabla \ObjFunc(\optimand) = \left[\begin{array}{ccc} \frac{\partial \ObjFunc}{\partial \optimand_1}(\optimand) & \ldots & \frac{\partial \ObjFunc}{\partial \optimand_\OptDim}(\optimand) \end{array} \right] \]

No slope in any direction at $\optimum$ means: $\vec{v} \cdot \nabla \ObjFunc(\optimum) = 0$ for all $\vec{v} \neq 0$
And that means: $\nabla \ObjFunc(\optimum) = 0$
The first-order condition is: “the gradient vanishes at the optimum”

First-order condition or first-order conditions?

We have one vector equation $\nabla \ObjFunc(\optimum) = 0$
This is the same as a system of $\OptDim$ equations for the partial derivatives: \[\begin{eqnarray*} \frac{\partial \ObjFunc}{\partial \optimand_1}(\optimum) & = & 0\\ & \vdots & \\ \frac{\partial \ObjFunc}{\partial \optimand_\OptDim}(\optimum) & = & 0 \end{eqnarray*}\]
This is good because we also have $\OptDim$ unknowns, $\optimum = \left[ \begin{array}{ccc} \optimum_1 & \ldots & \optimum_\OptDim \end{array}\right]$
$\OptDim$ equations for $\OptDim$ unknowns $\Rightarrow$ typically a solution
- Typically a unique solution if all the equations are linear in $\optimum$
- Often not unique because nonlinear in $\optimum$
- But still, there are solutions!

The function increases in every direction: the second-order condition

Second-order Taylor series for vectors: \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + (\optimand - \optimum) \cdot \nabla \ObjFunc(\optimum) + \frac{1}{2}(\optimand - \optimum) \cdot \left(\nabla\nabla \ObjFunc (\optimum) \right) (\optimand - \optimum) \]
Here $\nabla\nabla\ObjFunc(\optimum)$ is the matrix of second partial derivatives, $\frac{\partial^2 \ObjFunc}{\partial \optimand_i \partial \optimand_j}$, a.k.a. the Hessian
First-order condition says the gradient term is zero at $\optimum$, so \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + \frac{1}{2}(\optimand - \optimum) \cdot \left(\nabla\nabla \ObjFunc (\optimum) \right) (\optimand - \optimum) \]
$\optimum$ is a minimum means: \[ (\optimand - \optimum) \cdot \left(\nabla\nabla \ObjFunc(\optimum)\right) (\optimand - \optimum) > 0 \]

Positive-definite matrices

A square matrix $\mathbf{h}$ is positive-definite when, for any non-zero vector $\vec{v}$, \[ \vec{v} \cdot \mathbf{h} \vec{v} > 0 \]
- If we only have $\vec{v} \cdot \mathbf{h} \vec{v} \geq 0$ then $\mathbf{h}$ is only non-negative-definite (or positive semi-definite)
Not the same as $\mathbf{h}$ only having positive entries!
- E.g., $\mathbf{p} = \left[\begin{array}{cc} 1 & -0.5\\ -0.5 & 1\end{array}\right]$ is positive-definite
- E.g., $\mathbf{n} = \left[\begin{array}{cc} 0.5 & 1\\ 1 & 0.5\end{array}\right]$ is not positive-definite
We write this as $\mathbf{h} \succ 0$
- Non-negative-definite is $\mathbf{h} \succeq 0$
For symmetric matrices: $\mathbf{h}$ is positive definite $\Leftrightarrow$ all eigenvalues of $\mathbf{h}$ are $>0$
- The Hessian matrix $\nabla\nabla\ObjFunc$ is always symmetric (why?)
- We’ll do a refresher on eigenvalues in a few weeks before we really need them

The first- and second- order conditions for minima

For $\optimum$ to be a local minimum,

First-order condition: “The gradient must vanish”, $\nabla \ObjFunc(\optimum) = 0$
- Necessary, except at a boundary
Second-order condition: “The Hessian should be positive-definite”, $\nabla \nabla \ObjFunc(\optimum) \succ 0$
- Sufficient; minima where it’s violated are weird and a-typical

Near a minimum, nice functions look quadratic

Go back to the Taylor approximation: if $\optimum$ is a local minimum, so $\nabla \ObjFunc(\optimum) = 0$, then \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + \frac{1}{2}(\optimand-\optimum) \cdot \left(\nabla\nabla \ObjFunc(\optimum)\right) (\optimand - \optimum) \]
Consequence: if we come close to the minimum, so $\|\optimand - \optimum\| = \epsilon \ll 1$, then \[ \ObjFunc(\optimand) \approx \ObjFunc(\optimum) + O(\epsilon^2) \]
If we can get $\epsilon$-close to the location of the optimum, we get $O(\epsilon^2)$-close to the value of the optimum (and $\epsilon^2 \ll \epsilon \ll 1$)
- Turned around, to get within $\delta$ of the value of the optimum, we need to only get with $O(\sqrt{\delta})$ of the location (and $\delta \ll \sqrt{\delta} \ll 1$)

Minimizing risk vs. minimizing empirical risk

We want to minimize risk, \[ \optimum = \argmin_{\optimand \in \OptDomain}{\Risk(\optimand)} = \argmin_{\optimand \in \OptDomain}{\Expect{\Loss(Y, s(X))}} \]
We can minimize empirical risk, \[ \widehat{\optimand} = \argmin_{\optimand \in \OptDomain}{\EmpRisk(\optimand)} = \argmin_{\optimand \in \OptDomain}{\frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i))}} \]
We’re going to see later that \[ \|\widehat{\optimand} - \optimum\| = O(1/\sqrt{n}) \]
- Basically: because of the law of large numbers
- Assuming $\optimand$ has finite dimensions which don’t change with $n$
Consequence: \[ \Risk(\widehat{\optimand}) \approx \Risk(\optimum) + O(1/n) \] with factors from the Hessian buried inside the big O
$\Rightarrow$ Minimizing the empirical risk comes closer and closer to minimizing the true risk

Morals to remember, about minimizing smooth functions

Local vs. global minima
First-order condition: “the gradient vanishes”, $\nabla \ObjFunc(\optimum)=0$
- Except at boundaries
Second-order condition: “the Hessian is positive-definite”, $\nabla\nabla \ObjFunc(\optimum) \succ 0$
- Except for weird, a-typical situations
“Near a minimum, nice functions look quadratic”
$\Rightarrow$ Coming within $O(\epsilon)$ of the location of the minimum puts us within $O(\epsilon^2)$ of the value of the minimum

Next time: actual algorithms

How do we get the computer to actually use all this calculus?
- Algorithms for optimization based on these and related ideas
What happens because the computer can’t do calculus exactly?
- Optimization error and its consequences

Backup: What if $\nabla\nabla\ObjFunc \succeq 0$?

What if the Hessian is only non-negative-definite, or positive-semi-definite?
Then there’s (at least) one direction $\vec{v}$ where \[ \vec{v} \cdot \nabla\nabla\ObjFunc \vec{v} = 0 \]
This suggests that if we start at $\optimum$ and take a small enough step in the direction $\vec{v}$, we don’t (necessarily) increase $\ObjFunc$
We can have this when there is a continuous set of minima
- Imagine a bowl where the base is raised in the middle — there’s a ring of minima around the center
This is a weird and delicate situation

Backup: Big-O notation

$f(x) = O(g(x))$ as $x\rightarrow \infty$ means: there’s some $C> 0$ so $|f(x)| \leq C g(x)$ for all sufficiently big $x$
- E.g., $10000000 + e^{-x}$ = O(1)$
- E.g., $37 x^2 + 42x + 1421 = O(x^2)$
- “Is at most of the order of”, sometimes abbreviated “is of the order of”
- For relevance, typically try to give the tightest bound we can, $37x^2 = O(x^4)$ but that’s not informative
- Use the same notation for limits $x \rightarrow 0$
Small $o$ notation: $f(x) = o(g(x))$ means: $\lim{\frac{f(x)}{g(x)}} = 0$

Backup: What do I mean when I say “weird, a-typical”?

The set $D$ is dense in another set $A$ when there’s a point in $D$ arbitrarily close to every point in $A$
- E.g., the rationals are dense in $[0,1]$
The set $N$ is nowhere dense in $A$ when it’s not dense in any open subset of $A$
- Open intervals: think of say $(1/4, 3/4)$, as opposed to $[1/4, 3/4]$
- On the line, open sets are, roughly, unions of a countable number of open intervals; similarly in $\mathbb{R}^d$
The set $M$ is meager if it’s a countable union of nowhere-dense sets
- The rational numbers are meager, because there’s only (!) a countable infinity of them, and each of them is nowhere-dense
A set is typical if its complement is meager
- Alternately: a set is typical if it’s both open and dense
- The irrational numbers in $[0,1]$ are typical
Local minima of smooth functions with positive second derivatives are typical, those with zero second derivatives are not typical
- If you start from a minimum which does have a positive second derivative, you can continuously adjust it by arbitrarily small amounts and it still has a minimum at the same location with a positive second derivative (set is open and sense)
- If you find a function with a zero second derivative, there are arbitrarily small tweaks to the function where you now have the same minimum but a positive second derivative
  - e.g., $x^4$ vs $x^4 + \epsilon x^2$, for $\epsilon > 0$ as small as you like
These notions come from topology, which started by asking what properties of shapes stay the same under smooth transformations

Optimization — Basics from Calculus

Previously

Optimization: some jargon

Local vs. global minima

Local vs. global minima

“The” minimum: value vs. location

Finding the optimum: calculus basics

The first order condition

The first order condition

The first order condition and boundary optima

The first order condition and boundary optima

The first order condition and boundary optima

The second order condition

A bit more insight into the second-order condition

Generic minima look, locally, like parabolas

Break for in-class exercise (15 min.)

What about more than one dimension?

No slope in any direction: the first-order condition

First-order condition or first-order conditions?

The function increases in every direction: the second-order condition

Positive-definite matrices

The first- and second- order conditions for minima

Near a minimum, nice functions look quadratic

Minimizing risk vs. minimizing empirical risk

Morals to remember, about minimizing smooth functions

Next time: actual algorithms

Backup: What if \(\nabla\nabla\ObjFunc \succeq 0\)?

Backup: Big-O notation

Backup: What do I mean when I say “weird, a-typical”?

References