\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\optimand}{\theta} \newcommand{\optimum}{\optimand^*} \newcommand{\OptimalParameter}{\optimand^*} \newcommand{\ERM}{\hat{\optimand}} \newcommand{\ObjFunc}{{M}} \newcommand{\Hessian}{\mathbf{k}} \]

In previous episodes

Decision problem with available actions \(A\), state \(Y\), information \(X\), loss function \(\Loss(y,a)\)
Set \(S\) of “available” strategies, each \(s \in S\) is a rule saying what action to take for each value of the information \(x\)
Strategies are parameterized by a vector \(\optimand\), so \(s(x;\theta)\) is action suggested by parameter \(\theta\) when \(X=x\)
Risk of a strategy is its expected loss, \(\Risk(\optimand) \equiv \Expect{\Loss(Y, s(X;\optimand))}\)
Want to minimize the risk, \(\optimum = \argmin_{\optimand}{\Risk(\optimum)}\)
- Not necessarily true, just the best within the class
Optimization: for any smooth objective function \(\ObjFunc\), for \(\optimum\) to be a minimum,
- First-order condition: \(\nabla \ObjFunc(\optimum) = 0\)
- Second-order condition: \(\nabla \nabla \ObjFunc(\optimum) \succ 0\)
- Penalties \(\Leftrightarrow\) constraints, reducing \(S\)
- Specific algorithms based on first- and second- order conditions

The fundamental issue

We want \[ \optimum = \argmin_{\theta}{\Risk(\theta)} \] but this involves the true distribution of \(X\) and \(Y\) which we don’t know, so we can’t actually minimize it
We have data \((x_1, y_1), \ldots (x_n, y_n)\) sampled from the true distribution
We define empirical risk as average loss on this data \[ \EmpRisk(\optimand) \equiv \frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \optimand))} \]
Empirical risk is something we can calculate and minimize: \[ \ERM = \argmin_{\optimand}{\frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \optimand))}} \]
- Notice: \(\optimum\) is non-random; \(\ERM\) is a function of the data, thus random

Today’s agenda

What can we say about the difference between \(\ERM\) and \(\OptimalParameter\)?
- Are we approaching the best parameters?
What can we say about the difference between \(\Risk(\ERM)\) and \(\Risk(\OptimalParameter)\)?
- Never mind about the parameters, are we approaching the best risk?
What can we say about the difference between \(\Risk(\ERM)\) and \(\EmpRisk(\ERM)\)?
- What risk can we actually expect on new data?

Empirical risk converges on true risk

Assume the \((X_i, Y_i)\) are IID
Remember the law of large numbers: if \(Z_i\) are IID, and \(h\) is any fixed function, then \[ \frac{1}{n}\sum_{i=1}^{n}{h(Z_i)} \rightarrow \Expect{h(Z)} \]
- Assuming \(\Expect{h(Z)}\) is well-defined
In fact: \[\begin{eqnarray} \Expect{\frac{1}{n}\sum_{i=1}^{n}{h(Z_i)}} & = & \Expect{h(Z)}\\ \Var{\frac{1}{n}\sum_{i=1}^{n}{h(Z_i)}} & = & \frac{1}{n}\Var{h(Z)} \end{eqnarray}\]
For a fixed parameter \(\optimand\), \[\begin{eqnarray} \EmpRisk(\optimand) & \rightarrow & \Risk(\optimand)\\ \Expect{\EmpRisk(\optimand)} & = & \Risk(\optimand)\\ \Var{\EmpRisk(\optimand)} & = & \frac{1}{n}\Var{\Loss(Y, s(X;\optimand))} \equiv \frac{\sigma^2_{\optimand}}{n} \end{eqnarray}\]
The central limit theorem also applies: \[ \EmpRisk(\optimand) \rightsquigarrow \mathcal{N}(\Risk(\optimand), \sigma^2_{s}/n) \]

“The usual asymptotics”

Big idea: empirical risk \(=\) true risk plus noise that gets small with \(n\)
\(\Rightarrow\) ERM \(=\) true risk minimizer plus noise that gets small with \(n\)
- Also called small-noise asymptotics
We’ll use some calculus to get
- Convergence of \(\ERM\) to \(\optimum\)
- Fluctuations of \(\ERM\) around \(\optimum\)

The empirical risk minimizer is the true minimizer plus noise

\(\ERM\) minimizes \(\EmpRisk\): \[\begin{eqnarray} 0 & = & \nabla \EmpRisk(\ERM)\\ & \approx & \nabla \EmpRisk(\optimum) + \nabla \nabla \EmpRisk(\optimum) (\ERM - \optimum) ~ (\text{Taylor expand around}\ \optimum)\\ \ERM & \approx & \optimum - \left( \nabla \nabla \EmpRisk(\optimum) \right)^{-1} \nabla \EmpRisk(\optimum) \end{eqnarray}\]
\(\optimum\) is fixed, so \[\begin{eqnarray} \EmpRisk(\optimum) & \rightarrow & \Risk(\optimum) ~ \text{(LLN)}\\ \nabla \EmpRisk(\optimum) & \rightarrow & \nabla \Risk(\optimum) ~ \text{(usually)}\\ \nabla \nabla \EmpRisk(\optimum) & \rightarrow & \nabla \nabla \Risk(\optimum) \equiv \Hessian ~ \text{(usually)} \end{eqnarray}\]
So \[ \ERM \approx \optimum - \Hessian^{-1} \nabla \EmpRisk(\optimum) \]

Asymptotic convergence and unbiasedness

\[ \ERM \approx \optimum - \Hessian^{-1} \nabla \EmpRisk(\optimum) \]

But \(\nabla \EmpRisk(\optimum) \rightarrow \nabla \Risk(\optimum) = 0\)
So \(\ERM \rightarrow \optimum\) (asymptotic convergence)
Usually \(\Expect{\nabla \EmpRisk} = \nabla \Expect{\EmpRisk} = \nabla\Risk\), so also \[ \Expect{\ERM} \approx \optimum - 0 = \optimum \]
\(\therefore\) \(\ERM\) is (asymptotically) unbiased

Sandwich covariance

\[\begin{eqnarray} \ERM & \approx & \optimum - \Hessian^{-1} \nabla \EmpRisk(\optimum)\\ \Var{\ERM} & \approx & \Var{\optimum - \Hessian^{-1} \nabla \EmpRisk(\optimum)}\\ & = & \Var{\Hessian^{-1} \nabla \EmpRisk(\optimum)}\\ & = & \Hessian^{-1} \Var{\nabla \EmpRisk(\optimum)} \Hessian^{-1} \end{eqnarray}\]

This is the “sandwich covariance matrix” for \(\ERM\)
- a.k.a. “sandwich variance”

Sandwich covariance (2)

Remember \(\EmpRisk\) is a sample average, so \[\begin{eqnarray} \Var{\nabla \EmpRisk(\optimum)} & = & \Var{\nabla \left(\frac{1}{n}\sum_{i=1}^{n}{\Loss(Y_i, s(X_i;\theta))} \right)} \\ & = & \Var{\frac{1}{n}\sum_{i=1}^{n}{\nabla \Loss(Y_i, s(X_i;\theta))}}\\ & = & \frac{1}{n^2}n\Var{\nabla \Loss(Y, s(X;\theta))}\\ & \equiv & \frac{\mathbf{j}}{n} \end{eqnarray}\] so \[ \Var{\ERM} \approx \frac{1}{n} \Hessian^{-1} \mathbf{j} \Hessian^{-1} \]

How far is \(\ERM\) from \(\optimum\)?

Just saw \(\ERM\) is asymptotically unbiased
\(\Var{\ERM} = O(1/n)\)
So \(\Expect{(\ERM - \optimum)^2} = O(1/n)\)
So \[ \ERM = \optimum + O(1/\sqrt{n}) \]

Gaussian fluctuations

The CLT holds for \(\EmpRisk(\optimand)\) at any fixed \(\theta\), so usually a CLT for \(\nabla \EmpRisk\): \[ \nabla \EmpRisk(\optimum) \rightsquigarrow \mathcal{N}(0, \mathbf{j}/n) \]
So: \[ \ERM \rightsquigarrow \mathcal{N}(\optimum, n^{-1} \Hessian^{-1} \mathbf{j} \Hessian^{-1}) \]

2 cheers for ERM

Hooray! ERM converges on the best-in-class parameters \(\optimum\)
Hooray! ERM converges pretty fast in parametric problems, \(O(1/\sqrt{n})\)

Some reasons to be a bit more hesitant

Who cares about \(\optimum\)? We care about risk!
Is \(\Risk(\ERM)\) converging on \(\Risk(\optimum)\)? If so, how quickly?
Is \(\EmpRisk(\ERM)\) a good estimate of \(\Risk(\ERM)\)? How much worse could the true risk be than the empirical risk?

How do we find those magic matrices?

Approximations/estimates, since \(\EmpRisk \rightarrow \Risk\) and \(\ERM \rightarrow \optimum\): \[\begin{eqnarray} \Hessian & \equiv & \nabla\nabla \Risk(\optimum)\\ & \approx & \nabla \nabla \Risk(\ERM)\\ & \approx & \nabla \nabla \EmpRisk(\ERM)\\ \mathbf{j} & \equiv & \Var{\nabla \Loss(Y, s(X;\optimum))}\\ & \approx & \Var{\nabla \Loss(Y, s(X;\ERM))}\\ & \approx & \frac{1}{n}\sum_{i=1}^{n}{\left( \nabla \Loss(y_i, s(x_i;\ERM))\right) \left( \nabla \Loss(y_i, s(x_i;\ERM))\right)^T} \end{eqnarray}\]
- (last expression \(=\) sample variance-covariance matrix for \(\nabla \Loss\))
Just need to compute \(\nabla \Loss\) and \(\nabla \nabla \Loss\)
- Often we did that to find \(\ERM\) in the first place
You’ll work through an example in HW 5

But what about the predictions?

\(\optimum\) is the best parameter value, so best predictions are \(s(x;\optimum) = \OptimalModel(x)\)
- “Best” = risk-minimizing = expected loss minimizing
\(\ERM\) estimates \(\optimum\), so \(s(x;\ERM) = \hat{s}(x)\) estimates the best prediction
Uncertainty about \(\optimum\) \(\Rightarrow\) uncertainty about best predictions
Variance of \(\ERM\) \(\Rightarrow\) variance for best predictions
- “Propagation of error”, “uncertainty propagation”, “the delta method”
- Another problem in HW 5
- More fun with \(\nabla\)s and matrices

But what about the predictions? (2)

To preview the homework a little, use a Taylor expansion: \[\begin{eqnarray} s(x;\ERM) & \approx & s(x;\optimum) + \nabla s(x;\optimum) (\ERM - \optimum)\\ & = & s(x;\optimum) + \nabla s(x;\optimum) O(1/\sqrt{n})\\ & = & s(x;\optimum) + O(1/\sqrt{n}) \rightarrow s(x;\optimum)\\ \end{eqnarray}\]
We’ll be more precise in the homework, rather than just \(O(1/\sqrt{n})\)

What about the risk?

Taylor expand again: \[\begin{eqnarray} \Risk(\ERM) & \approx & \Risk(\optimum) + \frac{1}{2} (\ERM - \optimum) \cdot \Hessian (\ERM - \optimum)\\ & = & \Risk(\optimum) + O(1/\sqrt{n})O(1/\sqrt{n})\\ & = & \Risk(\optimum) + O(1/n) \end{eqnarray}\]
Remember that in Lecture 4 we said \[ \Risk(\optimand) = \Risk_0 + (\Risk(\optimum) - \Risk_0) + (\Risk(\optimand) - \Risk(\optimum)) = \text{minimum risk} + \text{approximation error} + \text{estimation error} \]
Generically, estimation error for ERM is \(O(1/n)\)

Being more precise about the risk

For any fixed \(\optimand\), \[ \Expect{\EmpRisk(\optimand)} = \Risk(\optimand) \]
Define \(\gamma(\optimand) \equiv \EmpRisk(\optimand) - \Risk(\optimand)\), the deviation or fluctuation at \(s\)
\(\Expect{\gamma(\optimand)} = 0\) for any fixed \(s\), and \(\Var{\gamma(\optimand)} = O(1/n)\)
But what about \(\gamma(\ERM)\)? \(\ERM\) is not a fixed strategy!!!

Empirical risk minimization is optimistic

Intuition: \(\ERM\) picked to fit this data, so it partly adapts to real patterns that will show up in new data, and partly to the accidents of the training data
Math: \[ \ERM = \argmin_{\optimand}{\left( \Risk(\optimand) + \gamma(\optimand) \right)} \]
Picking \(\optimand\) to minimize \(\EmpRisk(\optimand)\) is partly about finding the strategy with the smallest true risk, minimizing \(\Risk(\optimand)\)
Picking \(\optimand\) to minimize \(\EmpRisk(\optimand)\) is partly about finding the luckiest strategy, one with very negative \(\gamma(\optimand)\)
- Remember \(\Expect{\gamma(\optimand)} = 0\) for each \(\optimand\)
Implication: \(\Expect{\gamma(\ERM)} < 0\)
Implication: \[\begin{eqnarray} \Expect{\EmpRisk(\ERM)} & = & \Expect{\Risk(\ERM)} + \Expect{\gamma(\ERM)}\\ & < & \Expect{\Risk(\ERM)} \end{eqnarray}\]
The strategy we pick by minimizing the in-sample loss will do worse on new data (on average)

A toy example (1)

\(Y \sim \mathcal{N}(7, 1)\), \(n=20\) data points (tick marks on the axis):

A toy example (2)

Use squared error, with \(\ModelClass =\) all constants, so we just want to predict the expected value of \(Y\)
True risk when predicting \(\theta\) is \(1+(\theta-7)^2\) (why?)
- What if \(Y\) is Gaussian but with a different variance? What if it was non-Gaussian but with the same expectation and variance?
True risk is minimized at the expected value:

A toy example (3)

Empirical risk is minimized at the sample mean:

A toy example (4)

The difference between the two curves is the risk deviation \(\gamma(\theta)\):
\(\gamma(\theta)\) is really a random function (a stochastic process), and this is one draw from its distribution (one realization of the process)

A toy example (5)

Repeat the simulation many times, to evaluate \(\gamma(\theta)\) for any fixed \(\theta\)
- At each \(\theta\), get a Gaussian distribution centered at 0 (why?)

A toy example (6)

Things look different for the \(\ERM\) chosen by empirical risk minimization:

Estimating the true risk from the empirical risk

We want to know \(\Risk(\ERM)\), and we know \(\EmpRisk(\ERM)\)
- Equivalently, \(\Risk(\ERM)\) and \(\EmpRisk(\ERM)\)
We saw a little while ago that \[ \Risk(\ERM) \approx \Risk(\optimum) + \frac{1}{2} (\ERM - \optimum) \cdot \Hessian (\ERM - \optimum)\\ \]
We don’t know \(\Risk(\optimum)\) but \(\optimum\) is fixed so \(\Risk(\optimum) = \Expect{\EmpRisk(\optimum)}\)
We don’t know \(\EmpRisk(\optimum)\) but we can Taylor expand: \[\begin{eqnarray} \EmpRisk(\optimum) & \approx & \EmpRisk(\ERM) + \frac{1}{2} (\optimum - \ERM) \cdot \left( \nabla \nabla \EmpRisk(\ERM) (\optimum - \ERM)\right)\\ & = & \EmpRisk(\ERM) + \frac{1}{2} (\ERM - \optimum) \cdot \left( \nabla \nabla \EmpRisk(\ERM) (\ERM - \optimum)\right)\\ & \approx & \EmpRisk(\ERM) + \frac{1}{2} (\ERM - \optimum) \cdot \left( \nabla \nabla \Risk(\ERM) (\ERM - \optimum)\right)\\ & = & \EmpRisk(\ERM) + \frac{1}{2} (\ERM - \optimum) \cdot \left( \Hessian (\ERM - \optimum)\right)\\ \end{eqnarray}\]

Estimating the true risk from the empirical risk (2)

Put the two Taylor series together \[ \Risk(\ERM) \approx \EmpRisk(\ERM) + (\ERM - \optimum) \cdot \Hessian (\ERM - \optimum) \]
Now (see back-up) \[ \Expect{\Risk(\ERM)} \approx \Expect{\EmpRisk(\ERM)} + n^{-1}\tr{\left(\mathbf{j}\Hessian^{-1}\right)} \]
- Recall: \(\tr{\mathbf{a}} \equiv \sum_{i}{a_{ii}}=\) sum of the diagonal entries of \(\mathbf{a}\) \(=\) sum of the eigenvalues of \(\mathbf{a}\)
So we can estimate \[ \Risk(\ERM) \approx \EmpRisk(\ERM) + n^{-1}\tr{\left(\mathbf{j}\Hessian^{-1}\right)} \]

Estimating the true risk from the empirical risk (3)

\[ \Risk(\ERM) \approx \EmpRisk(\ERM) + n^{-1}\tr{\left(\mathbf{j}\Hessian^{-1}\right)} \]

An (asymptotically) unbiased estimate of the true risk based on the empirical risk
The 2nd term is always \(> 0\)
The 2nd term measures how “flat” the minimum is (\(\Hessian^{-1}\)) and how much noise there is in the function we’re optimizing (\(\mathbf{j}\))
Lots of other formulas are special cases of this:
- Akaike information criterion (AIC) for model selection with maximum likelihood
  - If we use the log loss and our model is right, then \(\tr{(\mathbf{j}\Hessian^{-1})} =\) number of dimensions in \(\theta\)
- Mallows \(C_p\) for linear regression, \(\EmpRisk(\ERM) + 2\sigma^2 p/n\) (when we estimate \(p\) coefficients and the true noise around the regression line has variance \(\sigma^2\))
- Generalized Mallows \(C_p\) for linear smoothers, \(\EmpRisk(\ERM) + 2\sigma^2 \mathrm{df}(\ERM)/n\)
  - \(\mathrm{df}(\ERM) \equiv \sum_{i=1}^{\sigma^2}{\Cov{Y_i, s(X_i, \ERM)}}\)
The special cases let us see the \(\tr{(\mathbf{j}\Hessian^{-1})}\) term is something like “how flexible is the model?”

The moral of all this math: approximation-estimation trade-off

Remember how we broke up the risk: \[\begin{eqnarray} \Risk(\ERM) & = & \Risk_0 + (\Risk(\OptimalParameter) - \Risk_0) + (\Risk(\ERM) - \Risk(\OptimalParameter))\\ & = & \text{(optimal risk)} + \text{(approximation error)} + \text{(estimation error)} \end{eqnarray}\]
What happens if we have two sets of strategies, \(\ModelClass_1\) and \(\ModelClass_2\), and \(\ModelClass_1 \subset \ModelClass_2\)?
Usually two optimal models, \(\OptimalParameter_1 \in \ModelClass_1\) and \(\OptimalParameter_2 \in \ModelClass_2\), and two estimates, \(\ERM_1\) and \(\ERM_2\)
Because \(\ModelClass_1 \subset \ModelClass_2\):
1. \(\EmpRisk(\ERM_1) \geq \EmpRisk(\ERM_2)\): empirical risk can only get better by optimizing over more strategies
2. \(\Risk(\OptimalParameter_1) \geq \Risk(\OptimalParameter_2)\): true risk can only get better by optimizing over more strategies
3. \(\max_{s\in\ModelClass_1}{|\gamma(s)|} \leq \max_{s\in\ModelClass_2}{|\gamma(s)|}\): the maximum deviation can only get bigger by searching over more strategies
Thing (1) means ERM always prefers the bigger, more flexible model class
Thing (2) means that bigger, more flexible model classes will have smaller approximation error
Thing (3) means that bigger, more flexible model classes will have bigger estimation error

Over-fitting

Over-fitting = fitting a model that’s bigger (more flexible, more powerful) than the one which will predict best (given your data)
Over-fitting happens because optimism increases with model size
- Equivalently, because estimation error grows with model size
Some amount of optimism is built in to empirical risk minimization
- We asked the strategy to optimize performance on the training data, and it did so
ERM is very prone to over-fitting

Avoiding over-fitting

Good practical advice: constrain the strategies using what you know (or believe…) about the situation
Good practical advice: keep it simple
- Tension: sometimes we know things are complicated (e.g. language modeling)
Impose penalties, so we minimize \(\EmpRisk(s) + g(s)\) for some \(g(s)\) which is something like \(\Expect{|\gamma(s)|}\)
Get tighter control on over-fitting

Why the optimism is not the end of the story

\[ \Risk(\ERM) \approx \EmpRisk(\ERM) + n^{-1}\tr{\mathbf{j}\Hessian^{-1}} \]

An (asymptotically) unbiased estimate of the true risk based on the empirical risk
What about the variance? What about the distribution?

Summing up

The empirical risk of a fixed strategy is an unbiased estimate of its true risk
The strategy \(\ERM\) picked by ERM isn’t fixed but random
The empirical risk of \(\ERM\) is a negatively-biased or optimistic estimate of the true risk of \(\ERM\)
Larger, more flexible models have smaller approximation error, but more optimism and so more over-fitting
The usual asymptotics give us rough, asymptotic formulas for the optimism / generalization error

Backup: Expectations of quadratic forms, and the optimism formula

We want \[ \Expect{(\ERM - \optimum) \cdot \left( \Hessian (\ERM - \optimum)\right)} \]
A quadratic form is an expression like \(Z \cdot \mathbf{b} Z\), where \(Z\) is a random vector and \(\mathbf{b}\) is a non-random matrix
- \(\Expect{Z} = \mu\) (a vector) and \(\Var{Z} = \Sigma\) (a matrix)
- The quadratic form is a random scalar
Fun math fact: \[ \Expect{Z \cdot \mathbf{b} Z} = \tr{(\mathbf{b} \mathbf{\Sigma})} + \mu \cdot \mathbf{b} \mu \]
- It will be character-building to prove this without looking it up (hint: trace is cyclic)
Applied here, \(Z = \ERM-\optimum\)
- \(\Expect{Z} = \Expect{\ERM} - \optimum \rightarrow 0\) (as \(n\rightarrow\infty\))
- \(\Var{Z} = n^{-1} \Hessian^{-1}\mathbf{j}\Hessian^{-1}\) (as \(n\rightarrow\infty\))
So \[ \Expect{(\ERM - \optimum) \cdot \left( \Hessian (\ERM - \optimum)\right)} \rightarrow \tr{(\Hessian n^{-1} \Hessian^{-1} \mathbf{j} \Hessian^{-1})} = n^{-1} \tr{(\mathbf{j}\Hessian^{-1})} \]

Empirical Risk Minimization, Asymptotics, Optimism