Empirical Risk Minimization, Asymptotics, Optimism

36-462/662, Spring 2022

14 February 2022 (Lecture 9)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\optimand}{\theta} \newcommand{\optimum}{\optimand^*} \newcommand{\OptimalParameter}{\optimand^*} \newcommand{\ERM}{\hat{\optimand}} \newcommand{\ObjFunc}{{M}} \newcommand{\Hessian}{\mathbf{k}} \]

In previous episodes

The fundamental issue

Today’s agenda

Empirical risk converges on true risk

“The usual asymptotics”

The empirical risk minimizer is the true minimizer plus noise

Asymptotic convergence and unbiasedness

\[ \ERM \approx \optimum - \Hessian^{-1} \nabla \EmpRisk(\optimum) \]

Sandwich covariance

\[\begin{eqnarray} \ERM & \approx & \optimum - \Hessian^{-1} \nabla \EmpRisk(\optimum)\\ \Var{\ERM} & \approx & \Var{\optimum - \Hessian^{-1} \nabla \EmpRisk(\optimum)}\\ & = & \Var{\Hessian^{-1} \nabla \EmpRisk(\optimum)}\\ & = & \Hessian^{-1} \Var{\nabla \EmpRisk(\optimum)} \Hessian^{-1} \end{eqnarray}\]

Sandwich covariance (2)

How far is \(\ERM\) from \(\optimum\)?

Gaussian fluctuations

2 cheers for ERM

  1. Hooray! ERM converges on the best-in-class parameters \(\optimum\)
  2. Hooray! ERM converges pretty fast in parametric problems, \(O(1/\sqrt{n})\)

Some reasons to be a bit more hesitant

How do we find those magic matrices?

But what about the predictions?

But what about the predictions? (2)

What about the risk?

Being more precise about the risk

Empirical risk minimization is optimistic

A toy example (1)

A toy example (2)

A toy example (3)

A toy example (4)

A toy example (5)

A toy example (6)

Estimating the true risk from the empirical risk

Estimating the true risk from the empirical risk (2)

Estimating the true risk from the empirical risk (3)

\[ \Risk(\ERM) \approx \EmpRisk(\ERM) + n^{-1}\tr{\left(\mathbf{j}\Hessian^{-1}\right)} \]

The moral of all this math: approximation-estimation trade-off

Over-fitting

Avoiding over-fitting

Why the optimism is not the end of the story

\[ \Risk(\ERM) \approx \EmpRisk(\ERM) + n^{-1}\tr{\mathbf{j}\Hessian^{-1}} \]

Summing up

Backup: Expectations of quadratic forms, and the optimism formula