36-462/662, Spring 2022
3 February 2022
\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\optimum}{\optimand^*} \newcommand{\ObjFunc}{{M}} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \]
For \(\optimum\) to be a local minimum,
optim()
, is one of these)\[ \EmpRisk(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \theta))} \]
\[ \text{risk} = \text{minimal risk} + \text{approximation error} + \text{estimation error} + \text{optimization error} \]
Don’t bother optimizing more precisely than the noise in the data will support (Bottou and Bosquet 2012)
optim()
optim(par, fn, gr, method, ...)
par
\(=\) Initial guess at the “parameters” \(=\) a vector, our \(\optimand^{(0)}\)fn
\(=\) Function to be minimized, our \(\ObjFunc(\optimand)\)
gr
\(=\) Function to calculate the gradient, our \(\nabla \ObjFunc(\optimand)\)
method
\(=\) Which optimization algorithm?
Nelder-Mead
a.k.a. simplex method, doesn’t use derivatives, can be good for dis-continuous functions but inefficient for smooth onesBFGS
is a Newton-type method, but with clever tricks to not spend quite so much time computing and inverting Hessians...
: lots of extra settings, including things like the “tolerance” (how small an improvement in fn
/ \(\ObjFunc\) to bother with)## List of 5
## $ par : num [1:2] 4.96e-10 4.96e-10
## $ value : num -1
## $ counts : Named int [1:2] 41 13
## ..- attr(*, "names")= chr [1:2] "function" "gradient"
## $ convergence: int 0
## $ message : NULL
## [1] 4.956205e-10 4.956205e-10
## [1] -1
optim()
isn’t enough?Albert, Arthur E., and Jr. Gardener Leland A. 1967. Stochastic Approximation and Nonlinear Regression. Cambridge, Massachusetts: MIT Press.
Bottou, Léon, and Olivier Bosquet. 2012. “The Tradeoffs of Large Scale Learning.” In Optimization for Machine Learning, edited by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, 351–68. Cambridge, Massachusetts: MIT Press. http://leon.bottou.org/publications/pdf/mloptbook-2011.pdf.
Culberson, Joseph C. 1998. “On the Futility of Blind Search: An Algorithmic View of ‘No Free Lunch’.” Evolutionary Computation 6:109–27. http://www.cs.ualberta.ca/~joe/Abstracts/TR96-18.html.
Nevel’son, M. B., and R. Z. Has’minskiĭ. n.d. Stochastic Approximation and Recursive Estimation. Providence, Rhode Island: American Mathematical Society.
Robbins, Herbert, and Sutton Monro. 1951. “A Stochastic Approximation Method.” Annals of Mathematical Statistics 22:400–407. https://doi.org/10.1214/aoms/1177729586.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323:533–36. https://doi.org/10.1038/323533a0.
Traub, J. F., and A. G. Werschulz. 1998. Complexity and Information. Lezioni Lincee. Cambridge, England: Cambridge University Press.
Wolpert, David H., and William G. Macready. 1997. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1:67–82. https://doi.org/10.1109/4235.585893.
After L. O. Hesse, 1811–1874↩