\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\optimum}{\optimand^*} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \]

Previously

We’d like to select strategies by minimizing risk, $\Risk(s) \equiv \Expect{\Loss(Y, s(X))}$
We have selected selecting strategies by minimizing the empirical risk $\EmpRisk(s) = n^{-1}\sum_{i=1}^{n}{\Loss(Y_i, s(X_i))}$
We’ve looked at the theory of optimization ($\nabla \ObjFunc(\optimum) = 0$, $\nabla\nabla\ObjFunc(\optimum) \succ 0$)
We’ve looked at actual algorithms for approximating the optimum
What if we want to do something other than minimizing empirical risk?

Think about ordinary least squares

$Y=$ a real-number random variable
$X=$ a vector with $p$ dimensions
- Make one coordinate always $1$ if we want an intercept
$A=$ a one-number guess about $Y$
$\Loss(y,a) = (y-a)^2$
Strategies $S=$ linear functions of $X$, parameterized by coefficient vector $b$ so $s(x;b) = b \cdot x$
Data $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$, form into $n$-row matrices $\mathbf{x}$ and $\mathbf{y}$
There’s an exact formula for coefficient vector $\beta$ of the optimal strategy $\OptimalModel$: \[ \beta = (\Var{X})^{-1} \Cov{X,Y} \]
There’s an exact formula for the ERM: \[ \hat{\beta} = (\mathbf{x}^T\mathbf{x})^{-1} \mathbf{x}^T \mathbf{y} \]

Think about ordinary least squares (2)

\[ \hat{\beta} = (\mathbf{x}^T\mathbf{x})^{-1} \mathbf{x}^T \mathbf{y} \]

Doesn’t work if $\mathbf{x}^T\mathbf{x}$ can’t be inverted
If $\mathbf{x}^T\mathbf{x}$ is close to being un-invertible, this becomes really unstable numerically and statistically
- Small changes in $\mathbf{x}$ or $\mathbf{y}$ lead to really big changes in $\hat{\beta}$, and so in predictions

Thinking about ordinary least squares (3)

Geometrically, we can’t invert $\mathbf{x}^T\mathbf{x}$ when its columns are collinear
- One column is an exact linear function of one or more of the other columns, say $x^{(i)} = \sum_{j\neq i}{a_j x^{(j)}}$
- Implies $0 = \sum_{j=1}^{p}{a_j x^{(j)}}$ with $a_i = -1$
Near collinearity $\Rightarrow$ instability
This can happen if we’re careless about variable choice
- Standard example: regress on an average and on the individual measurements going in to the average
- Or $x^{(1)}$ is mass in kilograms, $x^{(2)}$ is weight in pounds, and we can add whatever $b$ we like to $\beta_1$ if we also subtract $2.2 b$ from $\beta_2$

Thinking about ordinary least squares (4)

Collinearity is inevitable if we have a lot of variables
More geometry: 2 points define a line, 3 points define a plane, etc.
In general $d+1$ points define a $d$-dimensional linear subspace
If $n < p$, then the $n$ data points define an $n-1$ dimensional subspace of the $p$-dimensional space
$\therefore$ If $n < p$, then collinearity is guaranteed

Thinking about ordinary least squares (5)

Usually we want more information, not less!
With modern techniques it’s easy to have $p > n$ even when $n$ is big
“We know too much about each data point to fit a model” sounds absurd
Can we stabilize things somehow, and get rid of the small denominators?

Penalties

One approach is to add a penalty term: pick $\lambda \geq 0$ and solve \[ \min_{b}{\hat{r}(b) + \lambda\Penalty(b)} \]
The penalty function $\Penalty$ should be $\geq 0$, and should, somehow, make the optimization problem more stable
- $\lambda$ is called the penalty factor or the strength of the penalty
Two popular choices of penalties: the “$L_1$” and “$L_2$” penalties: \[\begin{eqnarray*} \Penalty(b) & = & \sum_{j=1}^{p}{|b_j|}\\ \Penalty(b) & = & \sum_{j=1}^{p}{b_j^2} = \text{squared (Euclidean) length of}\ b \end{eqnarray*}\]
You can guess what the $L_q$ penalty is (or see backup slide)
- There are also penalties not of this form

Penalties

For now, we’ll use the $L_2$ penalty just to be definite. So now we need to pick a $\lambda$ and we’ll use \[ \hat{\beta}(\lambda) = \argmin_{b}{\hat{r}(b) + \lambda \sum_{j=1}^{p}{b_j^2}} \]
The first bit is just the mean squared error, so we want that to be small
The second bit is the penalty, which wants to make the coefficient vector small
$\lambda$ controls the trade-off between empirical risk and the length of the coefficient vector
- Increasing the length of $b$ by 1 unit is worth it if it reduces the MSE by at least $\lambda$
- Turned around: increasing the MSE by 1 unit is worth it if it reduces the length of$b$ by at least $1/\lambda$
$\lambda$ is the price at which we’re willing to trade fitting the data against having an oversized coefficient vector

Some pictures

First let’s draw what the empirical-risk surface $\EmpRisk(b)$ looks like
- Some made-up 2D data; see .Rmd file for the details
- Mark ($+$) at the minimum of $\EmpRisk$

Some pictures (2)

What does the $L_2$ penalty surface $\Penalty(b)$ look like?
- Marked the origin and the minimum of $\EmpRisk$

Some pictures (3)

Now what does the combined empirical-risk-and-penalty surface $\EmpRisk(b) + \lambda \Penalty(b)$ look like?
- $\lambda=1$ first for simplicity
- Marks: $+$ for minimum of $\EmpRisk$, $\circ$ for minimum of $\Penalty(b)$, $*$ for minimum of $\EmpRisk + \Penalty$

Some pictures (4)

$\lambda=1/4$

Some pictures (4)

$\lambda=4$

What does the penalty do?

It shrinks the estimated coefficient vector towards the origin, away from the empirical risk minimizer
Same shrinkage no matter what data, so the penalized estimate is
- Less sensitive to the data
- More stable in the face of noise in the data
- Lower variance than the ERM
- More biased than the ERM (unless the optimal vector really is the origin)
We say that the penalty regularizes the optimization problem, and the estimate
Bigger $\lambda \Rightarrow$ more regularization, more shrinkage

What specifically does the $L_2$ penalty do?

It’s not too hard to show (HW!) that \[ \hat{\beta}(\lambda) = (\mathbf{x}^T \mathbf{x} + (\text{spoiler}) \mathbf{I})^{-1} \mathbf{x}^T\mathbf{y} \]
- So we add something to the diagonal of $\mathbf{x}^T\mathbf{x}$
- Even if $\mathbf{x}$ is collinear, this will break the collinearity when we come to calculate the coefficients
  - (Keeps the eigenvalues from getting too close to 0)
This is called ridge regression or Tikhonov regularization
- Adding a “ridge” along the diagonal of the $\mathbf{x}^T\mathbf{x}$ matrix

What about $L_1$?

What does the $L_1$ penalty look like?

What about $L_1$? (2)

Combined empirical risk plus $L_1$ penalty ($\lambda = 1$)

The “corners” of $L_1$ come through, and favor driving some coordinates of the coefficient vector to $0$
- $L_1$ is a sparsity-promoting penalty, unlike $L_2$
Least squares plus $L_1$ penalty is called the lasso
There’s no closed formula for the lasso, unlike ridge regression

What about $L_1$? (3)

$\lambda=1/4$

What about $L_1$? (4)

$\lambda=4$

What about $L_1$ and $L_2$?

This is called the “elastic net”
- $L_1$ “promotes sparsity”
- $L_2$ gives additional shrinkage, especially for very large parameters
Can write either \[ \ObjFunc(\optimand) + \lambda_1\sum_{j=1}^{\OptDim}{|\theta_j|} + \lambda_2\sum_{j=1}^{p}{\theta_j^2} \] or \[ \ObjFunc(\optimand) + \alpha \lambda\sum_{j=1}^{\OptDim}{|\theta_j|} + (1-\alpha)\sum_{j=1}^{p}{\theta_j^2} \]
- Then $\alpha=1$ is lasso, $\alpha=0$ is ridge, and $0 < \alpha < 1$ is the elastic net proper
- I find the former more intuitive, but lots of software uses the other convention
glmnet package does this very efficiently, using convex programming algorithms (see below)
- At one extreme it does lasso/$L_1$, at the other it does ridge/$L_2$, and you can control how much

Penalties $\Leftrightarrow$ Constraints

Another way to regularize is to add a constraint \[ \hat{\beta}(c) = \argmin_{b: \Penalty(b) \leq c}{\EmpRisk(b)} \]
The constraint is that $\Penalty(b) \leq c$
The constraint reduces the feasible set
An $L_2$ constraint would say: “Find us the coefficient vector with the smallest MSE, among all vectors whose length is $\leq \sqrt{c}$”
- Whereas ordinary least squares says: “Find us the coefficient vector with the smallest MSE, no matter how long it might be”

Constrained optimization in general

Get a bit more abstract for a moment and think about constrained optimization in general \[\begin{eqnarray*} \optimum & = & \argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand)}\\ & \text{subject to} &\\ \Penalty(\optimand) & \leq & c \end{eqnarray*}\]
How might we solve this?

Use the constraint equation $\Penalty(\optimand) = c$ to eliminate a degree of freedom
- i.e., write one coordinate in $\optimand$ as a function of the others and of $c$
- Do unconstrained optimization over the remaining degrees of freedom
- What about the $\leq$ case?!?
Add a new variable and do unconstrained optimization over a larger problem

Lagrange multipliers

If we have one equality constraint, say $\Penalty(\optimand) = c$, we’d add a Lagrange multiplier: \[ \min_{\optimand \in \OptDomain, \lambda \in \mathbb{R}}{\ObjFunc(\optimand) + \lambda\left( \Penalty(\optimand) - c \right)} = \min_{\optimand \in \OptDomain, \lambda \in \mathbb{R}}{\Lagrangian(\optimand, \lambda)} \]
- $\Lagrangian$ is called the Lagrangian function for the constrained optimization problem, or just the Lagrangian
Do unconstrained optimization over both $\optimand$ and $\lambda$; start with the derivatives \[\begin{eqnarray} \frac{\partial \Lagrangian}{\partial \lambda} & = & \Penalty(\optimand) - c\\ \frac{\partial \Lagrangian}{\partial \optimand} & = & \frac{\partial \ObjFunc}{\partial \optimand} + \lambda \frac{\partial \Penalty}{\partial \optimand} \end{eqnarray}\]
- Set derivatives to zero and solve for $\optimum, \lambda^*$
- One extra unknown but also one extra equation

Lagrange multipliers (2)

Set to zero: \[\begin{eqnarray} \Penalty(\optimum) & = & c\\ \frac{\partial \ObjFunc}{\partial \optimand}(\optimum) & = & -\lambda^* \frac{\partial \Penalty}{\partial \optimand}(\optimum) \end{eqnarray}\]
First equation is the constraint again
- So $\Lagrangian(\optimum, \lambda^*) = \ObjFunc(\optimum)$ always
Second equation involves both $\optimum$ and $\lambda^*$
$\optimum$ will not be the same as the unconstrained optimum
- Unless that optimum happened to satisfy the constraint already, in which case $\lambda^* = 0$
Solving this system might be easy or hard, but the solution does give the constrained optimum

Lagrange multipliers are prices

Changing the constraint level $c$ changes $\optimum$ and $\ObjFunc(\optimum)$: \[\begin{eqnarray} \frac{\partial \ObjFunc(\optimum)}{\partial c} & = & \frac{\partial \Lagrangian(\optimum,\lambda^*)}{\partial c}\\ & = & \frac{\partial}{\partial c}\left( \ObjFunc(\optimum) + \lambda^*(\Penalty(\optimum) - c)\right)\\ & = & \left[\frac{\partial\ObjFunc}{\partial \optimand}(\optimum)+\lambda^*\frac{\partial \Penalty}{\partial \optimand}(\optimum) \right]\frac{\partial \optimum}{\partial c} + \left[\Penalty(\optimum)-c\right]\frac{\partial \lambda^*}{\partial c} - \lambda^*\\ & = & [0] \frac{\partial \optimum}{\partial c} + [0] \frac{\partial \lambda^*}{\partial c} - \lambda^*\\ & = & -\lambda^* \end{eqnarray}\]
$\lambda^* =$ Rate at which the optimal value improves as the constraint is relaxed
$\lambda^* =$ How much would you pay for a marginal change in the level of the constraint, your shadow price for that constraint

Lagrange multipliers vs. penalties

Once we know $\lambda^*$, we just do the optimization with an extra term: \[\begin{eqnarray} \optimum & = & \argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand) + \lambda^*(\Penalty(\optimand) - c)}\\ & = & \argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand) + \lambda^* \Penalty(\optimand)} \end{eqnarray}\]
This is a penalized optimization problem

Lagrange multipliers turns constrained optimization into penalized optimization

The penalty factor $\lambda$ corresponds to the constraint level $c$
- As the economists say: “A fine is a price” (Gneezy and Rustichini 2000)

Many constraints

For multiple equality constraints $\Penalty_1(\optimand) = c_1$, $\Penalty_2(\optimand) = c_2$, $\ldots$ $\Penalty_k(\optimand) = c_k$, we add $k$ Lagrange multipliers: \[ \Lagrangian(\optimand, \lambda_1, \ldots \lambda_k) = \ObjFunc(\optimand) + \sum_{j=1}^{k}{\lambda_j (\Penalty_j(\optimand) - c_j)} \]
Each constraint equation gets recovered when we take the derivative w.r.t. that multiplier
Each multiplier tells us our shadow price for loosening each constraint
Equivalently: adding many penalty terms

Inequality constraints

What if the constraint is that $\Penalty(\optimand) \leq c$? (not $=c$)
We add on the Lagrange multiplier anyway, as though it were an equality
Case 1: the global, unconstrained optimum obeys the constraint
- The constraint does not bind or bite, or is inactive
- We should get $\lambda^* = 0$
Case 2: the unconstrained optimum is outside in constrained feasible set
- The constraint binds, or is binding, or bites, or is active
- The constrained optimum $\optimum$ is a point where $\Penalty(\optimand) = c$
- We’ll get $\lambda \neq 0$
- (Ignoring some subtleties, which form the “Karush-Kuhn-Tucker theorem”)

Summing up on constraints and Lagrange multipliers

Equality constraints act like penalties
- Penalty factor $\lambda$ $\Leftrightarrow$ Lagrange multiplier enforcing the constraint
  - How big does the fine for violating the constraint need to be?
Loosening the constraints $\Leftrightarrow$ weakening the penalties
Inequality constraints, like $\Penalty(\optimand) \leq c$, get treated the same way
- Some multipliers/penalty factors might be 0 if those constraints don’t bite
Penalties and constraints are different ways of looking at the same thing

Mathematical programming

Optimization under constraints is called mathematical programming
- The name goes back to the 1930s and is older than “computer programming”
Linear programming: optimize a linear objective function under linear constraints
- Basically invented by Kantorovich for economic planning in the USSR in the 1930s, re-invented in the West for logistics and decision support in WWII (“operations research”), then adapted to corporate decision making, financial portfolio allocation, etc.
Convex programming: optimize a convex function under convex constraints
- Meaning: take any two points in the feasible set; every point in between them is also in the feasible set
- Includes linear programming as a special case
There are efficient (polynomial-time) algorithms for convex programming problems
- Finding an $\epsilon$-approximate optimum over $p$ variables with $k$ constraints takes time $O((k+p)^{3/2} p^2 \log{1/\epsilon})$

Mathematical programming (2)

The $L_1$ and $L_2$ constraints are convex
- $L_1 =$ all points inside a diamond around the origin
- $L_2 =$ all points inside a ball around the origin
$\therefore$ We can use convex-programming algorithms to find the constrained/penalized optimum
- Not needed for ridge/$L_2$ but very helpful for lasso/$L_1$

What do constraints/penalties do to learning and risk?

Constraints reduce the feasible set
- From all of $\OptDomain$ to $\Penalty(\optimand) \leq c$
For learning problems: constraints mean a smaller set of allowable strategies
- From all of $\ModelClass$ to $s \in \ModelClass: \Penalty(s) \leq c$
- Call this sub-set $\ModelClass_c$
Smaller strategy space $\Rightarrow$ higher best-in-class risk, $\min_{s \in S_c}{\Risk(s)} \geq \min_{s \in S}{\Risk(s)}$
Smaller strategy space $\Rightarrow$ lower variance of estimation
- More strictly: lower maximum deviation, $\max_{s \in S_c}{|\Risk(s) - \EmpRisk(s)|} \leq \max_{s \in S}{|\Risk(s) - \EmpRisk(s)|}$
Constraints $\Rightarrow$ more approximation error, less estimation error
Since penalties are equivalent to constraints, all of this applies to penalties as well
Since we only care about $(\text{approximation}) + (\text{estimation})$, regularization often helps

Summing up

Optimization problems are often “ill-posed”, “irregular”, unstable
- In learning: often (but not just) from high-dimensional data
We respond by regularizing, either add a penalty to the objective function, or constrain the feasible set
- Constraints and penalties are equivalent via Lagrange multipliers
- Lagrange multiplier $=$ price for loosening the constraint
Penalty form: just another objective function to minimize
Constraint form: special algorithms (often more computationally efficient)
In statistical learning, two of the most useful penalties/constraints are the $L_1$ and $L_2$ penalties on coefficient vectors
- $L_2$ shrinks towards the origin
- $L_1$ shrinks and favors sparsity (some coefficients exactly 0)
Regularization increases approximation error but reduces estimation error, so it is often a net advantage in learning

Backup: More about why $L_1$ promotes sparsity but $L_2$ doesn’t

At the penalized optimum $\optimum$, \[ \nabla \ObjFunc(\optimum) + \lambda \nabla \Penalty(\optimum) = 0 \]
Dropping down to 1D for simplicity for the moment, this becomes \[ \frac{d\ObjFunc}{d\optimand}(\optimum) = - \lambda \frac{d\Penalty}{d\optimand}(\optimum) \]
If $\Penalty=$ $L_2$, then $\frac{d\Penalty}{d\optimand} = 2\optimand$, so the “drive” to reduce $\optimand$ from the penalty is getting weaker and weaker as $\optimand \rightarrow 0$
- Said differently, if we’re already close to 0, then the penalty’s desire to reduce $\optimand$ can be balanced out if increasing $\optimand$ improves the average loss even just a little
- On the other hand, very far from $\optimand=0$, the improvement in the average loss has to be terrific to balance the penalty
  - To increase $\optimand$ from 100$ to $100.0001$, the improvement in average loss has to be $10^6$ times as great as what’s required to move $\optimand$ from $0.0001$ to $0.0002$
  - In the limit, any improvement in average loss is enough to move $\optimand$ away from being exactly zero
If $\Penalty=$ $L_1$, then $\frac{d\Penalty}{d\optimand} = \pm 1$, depending on whether $\optimand > 0$ or $\optimand < 0$, so the “drive” from the penalty is equally large at all values of $\optimand \neq 0$
- $L_1$ will demand just as large an improvement in average loss to increase $\optimand$ from $0$ to $0.0001$ as it would to increase it from $100$ to $100.0001$

Backup: $L_q$ penalties

The general $L_q$ penalty is \[ \sum_{j=1}^{p}{|\optimand_j|^q} \]
Includes $L_1$ and $L_2$
$L_0$ counts the number of non-zero entries in $\optimand$, but that’s very hard to optimize
- Intuitively: no possibility of seeing whether small adjustments are in the right direction
$L_q$ penalties, $q \leq 1$, are sparsity-promoting; not so for larger $q$
Note: the $L_q$ norm of a vector is $\left(\sum_{j=1}^{p}{|\optimand_j|^q}\right)^{1/q}$
- Constraints on the norm, with the power of $1/q$, translate directly into constraints on the sum of powers of coefficients
  - Exception: $L_0$
- We usually prefer to work with the sum-of-powers form of penalty, because it’s calculationally cleaner
  - Compare gradient of $\sqrt{\optimand_1^2 + \optimand_2^2}$ to gradient of $\optimand_1^2 + \optimand_2^2$
Incidentally: it’s more common to say “$L_p$”, but we’re using $p$ for the number of coefficients/parameters

Backup: Intercepts, standardized variables

We don’t usually penalize the intercept term
Before penalizing slope parameters, we usually standardize the predictor variables ($=$ center, and then divide by each variable’s standard deviation) so that the coefficients are comparable to each other
- Important exception: if the predictor variables are all already comparable to each other (all lengths or masses or prices, etc.), it may not make sense to standardize, because the coefficients are already comparable to each other
Most software, glmnet included, will do the standardization “internally”, and report back the corresponding un-standardized coefficients

Backup: Inverting $\mathbf{x}^T\mathbf{x}$ and eigenvalues

$\mathbf{x}^T\mathbf{x}$ is symmetric and non-negative-definite so it has an eigen-decomposition, \[ \mathbf{x}^T \mathbf{x} = \mathbf{v}^T \mathbf{d} \mathbf{v} \]
- $\mathbf{d} =$ diagonal matrix of eigenvalues, $\mathbf{v} =$ orthognal matrix of normalized eigenvectors
If inversion is possible \[ (\mathbf{x}^T\mathbf{x})^{-1} = (\mathbf{v}^T \mathbf{d} \mathbf{v})^{-1} = (\mathbf{v})^{-1} \mathbf{d}^{-1} (\mathbf{v}^T)^{-1} = \mathbf{v}^T \mathbf{d}^{-1} \mathbf{v} \]
- $\mathbf{v}$ is orthogonal iff $\mathbf{v}^{T} = \mathbf{v}^{-1}$
Any eigenvalues $=0$ iff no inverse $(\mathbf{x}^T\mathbf{x})^{-1}$
- There will be non-zero vectors $a$ which solve $0 = \sum_{j=1}^{p}{a_j x_{ij}}$ (collinearity)
- The vector $a$ will be an eigenvector of $\mathbf{x}^T\mathbf{x}$ with eigenvalue $0$
Any eigenvalues $\approx 0$ implies huge reciprocal eigenvalues and instability
- Equivalently, near collinearity
- A “small denominators” problem

Backup: Interior point methods for convex programming

\[\begin{eqnarray*} \optimum & = & \argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand)}\\ & \text{subject to} &\\ \Penalty(\optimand) & \leq & c \end{eqnarray*}\]

Set \[ r(\optimand) = \left\{ \begin{array}{cc} -\log{(c-\Penalty(\optimand))} & \Penalty(\optimand) \leq c\\ \infty & \Penalty(\optimand) > c \end{array} \right. \]
- Notice: $r(\optimand) \rightarrow \infty$ as $\optimand$ approaches the edge of the feasible set
Start at some $\optimand^{(0)}$ in the feasible set and begin with a large $\mu > 0$
Minimize $\ObjFunc(\optimand) + \mu r(\optimand)$, get $\optimand^{(1)}$
Re-start from $\optimand^{(1)}$ with a lower value of $\mu$
Iterate with lower and lower values of $\mu$
(Not the only way to do convex programming efficiently, but an important one)

Backup: “Comrades, let’s optimize!”

Kantorovich invented linear programming to help solve economic planning problems in the USSR (Kantorovich 1965)
In the 1950s and especially 1960s, there were serious efforts to use mathematical programming, with computers, to do planning for the whole of the Soviet economy
This was the first time a lot of insanely talented, dedicated and ambitious people tried to use the power of computers, data, and optimization to disrupt / fix the world
A lot of people at the time, not just in the USSR, thought it would succeed
- Many western economists tried very hard to argue that markets were actually as good as optimization-based planning! (Robert Dorfman and Solow 1958)
Spoiler: optimization did not, in fact, lead to Communist utopia
Spufford (2010) is an incredibly good book about this, which I think every aspiring “data scientist” ought to be required to read

References

Gneezy, Uri, and Aldo Rustichini. 2000. “A Fine Is a Price.” Journal of Legal Studies 29:1–17. https://doi.org/10.1086/468061.

Kantorovich, L. V. 1965. The Best Use of Economic Resources. Cambrdige, Massachusetts: Harvard University Press.

Robert Dorfman, Paul A. Samuelson, and Robert M. Solow. 1958. Linear Programming and Economic Analysis. New York: McGraw-Hill.

Spufford, Francis. 2010. Red Plenty. London: Faber; Faber.

Regularizing Optimization with Penalties and Constraints

Previously

Think about ordinary least squares

Think about ordinary least squares (2)

Thinking about ordinary least squares (3)

Thinking about ordinary least squares (4)

Thinking about ordinary least squares (5)

Penalties

Penalties

Some pictures

Some pictures (2)

Some pictures (3)

Some pictures (4)

Some pictures (4)

What does the penalty do?

What specifically does the \(L_2\) penalty do?

What about \(L_1\)?

What about \(L_1\)? (2)

What about \(L_1\)? (3)

What about \(L_1\)? (4)

What about \(L_1\) and \(L_2\)?

Penalties \(\Leftrightarrow\) Constraints

Constrained optimization in general

Lagrange multipliers

Lagrange multipliers (2)

Lagrange multipliers are prices

Lagrange multipliers vs. penalties

Many constraints

Inequality constraints

Summing up on constraints and Lagrange multipliers

Mathematical programming

Mathematical programming (2)

What do constraints/penalties do to learning and risk?

Summing up

Backup: More about why \(L_1\) promotes sparsity but \(L_2\) doesn’t

Backup: \(L_q\) penalties

Backup: Intercepts, standardized variables

Backup: Inverting \(\mathbf{x}^T\mathbf{x}\) and eigenvalues

Backup: Interior point methods for convex programming

Backup: “Comrades, let’s optimize!”

References