36-465/665, Spring 2021

22 April 2021 (Lecture 22)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \DeclareMathOperator{\sgn}{sgn} \newcommand{\Margin}{M} \]

- Kernel machines: \[ s(x) = b_0 + \sum_{i=1}^{n}{\alpha_i K(x, x_i)} = b_0 + \sum_{j=1}^{\infty}{\beta_j \phi_j(x)} = b_0 + \langle \beta, \phi(x) \rangle \]
- Kernel matrix \(\mathbf{K} =\) the \([n\times n]\) matrix where \(K_{ij} = K(x_i, x_j)\)
- If \(\| \beta \| \leq c\) then (HW10Q1) \[
\EmpRademacher_n(\ModelClass) \leq \frac{c\sqrt{\tr{\mathbf{K}}}}{n}
\]
- And (HW10Q2) \(\|\beta\|^2 = \alpha \cdot \mathbf{K} \alpha\)

- Regression: use kernel ridge regression, \[ \min_{\beta}{\frac{1}{n}\sum_{i=1}^{n}{(y_i - s(x_i;\beta))^2} + \lambda \|\beta\|^2} = \min_{\alpha \in \mathbb{R}^n}{\frac{1}{n}\sum_{i=1}^{n}{(y_i - s(x_i;\alpha))^2} + \lambda \left(\alpha \cdot \mathbf{K} \alpha\right)} \]
- Solution: \(\alpha = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}\)
- Great for least-squares regression, awkward for other tasks

- Classification: classes \(y_i\) are \(\pm 1\), maximize the margin by minimizing \[
\sum_{i=1}^{n}{\sum_{j=1}^{n}{\alpha_i \alpha_j y_i y_j K(x_i, x_j)}}
\] subject to \(\sum_{i=1}^{n}{\alpha_i y_i} = 0\), \(\sum_{i=1}^{n}{\alpha_i}=1\), \(\alpha_i \geq 0\)
- Slightly different objective if we allow for some slack/mis-classification

- Solution: support vector machine: \[
\sgn{\left(b_0 + \sum_{i=1}^{n}{\alpha_i y_i K(x, x_i)} \right)}
\]
- Many/most \(\alpha_i = 0\) (
**sparse**), if \(\alpha_i \neq 0\) then \(x_i\) is a**support vector**

- Many/most \(\alpha_i = 0\) (

- The space of functions of the form \(\sum_{j=1}^{\infty}{\beta_j \phi_j(x)}\) is huge!
- Not
*every*function, but*lots*of functions, and even if the best function isn’t in this space, it can often be closely approximated

- Not
- But we only have to find the \(n\) weights \(\alpha_i\)
- Finite-dimensional approximations to an infinite-dimensional space

- Kernels measure similarity and there are lots of good ways to do that
- Radial basis functions, polynomials, splines…
- Specialized kernels for strings (think: text, DNA), wave-forms (think: speech, music), images, graphs/networks, …

- Learning-theoretic results so there are both generalization error bounds
*and*oracle inequalities- We can make guarantees about out-of-sample performance
- We can make guarantees about near-optimality

- Kernel machines became big around 1995 with the first support vector machines, and pretty much ruled until say 2012 or 2013
- Deep neural networks trained by gradient descent
*may*be kernel machines in disguise (Domingos 2020)

- Deep neural networks trained by gradient descent

- Need to think hard about the kernel function \(K\)
- Computationally, everything comes down to the kernel matrix \(\mathbf{K}\)
- \(\mathbf{K}\) is \([n\times n]\)
- Not so bad when \(n=10^2\) or \(10^3\) or \(10^4\)
- Awkward when \(n=10^6\) (8 Tb of memory just to store one matrix!)
- Inverting a matrix takes \(O(n^3)\) time, but even just calculating the objective function for an SVM takes \(O(n^2)\) time
- Very elaborate algorithms can invert in \(O(n^{2.37\ldots})\) time (but can actually be slower at small \(n\)

- Also, you need to keep around all \(n\) weights \(\alpha_1, \ldots \alpha_n\)
- Even with SVMs, the number of support vectors is typically \(\propto n\)

- Come up with a big collection of basis functions, say \(\psi(x;w)\), where the \(w\) are
**indices**or**parameters**- Example: \(\psi(x;w) = \cos{(w_1 \cdot x + w_2)}\) (cosine basis or Fourier basis)
- Example: \(\psi(x;w) = \Indicator{\|x - w\| \leq h}\) (bumps)

**Randomly**draw values \(W_1, W_2, \ldots W_q\) from some distribution \(\rho\) over \(w\)- Now fit \[ \sum_{j=1}^{q}{\beta_j \psi(x;W_j)} \]

Bochner’s theorem: If \(K(x,z)=K(x-z)\) is a kernel over \(\mathbb{R}^d\), then there’s a probability distribution \(\rho\) over \(\mathbb{R}^d\) and a constant \(C>0\) where \[ K(x,z) = C \int_{\mathbb{R}^d}{\rho(w)\cos{(w\cdot(x-z))} dw} \] + Note: \(K(x,z)/C\) is another kernel with the same feature space, so we can set \(C=1\) “without loss of generality”

- \(\rho\) is the
**Fourier transform**of \(K(x-z)\) - Implication: If \(W \sim \rho\), then \[ K(x,z) = \Expect{\cos{(W \cdot (x-z))}} \]
- Implication: if \(W_1, W_2, \ldots W_q\) are all \(\sim \rho\), then \[ \frac{1}{q}\sum_{j=1}^{q}{\cos{(W_j \cdot(x-z))}} \approx \Expect{\cos{(W \cdot (x-z))}} = K(x,z) \]

- Remember \(K(x,z) = \langle \phi(x), \phi(z) \rangle\)
- After some play with trig. identities, two sets of random features will work: \[
\psi_i(x) = (\cos{W_i \cdot x}, \sin{W_i \cdot x})
\] or \[
\psi_i(x) = \cos{W_i \cdot x + b_i}
\] where \(b \sim\mathrm{Unif}(-\pi, \pi)\)
- More common to use the 2nd approach (random “phase” \(b_i\))

- Define \[ \psi(x) = (\psi_1(x), \psi_2(x), \ldots \psi_q(x)) \]
- Then \[ \Expect{\psi(x) \cdot \psi(z)} = K(x,z) \]
- Since \(\cos\) is bounded between \(-1\) and \(1\), we can use Hoeffding’s inequality to show \(\psi(x) \cdot \psi(z)\) is close to \(K(x,z)\) with high probability like \(e^{-q\epsilon^2}\)
- In fact \(\Prob{\max_{x,z}{|\psi(x) \cdot \psi(z) - K(x,z)| \geq \epsilon}} = O(\epsilon^{-1}e^{-q\epsilon^2})\) (omitting constants in exponent etc.) (Rahimi and Recht 2008, claim 1)

- Pick your favorite kernel \(K\) and a number \(q > 0\)
- Find \(K\)’s Fourier transform \(\rho\)
- People have been tabulating Fourier transforms since Fourier almost 200 years ago, you can usually look it up
- Conveniently, the Fourier transform of a Gaussian is also a Gaussian

- Draw \(W_1, W_2, \ldots W_q\) from \(\rho\) and \(b_1, b_2, \ldots b_q\) from \(\mathrm{Unif}(-\pi, \pi)\)
- Calculate the
**random Fourier features of the data**\[ \psi(x) = (\cos{W_1 \cdot x + b_1}, \cos{W_2 \cdot x +b_2}, \ldots \cos{W_q \cdot x + b_q}) \] - Form the \([n \times q]\) matrix \(\mathbf{\psi}\)
- Now run your favorite linear method on \(\mathbf{\psi}\)
- \(\mathbf{\psi}\mathbf{\psi}^T \approx \mathbf{K}\), a
**low-rank**(rank-\(q\)) approximation, and there can be situations where we want to use that because even though it’s \([n \times n]\) there are special computational techniques for low-rank matrices; more common however to just work with \(\mathbf{\psi}\)

- \(\mathbf{\psi}\mathbf{\psi}^T \approx \mathbf{K}\), a

- The
**random kitchen sink**idea of Rahimi and Recht (2009) - Pick your favorite parameterized collection of functions \(\psi(x;w)\), a distribution \(\rho(w)\), an integer \(q\), and a \(C> 0\)
- Draw \(W_1, \ldots W_q\) from \(\rho\)
- Calculate \(\psi(x_i) = (\psi(x_i;W_1), \ldots \psi(x_i, W_q))\)
- Do constrained ERM over linear models in these features: \[ \hat{\beta} = \argmin_{\beta: \max_{i \in 1:q}{|\beta_i| \leq C/q}}{\frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, \beta \cdot \psi(x_i))}} \]
- If the loss function \(\Loss\) is reasonable, then we have an oracle inequality: \[
\Prob{\Risk(\hat{s}) - \Risk(\OptimalModel) \geq O\left(\left(\frac{1}{\sqrt{n}}+\frac{1}{\sqrt{q}}\right) C \sqrt{\log{(2/\delta)}}\right)} \leq \delta
\]
- You’ll work this through in HW13, with all the constants
- Basic idea: The random features give a good approximation to the optimal machine (that’s where the \(O(1/\sqrt{q})\) comes from), and linear models on the random features have low Rademacher complexity (that’s where the \(O(1/\sqrt{n})\) comes from)

- Want the function space to be rich
- Random Fourier features are good, so are random bumps

- Typically make \(C\) large enough that the constraint stays inactive
- Or just ignore it altogether…

- Can also do ridge or lasso, or make the linear part the input to a generalized linear model, etc.

```
library(expandFunctions)
my.rFfs <- raptMake(p = 2, q = 30, WdistOpt = list(sd = 1), bDistOpt = list(min = -pi,
max = pi))
dim(rapt(as.matrix(df[, c("x1", "x2")]), my.rFfs))
```

`## [1] 200 30`

`df.augmented <- data.frame(y = df$y, cos(rapt(as.matrix(df[, c("x1", "x2")]), my.rFfs)))`

Here are the first few \(W_j\)

```
## [,1] [,2]
## [1,] 1.0917135 -1.83151364
## [2,] 0.4475455 1.16703196
## [3,] 0.7942617 -1.64220647
## [4,] -0.6990022 -0.43673732
## [5,] -1.3940621 -0.48644337
## [6,] 0.9985419 -0.09824253
```

and the corresponding \(b_j\)

```
## [,1]
## [1,] 0.4192803
## [2,] -1.4453380
## [3,] 0.4489669
## [4,] -2.5730839
## [5,] -2.2053458
## [6,] 0.4401365
```

- The features have random orientations, wavelengths, and phase off-sets
- Any model we build is going to be a linear combination of these random cosine waves

```
glm.rks <- glm(y ~ ., data = df.augmented, family = "binomial")
table(predict(glm.rks, type = "response") >= 0.5, df.augmented$y)
```

```
##
## 0 1
## FALSE 100 0
## TRUE 0 100
```

- Kernel machines are great in many ways
*but*there are computational bottlenecks for large \(n\)- Every prediction needs \(O(n)\) calculations and requires remembering \(O(n)\) weights \(\alpha_i\)
- The \([n \times n]\) kernel matrix \(\mathbf{K}\) is needs lots of memory and time

- Random feature machines emerged as an approximation to kernel machines
- Each prediction needs \(O(q)\) calculations and requires remembering \(q\) weights \(\beta_j\)
- Working with linear-in-the-features methods needs only \([q\times q]\) matrices

- Random feature machines turned out to be useful in their own right, not just as approximations to kernel machines
- Very simple computationally (especially if you use random Fourier features)
- Highly expressive with not-too-many random features
- Good generalization properties
- Good in-practice performance

Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” E-print, arxiv:2012.00152. https://arxiv.org/abs/2012.00152.

Rahimi, Ali, and Benjamin Recht. 2008. “Random Features for Large-Scale Kernel Machines.” In *Advances in Neural Information Processing Systems 20 (Nips 2007)*, edited by John C. Platt, Daphne Koller, Yoram Singer, and Samuel T. Roweis, 1177–84. Red Hook, New York: Curran Associates. http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.

———. 2009. “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In *Advances in Neural Information Processing Systems 21 [Nips 2008]*, edited by Daphne Koller, D. Schuurmans, Y. Bengio, and L. Bottou, 1313–20. Red Hook, New York: Curran Associates, Inc. https://papers.nips.cc/paper/2008/hash/0efe32849d230d7f53049ddc4a4b0c60-Abstract.html.