---
title: Lecture 25 --- Forests and Other Ensembles
date: 25 November 2019
bibliography: locusts.bib
output:
  html_document:
    toc: true
---

```{r, include=FALSE}
library(knitr)
# Set knitr options for knitting code into the report:
# - Save results so that code blocks aren't re-run unless code changes (cache),
# _or_ a relevant earlier code block changed (autodep), but don't re-run if the
# only thing that changed was the comments (cache.comments)
# - Don't clutter R output with messages or warnings (message, warning)
  # This _will_ leave error messages showing up in the knitted report
# - Center figures by default
opts_chunk$set(cache=TRUE, autodep=TRUE, cache.comments=FALSE,
               message=FALSE, warning=FALSE, fig.align="center",
               background="white", highlight=FALSE, tidy=TRUE,
               echo=FALSE)

```


\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\]

# Trees

## Recap on the CART procedure for growing trees

- If there are less than $n_{min}$ cases left, stop
- Consider all possible binary splits based on _one_ feature alone
  + For classification outcomes, consider the reduction in $H[Y|X]$
  + For regression, consider the reduction in $\Var{Y|X}$
- Does the best split improve the prediction by at least $\delta$?
  + If no, stop
  + If yes, recursively apply the procedure to cases on either side of the split
- Prune the tree by cross-validation

## Virtues and limits of trees

- Trees can approximate any function you like, to any degree of approximation
  you like
  + Proof: It's what you do to plot a curve on a screen with pixels
- Fully nonparametric, perfectly happy to handle interactions
  + In fact, by default every variable gets to "interact" with every other
- Really good approximation may nonetheless need a really large number of leaves
- More leaves means less bias but also more variance
  + True even if the splits are all treated as fixed
  + Even more true if the splits are found by adaptively growing the tree

```{r}
x <- c(runif(24, min=0, max=10), runif(5, min=0, max=1))
y <- 10*dgamma(x, shape=2, scale=1) + rnorm(29, sd=0.1)
df <- data.frame(x=x, y=y)
df[30,] <- c(8, 2.5)
plot(df)
library(tree)
demo.tree <- tree(y~x, data=df, minsize=1)
plotting.grid <- data.frame(x=seq(from=0, to=10, length.out=100))
lines(plotting.grid$x,
      predict(demo.tree, newdata=plotting.grid))
boring.tree <- tree(y~x, data=df)
lines(plotting.grid$x,
      predict(boring.tree, newdata=plotting.grid),
      lty="dotted")
```

_Some simulated regression data (dots), plus a regression tree fit to the data by growing a big tree without pruning (solid line), and another regression tree fit with the default control settings (dotted line).  Notice how the more stable, dotted-line tree misses the outlier (if it_ is _an outlier and not a genuine feature of the data-generating process), but also misses some of the apparent structure of the regression curve (if it_ is _structure and not just seeing patterns in noise)._


- Getting really good performance may require a really big tree (with low bias), but those have very high variance
- How can we grow big trees with low variance?

# Forests

- "Forest" methods combine the predictions from multiple trees
- Forests are a special case of **ensemble methods**, which combine multiple models to improve on what any one of them could do
- The simplest sort of combination is averaging
  + For classification, whenever I write "averaging", read "voting"
  + (Or, if you like, average the conditional probabilities and then threshold the average probability)

# Three leading forms of ensemble methods

1. "Bagging": randomly perturb the data, grow a tree to the new data, average
2. "Random forests": combine bagging with _random_ feature selection
3. "Boosting": sequentially fit models to the _errors_ of earlier models

## Bagging, or "bootstrap averaging" [@Breiman-bagging]

The bagging procedure is simplicity itself:

- Start with a data set $D = (X_1, Y_1), \ldots (X_n, Y_n)$
- Fix the number of trees $m$ we want in the forest
- For $k \in 1:m$
  + Generate $\tilde{D}_k$ by resampling the $(X_i, Y_i)$ $n$ times, with replacement
    * That is, $\tilde{D}_k$ is a resampling ("nonparametric") bootstrap simulation of the data-generating process
	* With high probability, some data points are repeated in $\tilde{D}_k$, and some do not appear at all
  + Grow a tree $\tau_k$ from $\tilde{D}_k$
    * Typically without pruning
- Make predictions by averaging the $\tau_k$

### A little demo

```{r}
resample <- function(x) { sample(x, size=length(x), replace=TRUE) }
resample.data.frame <- function(df) {
    rows <- 1:nrow(df)
    return(df[resample(rows),])
}    
```

```{r}
par(mfrow=c(2,2))
bag.1 <- resample.data.frame(df)
bag.2 <- resample.data.frame(df)
bag.3 <- resample.data.frame(df)
resample.over.original <- function(df, bag) {
    plot(df, col="lightgrey")
    points(jitter(bag$x), jitter(bag$y))
}    
plot(df)
resample.over.original(df, bag.1)
resample.over.original(df, bag.2)
resample.over.original(df, bag.3)
par(mfrow=c(1,1))
```

_Original data for the running example (top left) and three bootstrap resamplings; in each resampling, the full data set is shown in light grey (for comparisons), and the coordinates are slightly "jittered", so that a repeatedly-sampled point appears as multiple points very close to each other.  _

```{r}
tree.1 <- tree(y~x, data=bag.1, minsize=1)
tree.2 <- tree(y~x, data=bag.2, minsize=1)
tree.3 <- tree(y~x, data=bag.3, minsize=1)
par(mfrow=c(2,2))
plot(demo.tree); text(demo.tree, digits=2)
plot(tree.1); text(tree.1, digits=2)
plot(tree.2); text(tree.2, digits=2)
plot(tree.3); text(tree.3, digits=2)
par(mfrow=c(1,1))
```

_Tree fit to the full data (top left), plus the three trees fit to the three bootstrap resamplings from the previous figure._

```{r}
par(mfrow=c(2,2))
plot(df)
lines(plotting.grid$x,
      predict(demo.tree, newdata=plotting.grid))
resample.over.original(df, bag.1)
lines(plotting.grid$x,
      predict(tree.1, newdata=plotting.grid))
resample.over.original(df, bag.2)
lines(plotting.grid$x,
      predict(tree.2, newdata=plotting.grid))
resample.over.original(df, bag.3)
lines(plotting.grid$x,
      predict(tree.3, newdata=plotting.grid))
par(mfrow=c(1,1))
```

_Original data (top left), plus the same three resamplings, with the regression function estimated by the tree fit to each data set._

```{r}
plot(df)
lines(plotting.grid$x,
      predict(demo.tree, newdata=plotting.grid))
lines(plotting.grid$x,
      predict(tree.1, newdata=plotting.grid), lwd=0.3,
      col="blue")
lines(plotting.grid$x,
      predict(tree.2, newdata=plotting.grid),
      lwd=0.3, col="blue")
lines(plotting.grid$x,
      predict(tree.3, newdata=plotting.grid),
      lwd=0.3, col="blue")
bagged <- rowMeans(cbind(predict(tree.1, newdata=plotting.grid),
                         predict(tree.2, newdata=plotting.grid),
                         predict(tree.3, newdata=plotting.grid)))
lines(plotting.grid$x,
      bagged,
      col="blue")
lines(plotting.grid$x,
      predict(boring.tree, newdata=plotting.grid),
      lty="dotted")
```

_Full data (points), plus regression tree fit to them (black), plus the three trees fit to bootstrap resamplings (thin blue lines), plus the average of the three bootstrapped trees, i.e., the bagged model (thick blue line).  Notice how
the impact of the outlier is attenuated by bagging, but the main features of the unstable-but-sensitive big tree have been preserved, and the bagged curve
show more detail than just fitting a stable-but-insensitive tree (dotted black line) --- for instance, bagging picks up that the curve rises for small values of $x$._

### Why does this help?

- Bootstrapping perturbs the data, but in ways which resemble re-running
the experiment
- Even if each tree $\tau_k$ is over-fit to $\tilde{D}_k$, what they have in common
will be the main features of the data set; what's peculiar to each resampling
will tend to average out
- So the trees shouldn't be (much) more biased than a tree fit to the whole data, but with less variance, or sensitivity to accidents of the original data

### Why it might be a bit surprising that bagging helps

Making predictions using a bag of trees is equivalent to making predictions
using one giant tree with a huge number of leaves.  (Can you prove this, and
explain why if each tree has $r$ leaves, a forest of $m$ trees is equivalent to
one tree with $O(r^m)$ leaves?)  And we know that giant trees should be really
unstable, with really high variance.  So it might seem that we haven't
gained anything, but we really do get better, more stable predictions
from bagging.  The trick is that we don't get our predictions by growing
just _any_ giant tree --- only trees that arise by averaging many smaller
trees are allowable.  This however suggests that we should be really
careful about what we mean when we say things like "simpler models are
usually better", or even "simpler models are usually more stable"
[@Domingos-on-Occams-Razor;@Domingos-process-oriented-evaluation].


### How big a forest?

- $m$ in the range of 100--1000 is often plenty
- Statistically, increasing $m$ can only reduce the variance
- But there are diminishing returns to increasing $m$, for an interesting reason...

#### Some basic math about averaging

- Suppose we've got two random variables, let's say $T_1$ and $T_2$, with equal variance $\sigma^2 > 0$, and correlation $\rho > 0$
- What's $\Var{(T_1 + T_2)/2}$?
\begin{eqnarray}
\Var{\frac{T_1 + T_2}{2}} & = & \frac{1}{4}\Var{T_1+T_2}\\
& = & \frac{1}{4}\left(\Var{T_1} + \Var{T_2} + 2\Cov{T_1, T_2}\right)\\
& = & \frac{\sigma^2}{2} + \frac{2\rho\sigma^2}{4} = \frac{\sigma^2}{2}(1+\rho)
\end{eqnarray}
  + If $\rho=0$, we get the familiar result, that averaging two variables halves the variance
  + If $\rho > 0$, we get more variance than that, by a factor of $\rho$
    * If $T_1$ is above [below] its mean, $T_2$ tends to also be above [below] its mean, and so fluctuations for the average are exaggerated
  + If $\rho < 0$, we'd get _less_ variance than if they were uncorrelated
    * If one variable is above its mean the other tends to be below its mean, suppressing fluctuations for the average
- If we average $m$ variables $T_1, \ldots T_m$, all with variance $\sigma^2$ and all with correlation $\rho$ with each other, we get
\begin{eqnarray}
\Var{m^{-1}\sum_{k=1}^{m}{T_k}} & = & \frac{\sigma^2}{m} + \frac{1}{m^2}\sum_{k=1}^{m}{\sum_{l \neq k}{\Cov{T_k, T_l}}}\\
& = & \frac{\sigma^2}{m} + \frac{m(m-1)}{m^2}\rho\sigma^2\\
& = & \sigma^2\left(\frac{1}{m} + \rho\frac{m-1}{m}\right)
\end{eqnarray}
  + Even as $m\rightarrow\infty$, this goes to $\rho\sigma^2$, not to 0
    * Unless $\rho=0$
	* $\rho$ can't be negative if $m\rightarrow\infty$ (see backup)


```{r, echo=FALSE}
curve(1/x, from=1, to=1000, log="xy", xlab="m",
      main="Variance of averaging m correlated terms",
      ylab="Variance of average")
curve(1/x + 0.01*(x-1)/x, add=TRUE, lty="dashed")
curve(1/x + 0.1*(x-1)/x, add=TRUE, lty="dotted")
legend("bottomleft", legend=c(expression(rho==0),
                              expression(rho==0.01),
                              expression(rho==0.001)),
       lty=c("solid", "dashed", "dotted"))
```

_Variance of averaging $m$ terms of equal variance $\sigma^2=1$, each with correlation $\rho$ with the others.  Notice that the variance declines monotonically with $m$, but, unless $\rho = 0$, it asymptotes to a non-zero value._


#### Why this matters here

- Each tree from bagging has the same distribution
  + $\tau_k$ is a random tree, because $\tilde{D}_k$ is a random data set
  + All bootstraps have the same distribution, so $\tau_k$ and $\tau_l$ have the same distribution
  + For any point $x$, $T_k \equiv \tau_k(x)$ is random variable
  + $T_k$ will have some variance $\sigma^2$
- Trees will be positively correlated
  + i.e., if $\tau_k(x) > \Expect{\tau(x)}$, then we should tend to expect $\tau_l > \Expect{\tau(x)}$
    * Whatever quirk of the initial sample $D$ led to over-estimating the function at $x$ on run $k$, it's probably shared with other resamplings of $D$
	* (Conditional on $D$, the $\tau_k$ are IID)
- All trees are equally correlated
  + In the sense that $\Cov{\tau_k(x), \tau_l(x)}$ has to be the same for all $k\neq l$
    * Could change with $x$
  + This follows because, given $D$, all the resamplings $\tilde{D}_k$ and $\tilde{D}_l$ are IID
- So at any point $x$, $\tau_k(x)$ has some variance $\sigma^2(x)$ and some correlation $\rho(x)$ across trees, and so
\[
\Var{\frac{1}{m}\sum_{k=1}^{m}{\tau_k(x)}} = \sigma^2(x)\left(\frac{1}{m} + \rho(x)\frac{m-1}{m}\right) \rightarrow \sigma^2(x) \rho(x)
\]
  + (Resampling the same fixed data set over and over can't give us unlimited information)

## What can we do to reduce the correlation between trees?

- Tweak the data
  + But bootstrap is so natural...
- Add noise to the prediction
  + Seems drastic...
- "Random forests" [@Breiman-random-forests]
  + Paper considered a whole range of ways of randomly building lots of trees, including bagging
  + Introduced the idea of combining bagging with **random feature selection**
  + We now call that combination "random forests"

### The random forests algorithm of @Breiman-random-forests

- Given:
  + Data set $D=(X_1, Y_1), \ldots (X_n, Y_n)$
  + Desired forest size $m$
  + Number of features to consider for each split $q < p=dim(X)$
- repeat $m$ times:
  + Generate a bootstrap sample $\tilde{D}_k$
  + Run CART to build a tree, but at each interior node, pick $q$ features from $X$, and _only_ consider splits on those features
    * Re-select features independently at each node, with no memory of what was done in other trees, or elsewhere in this tree
	* Don't bother pruning
  + Return tree $\tau_k$
- To predict, average over the $\tau_k$

Some practicalities:

- Number of features to pick $q$ shouldn't be too big
  + $q=\sqrt{p}$ is a common default and often works pretty well
  + I am surprised at how often $q=1$ is competitive
  + Of course, if $p=1$ to start with, random forests becomes good old-fashioned bagging
    * So there's nothing for me to show you here with the running example
- Number of trees: $m=100$ or $m=1000$ often plenty
  + As with bagging, increasing $m$ can only improve things statistically, _but_ there are diminishing returns, and computational costs


### What random forests does

- Different trees will make different random feature selections
- This keeps them from being _too_ similar
- Using bootstrapped versions of the same data forces them to all be trying to solve the same prediction problem
- Results in a good balance of not-too-much bias, not-too-much variance, and low(er) correlation across trees
- RF is simple, fast, and often very competitive in terms of sheer predictive performance
  * Computation: CART is fast, bootstrapping is fast, growing the forest is "embarrassingly parallel"
  * Sheer performance is often close to the best available with much more computational effort

## Boosting

- Bagging and random forests are parallel: no communication or cross-talk between growing tree 1 and tree 2, etc.
- **Boosting** is a **sequential** procedure, where tree $k$ is grown in a way that tries to compensate for the mistakes of trees $1, 2, \ldots k-1$

### The basic boosting procedure

- Given: data set $D=(X_1, Y_1), \ldots (X_n, Y_n)$, number of rounds of
  boosting $m$
  + You can make $m$ adapt to the data if you really want
- Initialize all data points to equal weight $w_i^{(1)} = 1$ for $i \in 1:n$
- for $k \in 1:m$
  + Grow tree $\tau_k$ by doing a _weighted_ fit to the data, with weights $w^{(k)}$
  + Calculate the error $\tau_k$ makes on each data point, $e_{ik}$
  + Multiply weights, so $w_i^{(k+1)} = w_i^{(k)} f(e_{ik})$ for some function $f$
    * $f$ should increase with the size of the error, and be close to or actually 0 for 0 error
	* Good choices of $f$ depend on whether we're doing classification or regression, etc.
- Return the set of trees $\tau_k$
- Make predictions by averaging over the trees
  + Possibly weight tree $\tau_k$ by how well it fit the data

### What boosting does

- The first tree tries to fit all the data equally
- The _second_ tree doesn't care much about data points that the first tree fit.
Instead, it adjusts itself to try to fit the errors of tree 1.  (Weights go up with error.)
- The _third_ tree doesn't care much about data points that _either_ of the first two trees fit, and _really_ doesn't care about data points that _both_ of them fit, but focuses on data points _neither_ of them fit
- And so on for the fourth, seventh, 183rd tree...
- Each model in the sequence tries to correct the errors of the models that came before it


### Relationship between trees in boosting

- Using the same data...
- But with very different weights...
- Weight moves to places the earlier trees didn't predict well, which can create a kind of anti-correlation in the predictions
- Not easily analyzed with variances and correlations (unlike bagging)
  + Trees are dependent, changing variance, changing correlations
  + Weight tends to accumulate on points which are very hard for _any_ tree
    to predict
  + Think of boosting as a game between a Learner and an Adversary
    * Learner scores points if it can grow a tree that fits all the data
	* Adversary scores points if it can propose points that give Learner trouble
	* Game converges to an equilibrium or stale-mate

### In practice...

- Boosting typically uses **weak learners**
  + Like a tree with only two leaves
  + Highly biased, but very stable
- Combining many _different_ weak learners creates a strong learner
  + Boosting specifically uses each learner to compensate for the faults of the others
- Use a lot of rounds of boosting
  + Each weak learner is easy to fit, after all...
- Often highly competitive performance
  + Maybe not _quite_ as good as random forests

### Note on regression boosting and gradient boosting

- For regression, boosting is (often) equivalent to fitting a model,
and then fitting a model to the residuals of the first model, and then fitting
a model to the residuals of the second model, etc., and adding them together [@tEoSL-2nd]
- This idea can be generalized to other loss functions than squared error,
with the derivative of the loss taking the role of the residuals; this is **gradient boosting**

```{r}
par(mfrow=c(2,2))
plot(df)
boost.0 <- prune.tree(tree(y~x, data=df), best=3)
lines(plotting.grid$x,
      predict(boost.0, newdata=plotting.grid))
df$r0 <- residuals(boost.0)
plot(r0 ~ x, data=df)
boost.1 <- prune.tree(tree(r0 ~ x, data=df), best=3)
lines(plotting.grid$x,
      predict(boost.1, newdata=plotting.grid))
df$r1 <- residuals(boost.1)
plot(r1 ~ x, data=df)
boost.2 <- prune.tree(tree(r1 ~ x, data=df), best=3)
lines(plotting.grid$x,
      predict(boost.2, newdata=plotting.grid))
plot(y~x, data=df)
boosted <- rowSums(cbind(predict(boost.0, newdata=plotting.grid),
                          predict(boost.1, newdata=plotting.grid),
                          predict(boost.2, newdata=plotting.grid)))
lines(plotting.grid$x,
      boosted)
```

_Illustration of boosting.  Top left: running-example data, plus a regression tree with only three leaves.  Top right: Residuals from the first model, plus a three-leaf tree fit to those residuals.  Bottom left: residuals from the second model, plus a three-leaf tree fit to_ those _residuals.  Bottom right: Original data again, plus the sum of the three estimated trees.  In practice, one would use many more than three steps of boosting._


# Not just trees

- Bagging and boosting are often applied to all kinds of models, not just CART
  + Random forests are harder to combine with other models, because what's the equivalent of splitting on features?
- There's no point to using bagging with linear estimators
  + E.g., ordinary least squares for a linear regression
  + Because: taking $m\rightarrow\infty$ just gives you back the linear estimator on the full data!
- Bagging tends to improve very nonlinear but unstable methods
  + Like trees
  + Or like linear regression _with variable selection_
- Boosting tends to improve nonlinear, low-capacity methods
  + Like trees with very few leaves
  + Or simple smoothing methods

# Diversity in ensemble methods (after @Krogh-Vedelsby-neural-network-ensembles )

- Think about regression for the moment
- We're trying to estimate $\mu$, we've got $m$ models, each predicting $T_1, \ldots T_m$
  + Our average prediction is $\overline{T} \equiv \frac{1}{m}\sum_{i=1}^{m}{T_i}$
  + The variance around this is $V$
\begin{eqnarray}
V & \equiv &\frac{1}{m}\sum_{i=1}^{m}{(T_i - \overline{T})^2}\\
& = & \frac{1}{m}\sum_{i=1}^{m}{(T_i - \mu + \mu - \overline{T})^2}\\
& = & \frac{1}{m}\left[\sum_{i=1}^{m}{(T_i-\mu)^2} + \sum_{i=1}^{m}{(\mu-\overline{T})^2} + 2\sum_{i=1}^{m}{(T_i-\mu) (\mu-\overline{T})}\right]\\
& = & \frac{1}{m}\sum_{i=1}^{m}{(T_i - \mu)^2} + (\mu-\overline{T})^2 + 2(\mu-\overline{T})\frac{1}{m}\sum_{i=1}^{m}{(T_i-\mu)}\\
& = & \frac{1}{m}\sum_{i=1}^{m}{(T_i - \mu)^2} + (\mu-\overline{T})^2 + 2(\mu-\overline{T})(\overline{T}-\mu)\\
& = & \frac{1}{m}\sum_{i=1}^{m}{(T_i - \mu)^2} - (\mu-\overline{T})^2\\
(\mu-\overline{T})^2 & = & \frac{1}{m}\sum_{i=1}^{m}{(T_i - \mu)^2} - V
\end{eqnarray}
- The final left-hand side the squared error of the average estimator
  + The RHS is the averaged squared error of the $m$ different estimators...
  + ... _minus_ the variance of the estimators
  + All else being equal, better individual models will improve the ensemble
  + All else being equal, _more diverse_ individual models will improve the ensemble
- In a slogan,
\[
\mathrm{(performance\ of\ group)} = \mathrm{(average\ individual\ performance)} + \mathrm{(diversity\ of\ group)}
\]

#### Some notes on the math

- This works equally well if you're doing a weighted average of predictors --- you just need to define $V$ as the weighted variance within the ensemble (exercise!)
- Similar math for any loss function that has a bias-variance decomposition
  + The variance $V$ here is the variance across the ensemble, not the variance which was going to $\rho\sigma^2$ as $m$ grew above

# Further reading

- Look at the suggested readings from @Berk-on-stat-learn and @tEoSL-2nd for technical details on bagging, random forests and boosting
- @Schapire-Freund-boosting is a great reference on boosting, and one of the best-written books on machine learning I've ever read
- On diversity, see @scotte-Difference (especially on how efforts to get the "best" individuals can reduce the group diversity and therefore lower performance)


# Backup topics

## The minimum (average) correlation among $m$ variables

Suppose we have $m$ variables $T_1, \ldots T_m$, each with variance $\sigma^2$, and $\Cov{T_k, T_l} = \rho_{kl} \sigma^2$.  The variance of the average is
\begin{eqnarray}
\Var{\frac{1}{m}\sum_{k=1}^{m}{T_k}} & = & \frac{\sum_{k=1}^{m}{\Var{T_k}}}{m^2} + \frac{1}{m^2}\sum_{k=1}^{m}{\sum_{l\neq k}{\Cov{T_k, T_l}}}\\
& = & \frac{\sigma^2}{m} + \frac{\sigma^2}{m^2}\sum_{k=1}^{m}{\sum_{l\neq k}{\rho_{kl}}}\\
& = & \sigma^2\left(\frac{1}{m} + \frac{m(m-1)}{m^2}\overline{\rho}\right)
\end{eqnarray}
where $\overline{\rho}$ is the average of all the $\rho_{kl}$.  Since
this is a variance, it must be $\geq 0$, which implies
\begin{eqnarray}
\frac{1}{m} + \frac{m-1}{m}\overline{\rho} & \geq & 0\\
(m-1)\overline{\rho} & \geq & -1\\
\overline{\rho} & \geq & \frac{-1}{m-1}
\end{eqnarray}

Notes/exercises:

0. Since this is the minimum average correlation, if all variables have equal correlation $\rho$, this is also the minimum value of $\rho$.
1. The assumption that all $T_k$ have the same variance $\sigma^2$ is not essential --- it just simplifies the book-keeping.  (Going through the book-keeping will be a character-building exercise.)
2. You should be able to create a family of $m$ variables with equal variance, and with correlation exactly $-1/(m-1)$, _without_ Googling...


## Exchangeable variables and random limits for averages

- A sequence of random variables $T_1, T_2, \ldots T_m, \ldots$ is **exchangeable** when the distribution doesn't change if we permute them
  + IID sequences are exchangeable, but not vice versa
  + Every _infinite_ exchangeable sequence is a mixture of IID sequences
    * Meaning: $T_k \perp T_l | D$ for some random variable $D$
    * For us, bootstrap resamplings $\tilde{D}_k$ are exchangeable, and IID condtional on the _original_ sample $D$
- Exchangeability implies that $(T_k, T_l)$ has the same distribution as $(T_1, T_2)$ for any $i\neq j$
  + (Can you prove this?)
- This in turn implies that $\Cov{T_k, T_l} = \rho \sigma^2$, where $\sigma^2 = \Var{T_k}$ and $\rho$ is some constant correlation
- So, as we've seen,
\[
Var{\frac{1}{m}\sum_{k=1}^{m}{T_k}} \rightarrow \rho\sigma^2
\]
- But $\Expect{m^{-1}\sum_{k}{T_k}} = \Expect{T_1}$
- $m^{-1}\sum_{k}{T_k}$ converges, but it converges to a _random_ limit, and the variance of those limits is $\rho\sigma^2$
- For much, much more, see @Kallenberg-symmetries

## The name "boosting"

- The name "boosting" comes from the idea of "boosting a weak learner to a strong learner"
- The original idea was (see, e.g., @Kearns-Vazirani):
  + We're trying to classify balanced data (with $Y=0$ and $Y=1$ equally probable), and perfect classification is possible, if only we knew the right rule
  + We can be (say) 90% confident that our "weak learner", trained on $n_0$ data points, will classify with accuracy (say) 51%, and won't get better with more data
    * (Exact numbers don't matter, just: only a little better than chance, with only a fixed confidence level)
  + Now collect $n=mn_0$ data points, break it up randomly into $m$ chunks, and run the weak learner $m$ times, and use the majority vote
  + By taking $m$ big enough, we be arbitrarily confident that the majority vote has an arbitrarily low error rate
    * More formally, for any $\epsilon > 0$, $\delta > 0$, there is an $m$ such that with at least $mn_0$ data points, the majority vote has an error rate $\leq \epsilon$ with probability at least $1-\delta$
- This is actually closer to bagging than to modern boosting...

# References