---
title: Lecture 23 --- More About Classifier Design
author: 36-462/662, Fall 2019
date: 18 November 2019
bibliography: locusts.bib
output:
  html_document:
    toc: true
---

\[
\newcommand{\Yhat}{\hat{Y}}
\newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)}
\newcommand{\indep}{\perp}
\]

## Recapitulation and notation-fixing

We're trying to predict a categorical response, label
or class $Y$, usually binary ($0$ or $1$), using a feature or vector of
features, $X$.  The prediction $\Yhat$ is a function of the feature $x$, so
we'll also write $\Yhat(x)$ for the guess at the label of a case with features
$x$.  We will be very interested in $\Prob{Y=1|X=x}$, and abbreviate it by
$p(x)$.  The over-all mis-classification error rate, or inaccuracy, is $\Prob{Y
\neq \Yhat}$.  This can be decomposed two ways.  One way is in terms of the
false positive rate, the false negative rate, and the "base rate" at
which the two classes occur:
\[
\Prob{Y \neq \Yhat} = \Prob{\Yhat=1|Y=0} \Prob{Y=0} + \Prob{\Yhat=0|Y=1}\Prob{Y=1} = FPR \times  \Prob{Y=0} + FNR \times \Prob{Y=1}
\]
The other decomposition is in terms of (one minus) the positive predictive value,
the negative predictive value, and the probability of making each prediction:
\[
\Prob{Y \neq \Yhat} = \Prob{Y=1|\Yhat=0}\Prob{\Yhat=0} + \Prob{Y=0|\Yhat=1}\Prob{\Yhat=1} = (1-PPV) \Prob{\Yhat=0} + (1-NPV) \Prob{\Yhat=1}
\]

If we want to minimize the inaccuracy, we should use the rule[^threshold] that
$\Yhat=1$ if $p(x) \geq 0.5$ and $\Yhat=0$ if $p(x) < 0.5$.  If we attach a
cost to each error, say $c_+$ to false positives and $c_-$ to false negatives,
then we should fix $\Yhat=1$ if $p(x) \geq t(c_+, c_-)$ and $\Yhat=0$ if $p(x)
< t(c_+, c_-)$.  The shape of the threshold function $t$ is such that $t=0.5$
when and only when $c_+ = c_-$.  (Can you remember, or work out, how to
find the threshold in terms of the costs $c_+$ and $c_-$?)

[^threshold]: Strictly, if $p(x)=0.5$, it doesn't matter what $\Yhat$ is, but I will always write $\geq$ for definiteness.

# Classification needs different distributions of features across classes

$p(x)$ is the distribution of the _class_ conditional on the _features_,
$\Prob{Y=1|X=x}$.  It turns out that the success of classification depends on
the "inverse" conditional probability, $\Prob{X=x|Y=y}$.  Let's introduce
the abbreviations
\begin{eqnarray}
f(x) \equiv \Prob{X=x|Y=1}\\
g(x) \equiv \Prob{X=x|Y=0}
\end{eqnarray}
(Some people would write $f_+$ and $f_-$, or $f_1$ and $f_0$, etc.)  Now
let's re-write the conditional probability of the class given the features, in
terms of the distribution of features in each class:
\begin{eqnarray}
p(x) & \equiv & \Prob{Y=1|X=x}\\
& = & \frac{\Prob{Y=1, X=x}}{\Prob{X=x}}\\
& = & \frac{\Prob{Y=1, X=x}}{\Prob{Y=1, X=x} + \Prob{Y=0, X=x}}\\
& = & \frac{\Prob{X=x|Y=1}\Prob{Y=1}}{\Prob{X=x|Y=1}\Prob{Y=1} + \Prob{X=x|Y=0}\Prob{Y=0}}
& = & \frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + g(x)\Prob{Y=0}}
\end{eqnarray}

This is a bit of a mouthful, but we can get a handle on it by considering
some extreme cases.

## How distributions of features matter to classification

### Identical distributions $\Leftrightarrow$ No information $\Leftrightarrow$ No useful classifier

Suppose that $f(x) = g(x)$ for all $x$.  Then we'd get
\[
p(x) = \frac{\Prob{Y=1}}{\Prob{Y=1}+\Prob{Y=0}} = \Prob{Y=1}
\]
which is just the base rate.  So we'd have $\Prob{Y=1|X=x} = \Prob{Y=1}$.
But this is the same as saying that $Y$ and $X$ are statistically
independent, $Y \indep X$, or that the mutual information is 0, $I[X;Y] = 0$.
Whatever threshold $t$ we apply, we'd make the _same_ prediction for
everyone, regardless of their features.

### Non-overlapping distributions $\Leftrightarrow$ Easy classification

On the other hand, suppose that at some point $x$,  $f(x) > 0$ but $g(x) = 0$.   Then we have
\[
p(x) =\frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + 0 \times \Prob{Y=0}} = 1
\]
and we should set $\Yhat=1$ here no matter what threshold we're using.  Similarly
if $f(x) = 0$ but $g(x) > 0$, we should set $\Yhat=0$
for any threshold.  If every point $x$ fell under one or the other of these
two cases, we'd say that the two distributions had "disjoint support", and
we could classify every point with certainty.

### In general...

Now let's consider the more general case, where we're applying a threshold
$t$ to $p(x)$.  The feature point $x$ is one where we set $\Yhat(x)=1$ when $p(x) \geq t$, or
\begin{eqnarray}
p(x) & \geq & t\\
\frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + g(x)\Prob{Y=0}} & \geq & t\\
\frac{1}{1 + \frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}}} & \geq & t\\
1 + \frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}} & \leq & \frac{1}{t}\\
\frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}} & \leq & \frac{1-t}{t}\\
\frac{f(x) \Prob{Y=1}}{g(x)\Prob{Y=0}} & \geq & \frac{t}{1-t}\\
\frac{f(x)}{g(x)} \frac{\Prob{Y=1}}{\Prob{Y=0}} & \geq & \frac{t}{1-t}
\end{eqnarray}

Notice that this involves the base-rates ($\Prob{Y=0}$, $\Prob{Y=1}$) as well
as the distribution of features in each class.  One way to interpret this last
equation is that the ratio $f(x)/g(x)$ gauges the evidence (one way or the
other) that the features $x$ provides about the class membership of a
particular case.  How strong this evidence has to be before it pushes us into a
decision depends on the threshold ($t$), but also on what we know about the
base-rates.  If the base-rates are very lop-sided, say $\Prob{Y=0} \gg
\Prob{Y=1}$, then we'd need stronger evidence (higher ratio $f(x)/g(x)$) to
push us into saying $Y=1$, i.e., to setting $\Yhat=1$.

## Information again

Let's revert to information theory.  We've seen that we can't classify at all
when $I[X;Y] = 0$ (at least not using $X$).  More generally, $I[X;Y] =
H[Y]-H[Y|X]$, and $H[Y|X]$ is precisely the entropy of the distribution $p(X)$
(averaged over $X$).  So larger values of $I[X;Y]$ mean smaller values of
$H[Y|X]$, which in turn means that $p(x)$ is closer to 0 or 1 for more values
of $x$, which means that more and more accurate classification is possible.
When we consider classification trees, we will look directly at trying to
(greedily) minimize $H[Y|X]$.

### Feature selection

Folklore in data mining is that you should worry more about finding the right
features than about exactly what model or classifier you apply to them.  We've
now seen the rational basis for this folklore.  Informative features are ones
whose distributions differ substantially across classes.  With un-informative
features, it doesn't matter which technique we apply, there's just no
information to be had.  Once you've found informative features, it's fairly
routine to try out different standard techniques (linear classifiers, nearest
neighbors, trees, etc., etc.) on the same feature set.  In any particular
problem, some of these techniques will have an easier time extracting
information in the features.  But the information has to be there in the
features to begin with.

# Trading off false positive rates against false negative rates

Another way to tackle the problem of classifier design is to frankly admit that
we have _two_ competing objectives, and try to balance them.  One objective is
to have low false positive rates, and the other objective is to have low false
negative rates.  For historical and rhetorical reasons, it's conventional
here to consider one minus the false negative rate,
\[
\Prob{\Yhat=1|Y=1}
\]
which is also called the "power", and written $\beta$.

## The optimization problem

When we chose a classifier, we get a benefit in the form of power, $\beta$, and
pay a cost in terms of the false positive rate, FPR.  To get a handle on this,
remember that for every point $x$, either $\Yhat(x) = 1$ or $\Yhat(x) = 0$.
Call the set of points where $\Yhat(x) = 1$, $S$.  (This is called the
**acceptance region** or **decision region**, and its boundary is called the
**decision boundary** or **critical boundary**.)  Then
\[
\beta(S) \equiv \Prob{\Yhat=1|Y=1} = \sum_{x \in S}{f(x)} ~ \text{or} ~ \int_{S}{f(x) dx}
\]
depending on whether the features are discrete or continuous.  (I did the
continuous version in class, but will do the discrete version here, just to
encourage mental flexibility on your part.)  This is our over-all benefit
to using a classifier with the decision region $S$.  Against this, the cost,
in the form of false positives, is
\[
FPR(S) = \Prob{\Yhat=1|Y=0} = \sum_{x \in S}{g(x)}
\]

Every possible classifier gives us some combination of power and false positive
rate.  We can imagine plotting power against false positive rate for each
classifier.  Some classifiers are bad: they have high false positive rates and
low power.  Many other classifiers will "dominate" those bad ones, because they
have a lower false positive rate and higher power.  But then there will be
classifiers which are harder to rank: one has a strictly higher power, but the
other has a strictly lower false positive rate.  There it's ambiguous which one
we should prefer.  On our plot, we'll see a curve along which increasing the
benefit (power) can only happen if we also increase the false positive rate.
This curve is called the **possibility frontier**, or the **Pareto
frontier**[^Pareto].  To get some grasp of this, think about saying $\Yhat=1$
for every $x$, i.e., setting $S$ to be the whole feature space.  This would
give $\beta=1$, but also $FPR=1$.  Against that, setting $\Yhat=0$, or
shrinking $S$ to the empty set, would give $FPR=0$, but also $\beta=0$.  So
we'd usually[^nonoverlapping] expect the frontier to connect $(0,0)$ to
$(1,1)$.


[^Pareto]: After Vilfredo Pareto, an economist who pioneered the study of optimization under competing objectives.

[^nonoverlapping]: If the distributions $f$ and $g$ don't overlap, we can get 0 FPR with positive power, and/or power 1 without FPR 1.

Now the usual way to deal with two competing objectives is to introduce a
**price**, letting us say how much one unit of one objective is worth in
terms of the other.  Here the two objectives having high power and low
false positive rate, so we need to set a price for power in terms of
false positives.  Let's call this price $r$.  So we want to maximize
\[
\beta - r \times FPR
\]
or
\[
\sum_{x\in S}{f(x)} - r \sum_{x \in S}{g(x)} = \sum_{x \in S}{f(x) - rg(x)}
\]
Now each summand is either positive, negative or 0.  To make the _sum_ as big
as possible, we should include every $x$ where the summand is positive in the
set $S$, and exclude every $x$ where the summand is negative.  (We don't care
about points where the summand is 0.)  That is, what we want to do is,
\[
\Yhat(x) = \left\{ \begin{array}{cc} 1 & f(x) - rg(x) \geq 0\\
                                     0 & f(x) - rg(x) < 0
	               \end{array}
	        \right.
\]

## Constraining false positive rates

We might start instead by saying that we want to maximize power with a _constraint_ on the false positive rate.  That is, we set as our problem
\[
\max_{S: FPR(S) \leq \alpha}{\beta(S)}
\]
where $\alpha$ is our maximum allowed false positive rate: maybe $\alpha=0.05$,
or $\alpha=0.01$, or $\alpha=10^{-6}$ if we really don't like false positives.

(Graphically, if we go back to our plot of false positive rate versus power,
we'd be drawing a horizontal line at $FPR=\alpha$, only considering points
below that line, and then just taking the right-most point, which is the
highest power classifier that obeys the constraint.)

The usual trick for solving constrained optimization problems is to turn
them into unconstrained ones by using a Lagrange multiplier.  We had
the old **objective function**
\[
\beta(S) = \sum_{x \in S}{f(x)}
\]
subject to the constraint
\[
FPR(S) \leq \alpha
\]
or
\[
FPR(S) - \alpha \leq 0
\]
or
\[
\sum_{x\in S}{g(x)} - \alpha \leq 0
\]
The **Lagrangian** combines the objective function plus a **Lagrange multiplier** times the constraint equation:
\[
\beta(S) - r(FPR(S) - \alpha)
\]
The actual problem is now to maximize the Lagrangian over _both_ $S$ and the Lagrange multiplier,
\[
\max_{S, r}{\beta(S) - r(FPR(S) - \alpha)}
\]
Equivalently,
\[
\max_{S, r}{\sum_{x \in S}{f(x)} - r\sum_{x \in S}{g(x)} + r\alpha}
\]

Now, when we do the maximization, there will be _some_ value of the multiplier
$r$ which will enforce the constraint $\alpha$, say $r^*$.  And once we know
$r^*$ what we're doing is just
\[
\max_{S}{\sum_{x\in S}{f(x) - r^* g(x)}}
\]
and we've seen how to do that: include $x$ in $S$ if, but only if,
\[
\frac{f(x)}{g(x)} \geq r^*
\]
In other words, the Lagrange multiplier looks just like a price.  Economists
call a price which enforces a constraint a "shadow price", and so $r^*$ is the
shadow price of power.

### Summing up on the power vs. FPR trade-off

If we want to enforce a limit on the false positive rate, $FPR \leq \alpha$,
but then maximize the power, we should follow the rule
\[
\Yhat(x) = \left\{ \begin{array}{cc} 1 & f(x)/g(x) \geq r\\
                                     0 & f(x)/g(x) < r
	               \end{array}
	       \right.
\]
for some threshold $r$ which is a function of $\alpha$.

Notice that _this_ classification rule does not involve the base rates at all.
This is sensible, since neither the false positive rate nor the power involves
the base rate.


### Likelihood ratios, hypothesis tests, the Neyman-Pearson lemma

In your other statistics classes, you'll have seen a lot of hypothesis tests,
many of which involve the likelihood ratio, or (equivalently) the log of the
likelihood ratio.  The way we've analyzed classifiers is exactly parallel to a
hypothesis test.  $g(x)$ the likelihood of the features $X=x$ under the null
hypothesis that $Y=0$, while $f(x)$ is the likelihood under the alternative
hypothesis that $Y=1$.

It may seem intuitive that a test should compare the likelihoods, but why
compare them through their ratio rather than say $f(x) - g(x)$, or for that
matter $f^2(x) - g^2(x)$?  We have now seen the answer: the ratio is, uniquely,
what we need to worry about if we want to maximize power while controlling the
false positive rate.  As a result about hypothesis testing, this was first
shown by Jerzy Neyman and Egon Pearson in the 1930s, and so it's sometimes
called "the Neyman-Pearson lemma" or "the Neyman-Pearson theorem".

# Classifier design

Our analyses of classifier performance has suggested three approaches to how we
might design classifiers.

1. Estimate $p(x)=\Prob{Y=1|X=x}$ and threshold that function.
2. Estimate $\frac{f(x)}{g(x)} = \frac{\Prob{X=x|Y=1}}{\Prob{X=x|Y=0}}$, perhaps by first estimating $f(x)$ and $g(x)$, and threshold that function.
3. Pick the region $S$ where $\Yhat=1$ (or, equivalently, the boundary
of this region) to have low inaccuracy.

Strategy (1) relies on what's called the **posterior probability** of being in
each class.  (The **prior probability** of being in class 1 is just the base
rate, $\Prob{Y=1}$.)  Strategy (2) relies on the likelihood ratio, and is
sometimes called a "Neyman-Pearson" approach.  Strategy (3) is what we might
call a "direct" approach, though I don't think it has a common name.

As always, there are advantages and disadvantages to each approach.

## The posterior-probability (feature-conditional) approach

**Pro**:

- We just need to estimate the distribution
of a one-dimensional, indeed binary, variable $Y$, conditional on the features
$X$.
   + Basically, we need to have some way of defining "similar" features $x$,
and then estimate $\Prob{Y=1|X=x}$ by the proportion of $Y=1$ cases among
the similar $x$'s.
   + We will see in the next lecture some ways of dividing
up the feature space which aim to have this property, and to make it fast
and easy to find similar $x$'s.
- If we care about our _average_ or _expected_ costs, we've seen that the optimal thing to do is to threshold $p(x)$, so this is the right way to control expected costs.
- It's sensitive to the base rates, so it can achieve low over-all inaccuracy.

**Cons**:

- It relies (implicitly) on the base rates, so if the data are very un-balanced,
it will be hard for it to advance over just always predicting the more
probable class.
- It relies (implicitly) on the base rates, so there'll
be trouble if those change.
- We need to decide on the relative costs of the two kinds of errors.  (This
choice is implicit in the choice of the threshold we apply to $p(x)$.)

## The Neyman-Pearson (class-conditional) approach

**Pros**:

- It's insensitive to base rates, so it doesn't really care about un-balanced data or
to changes in the base rates.
- If we specifically need to keep a lid on the false positive rate, it's exactly the right thing to do.
  + Imagine "false positive" is something like "Keeping a harmless person in
    jail", or "Denying a loan to someone who'd repay it".
- We really only need the _ratio_ $f(x)/g(x)$, and there are tricks for estimating that without first estimating $f(x)$ and $g(x)$.

**Cons**:

- It's insensitive to the base rates, so it usually won't have as high an over-all accuracy as the posterior-probability approach.
- We have to decide on an acceptable false-positive rate $\alpha$, or equivalently a price for power in terms of false positives, and this can feel arbitrary.

## The direct (chose-a-region) approach

```{r, echo=FALSE}
set.seed(2019-11-18)
x1 <- runif(n=30)
x2 <- runif(n=length(x1))
probs <- exp(-(x1^2+x2^2))
y <- rbinom(n=length(x1), size=1, prob=probs)
plot(0, xlim=c(0,1), ylim=c(0,1), xlab=expression(x[1]),
     ylab=expression(x[2]),
     type="n")
text(x=x1, y=x2, labels=y)
```

To illustrate, imagine seeing data like the figure above, and trying to
classify it with a linear classifier.  That means trying to find a straight
line which divides the positive, $Y=1$ points from the negative, $Y=0$ points.
Typically, when people try to do this, they aim to find the line with the
minimum classification error.  You can convince yourself that there is no
single straight line which will achieve an error rate of 0 on that data, but
also that some lines are better than others.  In symbols, we have a
family[^class] $\mathcal{S}$ of possible decision regions $S$, and we want to
find
\[ \min_{S \in \mathcal{S}}{\frac{1}{n}\sum_{i=1}^{n}{Y_i \neq \mathbb{1}(X_i \in S)}} \]
where the indicator function $\mathbb{1}(X_i \in S) =1$ if $X_i \in S$ and
$=0$ otherwise.

[^class]: The more usual jargon word here is "class", which I used in lecture, but this collides with "class" for whether $Y=1$ or $Y=0$, so I'll try to avoid it

There is nothing magic about using linear classifiers.  We might, for instance,
instead try to divide the points by enclosing all the positive points inside a
rectangle, and then (in this case) we could do better than with purely linear
separators.  This amounts to changing the family of possible regions
$\mathcal{S}$.  Many families are defined through complicated functions of the
features --- a popular choice is to take a bunch of nonlinear transformations
of the features, say $\phi_1(x), \phi_2(x), \ldots \phi_p(x)$, bundle them into
a vector $\phi(x)$, and then apply linear classifiers[^circle] to $\phi$.

[^circle]: As an example, if $x$ is two dimensional, and $\phi(x) = (x_1, x_2, x_1^2, x_2^2)$, a linear classifier applied to $\phi$ can pick out points inside (or outside) a circle, which you couldn't do with the raw features.  (What would the linear classifier for $\phi$ look like?)

In general, the bigger and more flexible we make $\mathcal{S}$, the lower we
will make the in-sample error, and the lower the generalization error _can_ be.
But there are two costs to using very flexible families.

1. There is a computational cost to searching for the best $S$ over a big family.
2. There is a real danger of over-fitting, or, as we say, "memorizing the noise" in the training data.

(1) is real but task-specific; (2) is a more general statistical issue, so I'll
elaborate on it.  Suppose we could use _very_ flexible regions, say a union of
a large number of square boxes.  Then we can get a region $S$ which looks like
the next figure.

```{r, echo=FALSE}
plot(0, xlim=c(0,1), ylim=c(0,1), xlab=expression(x[1]),
     ylab=expression(x[2]),
     type="n")
# Draw little grey boxes around each 1 (a horrible idea as a real
# classifier)
positive.points <- which(y==1)
box.width <- 0.1
rect(xleft=x1[positive.points]-box.width/2,
     ybottom=x2[positive.points]-box.width/2,
     xright=x1[positive.points]+box.width/2,
     ytop=x2[positive.points]+box.width/2,
     col="lightgrey", border=NA)
text(x=x1, y=x2, labels=y)
```

If there is any noise in the labels at all, the $S$ we get will, in part,
reflect that noise.  We say that $S$ "memorizes" the noise, as well as any true
signal about where the classification boundary should be.  The next figure
shows the same classification boundary, with the same values of the features,
but with $Y$ drawn independently from the same distribution $p(x)$, and you can
see that it's not doing so well --- it now misses some positive points and
includes some negative points.

```{r, echo=FALSE}
y.prime <- rbinom(n=length(x1), size=1, prob=probs)
plot(0, xlim=c(0,1), ylim=c(0,1), xlab=expression(x[1]),
     ylab=expression(x[2]),
     type="n")
rect(xleft=x1[positive.points]-box.width/2,
     ybottom=x2[positive.points]-box.width/2,
     xright=x1[positive.points]+box.width/2,
     ytop=x2[positive.points]+box.width/2,
     col="lightgrey", border=NA)
text(x=x1, y=x2, labels=y.prime, col="red")
```

So flexibility or "capacity" has an advantage (we can fit more) but also a
disadvantage (our results are less stable and more vulnerable to noise).


One way to combat the vulnerability to noise is to impose some sort of
geometric constraint.  A common one is to require a certain minimum distance
between any point and the classification boundary --- to insist on only using
classifiers with a large "margin".  Such geometric constraints rule out shapes
with very irregular, erratic boundaries, which is (one reason) why people talk
about this as "regularizing" the problem of finding an optimal classifier.
Just as we saw above, constraining classifiers to have a large margin is the
same as penalizing classifiers based on their margin.  Either way, we would
typically use cross-validation to pick the size of the constraint or the
penalty.

(People have developed various ways of measuring model capacity, to help
quantify this trade-off.  Many of them come down to variations on seeing how
well the family could seem to classify labels which were pure noise, i.e., how
well it would seem to do when $p(x)=0.5$ for all $x$.  These are important for
theory, have helped guide the development of new classifiers, and provide
important sanity checks on how well we can hope to do: see for instance
@Bartlett-Mendelson-on-Rademacher-complexity.  But in practice, people
overwhelmingly use cross-validation to assess their classifiers.)


**Pros**:

- In the direct, region-finding approach, we don't _explicitly_ need to estimate
  any probabilities.  We are still, implicitly, guessing at where $p(x)
  \geq 0.5$ (or whatever our threshold is), but we're just trying to find that
  boundary, rather than estimating the whole function $p(x)$.
  + We can also apply the shape-finding approach in a more Neyman-Pearson way:
    we'd first find the regions in $\mathcal{S}$ with acceptably low false
    positive rates, and then among them find the region with the highest power.
- For many families, there are very efficient algorithms for quickly finding
  the optimal classification region (perhaps with regularization).

**Cons**:

- Even though we're not explicitly considering costs of different kinds of
  errors, a certain ratio between the costs is implicit in the error rate
  we decide to minimize.  We might not like that ratio if we think about
  it consciously.
- Because we're not estimating any probabilities, just finding a decision
  region leaves us uncertain about what to do if costs shift (or we change our
  mind about costs).
- Because we're not estimating any probabilities, we don't have any sense of
  _confidence_ in our classifications (are we 99% sure this is positive case,
  or just 51% sure?).

# Summing up so far

1. We can use features $X$ to predict class labels $Y$ when, and only when,
   the distribution $\Prob{X=x|Y=y}$ changes with the label $y$.  The bigger
   the difference in those distributions, the more ability we will have to
   do classification on the basis of $X$.
2. Error rates involve the ratios of distributions
   $\Prob{X=x|Y=1}/\Prob{X=x|Y=0}$ (and possibly the base rates as well).
3. Classifier design strategies can be understood as either trying to estimate
   (and threshold) the posterior probability $\Prob{Y=y|X=x}$, as trying to
   estimate (and threshold) the likelihood ratio
   $\Prob{X=x|Y=1}/\Prob{X=x|Y=0}$, or as trying to directly find a decision
   boundary with good error rates.  All three strategies have their
   advantages and disadvantages.

# Predicting demographics from web search

@Goel-Hofman-who-does-what-on-the-web used the full browsing history of about a
quarter of a million (US) Web users primarily to examine how different
demographic groups --- defined by age, sex, race, education and household
income --- used the Web differently.  If we think of each demographic category
as a label $Y$, and which websites were visited (and how often) as features
$X$, they were primarily interested in $\Prob{X=x|Y=y}$, and how this differed
across demographic categories.  For instance, people with a post-graduate
degree visited news sites about three times as often as people with only a high
school degree.  (What's $X$ and what's $Y$ in that example?)  It may or may
not surprise you to learn that they found large differences in browsing
behavior across demographic groups.  To steal an example from the paper,
men are much more likely than women to visit ESPN, and women are more likely
than men to visit Lanc&ocirc;me.

You can now see where this is going.  By point (1) in our summary above, the
fact that $\Prob{X=x|Y=y}$ differs across classes $y$ means that we can use
browsing behavior ($X$) to predict demographic classes ($Y$).  Someone
who knows what websites you browse can predict your age, sex, race, education,
and household income.  To demonstrate this, @Goel-Hofman-who-does-what-on-the-web used the 10,000 most popular websites, creating a binary feature for each site, $X_i=1$ if site $i$ was visited at all during the study and $X_i=0$ if not.
They then used a linear classifier on these features, with one of the geometric
margin constraints I mentioned.  The next figure shows how well they were
able to predict each of the five demographic variables. 

![](goel-et-al.png)

_Detail of Figure 8 from @Goel-Hofman-who-does-what-on-the-web, showing the ability of a (regularized) linear classifier to predict demographic variables based on web browsing history.  Dots show the achieved accuracy, and the $\times$ shows the frequency of the more common class._

I include this not because the precise accuracies matter ---
there's no reason to think this is the best performance attainable, even with
these features --- but rather to prove the point that this kind of prediction
can be done.  It doesn't matter _why_ different demographic groups have
different browsing habits, just _that_ those distinctions make a difference.
This lets us (or our machines) work backwards from browsing to
accurate-but-not-perfect inferences about demographic categories.

## Inference

Now imagine a recidivism prediction system which does not, officially or
explicitly, consider sex, but _does_ have access to the defendant's web
browsing history.  (No such system exists, to best of my knowledge, but there's
no intrinsic limit on its creation.)  We know, from Goel et al., that sex can
be predicted with about 80% accuracy from browsing history (at least).  A
nefarious designer who wanted to include sex as a predictor for recidivism, but
to hide doing so, could therefore use browsing history to predict sex, and then
include predicted sex in their model.  A less nefarious designer might end up
doing something equivalent without even realizing it, say by slightly
increasing the predicted risk of those who visit ESPN and slightly reducing the
prediction for those who visit Lanc&ocirc;me.  Either designer might, when
pressed, say that they're not claiming to say _why_ ESPN predicts recidivism,
but facts and facts, and are you going to argue with the math?

In fact, we can go further.  We know that younger people have a higher risk of
violence than older people, that poorer people have a higher risk than richer
people, that men have a higher risk than women, that blacks have a higher risk
than whites[^levoy-allen], that less educated people have a higher risk than more educated
people[^crime].  A system which just used Web browsing to sort people into
these five demographic categories could[^hedging], therefore, achieve
non-trivial predictive power.  You can even imagine designing such a system
innocently, where we just try to boil down a large number of features into
(say) a five-dimensional space, before using them to predict violence, without
realizing that those five dimensions correspond to age, sex, race, income and
education.

[^levoy-allen]: There are multiple reasons for this association.  One is a long-standing history (cf. @Dollard-on-Southerntown) of segregating African-Americans into neighborhoods which are under-policed (in the sense that violence often goes unpunished by the forces of the law) and over-policed (in the sense that interactions with the police are often hostile).  This sets up a dynamic where people in those neighborhoods don't trust the police, which makes the police ineffective, which makes being known for willingness to use violence a survival strategy, which etc., etc.  @Leovy-ghettoside gives a good account of this feedback loop from (mostly) the side of the police; @Allen-cuz gives a glimpse of what it looks like from the other side.

[^crime]: Cathy O'Neil would remind us that many of these would flip around if we considered risk of _financial_ crimes rather than violence.

[^hedging]: I say "could", because there's some error in all these classifications, and it's _possible_ that these errors would cancel out the ability to predict violence from demographics.

None of this really relies on the features being Web browsing history;
anything whose distribution differs across demographic groups will do.

# Some more reading


On the Neyman-Pearson approach to classifiers, see @Scott-Nowak-NP-approach,
@Rigollet-Xin-NP-classification and
@Tong-plug-in-Neyman-Pearson-classification.  The Neyman-Pearson lemma itself
goes back to @Neyman-Pearson-np-lemma (where they never call it a lemma).  The
heuristic cost-benefit derivation of it is, so far as I know, my own invention.


# References