\[ \newcommand{\Yhat}{\hat{Y}} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\indep}{\perp} \]

We’re trying to predict a categorical response, label or class \(Y\), usually binary (\(0\) or \(1\)), using a feature or vector of features, \(X\). The prediction \(\Yhat\) is a function of the feature \(x\), so we’ll also write \(\Yhat(x)\) for the guess at the label of a case with features \(x\). We will be very interested in \(\Prob{Y=1|X=x}\), and abbreviate it by \(p(x)\). The over-all mis-classification error rate, or inaccuracy, is \(\Prob{Y \neq \Yhat}\). This can be decomposed two ways. One way is in terms of the false positive rate, the false negative rate, and the “base rate” at which the two classes occur: \[ \Prob{Y \neq \Yhat} = \Prob{\Yhat=1|Y=0} \Prob{Y=0} + \Prob{\Yhat=0|Y=1}\Prob{Y=1} = FPR \times \Prob{Y=0} + FNR \times \Prob{Y=1} \] The other decomposition is in terms of (one minus) the positive predictive value, the negative predictive value, and the probability of making each prediction: \[ \Prob{Y \neq \Yhat} = \Prob{Y=1|\Yhat=0}\Prob{\Yhat=0} + \Prob{Y=0|\Yhat=1}\Prob{\Yhat=1} = (1-PPV) \Prob{\Yhat=0} + (1-NPV) \Prob{\Yhat=1} \]

If we want to minimize the inaccuracy, we should use the rule^{1} that \(\Yhat=1\) if \(p(x) \geq 0.5\) and \(\Yhat=0\) if \(p(x) < 0.5\). If we attach a cost to each error, say \(c_+\) to false positives and \(c_-\) to false negatives, then we should fix \(\Yhat=1\) if \(p(x) \geq t(c_+, c_-)\) and \(\Yhat=0\) if \(p(x) < t(c_+, c_-)\). The shape of the threshold function \(t\) is such that \(t=0.5\) when and only when \(c_+ = c_-\). (Can you remember, or work out, how to find the threshold in terms of the costs \(c_+\) and \(c_-\)?)

\(p(x)\) is the distribution of the *class* conditional on the *features*, \(\Prob{Y=1|X=x}\). It turns out that the success of classification depends on the “inverse” conditional probability, \(\Prob{X=x|Y=y}\). Let’s introduce the abbreviations \[\begin{eqnarray}
f(x) \equiv \Prob{X=x|Y=1}\\
g(x) \equiv \Prob{X=x|Y=0}
\end{eqnarray}\] (Some people would write \(f_+\) and \(f_-\), or \(f_1\) and \(f_0\), etc.) Now let’s re-write the conditional probability of the class given the features, in terms of the distribution of features in each class: \[\begin{eqnarray}
p(x) & \equiv & \Prob{Y=1|X=x}\\
& = & \frac{\Prob{Y=1, X=x}}{\Prob{X=x}}\\
& = & \frac{\Prob{Y=1, X=x}}{\Prob{Y=1, X=x} + \Prob{Y=0, X=x}}\\
& = & \frac{\Prob{X=x|Y=1}\Prob{Y=1}}{\Prob{X=x|Y=1}\Prob{Y=1} + \Prob{X=x|Y=0}\Prob{Y=0}}
& = & \frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + g(x)\Prob{Y=0}}
\end{eqnarray}\]

This is a bit of a mouthful, but we can get a handle on it by considering some extreme cases.

Suppose that \(f(x) = g(x)\) for all \(x\). Then we’d get \[
p(x) = \frac{\Prob{Y=1}}{\Prob{Y=1}+\Prob{Y=0}} = \Prob{Y=1}
\] which is just the base rate. So we’d have \(\Prob{Y=1|X=x} = \Prob{Y=1}\). But this is the same as saying that \(Y\) and \(X\) are statistically independent, \(Y \indep X\), or that the mutual information is 0, \(I[X;Y] = 0\). Whatever threshold \(t\) we apply, we’d make the *same* prediction for everyone, regardless of their features.

On the other hand, suppose that at some point \(x\), \(f(x) > 0\) but \(g(x) = 0\). Then we have \[ p(x) =\frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + 0 \times \Prob{Y=0}} = 1 \] and we should set \(\Yhat=1\) here no matter what threshold we’re using. Similarly if \(f(x) = 0\) but \(g(x) > 0\), we should set \(\Yhat=0\) for any threshold. If every point \(x\) fell under one or the other of these two cases, we’d say that the two distributions had “disjoint support”, and we could classify every point with certainty.

Now let’s consider the more general case, where we’re applying a threshold \(t\) to \(p(x)\). The feature point \(x\) is one where we set \(\Yhat(x)=1\) when \(p(x) \geq t\), or \[\begin{eqnarray} p(x) & \geq & t\\ \frac{f(x) \Prob{Y=1}}{f(x) \Prob{Y=1} + g(x)\Prob{Y=0}} & \geq & t\\ \frac{1}{1 + \frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}}} & \geq & t\\ 1 + \frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}} & \leq & \frac{1}{t}\\ \frac{g(x)\Prob{Y=0}}{f(x) \Prob{Y=1}} & \leq & \frac{1-t}{t}\\ \frac{f(x) \Prob{Y=1}}{g(x)\Prob{Y=0}} & \geq & \frac{t}{1-t}\\ \frac{f(x)}{g(x)} \frac{\Prob{Y=1}}{\Prob{Y=0}} & \geq & \frac{t}{1-t} \end{eqnarray}\]

Notice that this involves the base-rates (\(\Prob{Y=0}\), \(\Prob{Y=1}\)) as well as the distribution of features in each class. One way to interpret this last equation is that the ratio \(f(x)/g(x)\) gauges the evidence (one way or the other) that the features \(x\) provides about the class membership of a particular case. How strong this evidence has to be before it pushes us into a decision depends on the threshold (\(t\)), but also on what we know about the base-rates. If the base-rates are very lop-sided, say \(\Prob{Y=0} \gg \Prob{Y=1}\), then we’d need stronger evidence (higher ratio \(f(x)/g(x)\)) to push us into saying \(Y=1\), i.e., to setting \(\Yhat=1\).

Let’s revert to information theory. We’ve seen that we can’t classify at all when \(I[X;Y] = 0\) (at least not using \(X\)). More generally, \(I[X;Y] = H[Y]-H[Y|X]\), and \(H[Y|X]\) is precisely the entropy of the distribution \(p(X)\) (averaged over \(X\)). So larger values of \(I[X;Y]\) mean smaller values of \(H[Y|X]\), which in turn means that \(p(x)\) is closer to 0 or 1 for more values of \(x\), which means that more and more accurate classification is possible. When we consider classification trees, we will look directly at trying to (greedily) minimize \(H[Y|X]\).

Folklore in data mining is that you should worry more about finding the right features than about exactly what model or classifier you apply to them. We’ve now seen the rational basis for this folklore. Informative features are ones whose distributions differ substantially across classes. With un-informative features, it doesn’t matter which technique we apply, there’s just no information to be had. Once you’ve found informative features, it’s fairly routine to try out different standard techniques (linear classifiers, nearest neighbors, trees, etc., etc.) on the same feature set. In any particular problem, some of these techniques will have an easier time extracting information in the features. But the information has to be there in the features to begin with.

Another way to tackle the problem of classifier design is to frankly admit that we have *two* competing objectives, and try to balance them. One objective is to have low false positive rates, and the other objective is to have low false negative rates. For historical and rhetorical reasons, it’s conventional here to consider one minus the false negative rate, \[
\Prob{\Yhat=1|Y=1}
\] which is also called the “power”, and written \(\beta\).

When we chose a classifier, we get a benefit in the form of power, \(\beta\), and pay a cost in terms of the false positive rate, FPR. To get a handle on this, remember that for every point \(x\), either \(\Yhat(x) = 1\) or \(\Yhat(x) = 0\). Call the set of points where \(\Yhat(x) = 1\), \(S\). (This is called the **acceptance region** or **decision region**, and its boundary is called the **decision boundary** or **critical boundary**.) Then \[
\beta(S) \equiv \Prob{\Yhat=1|Y=1} = \sum_{x \in S}{f(x)} ~ \text{or} ~ \int_{S}{f(x) dx}
\] depending on whether the features are discrete or continuous. (I did the continuous version in class, but will do the discrete version here, just to encourage mental flexibility on your part.) This is our over-all benefit to using a classifier with the decision region \(S\). Against this, the cost, in the form of false positives, is \[
FPR(S) = \Prob{\Yhat=1|Y=0} = \sum_{x \in S}{g(x)}
\]

Every possible classifier gives us some combination of power and false positive rate. We can imagine plotting power against false positive rate for each classifier. Some classifiers are bad: they have high false positive rates and low power. Many other classifiers will “dominate” those bad ones, because they have a lower false positive rate and higher power. But then there will be classifiers which are harder to rank: one has a strictly higher power, but the other has a strictly lower false positive rate. There it’s ambiguous which one we should prefer. On our plot, we’ll see a curve along which increasing the benefit (power) can only happen if we also increase the false positive rate. This curve is called the **possibility frontier**, or the **Pareto frontier**^{2}. To get some grasp of this, think about saying \(\Yhat=1\) for every \(x\), i.e., setting \(S\) to be the whole feature space. This would give \(\beta=1\), but also \(FPR=1\). Against that, setting \(\Yhat=0\), or shrinking \(S\) to the empty set, would give \(FPR=0\), but also \(\beta=0\). So we’d usually^{3} expect the frontier to connect \((0,0)\) to \((1,1)\).

Now the usual way to deal with two competing objectives is to introduce a **price**, letting us say how much one unit of one objective is worth in terms of the other. Here the two objectives having high power and low false positive rate, so we need to set a price for power in terms of false positives. Let’s call this price \(r\). So we want to maximize \[
\beta - r \times FPR
\] or \[
\sum_{x\in S}{f(x)} - r \sum_{x \in S}{g(x)} = \sum_{x \in S}{f(x) - rg(x)}
\] Now each summand is either positive, negative or 0. To make the *sum* as big as possible, we should include every \(x\) where the summand is positive in the set \(S\), and exclude every \(x\) where the summand is negative. (We don’t care about points where the summand is 0.) That is, what we want to do is, \[
\Yhat(x) = \left\{ \begin{array}{cc} 1 & f(x) - rg(x) \geq 0\\
0 & f(x) - rg(x) < 0
\end{array}
\right.
\]

We might start instead by saying that we want to maximize power with a *constraint* on the false positive rate. That is, we set as our problem \[
\max_{S: FPR(S) \leq \alpha}{\beta(S)}
\] where \(\alpha\) is our maximum allowed false positive rate: maybe \(\alpha=0.05\), or \(\alpha=0.01\), or \(\alpha=10^{-6}\) if we really don’t like false positives.

(Graphically, if we go back to our plot of false positive rate versus power, we’d be drawing a horizontal line at \(FPR=\alpha\), only considering points below that line, and then just taking the right-most point, which is the highest power classifier that obeys the constraint.)

The usual trick for solving constrained optimization problems is to turn them into unconstrained ones by using a Lagrange multiplier. We had the old **objective function** \[
\beta(S) = \sum_{x \in S}{f(x)}
\] subject to the constraint \[
FPR(S) \leq \alpha
\] or \[
FPR(S) - \alpha \leq 0
\] or \[
\sum_{x\in S}{g(x)} - \alpha \leq 0
\] The **Lagrangian** combines the objective function plus a **Lagrange multiplier** times the constraint equation: \[
\beta(S) - r(FPR(S) - \alpha)
\] The actual problem is now to maximize the Lagrangian over *both* \(S\) and the Lagrange multiplier, \[
\max_{S, r}{\beta(S) - r(FPR(S) - \alpha)}
\] Equivalently, \[
\max_{S, r}{\sum_{x \in S}{f(x)} - r\sum_{x \in S}{g(x)} + r\alpha}
\]

Now, when we do the maximization, there will be *some* value of the multiplier \(r\) which will enforce the constraint \(\alpha\), say \(r^*\). And once we know \(r^*\) what we’re doing is just \[
\max_{S}{\sum_{x\in S}{f(x) - r^* g(x)}}
\] and we’ve seen how to do that: include \(x\) in \(S\) if, but only if, \[
\frac{f(x)}{g(x)} \geq r^*
\] In other words, the Lagrange multiplier looks just like a price. Economists call a price which enforces a constraint a “shadow price”, and so \(r^*\) is the shadow price of power.

If we want to enforce a limit on the false positive rate, \(FPR \leq \alpha\), but then maximize the power, we should follow the rule \[ \Yhat(x) = \left\{ \begin{array}{cc} 1 & f(x)/g(x) \geq r\\ 0 & f(x)/g(x) < r \end{array} \right. \] for some threshold \(r\) which is a function of \(\alpha\).

Notice that *this* classification rule does not involve the base rates at all. This is sensible, since neither the false positive rate nor the power involves the base rate.

In your other statistics classes, you’ll have seen a lot of hypothesis tests, many of which involve the likelihood ratio, or (equivalently) the log of the likelihood ratio. The way we’ve analyzed classifiers is exactly parallel to a hypothesis test. \(g(x)\) the likelihood of the features \(X=x\) under the null hypothesis that \(Y=0\), while \(f(x)\) is the likelihood under the alternative hypothesis that \(Y=1\).

It may seem intuitive that a test should compare the likelihoods, but why compare them through their ratio rather than say \(f(x) - g(x)\), or for that matter \(f^2(x) - g^2(x)\)? We have now seen the answer: the ratio is, uniquely, what we need to worry about if we want to maximize power while controlling the false positive rate. As a result about hypothesis testing, this was first shown by Jerzy Neyman and Egon Pearson in the 1930s, and so it’s sometimes called “the Neyman-Pearson lemma” or “the Neyman-Pearson theorem”.

Our analyses of classifier performance has suggested three approaches to how we might design classifiers.

- Estimate \(p(x)=\Prob{Y=1|X=x}\) and threshold that function.
- Estimate \(\frac{f(x)}{g(x)} = \frac{\Prob{X=x|Y=1}}{\Prob{X=x|Y=0}}\), perhaps by first estimating \(f(x)\) and \(g(x)\), and threshold that function.
- Pick the region \(S\) where \(\Yhat=1\) (or, equivalently, the boundary of this region) to have low inaccuracy.

Strategy (1) relies on what’s called the **posterior probability** of being in each class. (The **prior probability** of being in class 1 is just the base rate, \(\Prob{Y=1}\).) Strategy (2) relies on the likelihood ratio, and is sometimes called a “Neyman-Pearson” approach. Strategy (3) is what we might call a “direct” approach, though I don’t think it has a common name.

As always, there are advantages and disadvantages to each approach.

**Pro**:

- We just need to estimate the distribution of a one-dimensional, indeed binary, variable \(Y\), conditional on the features \(X\).
- Basically, we need to have some way of defining “similar” features \(x\), and then estimate \(\Prob{Y=1|X=x}\) by the proportion of \(Y=1\) cases among the similar \(x\)’s.
- We will see in the next lecture some ways of dividing up the feature space which aim to have this property, and to make it fast and easy to find similar \(x\)’s.

- If we care about our
*average*or*expected*costs, we’ve seen that the optimal thing to do is to threshold \(p(x)\), so this is the right way to control expected costs. - It’s sensitive to the base rates, so it can achieve low over-all inaccuracy.

**Cons**:

- It relies (implicitly) on the base rates, so if the data are very un-balanced, it will be hard for it to advance over just always predicting the more probable class.
- It relies (implicitly) on the base rates, so there’ll be trouble if those change.
- We need to decide on the relative costs of the two kinds of errors. (This choice is implicit in the choice of the threshold we apply to \(p(x)\).)

**Pros**:

- It’s insensitive to base rates, so it doesn’t really care about un-balanced data or to changes in the base rates.
- If we specifically need to keep a lid on the false positive rate, it’s exactly the right thing to do.
- Imagine “false positive” is something like “Keeping a harmless person in jail”, or “Denying a loan to someone who’d repay it”.

- We really only need the
*ratio*\(f(x)/g(x)\), and there are tricks for estimating that without first estimating \(f(x)\) and \(g(x)\).

**Cons**:

- It’s insensitive to the base rates, so it usually won’t have as high an over-all accuracy as the posterior-probability approach.
- We have to decide on an acceptable false-positive rate \(\alpha\), or equivalently a price for power in terms of false positives, and this can feel arbitrary.

To illustrate, imagine seeing data like the figure above, and trying to classify it with a linear classifier. That means trying to find a straight line which divides the positive, \(Y=1\) points from the negative, \(Y=0\) points. Typically, when people try to do this, they aim to find the line with the minimum classification error. You can convince yourself that there is no single straight line which will achieve an error rate of 0 on that data, but also that some lines are better than others. In symbols, we have a family^{4} \(\mathcal{S}\) of possible decision regions \(S\), and we want to find \[ \min_{S \in \mathcal{S}}{\frac{1}{n}\sum_{i=1}^{n}{Y_i \neq \mathbb{1}(X_i \in S)}} \] where the indicator function \(\mathbb{1}(X_i \in S) =1\) if \(X_i \in S\) and \(=0\) otherwise.

There is nothing magic about using linear classifiers. We might, for instance, instead try to divide the points by enclosing all the positive points inside a rectangle, and then (in this case) we could do better than with purely linear separators. This amounts to changing the family of possible regions \(\mathcal{S}\). Many families are defined through complicated functions of the features — a popular choice is to take a bunch of nonlinear transformations of the features, say \(\phi_1(x), \phi_2(x), \ldots \phi_p(x)\), bundle them into a vector \(\phi(x)\), and then apply linear classifiers^{5} to \(\phi\).

In general, the bigger and more flexible we make \(\mathcal{S}\), the lower we will make the in-sample error, and the lower the generalization error *can* be. But there are two costs to using very flexible families.

- There is a computational cost to searching for the best \(S\) over a big family.
- There is a real danger of over-fitting, or, as we say, “memorizing the noise” in the training data.

- is real but task-specific; (2) is a more general statistical issue, so I’ll elaborate on it. Suppose we could use
*very*flexible regions, say a union of a large number of square boxes. Then we can get a region \(S\) which looks like the next figure.

If there is any noise in the labels at all, the \(S\) we get will, in part, reflect that noise. We say that \(S\) “memorizes” the noise, as well as any true signal about where the classification boundary should be. The next figure shows the same classification boundary, with the same values of the features, but with \(Y\) drawn independently from the same distribution \(p(x)\), and you can see that it’s not doing so well — it now misses some positive points and includes some negative points.

So flexibility or “capacity” has an advantage (we can fit more) but also a disadvantage (our results are less stable and more vulnerable to noise).

One way to combat the vulnerability to noise is to impose some sort of geometric constraint. A common one is to require a certain minimum distance between any point and the classification boundary — to insist on only using classifiers with a large “margin”. Such geometric constraints rule out shapes with very irregular, erratic boundaries, which is (one reason) why people talk about this as “regularizing” the problem of finding an optimal classifier. Just as we saw above, constraining classifiers to have a large margin is the same as penalizing classifiers based on their margin. Either way, we would typically use cross-validation to pick the size of the constraint or the penalty.

(People have developed various ways of measuring model capacity, to help quantify this trade-off. Many of them come down to variations on seeing how well the family could seem to classify labels which were pure noise, i.e., how well it would seem to do when \(p(x)=0.5\) for all \(x\). These are important for theory, have helped guide the development of new classifiers, and provide important sanity checks on how well we can hope to do: see for instance Bartlett and Mendelson (2002). But in practice, people overwhelmingly use cross-validation to assess their classifiers.)

**Pros**:

- In the direct, region-finding approach, we don’t
*explicitly*need to estimate any probabilities. We are still, implicitly, guessing at where \(p(x) \geq 0.5\) (or whatever our threshold is), but we’re just trying to find that boundary, rather than estimating the whole function \(p(x)\).- We can also apply the shape-finding approach in a more Neyman-Pearson way: we’d first find the regions in \(\mathcal{S}\) with acceptably low false positive rates, and then among them find the region with the highest power.

- For many families, there are very efficient algorithms for quickly finding the optimal classification region (perhaps with regularization).

**Cons**:

- Even though we’re not explicitly considering costs of different kinds of errors, a certain ratio between the costs is implicit in the error rate we decide to minimize. We might not like that ratio if we think about it consciously.
- Because we’re not estimating any probabilities, just finding a decision region leaves us uncertain about what to do if costs shift (or we change our mind about costs).
- Because we’re not estimating any probabilities, we don’t have any sense of
*confidence*in our classifications (are we 99% sure this is positive case, or just 51% sure?).

- We can use features \(X\) to predict class labels \(Y\) when, and only when, the distribution \(\Prob{X=x|Y=y}\) changes with the label \(y\). The bigger the difference in those distributions, the more ability we will have to do classification on the basis of \(X\).
- Error rates involve the ratios of distributions \(\Prob{X=x|Y=1}/\Prob{X=x|Y=0}\) (and possibly the base rates as well).
- Classifier design strategies can be understood as either trying to estimate (and threshold) the posterior probability \(\Prob{Y=y|X=x}\), as trying to estimate (and threshold) the likelihood ratio \(\Prob{X=x|Y=1}/\Prob{X=x|Y=0}\), or as trying to directly find a decision boundary with good error rates. All three strategies have their advantages and disadvantages.

Goel, Hofman, and Sirer (2012) used the full browsing history of about a quarter of a million (US) Web users primarily to examine how different demographic groups — defined by age, sex, race, education and household income — used the Web differently. If we think of each demographic category as a label \(Y\), and which websites were visited (and how often) as features \(X\), they were primarily interested in \(\Prob{X=x|Y=y}\), and how this differed across demographic categories. For instance, people with a post-graduate degree visited news sites about three times as often as people with only a high school degree. (What’s \(X\) and what’s \(Y\) in that example?) It may or may not surprise you to learn that they found large differences in browsing behavior across demographic groups. To steal an example from the paper, men are much more likely than women to visit ESPN, and women are more likely than men to visit Lancôme.

You can now see where this is going. By point (1) in our summary above, the fact that \(\Prob{X=x|Y=y}\) differs across classes \(y\) means that we can use browsing behavior (\(X\)) to predict demographic classes (\(Y\)). Someone who knows what websites you browse can predict your age, sex, race, education, and household income. To demonstrate this, Goel, Hofman, and Sirer (2012) used the 10,000 most popular websites, creating a binary feature for each site, \(X_i=1\) if site \(i\) was visited at all during the study and \(X_i=0\) if not. They then used a linear classifier on these features, with one of the geometric margin constraints I mentioned. The next figure shows how well they were able to predict each of the five demographic variables.

*Detail of Figure 8 from Goel, Hofman, and Sirer (2012), showing the ability of a (regularized) linear classifier to predict demographic variables based on web browsing history. Dots show the achieved accuracy, and the \(\times\) shows the frequency of the more common class.*

I include this not because the precise accuracies matter — there’s no reason to think this is the best performance attainable, even with these features — but rather to prove the point that this kind of prediction can be done. It doesn’t matter *why* different demographic groups have different browsing habits, just *that* those distinctions make a difference. This lets us (or our machines) work backwards from browsing to accurate-but-not-perfect inferences about demographic categories.

Now imagine a recidivism prediction system which does not, officially or explicitly, consider sex, but *does* have access to the defendant’s web browsing history. (No such system exists, to best of my knowledge, but there’s no intrinsic limit on its creation.) We know, from Goel et al., that sex can be predicted with about 80% accuracy from browsing history (at least). A nefarious designer who wanted to include sex as a predictor for recidivism, but to hide doing so, could therefore use browsing history to predict sex, and then include predicted sex in their model. A less nefarious designer might end up doing something equivalent without even realizing it, say by slightly increasing the predicted risk of those who visit ESPN and slightly reducing the prediction for those who visit Lancôme. Either designer might, when pressed, say that they’re not claiming to say *why* ESPN predicts recidivism, but facts and facts, and are you going to argue with the math?

In fact, we can go further. We know that younger people have a higher risk of violence than older people, that poorer people have a higher risk than richer people, that men have a higher risk than women, that blacks have a higher risk than whites^{6}, that less educated people have a higher risk than more educated people^{7}. A system which just used Web browsing to sort people into these five demographic categories could^{8}, therefore, achieve non-trivial predictive power. You can even imagine designing such a system innocently, where we just try to boil down a large number of features into (say) a five-dimensional space, before using them to predict violence, without realizing that those five dimensions correspond to age, sex, race, income and education.

None of this really relies on the features being Web browsing history; anything whose distribution differs across demographic groups will do.

On the Neyman-Pearson approach to classifiers, see Scott and Nowak (2005), Rigollet and Tong (2011) and Tong (2013). The Neyman-Pearson lemma itself goes back to Neyman and Pearson (1933) (where they never call it a lemma). The heuristic cost-benefit derivation of it is, so far as I know, my own invention.

Allen, Danielle S. 2017. *Cuz: The Life and Times of Michael A.* New York: Liveright.

Bartlett, Peter L., and Shahar Mendelson. 2002. “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.” *Journal of Machine Learning Research* 3:463–82. http://jmlr.csail.mit.edu/papers/v3/bartlett02a.html.

Dollard, John. 1937. *Caste and Class in a Southern Town*. New Haven, Connecticut: Yale University Press.

Goel, Sharad, Jake M. Hofman, and M. Irmak Sirer. 2012. “Who Does What on the Web: A Large-Scale Study of Browsing Behavior.” In *Sixth International AAAI Conference on Weblogs and Social Media [ICWSM 2012]*, edited by John G. Breslin, Nicole B. Ellison, James G. Shanahan, and Zeynep Tufekci. AAAI Press. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4660.

Leovy, Jill. 2015. *Ghettoside: A True Story of Murder in America*. New York: Spiegel; Grau.

Neyman, Jerzy, and Egon S. Pearson. 1933. “On the Problem of the Most Efficient Test of Statistical Hypotheses.” *Philosophical Transactions of the Royal Society of London A* 231:289–337. https://doi.org/10.1098/rsta.1933.0009.

Rigollet, Philippe, and Xin Tong. 2011. “Neyman-Pearson Classification, Convexity and Stochastic Constraints.” *Journal of Machine Learning Research* 12:2831–55. http://jmlr.org/papers/v12/rigollet11a.html.

Scott, Clayton, and Robert Nowak. 2005. “A Neyman-Pearson Approach to Statistical Learning.” *IEEE Transactions on Information Theory* 51:3806–19. https://doi.org/10.1109/TIT.2005.856955.

Tong, Xin. 2013. “A Plug-in Approach to Neyman-Pearson Classification.” *Journal of Machine Learning Research* 14:3011–40. http://jmlr.org/papers/v14/tong13a.html.

Strictly, if \(p(x)=0.5\), it doesn’t matter what \(\Yhat\) is, but I will always write \(\geq\) for definiteness.↩

After Vilfredo Pareto, an economist who pioneered the study of optimization under competing objectives.↩

If the distributions \(f\) and \(g\) don’t overlap, we can get 0 FPR with positive power, and/or power 1 without FPR 1.↩

The more usual jargon word here is “class”, which I used in lecture, but this collides with “class” for whether \(Y=1\) or \(Y=0\), so I’ll try to avoid it↩

As an example, if \(x\) is two dimensional, and \(\phi(x) = (x_1, x_2, x_1^2, x_2^2)\), a linear classifier applied to \(\phi\) can pick out points inside (or outside) a circle, which you couldn’t do with the raw features. (What would the linear classifier for \(\phi\) look like?)↩

There are multiple reasons for this association. One is a long-standing history (cf. Dollard (1937)) of segregating African-Americans into neighborhoods which are under-policed (in the sense that violence often goes unpunished by the forces of the law) and over-policed (in the sense that interactions with the police are often hostile). This sets up a dynamic where people in those neighborhoods don’t trust the police, which makes the police ineffective, which makes being known for willingness to use violence a survival strategy, which etc., etc. Leovy (2015) gives a good account of this feedback loop from (mostly) the side of the police; Allen (2017) gives a glimpse of what it looks like from the other side.↩

Cathy O’Neil would remind us that many of these would flip around if we considered risk of

*financial*crimes rather than violence.↩I say “could”, because there’s some error in all these classifications, and it’s

*possible*that these errors would cancel out the ability to predict violence from demographics.↩