\[ \newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. (Fisher 1938, 17)

1 Overivew

This lecture covers common sources of technical failure in data mining projects — the kinds of issues which lead to them just not working. (Whether they would be worth doing even if they did work is another story, for the last lecture.) We’ll first look at four sources of technical failure which are pretty amenable to mathematical treatment:

  1. Failing to generalize because of over-fitting
  2. Weak or non-existent predictive relationships between our features and the outcome we’re trying to predict.
  3. Changing relationships between our features and our outcome.
  4. The intrinsic difficulty of learning relationships between lots of variables.

The second half of the notes covers three distinct issues about measurement, model design and interpretation:

  1. Building models on bad data
  2. Tweaking the model specification to get the result we want
  3. Taking credit for what was going to happen anyway

They’re less mathematical but actually more fundamental than the first set of issues.

2 Failing to generalize by over-fitting the available data

  • Many, many, many data-mining projects have failed because they didn’t cross-validate
    • (or use some other sane way of estimating prediction error)
  • This is less common than it used to be, but it’s still an issue in many fields (economics, psychology, medicine…), and in some branches of industry
  • Or: they did cross-validation badly
    • Inadvertently using the testing data in setting up the model (say, centering and standardizing the features to mean 0 and variance 1 using the whole data set)
    • Deliberate cheating to win prediction contests by reverse-engineering the testing set
      • Exhibit A, Exhibit B, …
      • (There are many reasons I don’t use Kaggle in any of my courses)
    • Inadvertently failing to cross-validate properly, as in the NSA’s SKYNET project for deciding who to kill via drones
      • You should be able to work out the flaw in their cross-validation scheme from reading this news article (so why didn’t they?)
  • Even if you do cross-validate, you can still fail to generalize
    • Often due to data-set shift (covered last time)
    • It’s hard to know if your results will generalize to a new context until you actually try to generalize them (Yarkoni 2019)

3 Weak or non-existent signals

3.1 Weak predictive relationships are harder to estimate than strong ones

  • The weaker the relationship between the outcome and the features, the more data we need to distinguish it from pure noise
    • Suppose \(X\) and \(Y\) are nearly independent statistically
    • Which means that \(\Prob{Y=y|X=x} \approx \Prob{Y=y}\) for all \(x\)
    • Which means that \(\Prob{Y=y|X=x_1} \approx \Prob{Y=y|X=x_2}\) for any two \(x_1, x_2\)
    • Which means we’ll need a lot of data to learn the difference between those two distributions
  • But: even if there is no relationship, most methods will do their best to find something
    • e.g., in a linear model, even if all true slope coefficients \(=0\), your estimates (with finite data) won’t be exactly 0
    • CART (with pruning) and lasso are two of the few methods I know which will, in fact, say “there’s nothing to see here”
  • The weak-signal problem is exacerbated if comparing already-low rates or proportions
    • Recall that if you estimate a proportion \(p\) on \(n\) samples, the variance is \(p(1-p)/n\). If \(p\) is small, \(1-p \approx 1\), and the variance is about \(p/n\).
    • The relative error, or \(\frac[\text{standard error}{\text{parameter}}\) will then be about \(\sqrt{p/n}/p = 1/\sqrt{np}\). Notice that as \(p\) gets smaller, the relative error gets bigger (at constant \(n\))
    • Similar algebra for determining differences between small proportions — the relative error for the difference between two small probabilities can get very large!
    • One place this shows up: online advertising, where we know the rates are low, and so seeing whether changes make any difference can require really huge data sizes, and is probably too costly even for the largest companies

3.2 Sometimes there is no predictive relationship

  • Sometimes the signal just doesn’t exist
    • Phrenology was a 19th century pseudo-science which claimed to predict intelligence, character, etc., from the shape of heads, and in particular how pronounced bumps on certain parts of the skull were
      • It was taken astonishingly seriously for a very long time, by people all over the world and the political spectrum, and one can still trace its influence in (e.g.) how we picture “criminality” (Pick 1989)
      • Almost all of the purported correlations were just stuff phrenologists made up
        • Which is almost a legitimate form scientific method (“conjectures and refutations”, “hypothetico-deductivism”, “guess and check”), except that…
        • They didn’t properly test any of their conjectures, and when other people did start to actually, systematically check, none of the correlations held up
      • There was a little kernel of scientific truth here: specific mental functions do depend on specific regions of the brain. This was discovered by studying the effects of localized brain damage, which is still the source of most of our knowledge about what different brain regions are for (Shallice and Cooper 2011). These discoveries in the early 1800s helped inspire phrenology (Harrington 1989), but it quickly got out of control
        • There was one leap from “damaging/removing this brain region impairs this function” to “size of this brain region controls how strong this function is”
        • there was another leap from “size of this brain region” to “size of bump on the skull”
      • Phrenology has (deservedly) become a by-word for pseudo-science
      • So when you see people trying to predict sexual orientation from photographs, or whether someone will be a good employee from a brief video, etc., you should be very skeptical, and there is indeed every reason to think that these methods are (currently) just BS
    • You should think very hard about why you are trying to relate these features to this outcome, and what your data would look like if there was in fact no signal there

4 Changing relationships between features and outcomes

  • Many names: “Data-set shift”, “Distribution shift”, “Covariate shift”, etc.
  • Think about how the risk of a strategy \(s\) depends on the distribution: \[ \Expect{\ell(Y,s(X))} = \int{p(x) \left( \int{ \ell(y, s(x)) p(y|x) dy}\right) dx} \] where the inner expectation is the conditional risk, given \(X=x\)
  • If the distributions \(p(y|x)\) and/or \(p(x)\) change, then the risk of a fixed strategy changes, and the optimal strategy may change with those risks
  • Let’s distinguish some cases (following Quiñonero-Candela et al. (2009))

4.1 Covariate shift

“Covariate shift” = \(\Prob{Y|X}\) stays the same but \(\Prob{X}\) changes

  • Not an issue if your estimate of \(\Prob{Y|X}\) is very accurate
  • But most models are really dealing with “local” approximations
    • e.g., linear regression is really trying to find the best linear approximation to the true regression function \(\Expect{Y|X}\) (lecture 2)
    • The best linear approximation changes as \(\Prob{X}\) changes, unless that true regression function is really linear
  • In general, the best local approximation in some model class will change as \(\Prob{X}\) changes
  • \(\therefore\) The model we learned with the old \(\Prob{X}\) won’t work as well under the new distribution of the covariate
  • One potential way to cope is by weighting data points
    • If you know that the old pdf is \(p(x)\) and the new pdf is \(q(x)\), give data point \(i\) a weight proportional to \(p(x_i)/q(x_i)\) when fitting your model
    • There are some nice theoretical results about this (Cortes, Mansour, and Mohri 2010)
    • But it can be hazardous, since if \(q(x_i)\) is small and \(p(x_i)\) is not-so-small you’re giving a lot of weight to just a handful of points (“the small denominator problem”)
    • And if you need to estimate the ratio \(p(x)/q(x)\) there are ways to do so, but they introduce extra risk [@{Sugiyama-Kanawabee-on-covariate-shift]

4.2 Prior probability shift or class balance shift

“Prior probability shift” or “class balance shift” = \(\Prob{X|Y}\) stays the same but \(\Prob{Y}\) changes

  • This will change \(\Prob{Y|X}\), and so where you should draw your classification boundaries to minimize mis-classifications, or the expected cost of mis-classifications
    • Remember that those boundaries depend on \(\Prob{Y|X}\), which in turn is a function of \(\Prob{X|Y}\) and \(\Prob{Y}\)
    • It also, obviously, changes regressions
  • Re-weighting based on \(X\) doesn’t help here
  • You could try re-weighting on \(Y\)
  • Artificially balancing your data to have equal numbers of positive and negative cases \((\Prob{Y=0} \approx \Prob{Y=1})\) can help learn the difference between \(\Prob{X|Y=0}\) and \(\Prob{X|Y=1}\), but don’t expect that your error rate will really generalize to unbalanced data “in the wild”.
  • The “Neyman-Pearson” approach, of setting a limit on acceptable false positives and then minimizing false negatives, is more robust to this kind of shift than is just minimizing mis-classifications.

4.3 Concept drift

“Concept drift”1 = \(\Prob{X}\) stays the same, but \(\Prob{Y|X}\) changes, or, similarly, \(\Prob{Y}\) stays the same but \(\Prob{X|Y}\) changes

  • That’s history (more or less).
  • An outstanding example: Google Flu Trends
    • Estimated the prevalence of influenza from search-engine activity
    • Stopped working after a while (Lazer et al. 2014b)
    • It’s never recovered (Lazer et al. 2014a).
    • Because the actual relationship between “this many people searched for these words” and “this many people have the flu” changed

4.4 Coping mechanisms

  • I’ve mentioned weighting
  • Another family of coping strategies are incremental learning approaches that continually revise the model
    • One idea is to run many models in parallel:
      • Model 1 gets trained on data points \(1, 2, 3, \ldots n\) (increasing as \(n\) grows)
      • Model 2 gets trained on data points \(k+1, k+2, \ldots n\) (also increasing as \(n\) grows)
      • … Model \(r\) gets trained on data points \(rk+1, rk+2, \ldots n\)
      • Add a new model every \(k\) data points
      • Weight each model based on how well it’s done, and shift weight to the better models
      • Copes with many different kinds of shift, at some cost in efficiency if there no shifts (Shalizi et al. 2011)
  • Note that cross-validation within a data set won’t detect this; you really do need to compare different data sets!

5 Curse of dimensionality

  • Q: \(p\) features uniformly distributed on \([0,1]^p\), \(n=10^9\), what’s the expected number of points within \(\pm 0.005\) (on every axis) of the mid-point, as a function of \(p\)?
  • A: The (hyper-) volume of the target region is \((0.005\times 2)^p = 10^{-2p}\), so the expected number of points in the region is \(10^9 10^{-2p} = 10^{9-2p}\)
    • For \(p=1\), that’s \(10^7\), an immense amount of data
    • For \(p=3\), that’s \(10^3\), still a very respectable sample size
    • For \(p=4\), that’s \(10^1=10\), not nothing but not a lot
    • For \(p=10\), that’s \(10^{-11}\), meaning there’s a substantial probability of not having even one point in the target region
    • For \(p=100\), that’s \(10^{-191}\)
    • Not even worth calculating with thousands of features
  • There are lots of domains where we have thousands or hundreds of thousands of features: images, audio, genetics, brain scans, advertising tracking on the Internet…
  • Why that little calculation matters: Say we’re trying to estimate a \(p\)-dimensional function by averaging all the observations we have which are within a distance \(h\) of the point we’re interested in
    • The bias we get from averaging is \(O(h^2)\), regardless of the dimension
      • (Taylor-expand the function out to second order; to see this in detail, see Shalizi (n.d.), ch. 4
    • If we’ve got \(n\) samples, we expect \(O(nh^p)\) of them to be within the region we’re averaging over, so the variance of the average will be \(O(n^{-1}h^{-p})\)
    • Total error is bias squared plus variance, so it’s \(O(h^4) + O(n^{-1}h^{-p})\)
    • We can control \(h\), so adjust it to minimize the error: \[\begin{eqnarray} O(h^3) - O(n^{-1}h^{-p-1}) & = & 0\\ O(h^{p+4}) & = & O(n^{-1})\\ h & = & O(n^{-1/(p+4)}) \end{eqnarray}\]
    • The total error we get is \(O(n^{-4/(p+4)}\)
    • This is great if \(p=1\), but miserable if \(p=100\)

  • The curse of dimensionality is that as the amount of data we need grows exponentially with the number of features we use
    • Above argument is just for averaging, but:
      • Almost all predictive modeling methods boil down to “average nearby points” (they just differ in how “nearby” gets defined, or maybe the precise form of averaging)
      • A more complicated argument shows that \(O(n^{-4/(4+p)})\) is in fact the best we can generally hope for (see backup)
      • The basic reason is that the number of possible functions explodes as \(p\) increases, but the amount of information in our sample does not
  • There are basically three ways to escape the curse:
    1. Hope that you already know the right function to use (it’s linear, or quadratic, or maybe an additive combination of smooth, 1-D functions)
      • If we can’t have this, may be we can at least hope that we know what features really matter, so we don’t just blindly throw in irrelevant ones
    2. Hope that while there are \(p\) features, so the \(X\) vectors “live” in a \(p\)-dimensional space, the features are very strongly dependent on each other, so that the \(X\) vectors are on, or very close to, a \(q\)-dimensional subspace, with \(q \ll p\)
      • Use dimension reduction to try to find this subspace, or
      • Use a prediction method which automatically adapts to the “intrinsic” dimension, like k-nearest-neighors (Kpotufe 2011) or some kinds of kernel regression (Kpotufe and Garg 2013)
    3. Use a strongly-constrained parametric model, as in (1), even though it’s wrong, and live with some bias that won’t go away even as \(n\rightarrow\infty\)
      • This is probably the best reason to still use linear models
  • The curse of dimensionality means that blindly doing statistical learning with tons of features will not end well

6 Bad Data

6.1 Unrepresentative samples

  • Data mining projects typically work with convenience samples
  • A “convenience sample” of all your website’s users is fine — if they’re representative of future users
    • You often hope that they won’t be, because you want your audience to grow!
      • Back to data-set shift…
  • For example: US Twitter users are not a representative sample of Americans
    • Broadly speaking they’re younger, richer, more educated and more urban…
      • In most contexts they’re also more likely to be white, but oddly not for geo-tagged tweets (Malik et al. 2015; Malik 2018)
      • The demographic skew changed over time (Diaz et al. 2016)

6.2 Record linkage

  • Data sets often come from merging multiple sources
  • How do you know which records in different sources refer to the same person / firm / TV show / event?
    • Is “C. Shalizi” the same as “C. R. Shalizi”, “Cosma Shalizi”, “Cosma Rohilla Shalizi”, “Cosmar Shalizi”, “C. Shallizzi”, “Cosmar Shalizi”, “Cosimo Shallice”?
  • “Record linkage” or “entity resolution” is a huge problem and indeed an active area of research in statistics & databases
  • It’s often done very badly in practice, with very little review
  • O’Neil (2016), pp. 150–154, gives some (sadly typical) horror stories of record linkage gone awry
  • This issue could arise with any statistical project, but is more often a concern for data mining projects that deliberately want to merge multiple sources of information
    • Record linkage can also be tackled as a statistical learning problem (e.g., Bhattacharya and Getoor (2007))

6.3 Artifacts

  • Our systems are really good at picking up on associations between features and labels that are present in the data
  • Some of these are not at all what we want, and can lead to bizarre-seeming failures
  • For example, why does a leading image classifier categorize this as “sheep”?
  • A more serious example: a medical-x-ray classifier that learns to classify patients by disease, not based on the content of the x-ray images, but the extra markings on the slides which showed what kind of machine was used, which hospital the image came from, etc.
  • A recent example from an (unpublished) student project in our department: extremely good classification of biopsy tissue samples, imaged under a microscope, as healthy or diseased, which cross-validated well but failed to generalize to genuinely new data…
    • When the pathologists created the microscopic images, they did the healthy and diseased samples in separate batches
    • All the healthy samples were illuminated from the left, all the diseased samples from the right (different microscopes, IIRC)
    • The classifier learned to tell which direction the light was coming from
  • Artifacts interact with the problem of relying on proxies, and with the problem of data-set shift

6.4 Sheerly erroneous observations

  • Impossible values (e.g., body temperature of 0)
  • Missing values
    • Missing values “coded” as 0, -1, 99, 999…
    • Some but not all missing values coded as 88 when it was supposed to be 99 (Kahn and Udry 1986)
  • Random noise in the variables is not so bad, but will still weaken the predictive relationship between \(Y\) and \(X\)

6.5 Reliance on (weak) proxies

  • What a judge (should) care about is whether a defendant will actually commit an act of violence if released; what our data shows is whether someone is arrested for violence
    • Arrest for violence is a proxy for the act of violence
    • The proxy is subject to errors (both negative and positive)
  • Lots of data-mining projects are forced to use proxies for things that we really care about, like arrests vs. acts
  • Many others use proxies because they can
  • Every step in the chain of inference from proxy to proxy to … to feature-we’d-actually-want-to-measure-if-only-we-could to outcome, introduces error
    • Even if all the errors are random and undirected, it weakens the predictive relationship between our actual features (the proxies at the beginning of the chain) and the response
    • Errors may not be random at all

7 “The Garden of Forking Paths”

  • “The Garden of Forking Paths” is a short story by Jorge Luis Borges, which depicts life as being like a maze in a garden, where the path splits at every choice you could make; every path is a different possible life (Borges, n.d.)
    • This story actually helped influence Herbert Simon, our local culture hero and Nobel Laureate / Turing Prize winner, as he helped invent artificial intelligence (Simon 1991)
    • It’s also a classic of world literature and you should read it if you haven’t
  • Gelman and Loken (2014) used this as a simile for the choices that go in to building statistical models, and how they can influence the results
  • We have to make lots of choices in our predictive models:
    • What variable, exactly, are we trying to predict?
    • What variables, exactly, are we using to make the predictions?
    • What class of models, exactly, are we using to connect features to outcomes?
    • How, exactly, are we evaluating the predictions?
    • Which cases, exactly, get included and which excluded from the data set?
  • These choices are also called “researcher degrees of freedom”
  • Varying these choices can give very different answers to the same question

7.1 The COMPAS example, again

  • We’re trying to predict violence. Does that mean:
    • A binary indicator for being arrested for violence?
    • A binary indicator for being convicted for violence?
    • Time to arrest for violence (maxing at 2 years)?
    • Time to conviction for violence (maxing at 2 years)?
    • Some measure of how much violence someone commits, combining number of incidents and (somehow) their severity?
    • Why 2 years, exactly?
    • Precisely which crimes count as “violent”?
      • Homicide, assault, rape, robbery, all seem pretty clear
      • Should we try to include the idea that is violent, but it isn’t as violent as multiple homicide?
      • What about unlicensed possession of a handgun and intent to commit homicide — violent or not-violent?
      • What about an unlicensed handgun and drug-dealing?
    • The COMPAS study looks at re-arrest within two years. What about those who were imprisoned for more than 2 years — do we correct for them somehow?
  • We want to use prior criminal history as a predictor. Does that mean:
    • Number of prior arrests for anything at all?
    • Number of prior arrests for violence specifically?
    • Separate counts for prior arrests for violent and non-violent offenses?
    • Number of prior convictions (not just arrests), ditto?
    • Prior arrests and prior convictions, separately?
    • Time since last arrest or conviction?
    • Time spent in jail? Time spent in prison?
  • Model choice is something we’re familiar with, but still:
    • Linear models? Logistic models? With or without interactions? Which interactions?
    • Trees? Grown and pruned how, exactly?
  • We want to see how good the predictions are. Does that mean
    • Using a separate validation set?
    • Using cross-validation?
    • Using mis-classification rate?
    • Using the log-probability (entropy) loss?
    • Using a different measure of probabilistic calibration, like the Brier score?
    • Using the area under the FPR-FNR curve?
    • Treating this like an information-retrieval problem and looking at the precision-recall curve?
  • Who gets counted in the data? It’s easy to say “Everyone arrested in Broward County between date X and date Y”, but:
    • What if someone’s arrested more than once?
      • One record per arrest?
      • Only use the first record?
      • Only use the last record?
      • How did we do the record linkage to see that it’s the same person anyway?
    • What if the data is incomplete? Do we include records which are missing:
      • The COMPAS score?
      • Our measure of violent recidivism?
      • Our measure of the criminal record?
      • Our measure of race?
      • The criminal charge?
      • Records where the COMPAS score is dated before the arrest? (An error in entering the dates [which one? both?], a score generated by a previous arrest, a different arrestee with a similar name…?)

7.2 Choice of Measures

  • The points about precisely defining the variables are really about how we measure things like “violence” and “criminal history”
  • These are often somewhat nebulous ideas, but if we want to include them in a model, we need to make them precise, and we need to connect them to data
    • This is sometimes called “operationalizing” the ideas, or giving them an “operational definition”
      • “Operational definition” is a very misleading phrase based on 100-year-old philosophy; avoid it
  • The picture you should have in mind is something like \[ M = f(X) + \epsilon \] where \(X\) is what we’d like to measure and \(M\) are the features we can actually observe.
    • In a good measurement, \(f\) is a well-behaved, well-known 1-1 function and \(\epsilon\) is a small amount of purely-random noise
    • In a bad measurement, \(f\) is an poorly-known, ill-behaved function, maybe not even invertible, and \(\epsilon\) might really be a function of all sorts of other features and latent factors, possibly interacting with the \(X\) we want to measure
    • Of course we can have multiple indicators of \(X\), say \[ M_i = f_i(X) + \epsilon_i \] and then we can try to estimate or infer the latent variable \(X\) from the observables \(M\)
  • We can get very different results from different choices of measures!

7.3 An example of choices of measures

Orben and Przybylski (2019)

  • Goal is to predict “adolescent well-being” from “digital technology use”
  • Used linear regression
    • Because: psychologists
    • But also because: no funny business
  • Choice of 3 data sets, with different features
    • I’ll just present one of them, “YRBS”
  • Measure of adolescent well-being: “Mean of any possible combination of five items concerning mental health and suicidal ideation”
    • Or, instead of taking means, “code all cohort members who answered ‘yes’ to one or more as 1 and all others as 0”
  • Measure of digital technology use: “Two questions concerning electronic device use and TV use, or the mean of these questions”
  • Choice of including covariates or not
  • Ended up with 372 possible model specifications

Top panel: Standardized linear regression coefficient for the measure of adolescent well-being on digital technology use, with nominally-significant coefficients shown in black and nominally-not-significant ones in red. The dotted line is the median across all 372 specifications. The bottom panel explains the specifications, showing which variables went in to the measure of technology use, which variables went in to the measure of well-being, whether the well-being measure was a mean or a max, and whether demographic features were included as controls in the regression. (Orben and Przybylski 2019, Fig. 1)

  • For the other studies, with larger sample sizes but also more features, the range of coefficients was even larger
    • And, incidentally, the median coefficient across specifications was even closer to 0
    • Wearing glasses, OTOH, has a much narrower, more consistently negative range of coefficients for predicting adolescent well-being…
  • Now imagine expanding this to include trees, random forests, nearest neighbors (what metric?), kernel machines (which kernel?), neural networks (which architecture?)…

7.4 Some ways to cope, maybe

  1. Make some choices, hope that they’re reasonable, and hope that nobody challenges you on them.
  2. Do some actual science to figure out what good measures are
  3. Put in the time to try many different specifications, and hope they all give similar results
    • And/or hope that you can explain why some of specifications really don’t make sense after all, so you can ignore them
  4. Throw in lots of related measures, and do a lot of dimension reduction and/or clustering
    • E.g., if you’ve got 5 different ways of measuring depression, and they really are measuring the same thing, then they should be strongly correlated with each other, and when you do dimension reduction they should collapse on to one dimension
      • Factor models are often better than PCA here, especially if some of the measurements are just a lot noisier than others
    • But maybe some measurement relationships are nonlinear, or some latent variables are categorical…
  5. Hope that different choices of measures give you similar predictions
    • Easier to check this with different choices of how to measure the features
    • But you could still look at, say, different measures of outcomes and how they correlate with each other
  • I wish I could give you a simple answer with a flow-chart here, but nobody can (not honestly, not yet)

8 Prediction vs. Action

  • All the data-mining we’ve talked about is about prediction: “if we see \(X\), then we should expect to see \(Y\)
  • But usually the customer wants to take an action: “if we do \(X\), then we should expect \(Y\) to happen”
  • There is a gap here

8.1 Back to recommendation engines

  • Collaborative filtering (or matrix factorization or…) predicts “If you like such-and-such movies, then you will probably like this movie”
  • But making a recommendation is an action
  • The people running the engine want to know “If we recommend this movie to you, you will probably buy it” (or watch it and put up with an ad, etc.)
    • There is already a gap here between predicting the rating given other ratings, and predicting behavior given recommendation
  • In fact, they really want to know “If we recommend this movie to you, you are more likely to buy it than if we didn’t recommend it to you”
  • A small personal example:
    • The late Jane Haddam was one of my favorite mystery novelists
    • In 2019, I bought a couple of her books on Kindle to replace falling-apart paperbacks from the 1990s
    • Kindle then recommended Jane Haddam novels to me for months
    • This was an accurate prediction of what novels I like, but didn’t make me any more likely to buy those books (from Kindle or anywhere else)
  • This is really a causal inference problem: \[ \Expect{\text{Revenue}| X, do(R_Y=1)} - \Expect{\text{Revenue}|X, do(R_Y=0)} \] where \(X\) are the user features and \(R_Y\) is the indicator for recommending item \(Y\)
    • Or the distribution of revenue, etc., etc.

8.2 Some causal inference methods

  • Experiments are easy to analyze
    • If you’re trying out just one change, it’s the t-test
      • If you’re a “data scientist”, you re-invent the t-test 100 years late and call it “A/B testing”
    • If you’re trying multiple changes all at once, you also need 100-year-old statistics:
      • Linear regression with dummy variables (=“analysis of variance”) to separate the effects of different, overlapping treatments
      • Experimental design to figure out how to efficiently combine multiple changes
  • but experiments are hard to implement
    • The people running the recommendation engine don’t like it being messed with and/or randomly turned off
    • Users don’t like being experimented on
    • If effects are small you need a big experiment
    • The organization needs to hire a statistician and not just a data scientist
  • Could match otherwise-similar items or users that did or did not get recommendations and compare them
    • Can be especially useful cases with equal probability of getting recommendations (the “propensity score” of Rosenbaum and Rubin (1983))
    • Doesn’t work when recommendations are a deterministic function of the attributes you’re trying to match on (no “overlap”)
  • Many, many other tools for causal inference, all with their own assumptions
    • Sharma, Hofman, and Watts (2015) used independent variation in the traffic to some but not all recommended products to estimate that about 75% of click-throughs that happen via Amazon recommendations would’ve happened anyway

8.3 This isn’t just an issue with recommendation engines

  • Targeting marketing efforts at people who are already likely to buy your stuff is a common problem with all kinds of marketing (Rubin and Waterman 2006)
    • After all, “99% of the people who saw our ads bought our product!” sounds good
    • If 98% of them would’ve bought it anyway, maybe that ad wasn’t necessary…
    • But if you’re paid by the click-through, you’ll be very motivated to find that 98%…
  • Similar issues with all kinds of pattern discovery
    • E.g., “High school students who take AP classes are more likely to go to college, so we should encourage more students to take AP classes”
    • See also the discussion of “artifacts” above

9 Further reading

9.1 Data-set shift

  • Quiñonero-Candela et al. (2009) is still a very good reference on a range of techniques for dealing with data-set shift

9.2 On measurement

  • Cox and Donnelly (2011) contains a sound and not-too-technical discussion of a lot of the issues that go into good statistical measurement, and ways they can go wrong
  • Becker (2017) is a really good book about some of the difficulties of measuring human behavior, and a guide to the work which can go in to understanding what your measurements actually mean
  • Some of the most careful thinking about measurement has come out of psychology
    • I particularly recommend Borsboom (2005) (and Borsboom (2006))
    • Psychology has also seen a lot of very sloppy thinking about measurement, a lot of which persists (Flake and Fried 2019)
    • Jacobs and Wallach (2019) relate ideas about measurement coming from psychology to the questions about fairness we talked about in lecture 24

10 Backup: Why it’s so hard to beat the curse of dimensionality

  • We’ve seen that local-averaging methods will have a risk that shrinks like \(O(n^{-4/(4+p)})\)
  • What about other methods?
    • We don’t know what the true function is, so should explore what the maximum risk will be, over all not-too-crazy functions
      • If we use linear regression, its risk is \(O(pn^{-1})\) if the true function is linear…
      • … but \(O(1)\) (constant in \(n\)) if the true function is non-linear
  • You can show that the minimum possible maximum risk (“minimax risk”) is \(O(n^{-4/(4+p)})\) (Györfi et al. 2002)
    • (That rate assumes the function has at least 2 derivatives)
    • So local averaging is doing about as well as you can hope in general
    • Getting away from “in general” means: ruling out some functions a priori, without even looking at the data
  • Why is that the minimax rate? The precise argument is technical, but the intuition is information-theoretic
    • We observe the function plus noise, \(Y=f(X)+\epsilon\), at \(n\) points
    • From \((Y_1, \ldots Y_n)\), we want to estimate the function \(f\)
    • This is like a communication channel: the receiver sees \((Y_1, \ldots Y_n)\) and wants to recover the “signal” \(f\)
    • How many bits of information is there in the \(Y\) values? \(nH[Y]\) bits
    • How many possible functions \(f\) are there? Clearly, infinitely many…
      • So let’s limit the range of \(f\) to some interval of length (say) \(\tau\)
      • And let’s even say we don’t care about the exact value of \(f(x)\), we just want to know whether it fits into a discrete bin of width \(\delta\)
      • So the number of possible functions at \(n\) points is going to be at most \((\tau/\delta)^n\)
      • And the amount of information in the \(Y\) values is at most \(n\log{\tau/\delta)}\)
      • BUT we’re assuming the functions are smooth, so knowing \(f(x)\) limits the possible values of \(f(x^{\prime})\) if \(x^{\prime}\) is near by; it has to be within some distance \(\kappa < \tau\) of \(f(x)\), so after discretizing there are \(\kappa/\delta\) choices available
      • When I (the sender / Nature) set \(f(x)\), I don’t have complete freedom to set \(f(x^{\prime})\), it has to be close to what I chose for \(f(x)\), and you (the receiver / Learner) can use that
      • But every dimension for \(x\) gives me, the sender, an additional degree of freedom on which I can alter \(f(x^{\prime})\);
      • there will be \(\kappa/\delta\) choices for what to do when moving along axis 1, and \(\kappa\) independent choices for what to do when moving along axis 2
    • The upshot is that the number of effectively different functions grows exponentially with \(p\), it’s \(O((\kappa/\delta)^p)\)
    • Decoding (learning) requires more information at the receiver than in the signal, so we need more and more samples (growing \(n\)) to pick out the right function
    • Actually getting the rate requires doing this math more precisely
      • In particular, figuring out \(\kappa\), and just how quickly you can let \(\delta \rightarrow 0\) while still having enough information in the \(Y\)’s to pick out the right function

10.1 Backup: Alternative geometric formulations of the curse of dimensionality

  • We looked at how many sampled data points we can expect to find within a small distance of a given location in the \(X\) space
  • Alternative formulation 1: as \(p\) grows, the distance to the nearest neighbor approaches the distance to the average neighbor (at fixed \(n\))
    • Helps understand why nearest neighbor methods struggle in high dimensions
  • Most of the volume of a \(p\)-dimensional sphere (cube, etc.) is in a thin shell of width \(\epsilon\) near its surface
    • E.g., for a disk (=2D sphere) of radius 1, the area is \(\pi\), the area not within the shell of width \(\epsilon\) is \(\pi(1-\epsilon)^2\), so that fraction not close to the surface is \((1-\epsilon)^2\)
    • For a sphere of radius 1, the volume is \(\frac{4}{3}\pi\), the volume not close to the surface is \(\frac{4}{3}\pi(1-\epsilon)^3\), the fraction not close to the surface is \((1-\epsilon)^3\)
    • In general, in \(p\) dimensions, the fraction of the volume more than \(\epsilon\) away from the surface is \((1-\epsilon)^p \rightarrow 0\) as \(p\rightarrow\infty\) (for any \(\epsilon > 0\))
  • Alternative formulation 2: A small “amplification” or “blow-up” of any set with positive probability contains most of the probability
    • Because: the volume of a ball grows so rapidly with its radius
  • These are pretty generic features of high-dimensional distributions, not specific to uniforms (Boucheron, Lugosi, and Massart 2013)

References

Anderson, Chris. 2008. “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” Wired 16 (17). http://www.wired.com/2008/06/pb-theory/.

Becker, Howard S. 2017. Evidence. Chicago: University of Chicago Press.

Bhattacharya, Inrajit, and Lise Getoor. 2007. “Collective Entity Resolution in Relational Data.” ACM Transactions on Knowledge Discovery from Data 1 (1):5. https://doi.org/10.1145/1217299.1217304.

Borges, Jorge Luis. n.d. Ficciones. New York: Grove Press.

Borsboom, Denny. 2005. Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge, England: Cambridge University Press.

———. 2006. “The Attack of the Psychometricians.” Psychometrika 71:425–40. https://doi.org/10.1007/s11336-006-1447-6.

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.

Cortes, Corinna, Yishay Mansour, and Mehryar Mohri. 2010. “Learning Bounds for Importance Weights.” In Advances in Neural Information Processing 23 [Nips 2010], edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 442–50. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/4156-learning-bounds-for-importance-weighting.

Cox, D. R., and Christl A. Donnelly. 2011. Principles of Applied Statistics. Cambridge, England: Cambridge University Press. https://doi.org/10.1017/CBO9781139005036.

Diaz, Fernando, Michael Gamon, Jake M. Hofman, Emre Kiciman, and David Rothschild. 2016. “Online and Social Media Data as an Imperfect Continuous Panel Survey.” PLOS One 11:e0145406. https://doi.org/10.1371/journal.pone.0145406.

Fisher, R. A. 1938. “Presidential Address to the First Indian Statistical Congress.” Sankhya 4:14–17. https://www.jstor.org/stable/40383882.

Flake, Jessica Kay, and Eiko I. Fried. 2019. “Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them.” E-print, PsyArXiV:hs7wm. https://doi.org/10.31234/osf.io/hs7wm.

Gelman, Andrew, and Eric Loken. 2014. “The Statistical Crisis in Science.” American Scientist 102:460–65. https://doi.org/10.1511/2014.111.460.

Györfi, László, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002. A Distribution-Free Theory of Nonparametric Regression. New York: Springer-Verlag.

Harrington, Anne. 1989. Medicine, Mind and the Double Brain: A Study in Nineteenth Century Thought. Princeton, New Jersey: Princeton University Press.

Jacobs, Abigail Z., and Hanna Wallach. 2019. “Measurement and Fairness.” E-print, arxiv:1912.05511. https://arxiv.org/abs/1912.05511.

Kahn, Joan R., and J. Richard Udry. 1986. “Marital Coital Frequency: Unnoticed Outliers and Unspecified Interactions Lead to Erroneous Conclusions.” American Sociological Review 51:734–37. https://doi.org/10.2307/2095496.

Kpotufe, Samory. 2011. “K-Nn Regression Adapts to Local Intrinsic Dimension.” In Advances in Neural Information Processing Systems 24 [Nips 2011], edited by John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando Pereira, and Kilian Q. Weinberger, 729–37. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/4455-k-nn-regression-adapts-to-local-intrinsic-dimension.

Kpotufe, Samory, and Vikas Garg. 2013. “Adaptivity to Local Smoothness and Dimension in Kernel Regression.” In Advances in Neural Information Processing Systems 26 [Nips 2013], edited by C. J. C. Burges, Léon Bottou, Max Welling, Zoubin Ghahramani, and Kilian Q. Weinberger, 3075–83. Red Hook, New York: Curran Associates. https://papers.nips.cc/paper/5103-adaptivity-to-local-smoothness-and-dimension-in-kernel-regression.

Malik, Momin M. 2018. “Bias and Beyond in Digital Trace Data.” PhD thesis, Pittsburgh, Pennsylvania: Carnegie Mellon University. http://reports-archive.adm.cs.cmu.edu/anon/isr2018/abstracts/18-105.html.

Malik, Momin M., Hemank Lamba, Constantine Nakos, and Jürgen Pfeffer. 2015. “Population Bias in Geotagged Tweets.” In Papers from the 2015 ICWSM Workshop on Standards and Practices in Large-Scale Social Media Research [Icwsm-15 Spsm], 18–27. Association for the Advancement of Artificial Intelligence. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10662.

O’Neil, Cathy. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York: Crown.

Orben, Amy, and Andrew K. Przybylski. 2019. “The Association Between Adolescent Well-Being and Digital Technology Use.” Nature Human Behaviour 3:173–82. https://doi.org/10.1038/s41562-018-0506-1.

Pick, Daniel. 1989. Faces of Degeneration: A European Disorder, C. 1848 – C. 1918. Cambridge: Cambridge University Press.

Quiñonero-Candela, Joaquin, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence, eds. 2009. Dataset Shift in Machine Learning. Cambridge, Massachusetts: MIT Press.

Rosenbaum, Paul, and Donald Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70:41–55. http://www.jstor.org/stable/2335942.

Rubin, Donald B., and Richard P. Waterman. 2006. “Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology.” Statistical Science 21:206–22. https://doi.org/10.1214/088342306000000259.

Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Shalizi, Cosma Rohilla, Abigail Z. Jacobs, Kristina Lisa Klinkner, and Aaron Clauset. 2011. “Adapting to Non-Stationarity with Growing Expert Ensembles.” Statistics Department, CMU. http://arxiv.org/abs/1103.0949.

Shallice, Tim, and Richard P. Cooper. 2011. The Organisation of Mind. Oxford: Oxford University Press.

Sharma, Amit, Jake M. Hofman, and Duncan J. Watts. 2015. “Estimating the Causal Impact of Recommendation Systems from Observational Data.” In Proceedings of the Sixteenth ACM Conference on Economics and Computation [Ec ’15], edited by Michal Feldman, Michael Schwarz, and Tim Roughgarden, 453–70. New York: The Association for Computing Machinery. https://doi.org/10.1145/2764468.2764488.

Simon, Herbert. 1991. Models of My Life. New York: Basic Books.

Yarkoni, Tal. 2019. “The Generalizability Crisis.” E-print, psyArXiv:jqw35. https://doi.org/10.31234/osf.io/jqw35.


  1. Why “concept drift”? Because some of the early work on classifiers in machine learning came out of work in artificial intelligence on learning “concepts”, which in turn was inspired by psychology, and the idea was that you’d mastered a concept, like “circle” or “triangle”, if you could correctly classify instances as belonging to the concept or not; this meant learning a mapping from the features \(X\) to binary labels. If the concept changed over time, the right mapping would change; hence “concept drift”.