\[ \newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. (Fisher 1938, 17)
This lecture covers common sources of technical failure in data mining projects — the kinds of issues which lead to them just not working. (Whether they would be worth doing even if they did work is another story, for the last lecture.) We’ll first look at four sources of technical failure which are pretty amenable to mathematical treatment:
The second half of the notes covers three distinct issues about measurement, model design and interpretation:
They’re less mathematical but actually more fundamental than the first set of issues.
“Covariate shift” = \(\Prob{Y|X}\) stays the same but \(\Prob{X}\) changes
“Prior probability shift” or “class balance shift” = \(\Prob{X|Y}\) stays the same but \(\Prob{Y}\) changes
“Concept drift”1 = \(\Prob{X}\) stays the same, but \(\Prob{Y|X}\) changes, or, similarly, \(\Prob{Y}\) stays the same but \(\Prob{X|Y}\) changes
Orben and Przybylski (2019)
Top panel: Standardized linear regression coefficient for the measure of adolescent well-being on digital technology use, with nominally-significant coefficients shown in black and nominally-not-significant ones in red. The dotted line is the median across all 372 specifications. The bottom panel explains the specifications, showing which variables went in to the measure of technology use, which variables went in to the measure of well-being, whether the well-being measure was a mean or a max, and whether demographic features were included as controls in the regression. (Orben and Przybylski 2019, Fig. 1)
Anderson, Chris. 2008. “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” Wired 16 (17). http://www.wired.com/2008/06/pb-theory/.
Becker, Howard S. 2017. Evidence. Chicago: University of Chicago Press.
Bhattacharya, Inrajit, and Lise Getoor. 2007. “Collective Entity Resolution in Relational Data.” ACM Transactions on Knowledge Discovery from Data 1 (1):5. https://doi.org/10.1145/1217299.1217304.
Borges, Jorge Luis. n.d. Ficciones. New York: Grove Press.
Borsboom, Denny. 2005. Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge, England: Cambridge University Press.
———. 2006. “The Attack of the Psychometricians.” Psychometrika 71:425–40. https://doi.org/10.1007/s11336-006-1447-6.
Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.
Cortes, Corinna, Yishay Mansour, and Mehryar Mohri. 2010. “Learning Bounds for Importance Weights.” In Advances in Neural Information Processing 23 [Nips 2010], edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 442–50. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/4156-learning-bounds-for-importance-weighting.
Cox, D. R., and Christl A. Donnelly. 2011. Principles of Applied Statistics. Cambridge, England: Cambridge University Press. https://doi.org/10.1017/CBO9781139005036.
Diaz, Fernando, Michael Gamon, Jake M. Hofman, Emre Kiciman, and David Rothschild. 2016. “Online and Social Media Data as an Imperfect Continuous Panel Survey.” PLOS One 11:e0145406. https://doi.org/10.1371/journal.pone.0145406.
Fisher, R. A. 1938. “Presidential Address to the First Indian Statistical Congress.” Sankhya 4:14–17. https://www.jstor.org/stable/40383882.
Flake, Jessica Kay, and Eiko I. Fried. 2019. “Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them.” E-print, PsyArXiV:hs7wm. https://doi.org/10.31234/osf.io/hs7wm.
Gelman, Andrew, and Eric Loken. 2014. “The Statistical Crisis in Science.” American Scientist 102:460–65. https://doi.org/10.1511/2014.111.460.
Györfi, László, Michael Kohler, Adam Krzyżak, and Harro Walk. 2002. A Distribution-Free Theory of Nonparametric Regression. New York: Springer-Verlag.
Harrington, Anne. 1989. Medicine, Mind and the Double Brain: A Study in Nineteenth Century Thought. Princeton, New Jersey: Princeton University Press.
Jacobs, Abigail Z., and Hanna Wallach. 2019. “Measurement and Fairness.” E-print, arxiv:1912.05511. https://arxiv.org/abs/1912.05511.
Kahn, Joan R., and J. Richard Udry. 1986. “Marital Coital Frequency: Unnoticed Outliers and Unspecified Interactions Lead to Erroneous Conclusions.” American Sociological Review 51:734–37. https://doi.org/10.2307/2095496.
Kpotufe, Samory. 2011. “K-Nn Regression Adapts to Local Intrinsic Dimension.” In Advances in Neural Information Processing Systems 24 [Nips 2011], edited by John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando Pereira, and Kilian Q. Weinberger, 729–37. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/4455-k-nn-regression-adapts-to-local-intrinsic-dimension.
Kpotufe, Samory, and Vikas Garg. 2013. “Adaptivity to Local Smoothness and Dimension in Kernel Regression.” In Advances in Neural Information Processing Systems 26 [Nips 2013], edited by C. J. C. Burges, Léon Bottou, Max Welling, Zoubin Ghahramani, and Kilian Q. Weinberger, 3075–83. Red Hook, New York: Curran Associates. https://papers.nips.cc/paper/5103-adaptivity-to-local-smoothness-and-dimension-in-kernel-regression.
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014a. “Google Flu Trends Still Appears Sick: An Evaluation of the 2013–2014 Flu Season.” Electronic pre-print, SSRN/2408560. https://doi.org/10.2139/ssrn.2408560.
———. 2014b. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343:1203–5. https://doi.org/10.1126/science.1248506.
Malik, Momin M. 2018. “Bias and Beyond in Digital Trace Data.” PhD thesis, Pittsburgh, Pennsylvania: Carnegie Mellon University. http://reports-archive.adm.cs.cmu.edu/anon/isr2018/abstracts/18-105.html.
Malik, Momin M., Hemank Lamba, Constantine Nakos, and Jürgen Pfeffer. 2015. “Population Bias in Geotagged Tweets.” In Papers from the 2015 ICWSM Workshop on Standards and Practices in Large-Scale Social Media Research [Icwsm-15 Spsm], 18–27. Association for the Advancement of Artificial Intelligence. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10662.
O’Neil, Cathy. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York: Crown.
Orben, Amy, and Andrew K. Przybylski. 2019. “The Association Between Adolescent Well-Being and Digital Technology Use.” Nature Human Behaviour 3:173–82. https://doi.org/10.1038/s41562-018-0506-1.
Pick, Daniel. 1989. Faces of Degeneration: A European Disorder, C. 1848 – C. 1918. Cambridge: Cambridge University Press.
Quiñonero-Candela, Joaquin, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence, eds. 2009. Dataset Shift in Machine Learning. Cambridge, Massachusetts: MIT Press.
Rosenbaum, Paul, and Donald Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70:41–55. http://www.jstor.org/stable/2335942.
Rubin, Donald B., and Richard P. Waterman. 2006. “Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology.” Statistical Science 21:206–22. https://doi.org/10.1214/088342306000000259.
Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.
Shalizi, Cosma Rohilla, Abigail Z. Jacobs, Kristina Lisa Klinkner, and Aaron Clauset. 2011. “Adapting to Non-Stationarity with Growing Expert Ensembles.” Statistics Department, CMU. http://arxiv.org/abs/1103.0949.
Shallice, Tim, and Richard P. Cooper. 2011. The Organisation of Mind. Oxford: Oxford University Press.
Sharma, Amit, Jake M. Hofman, and Duncan J. Watts. 2015. “Estimating the Causal Impact of Recommendation Systems from Observational Data.” In Proceedings of the Sixteenth ACM Conference on Economics and Computation [Ec ’15], edited by Michal Feldman, Michael Schwarz, and Tim Roughgarden, 453–70. New York: The Association for Computing Machinery. https://doi.org/10.1145/2764468.2764488.
Simon, Herbert. 1991. Models of My Life. New York: Basic Books.
Yarkoni, Tal. 2019. “The Generalizability Crisis.” E-print, psyArXiv:jqw35. https://doi.org/10.31234/osf.io/jqw35.
Why “concept drift”? Because some of the early work on classifiers in machine learning came out of work in artificial intelligence on learning “concepts”, which in turn was inspired by psychology, and the idea was that you’d mastered a concept, like “circle” or “triangle”, if you could correctly classify instances as belonging to the concept or not; this meant learning a mapping from the features \(X\) to binary labels. If the concept changed over time, the right mapping would change; hence “concept drift”.↩