Statistics 36-462/662: Data Mining
Fall 2019
Prof. Cosma Shalizi
Mondays and Wednesdays 1:30--2:50 Doherty Hall 1212
Data mining is the art of extracting useful patterns from large bodies of data.
(Metaphorically: finding seams of actionable knowledge in the raw ore of
information.) The rapid growth of computerized data, and the computer power
available to analyze it, creates great opportunities for data mining in
business, medicine, science, government and elsewhere. The aim of this course
is to help you take advantage of these opportunities in a responsible way.
After taking the class, when you're faced with a new problem, you should be
able to (1) select appropriate methods, and justify their choice, (2) use and
program statistical software to implement them, and (3) critically evaluate the
results and communicate them to professional colleagues who are not also
statisticians or data-analysts.
Data mining is related to statistics
and to machine learning, but has its own
aims and scope. Statistics deal with systems for creating reliable inferences
from imperfect data. Machine learning develops technologies of automated
prediction and induction. We will use tools from statistics and from machine
learning, but we will use them as tools, not things to study in their own
right. We will do a lot of calculations, but will not prove many theorems, and
we will do many more experiments than calculations.
Prerequisites
For this semester, the prerequisite is
passing 36-401 (modern
regression) with a grade of C or better. The only exceptions are for
graduate students taking the graduate section (36-662; see below).
Graduate students and 36-662
This course is primarily intended as an advanced elective for undergraduate
students majoring in statistics. A limited number of graduate students are
welcome to take it as 36-662. (Graduate students who register for 36-462 will
be dropped from the roster.) Formally, 662 has no prerequisites, but students
are expected to have a background at least equivalent to 36-401, and the
prerequisites to 401: linear regression modeling, probability, mathematical
statistics, linear algebra, calculus, and competence in using R for data
analysis. This class doesn't have the resources to help those without this
background catch up.
Undergraduates must take the course as 36-462; those who register for 662
will be dropped from the roster.
Instructors
Professor | Dr. Cosma Shalizi | cshalizi [at] cmu.edu |
| | Baker Hall 229C |
Teaching assistant | Ms. Robin Dun | not to be bothered with e-mails |
Topics and Techniques to be Covered
These may change during the semester.
- Searching by similarity and information retrieval
- Searching by content (texts, images, genes, ...); attributes,
representations and definitions of similarity and distance; choice of
representation. Bag-of-words and vector-space representations of text. A
first look at linear classifiers, nearest neighbors, prototypes; user feedback;
evaluating searches.
- Dimension reduction
- Using principal components to reduce the number of features; using factor
analysis to reduce the number of features; using random projections to reduce
the number of features. Strengths and limitations of linear
dimension-reduction methods. Nonlinear dimensionality reduction: local linear
embedding, diffusion maps, random nonlinear features. Predictive embedding
methods; word2vec. Case studies: the "Big 5" personality traits;
latent semantic indexing; Netflix; Cambridge Analytica.
- Information
- Quantitative measures of information,
uncertainty, and information gain; information-theoretic feature selection.
- Nearest neighbor methods
- Nearest neighbors for classification, regression, and general prediction.
Finding nearest neighbors: data structures; locality-sensitive hashing; LSH and
information retrieval. Case studies: recommender systems / "you may also
like"; matching estimates of treatment effects.
- Clustering
- Unsupervised category-learning, a.k.a. clustering. Non-probabilistic
clustering methods: k-means clustering, hierarchical clustering. Geometry of
clustering, ideas about what makes for good clusters; clusters and hashing.
Probabilistic clustering: mixture models, hierarchical mixture (topic) models.
Case studies: demographic clustering; topic
models for text; topic models for genetics.
- Classifiers
- Supervised categorization, a.k.a. classification. Linear classifiers;
logistic regression. Evaluating classifiers: error rates, over- and under-
fitting, cross-validation; base rates, operating characteristics, ROC curves,
Neyman-Pearson classifiers. Support-vector machines and the "kernel trick."
"Fairness" in classification: definitions and disputes. Case studies: credit
scoring; "clinical vs. actuarial judgment"; COMPAS.
- Trees and ensembles
- Classification and regression trees; tree-growing algorithms. "Fast and
frugal heuristics". Basic idea of ensembles: combining predictive models; the
virtues of diversity. Bagging. Random forests. Boosting. Stacking.
Mixtures of experts and hierarchical mixtures of experts.
Case studies to be determined.
- Waste, Fraud, and Abuse
- When data mining will fail: bad data, wrong data, insufficient data,
overwhelming false positives, impossible problems, attacking the wrong problem.
Evaluating claims about data mining and calling BS. When data mining is evil.
Case studies: are we building a dystopia to make each other click on ads?
Course Mechanics and Grading
There are three reasons you will get assignments in this course. In order of
decreasing importance:
- Practice. Practice is essential to developing the skills
you are learning in this class. It also actually helps you learn, because some
things which seem murky clarify when you actually do them, and sometimes trying
to do something shows that you only thought you understood it.
- Feedback. By seeing what you can and cannot do, and what
comes easily and what you struggle with, I can help you learn better, by giving
advice and, if need be, adjusting the course.
- Evaluation. The university is, in the end, going to
stake its reputation (and that of its faculty) on assuring the world that you
have mastered the skills and learned the material that goes with your degree.
Before doing that, it requires an assessment of how well you have, in fact,
mastered the material and skills being taught in this course.
To serve these goals, there will be three kinds of assignment in this
course.
- Homework
- Every week will have a homework assignment, divided into a series of
questions or problems. These will have a common theme, and will usually build
on each other, but different problems may theory, analyzing real data sets on
the computer, and communicating the results.
- All homework will be submitted electronically through Canvas. Most weeks,
homework will be due at 10:00 pm on Wednesday. Homework
assignments will always be released by Friday of the previous week, and
sometimes before.
- There are specific formatting requirements for
homework --- see below.
- Online assignments
- Most homework assignments will be accompanied by short-answer or multiple
choice questions, to be done on Canvas, before the homework itself is
due. These will typically be due on Sunday nights; consult Canvas for the
exact schedule.
- In-class exercises
- Most lectures will have in-class exercises. These will be short (10--20
minutes) assignments, emphasizing problem solving, done in class in small
groups of no more than four people. The assignments will be given out in
class, and must be handed in on paper by the end of class. On most days, a
randomly-selected group will be asked to present their solution to the class.
- The in-class exercises will either be problems from that week's homework,
or close enough that seeing how to do the exercise should tell you how to do
some of the problems.
There will be no exams.
Time expectations
You should expect to spend 5--7 hours on assignments every week, averaging over
the semester. (This follows from the university's rules about how course
credits translate into hours of student time.) If you find yourself spending
significantly more time than that on the class, please come to talk to me.
Grading
Grades will be broken down as follows:
- In-class exercises: 10%. All exercises will have equal weight. Your
lowest two exercise grades will be dropped; if you complete all
exercises with a score of at least 50%, your lowest four grades will be
dropped.
- Homework: 90%. All homework assignments will be weighted equally. Most
assignments will include online short-answer questions, due before the
main assignment (see above); these will be incorporated into the grades for the
corresponding assignments. The online questions will amount to about 10% of
your final grade.
Your lowest two homework grades will be dropped. If you complete all
homework assignments, with a minimum score of at least 60%, your
lowest three homework grades will be dropped.
Late homework will not be accepted for any reason.
(The point of dropping the low grades is to let you schedule interviews,
social engagements, and other obligations, with flexibility, and without my
having to decide what is or is not important enough to change the rules.)
Grade boundaries will be as follows:
A | [90, 100] |
B | [80, 90) |
C | [70, 80) |
D | [60, 70) |
R | < 60 |
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular
assignment was wrongly graded, tell me as soon as possible. Direct any
questions or complaints about your grades to me; the teaching assistants have
no authority to make changes. (This also goes for your final letter grade.)
Complaints that the thresholds for letter grades are unfair, that you deserve a
higher grade, etc., will accomplish much less than pointing to concrete
problems in the grading of specific assignments.
As a final word of advice about grading, "what is the least amount of work I
need to do in order to get the grade I want?" is a much worse way to approach
higher education than "how can I learn the most from this class and from my
teachers?".
Lectures
Lectures will be used to amplify the readings, provide examples and demos, and
answer questions and generally discuss the material. They are also when you
will do the graded in-class assignments which will help consolidate your
understanding of the material, and help with your homework.
Readings
There are (up to) three classes of readings associated with each lecture.
- Key readings: Usually from the textbook,
occasionally from other sources which will be made available well in advance.
Do this reading before coming to lecture, and lecture will make a lot more
sense. (Also, the online exercises will be easy.)
- Suggested readings: Usually easier-to-read articles or academic
papers, or portions of the non-required texts, which will deepen your
understand of what we're doing in lecture and in the homework, but aren't quite
so critical.
- Background readings: Usually harder-to-follow or more historical papers,
which give further details about methods and applications.
TL;DR: definitely the key readings before class;
try to do the suggested readings too, they're easy; and the background readings are if you want to get into the weeds.
Electronics
Please don't use any electronic devices during lecture: no laptops, no
tablets, no phones, no recording devices, no watches that do more than tell
time. The only exception to this rule is for electronic assistive devices,
properly documented with CMU's Office of Disability Resources.
(The no-electronics rule is not arbitrary meanness on my part. Experiments
show,
pretty
clearly, that students learn more in electronics-free classrooms, not least
because your device isn't distracting your neighbors, who aren't as good at
multitasking as you are.)
R, R Markdown, and Reproducibility
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before). In this class, you'll be using it for
every homework. If you are not able to use R, or do not have ready,
reliable access to a computer on which you can do so, let me know at once.
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it; this writing is part of
the assignment and will be graded. Raw computer output and R code is not
acceptable; your document must be humanly readable.
All homework and exam assignments are to be written up
in R Markdown. (If you know
what knitr is and would rather use it,
ask first.) R Markdown is a system that lets you embed R code, and its output,
into a single document. This helps ensure that your work
is reproducible, meaning that other people can re-do your
analysis and get the same results. It also helps ensure that what you report
in your text and figures really is the result of your code (and not some
brilliant trick that you then forget, or obscure bug that you didn't notice).
For help on using R Markdown,
see "Using R Markdown
for Class Reports". (Part of the first lecture will be going over
the basics of R Markdown.)
For each assignment, you should generally submit two, and only two, files:
an R Markdown source file, integrating text, generated figures and R code, and
the "knitted," humanly-readable document, in either PDF (preferred) or HTML
format. (I cannot read Word files, and you will lose points if you submit
them.) I will be re-running the R Markdown file of randomly selected students;
you should expect to be picked for this about once in the semester. You will
lose points if your R Markdown file does not, in fact, generate your knitted
file (making obvious allowances for random numbers, etc.).
Some problems in the homework will require you to do math. R Markdown
provides a simple but powerful system for type-setting math. (It's based on
the LaTeX document-preparation system widely used in the sciences.) If you
can't get it to work, you can hand-write the math and include scans or photos
of your writing in the appropriate places in your R Markdown document. (You
should upload these additional files to Canvas.) You will, however, lose
points for doing so, starting with no penalty for homework 1, and growing to a
90% penalty (for those problems) by the last homework.
Canvas and Piazza
Homework will be submitted electronically through Canvas, which will also be
used as the gradebook. Some readings and course materials will also be
distributed through Canvas.
We will be using the Piazza website for question-answering. You will
receive an invitation within the first week of class.
Anonymous-to-other-students posting of questions and replies will be allowed,
at least initially. (Postings will not be anonymous to instructors.)
Anonymity will go away for everyone if it is abused. As noted above, homework
will generally be due at 10 pm on Wednesdays; however, you should not expect
the instructors to answer questions on Piazza (or by e-mail) after 6 pm.
Solutions
Solutions for all homework will be available, after their due date, through
Canvas. Please don't share them with anyone, even after the course has ended.
Office Hours
Wednesdays | 12:30--1:20 | Prof. Shalizi | Baker Hall 229C |
Tuesdays | 4:00--5:00 | Ms. Dunn | Baker Hall 132Q |
(TA office hours are TBD but will be announced here when set)
If you want help with computing, please bring a laptop.
If you cannot make the regular office hours, or have concerns you'd rather
discuss privately, please e-mail me about making an appointment.
Textbook
Our textbook is
It's available electronically through the university library, or you can
buy a copy.
The following books are recommended but not required;
portions of them may be distributed (via Canvas) as readings, as needed:
- Richard A. Berk, Statistical Learning from a Regression Perspective (2nd edition, New York: Springer, 2016) [available online through the university library]
- Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition, Berlin: Springer-Verlag, 2009) [free online]
- Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman,
Mining of Massive Datasets (second edition, Cambridge, England: Cambridge University Press, 2014) [free online]
- Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (New York: Crown Books, 2016) [Available
online through the university library]
- Paul Teetor, The R Cookbook (Sebastopol, California: O'Reilly, 2011) [R's help files are organized around "What does command X do?"; this is organized around "I want to do Y, what commands do I need to use?"]
Except for explicit group exercises,
everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbook and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or
legitimate online resources, though again, all use of these sources must be
acknowledged in your work. (Websites which compile course materials
are not legitimate online resources.)
In general, you are free to discuss, in general terms, homework with other
students in the class, though not to share or compare work; such conversations
must be acknowledged in your assignments. You may not discuss the
content of assignments with anyone other than current students or the
instructors until after the assignments are due. (Exceptions can be made, with
prior permission, for approved tutors.) You are, naturally, free to complain,
in general terms, about any aspect of the course, to whomever you like.
Any use of solutions provided for any assignment in this course in previous
years is strictly prohibited. This prohibition applies even to students who
are re-taking the course. Do not copy the old solutions (in whole or in part),
do not "consult" them, do not read them, do not ask your friend who took the
course last year if they "happen to remember" or "can give you a hint". Doing
any of these things, or anything like these things, is cheating, it is easily
detected cheating, and those who thought they could get away with it in the
past have failed the course. Even more importantly: doing any of those things
means that the assignment doesn't give you a chance to practice; it
makes any feedback you get meaningless; and of course it makes any evaluation
based on that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will be no penalty for asking. If you do violate
these policies but then think better of it, it is your responsibility to tell
me as soon as possible to discuss how your mis-deeds might be rectified.
Otherwise, violations of any sort will lead to severe, formal disciplinary
action, under the terms of the university's
policy
on academic integrity.
You must complete "homework 0" on the content of the university's policy on
academic integrity, and on these course policies. This assignment will not
factor into your grade, but you must complete it, with a grade of at
least 80%, before you can get any credit for any other assignment.
Accommodations for Students with Disabilities
The Office of
Disability Resources provides support services for both physically disabled
and learning disabled students. For individualized academic adjustment based
on a documented disability, contact Disability Resources at access [at]
andrew.cmu.edu or 412-268-6121; they will coordinate with me.
Schedule, Lecture Notes and Readings
Lecture notes and readings will be linked to here, as they become available.
Lecture topics are tentative and subject to change. Don't expect required and
suggested readings to finalize until about a week before each lecture.
PDM = Principles of Data Mining, our textbook
TBD = to be determined = I'll figure it out between when whenever you're reading this and when the readings need to be made available.
- Lecture 1 (Monday, 26 August): Introduction to the course
- What is data mining? how is it used? where did it come from? Some themes. R Markdown.
- Key readings:
- PDM, chapter 1
- "Using R Markdown for Class Reports"
- Suggested readings:
- O'Neil, Introduction and Chapter 1
- Very deep background readings:
- James Beniger, The Control Revolution: Technological and Economic Origins of the Information Society (Cambridge, Massachusetts: Harvvard University Press, 1986)
- Herbert Simon, The Shape of Automation for Men and Management (New York: Harper and Row, 1965) [Reprinted as The New Science of Management Decisions]
- Norbert Wiener, The Human Use of Human Beings: Cybernetics and Society (Boston: Houghton Mifflin, 1950)
- JoAnne Yates, Control through Communication: The Rise of System in American Management (Baltimore: Johns Hopkins University Press, 1989)
- After class: Slides (.Rmd source), welcome.Rmd example file
- Lecture 2 (Wednesday, 28 August): Information retrieval I
- Finding data by content. Approaches we will neglect: metadata; "coding" or
"content analysis". Textual features. The bag-of-words representation.
Vector representations of documents. Measures of similarity and the importance
of normalization; inverse document frequency and other tweaks to the
representation. Our first stab at nearest-neighbor classification. Example
with
the New
York Times Annotated Corpus. Abstraction and generalizing to
other types of data.
- Key reading: PDM, chapter 14
- After class: Handout with useful examples (.Rmd)
- Homework 0
- Assignment on Canvas only
- Due on Sunday, 1 September at 10 pm
- Lecture 3 (Wednesday, 4 September): Information retrieval II
- The trick to searching: queries are documents. Search evaluation:
precision, recall, precision-recall curves; error rates. Incorporating
user feedback. Proxies for relevance like "engagement". Classification:
nearest neighbors, linear classifiers, prototypes. Classifier evaluation by
mis-classification rate and by confusion matrices.
- After class: Slides (.Rmd)
- Homework 1
- Due on Wednesday, 4 September at 10 pm
- Assignment, nyt_corpus.zip data set, nytac-and-bow.R code
- Lecture 4 (Monday, 9 September): Dimension reduction I
- Goals of dimension reduction. Linear algebra review: vectors, matrices,
inner products, projections. Eigen-decomposition.
- Key reading: Excerpts from Stone, Independent Component Analysis (on Canvas)
- Lecture 5 (Wednesday, 11 September): Dimension reduction II
- Principal components analysis. Distance-preserving linear projection.
Performing and interpreting PCA. Example with historical complexity.
"Latent semantic indexing" = PCA on bag-of-words vectors. Multidimensional scaling and PCA. The Netflix problem, singular value decomposition, and PCA.
- Key reading: PDM, chapter 3, especially sections 3.6 and 3.7
- Suggested readings:
- Andrey Feuerverger, Yu He, and Shashi Khatri, "Statistical Significance of the Netflix Challenge", Statistical Science 27 (2012): 202--231
- Hastie et al., section 14.5
- Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on PCA
- Background readings:
- Peter Turchin et al., "Quantitative historical analysis uncovers a single dimension of complexity that structures global variation in human social organization", Proceedings of the National Academy of Sciences (USA) 115 (2018) E144--E151
- Michael E. Wall, Andreas Rechtsteiner and Luis M. Rocha, "Singular
Value Decomposition and Principal Component Analysis",
physics/0208101
- After class: Slides (.Rmd, including commented R commands for all figures)
- Homework 2
- Due on Wednesday, 11 September at 10 pm
- Assignment
- Lecture 6 (Monday, 16 September): Dimension reduction III
- Factor analysis: a generative model of the data (unlike PCA), which aims to
preserve correlations and a linear structure. Factor analysis as a
kind of linear manifold learning. Origins of factor analysis in psychometrics.
Utility and drawbacks. Examples: personality testing, Netflix (again),
Cambridge Analytica.
- Key reading: PDM, chapter 3, section 3.6
- Suggested readings:
- Matthew Hindman, "This is how Cambridge Analytica’s Facebook targeting model really worked — according to the person who built it", The Conversation, 30 March 2018
- O'Neil, chapter 6 ("Ineligible to Serve")
- Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on factor models
- Background readings:
- David J. Bartholomew, Latent Variable Models and Factor Analysis
- Annie Murphy Paul, The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves (New York: Free Press, 2004)
- Charles Spearman, "``General Intelligence,'' Objectively Determined
and Measured", American Journal of Psychology 15
(1904): 201--293 [Online]
- Godfrey H. Thomson, The Factorial Analysis of
Human Ability
- L. L. Thurstone, "The Vectors of Mind", Psychological Review 41 (1934): 1--32 [Online]
- After class: Slides (.Rmd)
- Lecture 7 (Wednesday, 18 September): Dimension reduction IV
- Random linear projections. Nonlinear dimension reduction: local
linear embedding.
- Key readings:
- PDM, section 6.5
- Shalizi, Advanced Data Analysis from an Elementary Point of View, appendix on nonlinear dimensionality reduction
- Suggested readings:
- Sam T. Roweis and Laurence K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding", Science 290 (2000): 2323--2326
- Lawrence K. Saul and Sam T. Roweis, "Think Globally, Fit Locally: Supervised Learning of Low Dimensional Manifolds", Journal of Machine Learning Research 4 (2003): 119--155
- After class: Slides (.Rmd, including a working LLE function)
- Homework 3
- Due on Wednesday, 18 September at 10 pm
- Assignment; data file; eigendresses.R
- Lecture 8 (Monday, 23 September): Dimension reduction V
- Nonlinear dimension reduction: diffusion maps
- Background reading:
- Mikhail Belkin and Partha Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation",
Neural Computation 15 (2003): 1373--1396 [PDF via Prof. Belkin]
- Ann B. Lee and Larry Wasserman, "Spectral Connectivity Analysis", Journal of the American Statistical Association 105 (2010): 1241--1255, arxiv:0811.0121
- After class: Handout
- Lecture 9 (Wednesday, 25 September): Nearest neighbors I: Mostly theoretical
- k-nearest neighbors for classification, regression, and general prediction.
Performance of nearest-neighbor prediction (=1NN) for noise-free function
learning; how and why the distance to the nearest neighbor goes to zero.
Effects of noise. Why 1NN converges to a risk of twice the minimum possible
risk. Bias-variance trade-off for k-nearest-neighbors.
- Key reading: PDM, sections 6.3.3, 10.1--10.2, 10.6, 10.10--10.11
- Suggested readings:
- Hastie et al., chapter 13 (skimming section 13.2)
- Background readings:
- Thomas M. Cover, "Estimation by the Nearest Neighbor Rule", IEEE Transactions on Information Theory 14 (1968): 50--55 [PDF reprint via Prof. Cover]
- Thomas M. Cover, "Rates of Convergence for Nearest Neighbor Procedures", pp. 413--415 in B. K. Kinariwala and F. F. Kuo (eds.), Proceedings of the Hawaii International Conference on Systems Sciences (Honolulu: University of Hawaii Press, 1968) [PDF reprint via Prof. Cover]
- Thomas M. Cover and P. E. Hart, "Nearest Neighbor Pattern Classification", IEEE Transactions on Information Theory 13 (1967): 21--27 [PDF reprint via Prof. Cover]
- After class: Notes (.Rmd source)
- Homework 4
- Due on Wednesday, 25 September at 10 pm
- Assigment
- Lecture 10 (Monday, 30 September): Nearest neighbors II: Mostly statistical
- The problem of selecting the number of nearest neighbors. Selecting k by
cross-validation: v-fold versus leave-one-out. Nearest neighbors as
a linear smoother, and the short-cut for leave-one-out CV.
- Key reading: PDM, sections 10.10--10.11
- Background reading:
- Mona Azadkia, "Optimal choice of k for k-nearest neighbor regression", arxiv:1909.05495
- Seymour Geisser, "The Predictive Sample Reuse Method with Applications", Journal of the American Statistical Association 70 (1975): 320--328
- Seymour Geisser and William F. Eddy, "A Predictive Approach to Model Selection", Journal of the American Statistical Association 74 (1979): 153--160
- M. Stone, "Cross-validatory choice and assessment of statistical predictions", Journal of the Royal Statistical Society B 36 (1974): 111--147
- Lecture 11 (Wednesday, 2 October): Nearest neighbors III: Mostly computational
- Finding nearest neighbors: data structures; locality-sensitive hashing;
random projections again; LSH and information retrieval.
- Key reading: PDM, sections 12.1--12.4
- Suggested reading: Leskovec et al., chapter 3
- Background reading:
- Aristides Gionis, Piotr Indyk and Rajeev Motwani, "Similarity Search in High Dimensions via Hashing", pp. 518--529 in M. P. Atkinson et al. (eds.),
Proceedings of the 25th International Conference on Very Large Data Bases [VLDB '99] (San Francisco: Morgan Kaufmann, 1999)
- After class: Slides (.Rmd)
- Homework 5
- Due on Wednesday, 2 October at 10 pm
- Assignment
- Lecture 12 (Wednesday, 7 October): Recommender systems I: Broad Ideas and Technical Issues
- Background readings:
- Gérard Biau, Benoît Cadre, Laurent Rouvière, "Statistical analysis of k-nearest neighbor collaborative recommendation", Annals of Statistics
38 (2010): 1568--1592, arxiv:1010.0499
- Benjamin Marlin, Richard S. Zemel, Sam Roweis, Malcolm Slaney, "Collaborative Filtering and the Missing at Random Assumption", arxiv:1206.5267
- Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl, "GroupLens: an open architecture for collaborative filtering of netnews",
pp. 175--186 in Proceedings of the 1994 ACM conference on Computer supported cooperative work [CSCW '94] [HTML reprint via Prof. Resnick]
- Upendra Shardanand and Pattie Maes, "Social Information Filtering: Algorithms for
Automating ``Word of Mouth’’", pp. 210--217 in CHI '95: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems [PDF]
- After class: Slides (.Rmd)
- Lecture 13 (Wednesday, 9 October):
Recommender systems II: What's Not to Love?
- Lecture canceled due to instructor illness, but see below for slides
- Suggested reading (in this order):
- "Amazon.com Recommendations Understand Area Woman Better Than Husband", The Onion 9 January 2007
- Tom Slee, "Online Monoculture and the End of the Niche", Whimsley 15 March 2009
- Tommy Wallach, "In Which Netflix Achieves Sentience As A Result of My Terrible Decision-making", The Toast, 4 August 2014
- Zeynep Tufekci, "YouTube's Recommendation Algorithm Has a Dark Side",
Scientific American 320:4 (April 2019): 77
- ---, "YouTube, the Great Radicalizer", New York Times, 10 March 2018
- Background reading:
- Maurizio Ferrari Dacrema, Paolo Cremonesi and Dietmar Jannach, "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches", in Proceedings of the 13th ACM Conference on Recommender Systems [RecSys 2019],
arxiv:1907.06902
- After class: Slides
(Rmd)
- Homework 6
- Due on Wednesday, 9 October at 10 pm
- Assignment
- Lecture 14 (Monday, 14 October)
- Everything about recommendation engines we were supposed to cover last Wednesday.
- After class: Slides
(Rmd)
- Lecture 15 (Wednesday, 16 October): Information measures I: Scene-setting
- Definitions: entropy, mutual information, conditional mutual
information. MI and statistical independence. Kullback-Leibler
divergence (relative entropy). Information theory and coding. Information
theory and hypothesis testing. Maximum likelihood vs. minimum divergence.
- Suggested reading: David Feldman, "A Brief Introduction to
Information Theory, Excess Entropy and Computational Mechanics", Part I, "Information Theory" (pages 1--11) [PDF]
- Homework 7
- Due on Wednesday, 16 October at 10 pm
- Assignment
- Lecture 16 (Monday, 21 October): Information measures II: The information-theoretic view of prediction and estimation.
- Predictive information. Sufficient statistics. Statistical
relevance bases. The information-bottleneck method. Information
theory in parameter estimation.
- Background reading:
- Naftali Tishby, Fernando C. Pereira and William Bialek, "The Information Bottleneck Method", pp. 368--377 in B. Hajek and R. S. Sreenivas (eds.), Proceedings of the 37th Annual Allerton Conference on
Communication, Control and Computing (Urbana, Illinois: University of Illinois Press, 1999), arxiv:physics/0004057
- Handout: PDF (R code)
- Lecture 17 (Wednesday, 23 October:) Information measures III: Feature selection and predictive embeddings / dimension reduction.
- Selecting features to maximize predictive information; to minimize
stored information. Creation of new features to capture information:
the information bottleneck again; predictive embeddings; word2vec.
- Background readings:
- Marco Baroni, Georgiana Dinu and German Kruszewski, "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors", pp. 238--247 in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics [ACL14] (Baltimore, MD: Association for Computational Linguistics, 2014)
- Yoav Goldberg and Omer Levy, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method", arxiv:1402.3722
- Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", arxiv:1301.3781
- Handout: PDF (R code)
- After class: Slides (.Rmd)
- Homework 8
- Due on Wednesday, 23 October at 10 pm
- Assignment
- Lecture 18 (Monday, 28 October): Clustering I
- Intro to clustering; k-means. Geometry of clusters.
- Key reading: PDM, sections 9.3--9.5 (skimming 9.1 and 9.2 is recommended)
- Suggested readings:
- Hastie et al., sections 14.3.1--14.3.6
- O'Neil, Chapters 9 and 10 ("No Safe Zone" and "The Targeted Citizen")
- After class: Notes to lecture 18
- Lecture 19 (Wednesday, 30 October): Clustering II
- Hierarchical clustering. Clustering as locality-sensitive hashing.
- Suggested reading: Hastie et al., section 14.3.12
- Homework 9
- Due on Wednesday, 30 October at 10 pm
- Assignment
- Lecture 19 (Monday, 4 November): Clustering III
- Probabilistic clustering with mixture models; the EM algorithm; k-means
again.
- Key reading: PDM, sections 9.2.4--9.2.5 (again), and section 9.6
- Suggested reading:
- Hastie et al., section 8.5
- Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on mixture models
- Lecture 20 (Wednesday, 6 November):
Clustering IV
- Using cross-validated log-likelihood to compare mixture models. Advanced probabilistic clustering: mixed membership; topic models in text mining and in genetics.
- Suggest reading: Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on mixture models
- Background readings:
- David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003): 993--1022
- April Galyardt, "Interpreting Mixed Membership", in Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds.), Handbook of Mixed Membership Models and Their Applications (New York; Chapman and Hall, 2016) [PDF]
- J. K. Pritchard, M. Stephens and P. Donnelly, "Inference of population structure using multilocus genotype data", Genetics 155 (2000): 945--959
- Homework 10
- Due on Wednesday, 6 November at 10 pm
- Lecture 21 (Monday, 11 November):
Classification I
- Review of how to evaluate classifiers, of linear classifiers and of nearest-neighbor classification.
Nonlinear classifiers: "kernel trick", support vector machines.
Random nonlinear features.
- Key reading: PDM, sections 10.3, 10.4, 10.7, 10.10, 10.11
- Suggested readings:
- Berk, chapter 7
- Hastie et al., chapters 4 and 12
- Background reading:
- Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning", pp. 1313--1320
in D. Koller et al. (eds.), Advances in Neural Information Processing Systems 21 [NIPS 2008]
- Lecture 22 (Wednesday, 13 November):
Classification II
- Notions of fairness for classifiers; critiques of "fairness"
- Key reading: O'Neil, chapters 1, 5, 6 and 8
- Suggested readings:
- Sam Corbett-Davies and Sharad Goel, "The Measure and
Mismeasure of Fairness: A Critical Review of Fair Machine
Learning", arxiv:1808.00023
- Simon DeDeo, "Wrong side of the tracks: Big Data and Protected Categories", pp. 31--42 in Cassidy R. Sugimoto, Hamid R. Ekbia and Michael Mattioli (eds.), Big Data Is Not a Monolith (Cambridge, Mass.: MIT Press, 2016), arxiv:1412.4643
- Background reading:
- Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan, "A Survey on Bias and Fairness in Machine Learning", arxiv:1908.09635
- Homework 11
- Due on Wednesday, 13 November at 10 pm
- Assignment
- Lecture 23 (Monday, 18 November):
Classification III
- Case studies in classifiers and their applications
- Background readings:
- Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, "Machine Bias", ProPublica 23 May 2016
- Alexandra Chouldechova, "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments", arxiv:1610.07524
- Sharad Goel, Jake M. Hofman, Sébastien Lahaie, David M. Pennock, and Duncan J. Watts, "Predicting consumer behavior with Web search",
Proceedings of the National Academy of Sciences
(USA) 107 (2010): 17486--17490
- Lecture 24 (Wednesday, 20 November):
Trees and ensembles I
- Classification and regression trees: basics and examples
- Key reading: PDM, sections 5.2 and 10.5
- Suggested readings:
- Berk, chapter 3
- Hastie et al., section 9.2
- Homework 12
- Due on Wednesday, 20 November at 10 om
- Assignment, compas_violence.csv data file
- Lecture 25 (Monday, 25 November):
Trees and ensembles II
- Bagging and random forests
- Suggested readings:
- Berk, chapters 4 and 5
- Hastie et al., sections 8.7 (bagging) and chapter 15 (random forests)
- Homework 13
- Due on Tuesday, 26 November at 10 pm --- NOTE DATE
- Lecture 26 (Monday, 2 December): Waste, fraud and abuse I
- Bad data sources, curse of dimensionality, failure to replicate, and other causes of technical failures
- Key reading: PDM, chapter 2, section 6.5, section 1.7
- Suggested readings:
- Chris Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", Wired 16:17 (June 2008)
- Fernando Pereira, comments on Anderson, Earning My Turns, 23 June 2008
- Luke Oakden-Rayner, "Exploring large scale public medical image datasets", arxiv:1907.12720
- Christian Grothoff and J.M. Porup, "The NSA’s SKYNET program may be killing thousands of innocent people", Ars Technica 16 February 2016
- Lecture 27 (Wednesday, 4 December): Waste, fraud and abuse II
- Key reading: O'Neil, TBD
- Suggested readings:
- Maciej Ceglowski, "What Happens Next Will Amaze You"
- Zeynep Tufekci, "We're building a dystopia just to make people click on ads", TED
- Homework 14
- Due on Wednesday, 4 December at 10 pm