Cosma Shalizi
Fall 2008
MW 10:30--11:20 Porter Hall 226B
F 10:30--11:20 Doherty Hall 1217
Data mining is the art of extracting useful patterns from large bodies of
data; finding seams of actionable knowledge in the raw ore of information. The
rapid growth of computerized data, and the computer power available to analyze
it, creates great opportunities for data mining in business, medicine, science,
government, etc. The aim of this course is to help you take advantage of these
opportunities in a responsible way. After taking the class, when you're faced
with a new problem, you should be able to (1) select appropriate methods, and
justify their choice, (2) use and program statistical software to implement
them, and (3) critically evaluate the results and communicate them to
colleagues in business, science, etc.
Data mining is related to statistics and to machine learning, but has its own
aims and scope. Statistics is a mathematical science, studying how reliable
inferences can be drawn from imperfect data. Machine learning is a branch of
engineering, developing a technology of automated induction. We will freely
use tools from statistics and from machine learning, but we will use them as
tools, not things to study in their own right. We will do a lot of
calculations, but will not prove many theorems, and we will do even more
experiments than calculations.
Outline, Notes, Readings
This is a rough outline of the material. The first five items should take us
to the middle of the semester, and (6) and (7) will definitely be covered; the
others will depend on time and class interests.
- Searching by similarity: Searching by content (texts, images,
genes, ...); attributes, representations and definitions of similarity and
distance; choice of representation; multi-dimensional scaling;
classifications; image search and invariants; user feedback; evaluating
searches
- Information: information and uncertainty; classes and attributes;
interactions among attributes; relative distributions
- Clustering: supervised and unsupervised learning; categorization;
unsupervised category-learning, a.k.a. clustering; k-means
clustering; hierarchical clustering; geometry of clusters; what makes a good
cluster?
- Data-reduction and feature-enhancement: Standardizing data; using
principal components to eliminate attributes; using factor analysis to
eliminate attributes; limits and pitfalls of PCA and factor analysis;
nonlinear dimensionality reduction
- Regression Review of linear regression; transformations to
linearity; the truth about linear regression; local linear regression;
polynomial regression; kernel regression; other non-parametric methods
- Prediction: Evaluating predictive models; over-fitting and capacity
control; regression trees; classification trees; neural networks; combining
predictive models; forests; how to gamble if you must
- Classification: Supervised categorization; linear classifiers;
logistic regression; the kernel trick; base rates and multiple testing;
spam, fraud, credit cards, profiling
- Change over time: trends and time series models; Markov models;
hidden Markov models; adapting to change and failure to adapt
- Modeling interventions: Estimating causal impacts without
experiments; matching; graphical causal models and Tetrad.
- Waste and Abuse: what happens when the data are bad; what happens
with the wrong kinds of data; situations when data mining will fail; trying
not to be evil; some failures
Lecture Notes
- Introduction to the course (25 August)
- Information retrieval and similarity searching (25 August)
- Multidimensional scaling and a first glance and classification (27 August)
- A little about page-rank (29 August)
Homework #1, due 8 September: assignment, R, newsgroups.tgz data file
Solutions
- Image search, abstraction and
invariance; the
accompanying slides
(8 September)
- Finding informative features
(10 September)
- Information and interaction among
features (12 September)
Homework #2, due 19 22 September: assignment, solutions, solutions code
Note: Information theory,
axiomatic foundations, connections to statistics — elaboration
on some points raised in lecture (12 September)
- Categorization: types of categorization, basic classifiers and finding simple clusters in data (15 September)
- Clustering: hierarchical clustering;
how many clusters? (17 September)
- Yet more clustering
(19 September; slides)
- Making better features:
transformations; projections and principal components analysis (22
September)
- Mathematics of principal components
analysis; interpretations and limitations of PCA (24 September)
- Yet more on linear dimensionality
reduction: PCA + information retrieval = Latent semantic indexing. Factor
analysis: motivations, historical roots, preliminaries to estimation
(26 September)
Homework #3, due 3 October: assignment
- More on Factor Analysis:
estimation and the rotation problem (29 September)
- Principal Components versus Factor
Analysis: worked examples, basic goodness-of-fit testing for factor
analysis; R code for lecture
(1 October)
- The truth about principal components
and factor analysis: strengths, limitations, factor models as graphical
models, factor models and mixture models, Thomson's sampling model; R code for Thomson's model (3 October)
Homework #4, due Friday, 10 October: assignment, nci.kmeans, nci.pca2.kmeans; solutions
- Regression: predicting quantiative
features: point prediction; expectations and mean-square optimality;
regression functions; regression as smoothing; linear regression as linear
smoothing; other kinds of linear smoothers; nearest-neighbor regression; kernel
regression. R code for
figures, data for running example.
(6 October)
- The truth about linear regression:
optimal linear prediction; shifting distributions and omitted variables; rights
and obligations of probabilistic assumptions; abuses of linear regression; how
to hurt angels (8 October)
- Extending linear regression:
weighted least-squares, heteroskedasticity, local linear
regression. R code for
figures, data for
running example (10 October)
- Review session: no handout (13 October)
- Midterm: exam
and solutions (15 October)
- Evaluating preditive models:
in-sample and generalization error; over-fitting and under-fitting; model
selection, capacity control,
cross-validation. R for figures. (20
October)
- Using cross-validation: mechanics and examples (22 October)
- Using non-parametric smoothing: adaptive smoothing, testing parametric
forms (24 October)
Homework #5, due Friday, 31 October: assignment; solutions, R for solutions
- Prediction trees 1: mostly
regression trees, plus a "classification tree we can believe in" (27 October)
Homework #6, due Friday, 7 November: assignment; solutions
- Prediction trees 2: classification trees (29 October and 3 November)
- Bootstrapping, Bagging, and Random Forests (5 November)
- Combining Predictive Models and the Power of Diversity (7 November)
- Linear Classifiers and the Perceptron Algorithm (10 November)
- Logistic Regression and Newton's Method (12 November)
Homework #7, due Friday, 21 November: assignment;
solutions
- Neural Networks: The Mathematical Reality (14 November)
- Neural Networks: The Biological Myth (17 November)
- Support Vector Machines (19 November)
- Support vector machines continued (21 November; same handout as previous)
Homework #8, due Monday, 1 December: assignment; solutions
- The Lecture Full of Fail: The wrong data, lying data, covariate shift,
low base-rates and overwhelming false positives, response
Waste, fraud and abuse (24 November; notes forthcoming)
Homework #9, due 15 December: assignment;
solutions
Readings
- David P. Feldman, "Introduction to Information Theory", ch. 1
- To accompany lecture 5
- Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing
Attribute Interactions", arxiv:cs.AI/0308002
- To accompany lecture 6
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman, "Indexing by Latent Semantic Analysis" [PDF]
- To accompany lecture 12 (optional)
- Thomas K. Landauer and Susan T. Dumais, "A Solution to Plato's Problem: The Latent Semantic
Analysis Theory of Acquisition, Induction, and Representation of Knowledge"
[PDF]
- To accompany lecture 12 (optional)
- L. L. Thurstone, "The Vectors of Mind"
- To accompany lecture 12 (optional)
- David Hand, "Classifier Technology and the Illusion of Progress",
arxiv:math.ST/0606441
- To accompany lecture 34
- Bernard E. Harcourt, "Against Prediction: Sentencing, Policing, and Punishing in an Actuarial Age", ssrn/756945
- To accompany lecture 34
Course Mechanics
Details on grading, exams, etc., can be found in
the full syllabus.
Blackboard
Homework, solutions, grades, class announcements, etc., will be handled through
Blackboard; let me know if you cannot access the course site.
Textbook
Our textbook is Principles of Data Mining
by Hand, Mannila and Smyth. It
should be at the campus bookstore by 26 August, but you can also buy it online
(I
like Powell's),
or directly from MIT
Press.
R
R is a free, open-source software
package/programming language for statistical computing. (It is a dialect of a
language called S, whose commercial version is S-plus.) You can expect at
least one assignment every week which uses R. If you do not have ready access
to a computer which can run it, contact me at once.
Here are some resources for learning R:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR", in PDF
- Quick-R. This is primarily aimed
at those who already know a commercial statistics package like SAS, SPSS or
Stata, but it's very clear and well-organized, and others may find it useful as well.
- There are now many books about R. Two I can recommend, having read them,
are Braun and Murdoch's A First Course in Statistical Programming with
R (official
site, Powell's)
and Venables and Ripley's Modern Applied Statistics with S
(official
site, Powell's).
Braun and Murdoch is more suitable for beginners; Venables and Ripley has a lot
of useful but more advanced material.