Cosma Shalizi
Fall 2009
Important update, December 2011
If you are looking for the latest version of this class, it is 36-462, taught
by Prof. Tibshirani in the spring of 2012. 36-350 is now the course number
for Introduction to Statistical Computing.
Data mining is the art of extracting useful patterns from large bodies of
data; finding seams of actionable knowledge in the raw ore of information. The
rapid growth of computerized data, and the computer power available to analyze
it, creates great opportunities for data mining in business, medicine, science,
government, etc. The aim of this course is to help you take advantage of these
opportunities in a responsible way. After taking the class, when you're faced
with a new problem, you should be able to (1) select appropriate methods, and
justify their choice, (2) use and program statistical software to implement
them, and (3) critically evaluate the results and communicate them to
colleagues in business, science, etc.
Data mining is related to statistics and to machine learning, but has its own
aims and scope. Statistics is a mathematical science, studying how reliable
inferences can be drawn from imperfect data. Machine learning is a branch of
engineering, developing a technology of automated induction. We will freely
use tools from statistics and from machine learning, but we will use them as
tools, not things to study in their own right. We will do a lot of
calculations, but will not prove many theorems, and we will do even more
experiments than calculations.
Outline, Notes, Readings
This is a rough outline of the material; details may change depending on
time and class interests.
- Searching by similarity: Searching by content (texts, images,
genes, ...); attributes, representations and definitions of similarity and
distance; choice of representation; multi-dimensional scaling;
classifications; image search and invariants; user feedback; evaluating
searches
- Information: information and uncertainty; classes and attributes;
interactions among attributes
- Clustering: supervised and unsupervised learning; categorization;
unsupervised category-learning, a.k.a. clustering; k-means
clustering; hierarchical clustering; geometry of clusters; what makes a good
cluster?
- Data-reduction and feature-enhancement: Standardizing data; using
principal components to eliminate attributes; using factor analysis to
eliminate attributes; limits and pitfalls of PCA and factor analysis;
nonlinear dimensionality reduction: local linear embedding, diffusion maps
- Regression Review of linear regression; transformations to
linearity; the truth about linear regression; local linear regression;
polynomial regression; kernel regression; additive models; other
non-parametric methods
- Prediction: Evaluating predictive models; over-fitting and
capacity control; regression trees; classification trees; combining
predictive models; forests; how to gamble if you must
- Classification: Supervised categorization; linear classifiers;
logistic regression; the kernel trick; base rates, Neyman-Pearson
classifiers, ROC curves
- Distributions: Histograms and the fundamental theorem of
statistics; kernel density estimation; conditional density estimation;
relative distributions; mixture models, probabilistic clustering, the EM
algorithm; clustering with confidence; large numbers of rare events
- Modeling interventions: Estimating causal impacts without
experiments; matching; graphical causal models and Tetrad.
- Waste and Abuse: when data mining will fail: bad data, wrong
data, insufficient data, overwhelming false positives, impossible problems,
attacking the wrong problem; when data mining is evil; some failures
Lecture Notes and Supplementary Readings
Note: Solutions are temporarily off-line, since some of the exercises are being re-used in this semester's class. Feel free to write me directly, however.
- Introduction to the course (24 August)
What is data mining? how is it used? where did it come from? Some themes.
- Information retrieval and searching
by similarity (26 August) Finding data by content. Approaches we will
neglect: metadata, coding. Textual features. The bag-of-words representation.
Vector representations of documents. Measures of similarity and the importance
of normalization. Our first stab at classification. Example with
the New
York Times Annotated Corpus.
- IR continued (28 August). The
trick to searching: queries are documents. Search evaluation: precision,
recall, precision-recall curves; error rates. Classification: nearest
neighbors and prototypes; classifier evaluation by mis-classification rate and
by confusion matrices. Inverse document frequency weighting. Visualizing
high-dimensional data by multi-dimensional scaling. Miscellaneous topics:
stemming, incorporating user feedback.
Homework 1, due 4 September: assignment,
R, data
- Page Rank (31 August). Links as
pre-existing feedback. How to exploit link information? The random walk on
the graph; using the ergodic theorem. Eigenvector formulation of page-rank.
Combining page-rank with textual features. Other applications. Further
reading on information retrieval.
- Image Search, Abstraction and
Invariance (2 September). Similarity search for images. Back to
representation design. The advantages of abstraction: simplification,
recycling. The bag-of-colors representation. Examples. Invariants.
Searching for images by searching text. An example in practice.
Slides for this lecture.
- Information Theory I (4
September). Good features help us guess what we can't represent. Good
features discriminate between different values of unobserved variables.
Quantifying uncertainty with entropy. Quantifying reduction in uncertainty/
discrimination with mutual information. Ranking features based on mutual
information. Examples, with code, of informative words for
the Times. Code.
Supplementary reading: David
P. Feldman, Brief Tutorial on
Information Theory, chapter 1
Homework 2, due 11 September: assignment
- Information Theory II (9
September). Dealing with multiple features. Joint entropy, the chain rule for
entropy. Information in multiple features. Conditional information, chain
rule for information, conditional independence. Interactions, positive and
negative, and redundancy. Greedy feature selection with low redundancy.
Example, with code, of selecting words for the Times. Sufficient
statistics and the information
bottleneck. Code.
Supplementary reading; Aleks Jakulin and Ivan Bratko, "Quantifying and
Visualizing Attribute
Interactions", arxiv:cs.AI/0308002
- Categorization; Clustering I (11
September). Dividing the world up into categories. Classification: known
categories with labeled examples. Taxonomy of learning problems (supervised,
unsupervised, semi-supervised, feedback, ...). Clustering: discovering unknown
categories from unlabeled data. Benefits of clustering, with an digression on
where official classes come from. Basic criterion for good clusters: lots of
information about features from little information about cluster. Practical
considerations: compactness, separation, parsimony, balance. Doubts about
parsimony and balance. The k-means clustering algorithm, or unlabeled
prototype classification: analysis, geometry, search. Appendix: geometric
aspects of the prototype and nearest-neighbor method.
Homework 3, due 18 September: assignment
- Clustering II (14 September).
Distances between partitions; variation-of-information distance.
Hierarchical clustering by agglomeration and its varieties. Picking the
number of clusters by merging costs. Performance of different clustering
methods on various doodles. Why we would like to pick the number of clusters
by predictive performance, and why it is hard to do at this stage. Reifying clusters.
- Transformations: Rescaling and
Low-Dimensional Summaries (16 September). Improving on our original
features. Re-scaling, standardization, taking logs, etc., of individual
features. Forcing things to be Gaussian considered harmful. Low-dimensional
summaries by combining features. Exploiting geometry to eliminate redundancy.
Projections on to linear subspaces. Searching for structure-preserving
projections.
- Principal Components I (18
September). Principal components are the directions of maximum variance.
Derivation of principal components as the best approximation to the data in a
linear subspace. Equivalence to variance maximization. Avoiding explicit
optimization by finding eigenvalues and eigenvectors of the covariance matrix.
Example of principal components with cars; how to tell a sports car from a
minivan. The standard recipe for doing PCA. Cautions in interpreting
PCA. Data-set used in the notes.
Homework 4, due 25 September: assignment
- Principal Components II (21
September). PCA + information retrieval = latent semantic indexing; why LSI is
a Good Idea. PCA and multidimensional scaling.
- Factor Analysis (23 and 25
September). From PCA to factor analysis by adding noise. Roots of factor
analysis in causal discovery: Spearman's general factor model and the tetrad
equations. Problems with estimating factor models: more unknowns than
equations. Solution 1, "principal factors", a.k.a. estimation through heroic
feats of linear algebra. Solution 2, maximum likelihood, a.k.a. estimation
through imposing distributional assumptions. The rotation problem: the factor
model is unidentifiable; the number of factors may be meaningful, but
the individual factors are not.
- The Truth about PCA and Factor
Analysis (28 September) PCA is data reduction without any probabilistic
assumptions about where the data came from. Picking number of components.
Faking predictions from PCA. Factor analysis makes stronger, probabilistic
assumptions, and delivers stronger, predictive conclusions --- which could be
wrong. Using probabilistic assumptions and/or predictions to pick how many
factors. Factor analysis as a first, toy instances of a graphical causal
model. The rotation problem once more with feeling. Factor models and mixture
models. Factor models and Thomson's sampling model: an outstanding fit to a
model with a few factors is actually evidence of a huge number
of badly measured latent variables. Final advice: it all depends, but
if you can only do one, try PCA.
R code for the Thomson sampling
model.
- Nonlinear Dimensionality Reduction I:
Locally Linear Embedding (5 and 7 October). Failure of PCA and all other
linear methods for nonlinear structures in data; spirals, for example.
Approximate success of linear methods on small parts of nonlinear structures.
Manifolds: smoothly curved surfaces embedded in higher-dimensional Euclidean
spaces. Every manifold looks like a linear subspace on a sufficiently small
scale, so we should be able to patch together many small local linear
approximations into a global manifold. Local linear embedding: approximate
each vector in the data as a weighted linear combination of its k
nearest neighbors, then find the low-dimensional vectors best reconstructed by
these weights. Turning the optimization problems into linear algebra. Coding
up LLE. A spiral rainbow. R.
- Nonlinear Dimensionality Reduction II:
Diffusion Maps (9 October). Making a graph from the data; random walks on
this graph. The diffusion operator, a.k.a. Laplacian. How the Laplacian
encodes the shape of the data. Eigenvectors of the Laplacian as coordinates.
Connection to page-rank. Advantages when data are not actually on a manifold.
Example.
Pre-midterm review (12 October): highlights of the course to date; no handout.
MIDTERM (14 October): exam
Homework 5, due 23 October: assignment
- Regression I: Basics (19
October). Guessing a real-valued random variable; why expectation values are
mean-square optimal point forecasts. The regression function; why its
estimation must involve assumptions beyond the data. The bias-variance
decomposition and the bias-variance trade-off. First example of improving
prediction by introducing variance. Ordinary least squares linear regression
as smoothing. Other linear smoothers: k-nearest-neighbors and kernel
regression. How much should we
smooth? R, data
for running example
- Regression II: The Truth About Linear
Regression (21 October). Linear regression is optimal linear (mean-square)
prediction; we do this because we hope a linear approximation will work well
enough over a small range. What linear regression does: decorrelate the input
features, then correlate them separately with the response and add up. The
extreme weakness of the probabilistic assumptions needed for this to make
sense. Difficulties of linear regression; collinearity, errors in variables,
shifting distributions of inputs, omitted variables. The usual extra
probabilistic assumptions and their implications. Why you should always
looking at residuals. Why you generally shouldn't use regression for causal
inference. How to torment angels. Likelihood-ratio tests for restrictions of
nice models.
- Regression III: Extending Linear
Regression (23 October). Weighted least squares. Heteroskedasticity:
variance is not the same everywhere. Going to consult the oracle. Weighted
least squares as a solution to heteroskedasticity. Nonparametric estimation of
the variance function. Local polynomial regression: local constants = kernel
regression, local linear regression, higher-order local polynomials. Lowess =
locally-linear smoothing for scatter plots. The oracles fall silent.
Homework 6, due Friday, 30 October: assignment, data set
- Evaluating Predictive Models (26
and 28 October). In-sample, out-of-sample and generalization loss or error;
risk as expected loss on new data. Under-fitting, over-fitting, and examples
with polynomials. Methods of model selection and controlling over-fitting:
empirical risk minimization, penalization, constraints/sieves, formal learning
theory, cross-validation. Limits of
generalization. R for creating figures.
- Smoothing
Methods in Regression (30 October). How much smoothing should we do?
Approximation by local averaging. How much smoothing we should do to
find the unknown curve depends on how smooth the curve really is,
which is unknown. Adaptation as a partial substitute for actual knowledge.
Cross-validation for adapting to unknown smoothness. Application: testing
parametric regression models by comparing them to nonparametric fits. The
bootstrap principle. Why ever bother with parametric
regressions? R
code for some of the examples.
Homework 7, due Friday, 6
November: assignment
- Additive
Models (2 November). A nice feature of linear models: partial responses,
partial residuals, and backfitting estimations. Additive models: regression
curve is a sum of partial response functions; partial residuals and the
backfitting trick generalize. Parametric and non-parametric rates of
convergence. The curse of dimensionality for unstructured nonparametric
models. Additive models as a compromise, introducing bias to reduce variance.
Example with the data from homework 6.
- Classification and Regression
Trees (4 and 6 November). Prediction trees. A classification tree we can
believe in. Prediction trees combine simple local models with recursive
partitioning; adaptive nearest neighbors. Regression trees: example; a little
math; pruning by cross-validation; more R mechanics. Classification trees:
basics; measuring error by mis-classification; weighted errors; likelihood;
Neyman-Pearson classifiers. Uncertainty for trees.
Homework 8, due 5 pm on Monday, 16 November: assignment
- Combining Models 1: Bagging and Model Averaging (9 November)
- Combining Models 2: Diversity and Boosting (11 November)
- Linear Classifiers (16 November).
Geometry of linear classifiers. The perceptron algorithm for learning linear
classifiers. The idea of "margin".
- Logistic Regression (18
November). Attaching probabilities to linear classifiers: why would we want
to? why would we use the logistic transform to do so? More-than-binary
logistic regression. Maximizing the likelihood; Newton's method for
optimization. Generalized linear models and generalized additive models;
testing GLM specifications with GAMs.
- Support Vector Machines (20
November). Turning nonlinear problems into linear ones by expanding into
high-dimensional feature spaces. The dual representation of linear
classifiers: weight training points, not features. Observation: in the dual
representation, only inner products of vectors matter. The kernel trick:
kernel functions let us compute inner products in feature spaces without
computing the features. Some bounds on the generalization error of linear
classifiers based on "margin" and the number of training points with non-zero
weight ("support vectors"). Learning support vector machines by trading
in-sample performance against bounds on over-fitting.
Homework 9, due at 5 pm on Monday, 30 November: assignment
- Density Estimation (23 November).
Histograms as distribution estimates. Glivenko-Cantelli, "the fundamental
theorem of statistics". Histograms as density estimates; selecting density
estimates by cross-validation. Kernel density estimates. Why kernels are
better than histograms. Curse of dimensionality again. Hint at alternatives
to kernel density estimates.
- Mixture Models, Latent Variables and
the EM Algorithm (30 November). Compressing and restricting density
estimates. Mixtures of limited numbers of distributions. Mixture models as
probabilistic clustering; finally an answer to "how many clusters?" The EM
algorithm as an iterative way of maximizing likelihood with latent variables.
Analogy to k-means. More theory of the EM algorithm. Applications: density
mixtures, signal processing/state estimation, mixtures of regressions, mixtures
of experts; topic models and probabilistic latent semantic analysis. A glance
at non-parametric mixture models.
- Graphical Causal Models (2 December). Distinction between causation and
association, and between causal and probabilistic prediction. Some examples.
Directed acyclic graphs and causal models. The Markov property. Conditional
independence via separation. Faithfulness.
- Causal Inference (4 December).
Estimating causal effects; control for confounding. Discovering causal
structure: the SGS algorithm and its variants. Limitations.
Take-home final exam, due 15
December: assignment; data
sets: expressdb_cleaned
(20
Mb), HuIyer_TFKO_expression
(20 Mb). With great thanks to Dr. Timothy Danford.
Previous versions of the course
With old notes, can be found here.
Course Mechanics
Details on grading, exams, etc., can be found in
the full syllabus.
Blackboard
Homework, solutions, grades, class announcements, etc., will be handled through
Blackboard; let me know if you cannot access the course site.
Textbook
Our required textbook is Principles of Data
Mining
by Hand, Mannila
and Smyth. It should be at the
campus bookstore already, but you can also buy it online (I
like Powell's),
or directly from MIT
Press.
Berk's Statistical Learning from a Regression Perspective (Powell's; publisher) is
an optional book, which covers some topics (mostly from the
second half of the course) in greater detail.
R
R is a free, open-source software
package/programming language for statistical computing. (It is a dialect of a
language called S, whose commercial version is S-plus.) You can expect at
least one assignment every week which uses R. If you do not have ready access
to a computer which can run it, contact me at once.
Here are some resources for learning R:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
- There are now many books about R. Three I can recommend, having
read them, are:
- Braun and Murdoch's A First Course in Statistical
Programming with R
(official
site, Powell's), suitable for absolute beginners
- Venables and Ripley's Modern Applied Statistics with
S (official
site, Powell's), useful but more advanced material
- John M. Chambers, Software for Data Analysis:
Programming with R
(official
site, Powell's)
is the best book on writing programs in R.
The campus bookstore should have copies of Braun and Murdoch, and
of Chambers, but they are optional.
You should also read the page
of Minimal Advice on
Programming.