Sivaraman Balakrishnan | 36-462 Data Mining

Data mining is the science of discovering structure and making predictions in data sets (typically, large ones). Data mining spans the fields of statistics and computer science. Since this is a course in statistics, we will adopt a statistical perspective for the majority of the course. Data mining also involves a good deal of both applied work (programming, problem solving, data analysis) and theoretical work (learning, understanding, and evaluating methodologies). We will try to maintain a balance between the two.

Upon completing this course, you should be able to tackle new data mining problems, by:

selecting the appropriate methods and justifying your choices;
implementing these methods programmatically (using, say, the R programming language) and evaluating your results;
explaining your results to a researcher outside of statistics or computer science.

Syllabus

The syllabus provides information on grading, class policies etc.

Lecture Notes and Annotated Slides

Lecture 1: Overview
Lecture 2: Introduction to Supervised Learning
Lecture 3: Logistic Regression
Lecture 4: Regularization
Lecture 5: More on Regularization
Lecture 6: Generative Models, LDA and QDA
Lecture 7: Naive Bayes, SVMs
Lecture 8: SVMs and Kernels
Lecture 9: Decision Trees
Lecture 10: Boosting
Lecture 11: Bagging and Random Forests
Lecture 12: More on Random Forests
Lecture 13: Linear Algebra Review
Lecture 14: Midterm Review
Lecture 15: PCA
Lecture 16: k-means
Lecture 17: Hierarchical Clustering
Lecture 18: Mixture Models
Lecture 19: Clustering Graphs
Lecture 20: Spectral Clustering
Lecture 21: Dimension Reduction
Lecture 22: Hyperparameter Selection
Lecture 23: Neural Networks 1
Lecture 24: Neural Networks 2
Lecture 25: Review