36-462 Data Mining
Data mining is the science of discovering structure and making predictions in data sets (typically, large ones). Data mining spans the fields of statistics and computer science. Since this is a course in statistics, we will adopt a statistical perspective for the majority of the course. Data mining also involves a good deal of both applied work (programming, problem solving, data analysis) and theoretical work (learning, understanding, and evaluating methodologies). We will try to maintain a balance between the two.
Upon completing this course, you should be able to tackle new data mining problems, by:
- selecting the appropriate methods and justifying your choices;
- implementing these methods programmatically (using, say, the R programming language) and evaluating your results;
- explaining your results to a researcher outside of statistics or computer science.
Syllabus
The syllabus provides information on grading, class policies etc.
Lecture Notes and Annotated Slides
- Lecture 1: Overview
- Lecture 2: Introduction to Supervised Learning
- Lecture 3: Logistic Regression
- Lecture 4: Regularization
- Lecture 5: More on Regularization
- Lecture 6: Generative Models, LDA and QDA
- Lecture 7: Naive Bayes, SVMs
- Lecture 8: SVMs and Kernels
- Lecture 9: Decision Trees
- Lecture 10: Boosting
- Lecture 11: Bagging and Random Forests
- Lecture 12: More on Random Forests
- Lecture 13: Linear Algebra Review
- Lecture 14: Midterm Review
- Lecture 15: PCA
- Lecture 16: k-means
- Lecture 17: Hierarchical Clustering
- Lecture 18: Mixture Models
- Lecture 19: Clustering Graphs
- Lecture 20: Spectral Clustering
- Lecture 21: Dimension Reduction
- Lecture 22: Hyperparameter Selection
- Lecture 23: Neural Networks 1
- Lecture 24: Neural Networks 2
- Lecture 25: Review