Cosma Shalizi

Lecture 1, 26 August 2019 — Welcome to the course

- Course mechanics
- All of the details are in the syllabus

- General orientation to the course

- Finding useful patterns in large collections of data
- a.k.a. “knowledge discovery in data bases”

So many things!

- Everything important about data mining in a small package
- Finding similar items by content
- Using features to define “similar content”
- Making predictions about patterns
- Evaluating predictions about patterns

- More than 3D is very hard for us to grasp
*Maybe*5 if you use color and animation well

- Somehow reduce the huge number of features to something more manageable
*but*still intelligible

- How much information does one feature give us about another?
- How much information does the model give us about predictions?
- How much information does the data give us about the models?

- Predicting “the new case will do what similar cases did” is surprisingly powerful
- Need good ways to define “similar” and to find similar cases in big data sets

- Simple, binary-choice models for prediction
- Plus ways of combining many trees to get “forests”

- “You may also like”
- Clustering, dimension reduction, classification, nearest neighbors, …

- Sometimes data mining just won’t work
- Bad data
- Overwhelming data and the curse of dimensionality

- Sometimes data mining is just wrong

- Statistics
- Exploratory data analysis
- Regression, especially regression by matching
- Principal components and factor analysis
- Classifiers
- Clustering

- Computer science
- Databases
- Pattern recognition (= classsifiers)
- Dimension reduction (= PCA and factor analysis)
- Clustering (= clustering)
- Nearest neighbors (= matching)
- Machine learning (= predictive statistical modeling)