---
title: 36-462/36-662, Data Mining
author: Cosma Shalizi
date: Lecture 1, 26 August 2019 --- Welcome to the course
output: slidy_presentation
---
## Welcome!
![](https://live.staticflickr.com/311/32174260996_3d3f68747b_z_d.jpg)
## Agenda for today
> - Course mechanics
> + All of the details are in the syllabus
> - General orientation to the course
## What is data mining?
- Finding useful patterns in large collections of data
+ a.k.a. "knowledge discovery in data bases"
## What are we going to learn about
So many things!
## Information retrieval
![](https://live.staticflickr.com/3281/5826310009_1834b058ab_z.jpg)
> - Everything important about data mining in a small package
> + Finding similar items by content
> + Using features to define "similar content"
> + Making predictions about patterns
> + Evaluating predictions about patterns
## Dimension reduction
![](https://live.staticflickr.com/3560/3487720211_1df38f25e8.jpg)
> - More than 3D is very hard for us to grasp
> + _Maybe_ 5 if you use color and animation well
> - Somehow reduce the huge number of features to something more manageable _but_ still intelligible
## Information measures
- How much information does one feature give us about another?
- How much information does the model give us about predictions?
- How much information does the data give us about the models?
## Nearest neighbors
- Predicting "the new case will do what similar cases did" is surprisingly powerful
- Need good ways to define "similar" and to find similar cases in big data sets
## Clustering and Classifiers
## Prediction and decision trees
![](classification-tree-we-can-believe-in.jpg)
- Simple, binary-choice models for prediction
- Plus ways of combining many trees to get "forests"
## Recommendation engines
- "You may also like"
- Clustering, dimension reduction, classification, nearest neighbors, ...
## Waste, fraud and abuse
![](https://web.archive.org/web/20080901072600if_/http://failblog.files.wordpress.com/2008/01/camerafail.jpg)
> - Sometimes data mining just won't work
> + Bad data
> + Overwhelming data and the curse of dimensionality
## Waste, fraud and abuse
![](https://web.archive.org/web/20080901072600if_/http://failblog.files.wordpress.com/2008/01/camerafail.jpg)
> - Sometimes data mining is just wrong
## Where did this come from?
> - Statistics
> + Exploratory data analysis
> + Regression, especially regression by matching
> + Principal components and factor analysis
> + Classifiers
> + Clustering
> - Computer science
> + Databases
> + Pattern recognition (= classsifiers)
> + Dimension reduction (= PCA and factor analysis)
> + Clustering (= clustering)
> + Nearest neighbors (= matching)
> + Machine learning (= predictive statistical modeling)
## Where did this _really_ come from?
```{r, fig.retina=NULL, out.width=400, echo=FALSE}
knitr::include_graphics("mound-city-1870.jpg")
```
## Where did this _really_ come from?
![](http://farm3.static.flickr.com/2411/2404562785_5b887699de.jpg?v=0)
## Where did this _really_ come from?
> - Computers made it _really easy_ to collect _and analyze_ the data
> + Even when we don't know what we're looking for
> - Tension: flexibility to find many different patterns vs. vulnerability to noise and coincidence
> + We'll look at a lot of different ways of finding patterns
> + We'll also look at how to avoid fooling ourselves
## What will you need to know?
> - 36-401, modern regression
> - $=$ Linear statistical models **in R**
> - $=$ Actual experience with predictive modeling of data
> - $+$ Mathematical statistics (for notions of inference and error)
> - $+$ Probability (for notions of distributions and risk)
> - $+$ Linear algebra through eigenvalues and eigenvectors (**essential** for multivariate data)
> - $+$ Calculus (**essential** for optimization)
## Course mechanics
- Class meetings
- Readings
- Homework
- Class homepage: [http://www.stat.cmu.edu/~cshalizi/dm/19]
+ Full syllabus with all the details
+ Links to course assignments, due dates, etc.
+ What to read
- Canvas: submitting work, gradebook, some readings
- Piazza: question-answering
## Class meetings
> - Lecture: me explaining and demonstrating stuff, you asking questions
> - In-class exercises: you checking your understanding
> - No electronics
## In-class exercises
- Short ($< 20$) minute problem-solving exercises
- Pencil-and-paper, not electronics
- Groups of up to 4
- Most if not all class meetings
## Reading
> - Most class meetings will have **required** reading: do it!
> - Many will have _recommended_ reading: try to do it
> - Most will have background reading: if you get interested
## Reading: Textbook
![_Principles of Data Mining_](https://mitpress.mit.edu/sites/default/files/styles/large_book_cover/http/mitp-content-server.mit.edu%3A18180/books/covers/cover/%3Fcollid%3Dbooks_covers_0%26isbn%3D9780262082907%26type%3D.jpg?itok=GrWC_te-)
## Reading: Recommended
![_Statistical Learning from a Regression Perspective_](https://media.springernature.com/w306/springer-static/cover-hires/book/978-3-319-44048-4)
## Reading: Recommended
![_Elements of Statistical Learning_](https://web.stanford.edu/~hastie/ElemStatLearn/CoverII_small.jpg)
## Reading: Recommended
```{r, fig.retina=NULL, out.width=400, echo=FALSE}
knitr::include_graphics("https://mathbabe.files.wordpress.com/2016/02/weaponsmath-r4-6-06.jpg")
```
## Homework
- Implementing methods on actual data
- Working out some of the mathematical details
- Practicing interpreting and communicating the results
- One assignment per week, 14 in all
+ Released by Friday each week (sometimes earlier)
+ Usually due Wednesdays at 10 pm via Canvas
## Homework
- 10% of each homework will be graded on the quality & clarity of your communication
+ There will be a rubric for this on each assignment
- Most (if not all) homeworks will also have an online assignment
+ Easy if you've done the reading
+ Usually due Sundays at 10 pm via Canvas
+ Online assignments will be about 10% of your grade
## Grading
- Homework: 90% of your grade
+ **Lowest 2** homework+online grades dropped automatically
+ No late homework
+ If you do all 14 homeworks with a minimum grade of $\geq 60$%, lowest **3** grades get dropped
- In-class exercises: 10% of your grade
+ **Lowest 2** dropped automatically
+ If you do all exercises with minimum of $\geq 50$%, lowest **4** dropped
- No exams
- Grade boundaries: 90 for an A, 80 for a B, etc.
## Time expectations
> - This is a 9 **credit-hour** class
> - You spend 3 hours in lecture each week
> - $\Rightarrow$ **6 hours** working on the class outside of lecture each week
> + averaged over the semester
> + Talk to me if it's taking much longer than that
## Cheating, collaboration & plagiarism
> - Don't
> - You can talk to each other, you can read what you like, but everything you turn in **must** be your own work
> + Exception: Don't read old solutions, or share this year's
> + Exception: Working together is OK for in-class exercises
> - Full policy in the syllabus
> - You will need to do a HW 0 about the class cheating policy before getting graded
## Homework format
- We will use [R Markdown](http://rmarkdown.rstudio.com/) to integrate your code directly in to your writing
+ Write a source file that's mostly ordinary text, plus the R code you want
+ "Knit" to an HTML or PDF with the text plus the output of the code (figures, tables, numbers)
- Ensures **computational reproducibility**: your results really came from the code that you say/think they came from
- You will turn in both your raw R Markdown file _and_ the knitted PDF or HTML
## Switch to R Studio
Specifically `welcome.Rmd`
## Some lessons from the demo
- Using R Markdown is almost no extra work over just using R and copying-and-pasting your results in
+ Eventually it's _less_ work than copying-and-pasting _and_ it's more reliable
- Regression trees are a characteristic data-mining model
+ Very flexible (can fit _anything_ with a big enough tree)
+ Good at things linear models are bad at (thresholds, interactions)
+ Almost never realistic (... but when is the linear model right?)
## Next time: Information retrieval
- Do the reading!