Rebecca Nugent
Department of Statistics
Carnegie Mellon University

Undergraduate Summer School
Park City Math Institute
Midway, UT, July 2016

  • Home
  • Notes/Handouts
  • Data and Code
  • Resources
  • Projects
  • OkCupid
  • Mammogram
  • Hipparcos
  • 2010 World Cup
  • CMU Class TVTime Data
  • Banknote Authentication
  • Generate Curved Data
  • Olive Oil
  • Aggregate/7 Groups
OkCupid: 60,000 online dating profiles
(Kim, Escobedo-Land, Journal of Statistics Education, Vol 23, Number 2 (2015) Paper

Profiles (csv file, 150 MB)

Codebook/Explanation of Variables

R code to get started
Continuous Variables and their Distributions: Code
Relationships and Joint Distributions for Two Continuous Variables: Code
mammogram: information about 816 patients with mammographic masses (tumors in breast tissue). The goal of the original analysis was to develop classification models for whether or not a mass is malignant without invasive biopsies.

M.Elter, R. Schulz-Wendtland, and T. Wittenberg (2007).The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process, Medical Physics 34(11), p. 4164-4172.

Data

Hipparcos Stars: information collected by the Hipparcos satellite on 2719 stars; variables include the visual band magnitude (Vmag), the parallactic angle (Plx), and the color of the star. Note that log luminosity (relative to the sun) is calculated: logL = (15-Vmag - 5logPlx)/2.5 Source: Penn State Center for Astrostatistics

Freeman, Richards, Schafer, Lee (2008). Astrostatistics: The Final Frontier, Chance, Vol 21, No. 3, p.31-35. Paper

Data       Variable List       Hertzsprung-Russell Diagram Info       Code used in class
2010 World Cup: statistics for all players and all countries for the 2010 World Cup provided by Opta (world-leading live sports data company) and the Guardian Data Blog

Data   Variable List
CMU Class data: some demonstration data generated by a group of Carnegie Mellon undergraduates in a regression course; goal was to predict the average amount of time spent watching TV/movies per day; just a convenience sample from a very imprecise, not validated survey. Should only be used for demos and not for research.

Data   Original Survey   Example Code
Banknote Authentication: famous classifier dataset. Images were taken of 1372 banknotes, some counterfeit and some genuine. Wavelet tranformation tools were used to extract the following descriptive features of the images: Variance, Skewness, Kurtosis, Entropy. We also have the true label for whether or not a banknote is genuine (Yes = 1, No = 0).

Data
Generating Curved Data: code to generate randomly scatter data (with appropriate, selected error) around splines created by the user through point-and-click

curvelib.R       generatesplinedata.R
Olive Oil Data: a famous olive oil data set used in clustering and classification. We have eight different chemical composition measurements on 572 Italian olive oils. Additionally, there are two sets of labels, the Region and the Area. There are three different Regions (Southern Italy, Sardinia, and Northern Italy). Each region is comprised of a group of area: Southern Italy = North Apulia, Calabria, South Apulia, and Sicily; Sardinia = Inland Sardinia, Costal Sardinia; Northern Italy = Umbrian, East Liguria, and West Liguria.

Data
"Aggregation": example clustering data set from the University of Eastern Finland Computer Science Department (Gionis, et al 2007)
788 2-D observations, seven groups (labels in the third column)

Data

© 2016 Department of Statistics, Carnegie Mellon University. Website design adapted from a design byMikhail Popov.