--- title: 'Lecture 8: Getting Data' author: "36-350" date: "22 September 2014" output: ioslides_presentation font-family: Garamond transition: none --- ## In Previous Episodes - Seen functions to load data in passing - Learned about string manipulation and regexp ## Agenda - Getting data into and out of the system when it's already in R format - Import and export when the data is already very structured and machine-readable - Dealing with less structured data - Web scraping ## Reading Data from R - You can load and save R objects + R has its own format for this, which is shared across operating systems + It's an open, documented format if you really want to pry into it - `save(thing, file="name")` saves `thing` in a file called `name` (conventional extension: `rda` or `Rda`) - `load("name")` loads the object or objects stored in the file called `name`, _with their old names_ ## ```{r} gmp <- read.table("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/06/gmp.dat") gmp$pop <- round(gmp$gmp/gmp$pcgmp) save(gmp,file="gmp.Rda") rm(gmp) exists("gmp") not_gmp <- load(file="gmp.Rda") colnames(gmp) not_gmp ``` ## - We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done - Many packages come with saved data objects; there's the convenience function `data()` to load them ```{r} data(cats,package="MASS") summary(cats) ``` _Note_: `data()` returns the name of the loaded data file! ## Non-R Data Tables - Tables full of data, just not in the R file format - Main function: `read.table()` + Presumes space-separated fields, one line per row + Main argument is the file name or URL + Returns a dataframe + Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file... - `read.csv()` is a short-cut to set the options for reading comma-separated value (CSV) files + Spreadsheets will usually read and write CSV ## Writing Dataframes - Counterpart functions `write.table()`, `write.csv()` write a dataframe into a file - Drawback: takes a lot more disk space than what you get from `load` or `save` - Advantage: can communicate with other programs, or even edit manually ## Less Friendly Data Formats - The `foreign` package on CRAN has tools for reading data files from lots of non-R statistical software - Spreadsheets are special ## Spreadsheets Considered Harmful - Spreadsheets look like they should be dataframes - Real spreadsheets are full of ugly irregularities + Values or formulas? + Headers, footers, side-comments, notes + Columns change meaning half-way down + Whole separate programming languages apparently intended to mostly to spread malware - Ought-to-be-notorious source of errors in both industry ([1](http://ftalphaville.ft.com/2013/01/17/1342082/a-tempest-in-a-spreadsheet/), [2](http://baselinescenario.com/2013/02/09/the-importance-of-excel/)) and science (e.g., Reinhart and Rogoff) ## Spreadsheets, If You Have To - Save the spreadsheet as a CSV; `read.csv()` - Save the spreadsheet as a CSV; edit in a text editor; `read.csv()` - Use `read.xls()` from the `gdata` package + Tries very hard to work like `read.csv()`, can take a URL or filename + Can skip down to the first line that matches some pattern, select different sheets, etc. + You may still need to do a lot of tidying up after ## ```{r} require(gdata, quietly=TRUE) ``` ## ```{r} setwd("~/Downloads/") gmp_2008_2013 <- read.xls("gdp_metro0914.xls",pattern="U.S.") head(gmp_2008_2013) ``` ## Semi-Structured Files, Odd Formats - Files with metadata (e.g., earthquake catalog) - Non-tabular arrangement - Generally, write function to read in one (or a few) lines and split it into some nicer format + Generally involves a lot of regexps + Functions are easier to get right than code blocks in loops ## In Praise of Capture Groups - Parentheses don't just group for quantifiers; they also create _capture groups_, which the regexp engine remembers - Can be referred to later (`\1`, `\2`, etc.) - Can also be used to simplify getting stuff out - Examples in the handout on regexps, but let's reinforce the point ## Scraping the Rich - Remember that the lines giving net worth looked like ```