--- title: 'Lecture 8: Getting Data' author: "36-350" date: "22 September 2014" output: ioslides_presentation font-family: Garamond transition: none --- ## In Previous Episodes - Seen functions to load data in passing - Learned about string manipulation and regexp ## Agenda - Getting data into and out of the system when it's already in R format - Import and export when the data is already very structured and machine-readable - Dealing with less structured data - Web scraping ## Reading Data from R - You can load and save R objects + R has its own format for this, which is shared across operating systems + It's an open, documented format if you really want to pry into it - `save(thing, file="name")` saves `thing` in a file called `name` (conventional extension: `rda` or `Rda`) - `load("name")` loads the object or objects stored in the file called `name`, _with their old names_ ## ```{r} gmp <- read.table("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/06/gmp.dat") gmp$pop <- round(gmp$gmp/gmp$pcgmp) save(gmp,file="gmp.Rda") rm(gmp) exists("gmp") not_gmp <- load(file="gmp.Rda") colnames(gmp) not_gmp ``` ## - We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done - Many packages come with saved data objects; there's the convenience function `data()` to load them ```{r} data(cats,package="MASS") summary(cats) ``` _Note_: `data()` returns the name of the loaded data file! ## Non-R Data Tables - Tables full of data, just not in the R file format - Main function: `read.table()` + Presumes space-separated fields, one line per row + Main argument is the file name or URL + Returns a dataframe + Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file... - `read.csv()` is a short-cut to set the options for reading comma-separated value (CSV) files + Spreadsheets will usually read and write CSV ## Writing Dataframes - Counterpart functions `write.table()`, `write.csv()` write a dataframe into a file - Drawback: takes a lot more disk space than what you get from `load` or `save` - Advantage: can communicate with other programs, or even edit manually ## Less Friendly Data Formats - The `foreign` package on CRAN has tools for reading data files from lots of non-R statistical software - Spreadsheets are special ## Spreadsheets Considered Harmful - Spreadsheets look like they should be dataframes - Real spreadsheets are full of ugly irregularities + Values or formulas? + Headers, footers, side-comments, notes + Columns change meaning half-way down + Whole separate programming languages apparently intended to mostly to spread malware - Ought-to-be-notorious source of errors in both industry ([1](http://ftalphaville.ft.com/2013/01/17/1342082/a-tempest-in-a-spreadsheet/), [2](http://baselinescenario.com/2013/02/09/the-importance-of-excel/)) and science (e.g., Reinhart and Rogoff) ## Spreadsheets, If You Have To - Save the spreadsheet as a CSV; `read.csv()` - Save the spreadsheet as a CSV; edit in a text editor; `read.csv()` - Use `read.xls()` from the `gdata` package + Tries very hard to work like `read.csv()`, can take a URL or filename + Can skip down to the first line that matches some pattern, select different sheets, etc. + You may still need to do a lot of tidying up after ## ```{r} require(gdata, quietly=TRUE) ``` ## ```{r} setwd("~/Downloads/") gmp_2008_2013 <- read.xls("gdp_metro0914.xls",pattern="U.S.") head(gmp_2008_2013) ``` ## Semi-Structured Files, Odd Formats - Files with metadata (e.g., earthquake catalog) - Non-tabular arrangement - Generally, write function to read in one (or a few) lines and split it into some nicer format + Generally involves a lot of regexps + Functions are easier to get right than code blocks in loops ## In Praise of Capture Groups - Parentheses don't just group for quantifiers; they also create _capture groups_, which the regexp engine remembers - Can be referred to later (`\1`, `\2`, etc.) - Can also be used to simplify getting stuff out - Examples in the handout on regexps, but let's reinforce the point ## Scraping the Rich - Remember that the lines giving net worth looked like ``` $72 B ``` or ``` $5,3 B ``` ## One regexp which catches this: ```{r} richhtml <- readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/labs/03/rich.html") worth_pattern <- "\\$[0-9,]+ B" worth_lines <- grep(worth_pattern, richhtml) length(worth_lines) ``` (that last to check we have the right number of matches) ## Just using this gives us strings, including the markers we used to pin down where the information was: ```{r} worth_matches <- regexpr(worth_pattern, richhtml) worths <- regmatches(richhtml, worth_matches) head(worths) ``` Now we'd need to get rid of the anchoring `$` and ` B`; we could use `substr`, but... ## Adding a capture group doesn't change what we match: ```{r} worth_capture <- worth_pattern <- "\\$([0-9,]+) B" capture_lines <- grep(worth_capture, richhtml) identical(worth_lines, capture_lines) ``` but it _does_ have an advantage ## Using `regexec` ```{r} worth_matches <- regmatches(richhtml[capture_lines], regexec(worth_capture, richhtml[capture_lines])) worth_matches[1:2] ``` List with 1 element per matching line, giving the whole match and then each paranethesized matching sub-expression ## Functions make the remaining manipulation easier: ```{r} second_element <- function(x) { return(x[2]) } worth_strings <- sapply(worth_matches, second_element) comma_to_dot <- function(x) { return(gsub(pattern=",",replacement=".",x)) } worths <- as.numeric(sapply(worth_strings, comma_to_dot)) head(worths) ``` _Exercise_: Write _one_ function which takes a single line, gets the capture group, and converts it to a number ## Web Scraping 1. Take a webpage designed for humans to read 2. Have the computer extract the information we actually want 3. Iterate as appropriate Take in unstructured pages, return rigidly formatted data ## !["and then a miracle happens"](http://imgc-cn.artprintimages.com/images/P-473-488-90/60/6079/KTUD100Z/posters/sidney-harris-i-think-you-should-be-more-explicit-here-in-step-two-cartoon.jpg) ## Being More Explicit in Step 2 - The information we want is _somewhere_ in the page, possibly in the HTML - There are usually markers surrounding it, probably in the HTML - We now know how to pick apart HTML using regular expressions ## - Figure out _exactly_ what we want from the page - Understand how the information is organized on the page + What does a human use to find it? + Where do those cues appear in the HTML source? - Write a function to automate information extraction + Generally, this means regexps + Parenthesized capture groups are helpful + The function may need to iterate + You may need more than one function - Once you've got it working for one page, iterate over relevant pages ## Example: Book Networks Famous example from [Vladis Krebs](http://www.orgnet.com/divided1.html) ![network of political books](krebs-2003.png) ## - Two books are linked if they're bought together at Amazon - Amazon gives this information away (to try to drive sales) - How would we replicate this? ## [http://www.amazon.com/dp/0387747303/] ![Part of the Amazon page for _Data Manipulation with R_](spector.png) ## - Do we want "frequently bought together", or "customers who bought this also bought that"? Or even "what else do customers buy after viewing this"? + Let's say "customers who bought this also bought that" - Now look carefully at the HTML + There are over 14,000 lines in the HTML file for this page; you'll need a text editor + Fortunately most of it's irrelevant ## ```

Customers Who Bought This Item Also Bought

    ``` ## Here's the first of the also-bought books: ```
  • ggplot2: Elegant Graphics for Data …