22 September 2014

## In Previous Episodes

• Seen functions to load data in passing
• Learned about string manipulation and regexp

## Agenda

• Getting data into and out of the system when it's already in R format
• Import and export when the data is already very structured and machine-readable
• Dealing with less structured data
• Web scraping

• You can load and save R objects
• R has its own format for this, which is shared across operating systems
• It's an open, documented format if you really want to pry into it
• save(thing, file="name") saves thing in a file called name (conventional extension: rda or Rda)
• load("name") loads the object or objects stored in the file called name, with their old names
gmp <- read.table("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/06/gmp.dat")
gmp$pop <- round(gmp$gmp/gmp$pcgmp) save(gmp,file="gmp.Rda") rm(gmp) exists("gmp") ## [1] FALSE not_gmp <- load(file="gmp.Rda") colnames(gmp) ## [1] "MSA" "gmp" "pcgmp" "pop" not_gmp ## [1] "gmp" • We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done • Many packages come with saved data objects; there's the convenience function data() to load them  data(cats,package="MASS") summary(cats) ## Sex Bwt Hwt ## F:47 Min. :2.00 Min. : 6.30 ## M:97 1st Qu.:2.30 1st Qu.: 8.95 ## Median :2.70 Median :10.10 ## Mean :2.72 Mean :10.63 ## 3rd Qu.:3.02 3rd Qu.:12.12 ## Max. :3.90 Max. :20.50 Note: data() returns the name of the loaded data file! ## Non-R Data Tables • Tables full of data, just not in the R file format • Main function: read.table() • Presumes space-separated fields, one line per row • Main argument is the file name or URL • Returns a dataframe • Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file… • read.csv() is a short-cut to set the options for reading comma-separated value (CSV) files • Spreadsheets will usually read and write CSV ## Writing Dataframes • Counterpart functions write.table(), write.csv() write a dataframe into a file • Drawback: takes a lot more disk space than what you get from load or save • Advantage: can communicate with other programs, or even edit manually ## Less Friendly Data Formats • The foreign package on CRAN has tools for reading data files from lots of non-R statistical software • Spreadsheets are special ## Spreadsheets Considered Harmful • Spreadsheets look like they should be dataframes • Real spreadsheets are full of ugly irregularities • Values or formulas? • Headers, footers, side-comments, notes • Columns change meaning half-way down • Whole separate programming languages apparently intended to mostly to spread malware • Ought-to-be-notorious source of errors in both industry (1, 2) and science (e.g., Reinhart and Rogoff) ## Spreadsheets, If You Have To • Save the spreadsheet as a CSV; read.csv() • Save the spreadsheet as a CSV; edit in a text editor; read.csv() • Use read.xls() from the gdata package • Tries very hard to work like read.csv(), can take a URL or filename • Can skip down to the first line that matches some pattern, select different sheets, etc. • You may still need to do a lot of tidying up after require(gdata, quietly=TRUE) ## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED. ## ## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED. ## ## Attaching package: 'gdata' ## ## The following object is masked from 'package:stats': ## ## nobs ## ## The following object is masked from 'package:utils': ## ## object.size setwd("~/Downloads/") gmp_2008_2013 <- read.xls("gdp_metro0914.xls",pattern="U.S.") head(gmp_2008_2013) ## U.S..metropolitan.areas X13.269.057 X12.994.636 X13.461.662 ## 1 Abilene, TX 5,725 5,239 5,429 ## 2 Akron, OH 28,663 27,761 28,616 ## 3 Albany, GA 4,795 4,957 4,928 ## 4 Albany, OR 3,235 3,064 3,050 ## 5 Albany-Schenectady-Troy, NY 40,365 42,454 42,969 ## 6 Albuquerque, NM 37,359 38,110 38,801 ## X13.953.082 X14.606.938 X15.079.920 ....... ## 1 5,761 6,143 6,452 252 ## 2 29,425 31,012 31,485 80 ## 3 4,938 5,122 5,307 290 ## 4 3,170 3,294 3,375 363 ## 5 43,663 45,330 46,537 58 ## 6 39,967 41,301 41,970 64 ## Semi-Structured Files, Odd Formats • Files with metadata (e.g., earthquake catalog) • Non-tabular arrangement • Generally, write function to read in one (or a few) lines and split it into some nicer format • Generally involves a lot of regexps • Functions are easier to get right than code blocks in loops ## In Praise of Capture Groups • Parentheses don't just group for quantifiers; they also create capture groups, which the regexp engine remembers • Can be referred to later (\1, \2, etc.) • Can also be used to simplify getting stuff out • Examples in the handout on regexps, but let's reinforce the point ## Scraping the Rich • Remember that the lines giving net worth looked like  <td class="worth">$72 B</td>

or

    <td class="worth">$5,3 B</td> One regexp which catches this: richhtml <- readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/labs/03/rich.html") worth_pattern <- "\\$[0-9,]+ B"
worth_lines <- grep(worth_pattern, richhtml)
length(worth_lines)
## [1] 100

(that last to check we have the right number of matches)

Just using this gives us strings, including the markers we used to pin down where the information was:

worth_matches <- regexpr(worth_pattern, richhtml)
worths <- regmatches(richhtml, worth_matches)
head(worths)
## [1] "$72 B" "$58,5 B" "$41 B" "$36 B"   "$36 B" "$35,4 B"

Now we'd need to get rid of the anchoring $ and B; we could use substr, but… Adding a capture group doesn't change what we match: worth_capture <- worth_pattern <- "\\$([0-9,]+) B"
capture_lines <- grep(worth_capture, richhtml)
identical(worth_lines, capture_lines)
## [1] TRUE

but it does have an advantage

## Using regexec

worth_matches <- regmatches(richhtml[capture_lines],
regexec(worth_capture, richhtml[capture_lines]))
worth_matches[1:2]
## [[1]]
## [1] "$72 B" "72" ## ## [[2]] ## [1] "$58,5 B" "58,5"

List with 1 element per matching line, giving the whole match and then each paranethesized matching sub-expression

Functions make the remaining manipulation easier:

second_element <- function(x) { return(x[2]) }
worth_strings <- sapply(worth_matches, second_element)
comma_to_dot <- function(x) {
return(gsub(pattern=",",replacement=".",x))
}
worths <- as.numeric(sapply(worth_strings, comma_to_dot))
head(worths)
## [1] 72.0 58.5 41.0 36.0 36.0 35.4

Exercise: Write one function which takes a single line, gets the capture group, and converts it to a number

## Web Scraping

1. Take a webpage designed for humans to read
2. Have the computer extract the information we actually want
3. Iterate as appropriate

Take in unstructured pages, return rigidly formatted data

## Being More Explicit in Step 2

• The information we want is somewhere in the page, possibly in the HTML
• There are usually markers surrounding it, probably in the HTML
• We now know how to pick apart HTML using regular expressions
• Figure out exactly what we want from the page
• Understand how the information is organized on the page
• What does a human use to find it?
• Where do those cues appear in the HTML source?
• Write a function to automate information extraction
• Generally, this means regexps
• Parenthesized capture groups are helpful
• The function may need to iterate
• You may need more than one function
• Once you've got it working for one page, iterate over relevant pages

## Example: Book Networks

• Two books are linked if they're bought together at Amazon

• Amazon gives this information away (to try to drive sales)

• How would we replicate this?

• Do we want "frequently bought together", or "customers who bought this also bought that"? Or even "what else do customers buy after viewing this"?
• Let's say "customers who bought this also bought that"
• Now look carefully at the HTML
• There are over 14,000 lines in the HTML file for this page; you'll need a text editor
• Fortunately most of it's irrelevant
<div class="shoveler" id="purchaseShvl">
<h2>Customers Who Bought This Item Also Bought</h2>
</div>

<div class="shoveler-pagination" style="display:none">

<span>&nbsp;</span>
<span>
Page <span class="page-number"></span>  of  <span class="num-pages"></span>
<span class="start-over"><span class="a-text-separator"></span><a href="#" onclick="return false;" class="start-over-link">Start over</a></span>
</span>
</div>

<div class="shoveler-button-wrapper" id="purchaseButtonWrapper">
<a class="back-button" href="#Back" style="display:none" onclick="return false;"><span class="auiTestSprite s_shvlBack"><span>Back</span></span></a>
<div class="shoveler-content">
<ul tabindex="-1">

Here's the first of the also-bought books:

<li>
<div class="new-faceout p13nimp"  id="purchase_0387981403" data-asin="0387981403" data-ref="pd_sim_b_1">

<a href="/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403/ref=pd_sim_b_1?ie=UTF8&refRID=1HZ0VDHEFFX3EM2WNWRH"  class="sim-img-title" > <div class="product-image">
<img src="http://ecx.images-amazon.com/images/I/31I22xsT%2BXL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" />
</div>
<span title="ggplot2: Elegant Graphics for Data Analysis (Use R!)">ggplot2: Elegant Graphics for Data &#133;</span> </a>

<div class="byline">
<span class="carat">&#8250</span> 

We could extract the ISBN from this, and then go on to the next book, and so forth…

<div id="purchaseSimsData" class="sims-data"
style="display:none" data-baseAsin="0387747303"
data-deviceType="desktop" data-featureId="pd_sim" data-isAUI="1" data-pageId="0387747303" data-pageRequestId="1HZ0VDHEFFX3EM2WNWRH" data-reftag="pd_sim_b" data-vt="0387747303"
data-wdg="book_display_on_website"
data-widgetName="purchase">0387981403,0596809158,1593273843,1449316956,
0387938362,144931208X,0387790535,0387886974,0470973927,0387759689,
1439810184,1461413648,1461471370,1782162143,1441998896,1429224622,
1612903436,1441996494,1461468485,1617291560,1439831769,0321888030,1449319793,
1119962846,0521762936,1446200469,1449358659,1935182390,0123814855,1599941651,
0387759352,1461476178,0387773169,0387922970,0073523321,141297514X,1439840954,
1612900275,1449339735,052168689X,0387781706,1584884509,0387848576,1420068725,
1441915753,1466572841,1107422221,111844714X,0716762196,0133412938,1482203537,
0963488406,1466586966,0470463635,1493909827,1420079336,0321898656,1461422981,
158488424X,1441926127,1466570229,1590475348,1430266406,0071794565,0071623663,
111866146X,1441977864,1782160604,1449340377,1449309038,0963488414,0137444265,
1461406846,0073014664,1449370780,144197864X,3642201911,0534243126,1461443423,
158488651X,1449357105,1118208781,1420099604,1107057132,1449355730,1118356853,
1449361323,0470890819,0387245448,0521518148,0521169828,1584888490,1461464455,
0387781889,0387759581,0387717617,0123748569,188652923X,0155061399,0201076160</div>

In this case there's a big block which gives us the ISBNs of all the also-bought books

Strategy:

• Load the page as text
• Search for the regexp which begins this block, contains at least one ISBN, and then ends
• Extract the sequence of ISBNs as a string, split on comma
• Record in a dataframe that Data Manipulation's ISBN is also bought with each of those ISBNs
• Snowball sampling: Go to the webpage of each of those books and repeat
• Stop when we get tired…
• Or when Amazon gets annoyed with us

## More considerations on web-scraping

• You should really look at the site's robots.txt file and respect it
• See [https://github.com/hadley/rvest] for a prototype of a package to automate a lot of the work of scraping webpages