Reading in data from the outside

All along, we’ve already been reading in data from the outside, using:

readLines(): reading in lines of text from a file or webpage; returns vector of strings
read.table(): read in a data table from a file or webpage; returns data frame
read.csv(): like the above, but assumes comma separated data; returns data frame

This week we’ll focus on read.table(), read.csv() and their counterparts write.table(), write.csv(), respectively

Reading in data from a previous R session

Reminder that we’ve seen two ways to read/write in data in specialized R formats:

readRDS(), saveRDS(): functions for reading/writing single R objects from/to a file
load(), save(): functions for reading/writing any number of R objects from/to a file

Advantage: these can be a lot more memory efficient than what we’ll cover this week. Disadvantage: they’re limited in scope in the sense that they can only communicate with R

`read.table()`, `read.csv()`

Have a table full of data, just not in the R file format? Then read.table() is the function for you. It works as in:

read.table(file=file.name, sep=" "), to read data from a local file on your computer called file.name, assuming (say) space separated data
read.table(file=webpage.link, sep="\t"), to read data from a webpage up at webpage.link, assuming (say) tab separated data

The function read.csv() is just a shortcut for using read.table() with sep=",". (But note: these two actually differ on some of the default inputs!)

Examples of reading in data

# This data table is comma separated, so we can use read.csv()
strike.df = 
  read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/strikes.csv")
head(strike.df)

##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

sapply(strike.df, class)

##         country            year   strike.volume    unemployment 
##        "factor"       "integer"       "integer"       "numeric" 
##       inflation left.parliament  centralization         density 
##       "numeric"       "numeric"       "numeric"       "numeric"

# This data table is tab separated, so let's specify sep="\t"
anss.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/anss.dat",
  sep="\t")
head(anss.df)

##           V1          V2       V3        V4     V5   V6
## 1       Date        Time      Lat       Lon  Depth  Mag
## 2 2002/01/01 10:39:06.82 -55.2140 -129.0000  10.00 6.00
## 3 2002/01/01 11:29:22.73   6.3030  125.6500 138.10 6.30
## 4 2002/01/02 14:50:33.49 -17.9830  178.7440 665.80 6.20
## 5 2002/01/02 17:22:48.76 -17.6000  167.8560  21.00 7.20
## 6 2002/01/03 07:05:27.67  36.0880   70.6870 129.30 6.20

sapply(anss.df, class)

##       V1       V2       V3       V4       V5       V6 
## "factor" "factor" "factor" "factor" "factor" "factor"

# Oops! It comes with column names, so let's set header=TRUE
anss.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/anss.dat",
  sep="\t", header=TRUE)
head(anss.df)

##         Date        Time     Lat      Lon Depth Mag
## 1 2002/01/01 10:39:06.82 -55.214 -129.000  10.0 6.0
## 2 2002/01/01 11:29:22.73   6.303  125.650 138.1 6.3
## 3 2002/01/02 14:50:33.49 -17.983  178.744 665.8 6.2
## 4 2002/01/02 17:22:48.76 -17.600  167.856  21.0 7.2
## 5 2002/01/03 07:05:27.67  36.088   70.687 129.3 6.2
## 6 2002/01/03 10:17:36.30 -17.664  168.004  10.0 6.6

sapply(anss.df, class)

##      Date      Time       Lat       Lon     Depth       Mag 
##  "factor"  "factor" "numeric" "numeric" "numeric" "numeric"

Helpful input arguments

The following inputs apply to either read.table() or read.csv() (though these two functions actually have different default inputs in general—e.g., header defaults to TRUE in read.csv() but FALSE in read.table())

header: boolean, TRUE is the first line should be interpreted as column names
sep: string, specifies what separates the entries; empty string "" is interpreted to mean any white space
quote: string, specifies what set of characters signify the beginning and end of quotes; empty string "" disables quotes altogether

Other helpful inputs: skip, row.names, col.names. You can read about them in the help file for read.table()

Another example of reading in data

# This data table is tab separated, and it comes with column names
sprint.m.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.m.dat",
  sep="\t", header=TRUE)

## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 981 did not have 8 elements

# Oops! It turns out it has some apostrophe marks in the City column, 
# so we have to set quote=""
sprint.m.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.m.dat",
  sep="\t", header=TRUE, quote="")
head(sprint.m.df)

##   Rank Time Wind        Name Country Birthdate     City       Date
## 1    1 9.58  0.9  Usain Bolt     JAM  21.08.86   Berlin 16.08.2009
## 2    2 9.63  1.5  Usain Bolt     JAM  21.08.86   London 05.08.2012
## 3    3 9.69  0.0  Usain Bolt     JAM  21.08.86  Beijing 16.08.2008
## 4    3 9.69  2.0   Tyson Gay     USA  09.08.82 Shanghai 20.09.2009
## 5    3 9.69 -0.1 Yohan Blake     JAM  26.12.89 Lausanne 23.08.2012
## 6    6 9.71  0.9   Tyson Gay     USA  09.08.82   Berlin 16.08.2009

unique(grep("'", sprint.m.df$City, value=TRUE)) # This is the troublemaker

## [1] "Villeneuve d'Ascq"

`write.table()`, `write.csv()`

To write a data frame (or matrix) to a text file, use write.table() or write.csv(). These are the counterparts to read.table() and read.csv(), and they work as in:

write.table(my.df, file="my.df.txt", sep=" ", quote=FALSE), to write my.dat to the text file “my.df.txt” (to be created in your working directory), with (say) space separation, and no quotes around the entries
write.csv(my.df, file="my.df.csv", quote=FALSE), to write my.dat to the text file “my.df.csv” (to be created in your working directory), with comma separation, and no quotes

Note that quote=FALSE, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE). Also, setting row.names=FALSE and col.names=FALSE will disable the printing of row and column names (defaults are row.names=TRUE and col.names=TRUE)

Reordering data

Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall:

The function order() takes in a vector, and returns the vector of indices that put this vector in increasing order
Set the input decreasing=TRUE in order() to get decreasing order
We can apply compute an appropriate vector of indices, and then use this on rows of our data frame to reorder all of the columns simultaneously

Examples of reordering

# The sprint data has its rows ordered by the fastest 100m time to the slowest.
# Suppose we wanted to reorder, from slowest to fastest
i.slow = order(sprint.m.df$Time, decreasing=TRUE) # By decreasing sprint time
sprint.m.df.slow = sprint.m.df[i.slow,] # Reorder rows by decreasing sprint time
head(sprint.m.df.slow)

##      Rank  Time Wind         Name Country Birthdate             City
## 2682 2691 10.09  1.8  Mel Lattany     USA  10.08.59 Colorado Springs
## 2683 2691 10.09 -0.9  Mel Lattany     USA  10.08.59      Z&uuml;rich
## 2684 2691 10.09  1.3   Carl Lewis     USA  01.07.61           Walnut
## 2685 2691 10.09  0.6 Calvin Smith     USA  08.01.61           Athens
## 2686 2691 10.09 -1.7   Carl Lewis     USA  01.07.61     Indianapolis
## 2687 2691 10.09 -0.9 Calvin Smith     USA  08.01.61      Z&uuml;rich
##            Date
## 2682 30.07.1978
## 2683 19.08.1981
## 2684 25.04.1982
## 2685 14.05.1982
## 2686 02.07.1982
## 2687 18.08.1982

# Suppose we wanted to reorder the rows by sprinter name, alphabetically
i.name = order(sprint.m.df$Name) # By sprinter name
sprint.m.df.name = sprint.m.df[i.name,] # Reorder rows by name
head(sprint.m.df.name)

##      Rank  Time Wind            Name Country Birthdate          City
## 1373 1281 10.03  0.5 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 1552 1456 10.04  1.0 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 2326 2145 10.07  1.0 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 568   491  9.96  2.0     Aaron Brown     CAN  27.05.92     Montverde
## 1114 1011 10.01  1.8     Aaron Brown     CAN  27.05.92     Montverde
## 1836 1683 10.05  1.9     Aaron Brown     CAN  27.05.92        Eugene
##            Date
## 1373 20.06.2009
## 1552 25.06.2005
## 2326 13.08.2011
## 568  11.06.2016
## 1114 11.06.2016
## 1836 05.06.2013

Part II

Merging data

Merging data frames

Suppose you have two data frames X, Y, and you want to combine them

Simplest case: the data frames have exactly the same number of rows, that the rows represent exactly the same units, and you want all columns from both; just use, data.frame(X,Y)
Next best case: you know that the two data frames have the same rows, but you only want certain columns from each; just use, e.g., data.frame(X$col1,X$col5,Y$col3)
Next best case: same number of rows but in different order; put one of them in same order as the other, with order(). Alternatively, use merge()
Worse cases: different numbers of rows … hard to line up rows …

Example: big city drivers

People in larger cities (larger areas) drive more—seems like a reasonable hypothesis, but is it true?

Distance driven, and city population, from http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm:

fha = read.csv(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/fha.csv",
  colClasses=c("character","double","double","double"), na.strings="NA")
nrow(fha)

## [1] 498

head(fha)

##                                   City Population Miles.of.Road
## 1         New York--Newark, NY--NJ--CT   18351295         43893
## 2 Los Angeles--Long Beach--Anaheim, CA   12150996         24877
## 3                      Chicago, IL--IN    8608208         25905
## 4                            Miami, FL    5502379         15641
## 5         Philadelphia, PA--NJ--DE--MD    5441567         19867
## 6    Dallas--Fort Worth--Arlington, TX    5121892         21610
##   Daily.Miles.Traveled
## 1               286101
## 2               270807
## 3               172708
## 4               125899
## 5                99190
## 6               125389

Area and population of “urbanized areas”, from http://www2.census.gov/geo/ua/ua_list_all.txt:

ua = read.csv(file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/ua.dat", 
              sep=";")
nrow(ua)

## [1] 3598

head(ua)

##   UACE           NAME   POP    HU AREALAND AREALANDSQMI AREAWATER
## 1   37  Abbeville, LA 19824  8460 29222871        11.28    300497
## 2   64  Abbeville, SC  5243  2578 11315197         4.37     19786
## 3   91 Abbotsford, WI  3966  1616  5363441         2.07     13221
## 4  118   Aberdeen, MS  4666  2050  7416616         2.86     52732
## 5  145   Aberdeen, SD 25977 12114 33002447        12.74    247597
## 6  172   Aberdeen, WA 29856 13139 39997951        15.44   1929689
##   AREAWATERSQMI POPDEN LSADC
## 1          0.12 1757.0    76
## 2          0.01 1200.1    76
## 3          0.01 1915.2    76
## 4          0.02 1629.4    76
## 5          0.10 2038.6    76
## 6          0.75 1933.3    76

Difficulties in navigating the merge

$\approx 500$ cities versus $\approx 4000$ “urbanized areas”
fha orders cities by population, ua is alphabetical by name
Both have place names, but those don’t always agree
Not even common names for the shared columns

Lesson: find the unique identifier

But both use the same census figures for population! And it turns out every settlement (in the top 498) has a unique census population:

length(unique(fha$Population)) == nrow(fha)

## [1] TRUE

ua.pop.top498 = sort(ua$POP, decreasing=TRUE)[1:nrow(fha)]
max(abs(fha$Population - ua.pop.top498))

## [1] 0

First way to merge

Reorder area column in second table by population, append to first table

ind.pop = order(ua$POP, decreasing=TRUE) # Order by population
df1 = data.frame(fha, area=ua$AREALANDSQMI[ind.pop][1:nrow(fha)])
# Neaten up names
colnames(df1) = c("City","Population","Roads","Mileage","Area")
nrow(df1)

## [1] 498

head(df1)

##                                   City Population Roads Mileage    Area
## 1         New York--Newark, NY--NJ--CT   18351295 43893  286101 3450.20
## 2 Los Angeles--Long Beach--Anaheim, CA   12150996 24877  270807 1736.02
## 3                      Chicago, IL--IN    8608208 25905  172708 2442.75
## 4                            Miami, FL    5502379 15641  125899 1238.61
## 5         Philadelphia, PA--NJ--DE--MD    5441567 19867   99190 1981.37
## 6    Dallas--Fort Worth--Arlington, TX    5121892 21610  125389 1779.13

Second way to merge

Use the merge() function

df2 = merge(x=fha, y=ua, by.x="Population", by.y="POP")
nrow(df2)

## [1] 498

tail(df2, 3)

##     Population                                 City Miles.of.Road
## 496    8608208                      Chicago, IL--IN         25905
## 497   12150996 Los Angeles--Long Beach--Anaheim, CA         24877
## 498   18351295         New York--Newark, NY--NJ--CT         43893
##     Daily.Miles.Traveled  UACE                                 NAME
## 496               172708 16264                      Chicago, IL--IN
## 497               270807 51445 Los Angeles--Long Beach--Anaheim, CA
## 498               286101 63217         New York--Newark, NY--NJ--CT
##          HU   AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC
## 496 3459257 6326686332      2442.75 105649916         40.79 3524.0    75
## 497 4217448 4496266014      1736.02  61141327         23.61 6999.3    75
## 498 7263095 8935981360      3450.20 533176599        205.86 5318.9    75

`merge()`

The merge() function tries to merge two data frames according to common columns, as in: merge(x, y, by.x="SomeXCol", by.y="SomeYCol"), to join two data frames x, y, by matching the columns “SomeXCol” and “SomeYCol”

Default (no by.x, by.y specified) is to match all columns with common names
Output will be a new data frame that has all the columns of both data frames
If you know databases, then merge() is doing a JOIN

Using `order()` and manual tricks versus `merge()`

Reordering is easier to grasp; merge() takes some learning
Reordering is simplest when there’s only one column to merge on; merge() handles many columns
Reorderng is simplest when the data frames are the same size; merge() handles different sizes automatically

So, do bigger cities mean more driving?

# Convert 1,000s of miles to miles
df1$Mileage = 1000*df1$Mileage
# Plot daily miles per person vs. area
plot(x=df1$Area, y=df1$Mileage/df1$Population, log="x",
     xlab="Miles driven per person per day",
     ylab="City area in square miles")
# Impressively flat regression line
abline(lm(Mileage/Population ~ Area, data=df1), col="red")

Summary

Read in data from a previous R session with readRDS(), load()
Read in data from the outside with read.table(), read.csv()
Can sometimes be tricky to get arguments right in read.table(), read.csv()
Helps sometimes take a look at the original data files to see their structure
For reordering data, use order(), rev(), and proper indexing
For merging data, use merge(); but can do it manually using reordering tricks

Reading in, Reordering, and Merging Data

Last week: Simulation

Part I

Reading in data from the outside

Reading in data from a previous R session

`read.table()`, `read.csv()`

Examples of reading in data

Helpful input arguments

Another example of reading in data

`write.table()`, `write.csv()`

Reordering data

Examples of reordering

Part II

Merging data frames

Example: big city drivers

Difficulties in navigating the merge

Lesson: find the unique identifier

First way to merge

Second way to merge

`merge()`

Using `order()` and manual tricks versus `merge()`

So, do bigger cities mean more driving?

Summary

Reading in, Reordering, and Merging Data

Last week: Simulation

Part I

Reading in data from the outside

Reading in data from a previous R session

read.table(), read.csv()

Examples of reading in data

Helpful input arguments

Another example of reading in data

write.table(), write.csv()

Reordering data

Examples of reordering

Part II

Merging data frames

Example: big city drivers

Difficulties in navigating the merge

Lesson: find the unique identifier

First way to merge

Second way to merge

merge()

Using order() and manual tricks versus merge()

So, do bigger cities mean more driving?

Summary

`read.table()`, `read.csv()`

`write.table()`, `write.csv()`

`merge()`

Using `order()` and manual tricks versus `merge()`