Reading in, Reordering, and Merging Data

Statistical Computing, 36-350

Monday October 14, 2019

Last week: Simulation

Part I

Reading in and reordering data

Reading in data from the outside

All along, we’ve already been reading in data from the outside, using:

This week we’ll focus on read.table(), read.csv() and their counterparts write.table(), write.csv(), respectively

Reading in data from a previous R session

Reminder that we’ve seen two ways to read/write in data in specialized R formats:

Advantage: these can be a lot more memory efficient than what we’ll cover this week. Disadvantage: they’re limited in scope in the sense that they can only communicate with R

read.table(), read.csv()

Have a table full of data, just not in the R file format? Then read.table() is the function for you. It works as in:

The function read.csv() is just a shortcut for using read.table() with sep=",". (But note: these two actually differ on some of the default inputs!)

Examples of reading in data

# This data table is comma separated, so we can use read.csv()
strike.df = 
  read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp/data/strikes.csv")
head(strike.df)
##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA
sapply(strike.df, class)
##         country            year   strike.volume    unemployment 
##        "factor"       "integer"       "integer"       "numeric" 
##       inflation left.parliament  centralization         density 
##       "numeric"       "numeric"       "numeric"       "numeric"
# This data table is tab separated, so let's specify sep="\t"
anss.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/anss.dat",
  sep="\t")
head(anss.df)
##           V1          V2       V3        V4     V5   V6
## 1       Date        Time      Lat       Lon  Depth  Mag
## 2 2002/01/01 10:39:06.82 -55.2140 -129.0000  10.00 6.00
## 3 2002/01/01 11:29:22.73   6.3030  125.6500 138.10 6.30
## 4 2002/01/02 14:50:33.49 -17.9830  178.7440 665.80 6.20
## 5 2002/01/02 17:22:48.76 -17.6000  167.8560  21.00 7.20
## 6 2002/01/03 07:05:27.67  36.0880   70.6870 129.30 6.20
sapply(anss.df, class)
##       V1       V2       V3       V4       V5       V6 
## "factor" "factor" "factor" "factor" "factor" "factor"
# Oops! It comes with column names, so let's set header=TRUE
anss.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/anss.dat",
  sep="\t", header=TRUE)
head(anss.df)
##         Date        Time     Lat      Lon Depth Mag
## 1 2002/01/01 10:39:06.82 -55.214 -129.000  10.0 6.0
## 2 2002/01/01 11:29:22.73   6.303  125.650 138.1 6.3
## 3 2002/01/02 14:50:33.49 -17.983  178.744 665.8 6.2
## 4 2002/01/02 17:22:48.76 -17.600  167.856  21.0 7.2
## 5 2002/01/03 07:05:27.67  36.088   70.687 129.3 6.2
## 6 2002/01/03 10:17:36.30 -17.664  168.004  10.0 6.6
sapply(anss.df, class)
##      Date      Time       Lat       Lon     Depth       Mag 
##  "factor"  "factor" "numeric" "numeric" "numeric" "numeric"

Helpful input arguments

The following inputs apply to either read.table() or read.csv() (though these two functions actually have different default inputs in general—e.g., header defaults to TRUE in read.csv() but FALSE in read.table())

Other helpful inputs: skip, row.names, col.names. You can read about them in the help file for read.table()

Another example of reading in data

# This data table is tab separated, and it comes with column names
sprint.m.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.m.dat",
  sep="\t", header=TRUE)
## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 981 did not have 8 elements
# Oops! It turns out it has some apostrophe marks in the City column, 
# so we have to set quote=""
sprint.m.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.m.dat",
  sep="\t", header=TRUE, quote="")
head(sprint.m.df)
##   Rank Time Wind        Name Country Birthdate     City       Date
## 1    1 9.58  0.9  Usain Bolt     JAM  21.08.86   Berlin 16.08.2009
## 2    2 9.63  1.5  Usain Bolt     JAM  21.08.86   London 05.08.2012
## 3    3 9.69  0.0  Usain Bolt     JAM  21.08.86  Beijing 16.08.2008
## 4    3 9.69  2.0   Tyson Gay     USA  09.08.82 Shanghai 20.09.2009
## 5    3 9.69 -0.1 Yohan Blake     JAM  26.12.89 Lausanne 23.08.2012
## 6    6 9.71  0.9   Tyson Gay     USA  09.08.82   Berlin 16.08.2009
unique(grep("'", sprint.m.df$City, value=TRUE)) # This is the troublemaker
## [1] "Villeneuve d'Ascq"

write.table(), write.csv()

To write a data frame (or matrix) to a text file, use write.table() or write.csv(). These are the counterparts to read.table() and read.csv(), and they work as in:

Note that quote=FALSE, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE). Also, setting row.names=FALSE and col.names=FALSE will disable the printing of row and column names (defaults are row.names=TRUE and col.names=TRUE)

Reordering data

Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall:

Examples of reordering

# The sprint data has its rows ordered by the fastest 100m time to the slowest.
# Suppose we wanted to reorder, from slowest to fastest
i.slow = order(sprint.m.df$Time, decreasing=TRUE) # By decreasing sprint time
sprint.m.df.slow = sprint.m.df[i.slow,] # Reorder rows by decreasing sprint time
head(sprint.m.df.slow)
##      Rank  Time Wind         Name Country Birthdate             City
## 2682 2691 10.09  1.8  Mel Lattany     USA  10.08.59 Colorado Springs
## 2683 2691 10.09 -0.9  Mel Lattany     USA  10.08.59      Zürich
## 2684 2691 10.09  1.3   Carl Lewis     USA  01.07.61           Walnut
## 2685 2691 10.09  0.6 Calvin Smith     USA  08.01.61           Athens
## 2686 2691 10.09 -1.7   Carl Lewis     USA  01.07.61     Indianapolis
## 2687 2691 10.09 -0.9 Calvin Smith     USA  08.01.61      Zürich
##            Date
## 2682 30.07.1978
## 2683 19.08.1981
## 2684 25.04.1982
## 2685 14.05.1982
## 2686 02.07.1982
## 2687 18.08.1982
# Suppose we wanted to reorder the rows by sprinter name, alphabetically
i.name = order(sprint.m.df$Name) # By sprinter name
sprint.m.df.name = sprint.m.df[i.name,] # Reorder rows by name
head(sprint.m.df.name)
##      Rank  Time Wind            Name Country Birthdate          City
## 1373 1281 10.03  0.5 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 1552 1456 10.04  1.0 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 2326 2145 10.07  1.0 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 568   491  9.96  2.0     Aaron Brown     CAN  27.05.92     Montverde
## 1114 1011 10.01  1.8     Aaron Brown     CAN  27.05.92     Montverde
## 1836 1683 10.05  1.9     Aaron Brown     CAN  27.05.92        Eugene
##            Date
## 1373 20.06.2009
## 1552 25.06.2005
## 2326 13.08.2011
## 568  11.06.2016
## 1114 11.06.2016
## 1836 05.06.2013

Part II

Merging data

Merging data frames

Suppose you have two data frames X, Y, and you want to combine them

Example: big city drivers

People in larger cities (larger areas) drive more—seems like a reasonable hypothesis, but is it true?

Distance driven, and city population, from http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm:

fha = read.csv(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/fha.csv",
  colClasses=c("character","double","double","double"), na.strings="NA")
nrow(fha)
## [1] 498
head(fha)
##                                   City Population Miles.of.Road
## 1         New York--Newark, NY--NJ--CT   18351295         43893
## 2 Los Angeles--Long Beach--Anaheim, CA   12150996         24877
## 3                      Chicago, IL--IN    8608208         25905
## 4                            Miami, FL    5502379         15641
## 5         Philadelphia, PA--NJ--DE--MD    5441567         19867
## 6    Dallas--Fort Worth--Arlington, TX    5121892         21610
##   Daily.Miles.Traveled
## 1               286101
## 2               270807
## 3               172708
## 4               125899
## 5                99190
## 6               125389

Area and population of “urbanized areas”, from http://www2.census.gov/geo/ua/ua_list_all.txt:

ua = read.csv(file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/ua.dat", 
              sep=";")
nrow(ua)
## [1] 3598
head(ua)
##   UACE           NAME   POP    HU AREALAND AREALANDSQMI AREAWATER
## 1   37  Abbeville, LA 19824  8460 29222871        11.28    300497
## 2   64  Abbeville, SC  5243  2578 11315197         4.37     19786
## 3   91 Abbotsford, WI  3966  1616  5363441         2.07     13221
## 4  118   Aberdeen, MS  4666  2050  7416616         2.86     52732
## 5  145   Aberdeen, SD 25977 12114 33002447        12.74    247597
## 6  172   Aberdeen, WA 29856 13139 39997951        15.44   1929689
##   AREAWATERSQMI POPDEN LSADC
## 1          0.12 1757.0    76
## 2          0.01 1200.1    76
## 3          0.01 1915.2    76
## 4          0.02 1629.4    76
## 5          0.10 2038.6    76
## 6          0.75 1933.3    76

Difficulties in navigating the merge

Lesson: find the unique identifier

But both use the same census figures for population! And it turns out every settlement (in the top 498) has a unique census population:

length(unique(fha$Population)) == nrow(fha)
## [1] TRUE
ua.pop.top498 = sort(ua$POP, decreasing=TRUE)[1:nrow(fha)]
max(abs(fha$Population - ua.pop.top498))
## [1] 0

First way to merge

Reorder area column in second table by population, append to first table

ind.pop = order(ua$POP, decreasing=TRUE) # Order by population
df1 = data.frame(fha, area=ua$AREALANDSQMI[ind.pop][1:nrow(fha)])
# Neaten up names
colnames(df1) = c("City","Population","Roads","Mileage","Area")
nrow(df1)
## [1] 498
head(df1)
##                                   City Population Roads Mileage    Area
## 1         New York--Newark, NY--NJ--CT   18351295 43893  286101 3450.20
## 2 Los Angeles--Long Beach--Anaheim, CA   12150996 24877  270807 1736.02
## 3                      Chicago, IL--IN    8608208 25905  172708 2442.75
## 4                            Miami, FL    5502379 15641  125899 1238.61
## 5         Philadelphia, PA--NJ--DE--MD    5441567 19867   99190 1981.37
## 6    Dallas--Fort Worth--Arlington, TX    5121892 21610  125389 1779.13

Second way to merge

Use the merge() function

df2 = merge(x=fha, y=ua, by.x="Population", by.y="POP")
nrow(df2)
## [1] 498
tail(df2, 3)
##     Population                                 City Miles.of.Road
## 496    8608208                      Chicago, IL--IN         25905
## 497   12150996 Los Angeles--Long Beach--Anaheim, CA         24877
## 498   18351295         New York--Newark, NY--NJ--CT         43893
##     Daily.Miles.Traveled  UACE                                 NAME
## 496               172708 16264                      Chicago, IL--IN
## 497               270807 51445 Los Angeles--Long Beach--Anaheim, CA
## 498               286101 63217         New York--Newark, NY--NJ--CT
##          HU   AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC
## 496 3459257 6326686332      2442.75 105649916         40.79 3524.0    75
## 497 4217448 4496266014      1736.02  61141327         23.61 6999.3    75
## 498 7263095 8935981360      3450.20 533176599        205.86 5318.9    75

merge()

The merge() function tries to merge two data frames according to common columns, as in: merge(x, y, by.x="SomeXCol", by.y="SomeYCol"), to join two data frames x, y, by matching the columns “SomeXCol” and “SomeYCol”

Using order() and manual tricks versus merge()

So, do bigger cities mean more driving?

# Convert 1,000s of miles to miles
df1$Mileage = 1000*df1$Mileage
# Plot daily miles per person vs. area
plot(x=df1$Area, y=df1$Mileage/df1$Population, log="x",
     xlab="Miles driven per person per day",
     ylab="City area in square miles")
# Impressively flat regression line
abline(lm(Mileage/Population ~ Area, data=df1), col="red")

Summary