%>% operator)dplyr is a package for data wrangling, with several key verbs (functions)slice() and filter(): subset rows based on numbers or conditionsselect(): select columnsarrange(): order rows by one or multiple columnsrename() and mutate(): rename or create columnsmutate_at(): apply a function to given columns%>%)Tidyverse functions are at their best when composed together using the pipe (%>%) operator!
library(tidyverse)
mtcars %>%
filter(., (mpg >= 14 & disp >= 200) | (drat <= 3)) %>%
head(., 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Pro tip: use ctrl + shift + m in RStudio as a shortcut for typing %>%
Mastering the tidyr verbs
tidyr verbsOur tidyr journey starts of with learning the following verbs (functions):
gather(): make “wide” data longerspread(): make “long” data widerseparate(): split a single column into multiple columnsunite(): combine multiple columns into a single columnKey takeaway: as with dplyr, think of data frames as nouns and tidyr verbs as actions that you apply to manipulate them—especially natural when using pipes
gather()Use gather() to make “wide” data longer:
library(EDAWR) # Load some nice data sets
EDAWR::cases %>%
head(., 3)
## country 2011 2012 2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
EDAWR::cases %>%
gather(., "year", "n", 2:4) %>%
head(., 5)
## country year n
## 1 FR 2011 7000
## 2 DE 2011 5800
## 3 US 2011 15000
## 4 FR 2012 6900
## 5 DE 2012 6000
year columnntidyr did all the heavy lifting of the transposing workspread()Use spread() to make “long” data wider:
EDAWR::pollution %>%
head(., 5)
## city size amount
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
EDAWR::pollution %>%
spread(., size, amount)
## city large small
## 1 Beijing 121 56
## 2 London 22 16
## 3 New York 23 14
tidyr did all the heavy lifting ofspread() and gather() are inversesseparate()EDAWR::storms %>%
head(., 3)
## # A tibble: 3 x 4
## storm wind pressure date
## <chr> <int> <int> <date>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
storms2 <- EDAWR::storms %>%
separate(., date, c("y", "m", "d"))
storms2
## # tibble [6 × 6]
## storm wind pressure y m d
## <chr> <int> <int> <chr> <chr> <chr>
## 1 Alberto 110 1007 2000 08 03
## 2 Alex 45 1009 1998 07 27
## 3 Allison 65 1005 1995 06 03
## 4 Ana 40 1013 1997 06 30
## 5 Arlene 50 1010 1999 06 11
## 6 Arthur 45 1010 1996 06 17
unite()Use unite() to combine multiple columns into a single column:
storms2 %>%
unite(., date, y, m, d, sep = "-")
## # A tibble: 6 x 4
## storm wind pressure date
## <chr> <int> <int> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
Note that unite() and separate() are inverse operations
dplyr: group_by() and summarize()
dplyr and SQLdplyr you should find SQL very natural, and vice versa!group_by() and summarize(), which are used to aggregate data (now)left_join() and inner_join() verbs (shortly)group_by()Use group_by() to define a grouping of rows based on a column:
mtcars %>%
group_by(., cyl) %>%
head(., 6)
## # A tibble: 6 x 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
dplyr verbs actsummarise()Use summarise() (or summarize() for us Americans) to apply functions to rows—ungrouped or grouped—of a data frame:
# Ungrouped
mtcars %>%
summarise(.,
mpg=mean(mpg),
hp=mean(hp))
## mpg hp
## 1 20.09062 146.6875
# Grouped by number of cylinders
mtcars %>%
group_by(., cyl) %>%
summarise(.,
mpg=mean(mpg),
hp=mean(hp))
## # A tibble: 3 x 3
## cyl mpg hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
mtcars %>%
group_by(., cyl) %>%
summarise(.,
mpg_mean=mean(mpg),
mpg_max=max(mpg),
hp_mean=mean(hp),
hp_max=max(hp))
## # A tibble: 3 x 5
## cyl mpg_mean mpg_max hp_mean hp_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 33.9 82.6 113
## 2 6 19.7 21.4 122. 175
## 3 8 15.1 19.2 209. 335
ungroup()Use ungroup() to remove groupings from a data frame:
mtcars %>%
group_by(., cyl) %>%
ungroup(.) %>%
summarise(.,
hp=mean(hp),
mpg=mean(mpg))
## # A tibble: 1 x 2
## hp mpg
## <dbl> <dbl>
## 1 147. 20.1
In last week’s lab we encountered the following example:
str.url <- "http://www.stat.cmu.edu/~ryantibs/statcomp-F18/data/trump.txt"
str.url %>%
readLines %>%
paste(collapse=" ") %>%
strsplit(split="[[:space:]]|[[:punct:]]") %>%
unlist(x = .) %>%
.[. != ""] %>%
table(.) %>%
sort(decreasing=TRUE) %>%
head(.)
## .
## the and of to our will
## 189 146 127 126 90 83
This required a tricky indexing operation using %>% on vectors: .[. != ""]
We can perform the same operations using dplyr if we convert to a data frame:
str.url %>%
readLines(.) %>%
paste(collapse=" ") %>%
strsplit(split="[[:space:]]|[[:punct:]]") %>%
unlist(.) %>%
as_data_frame() %>% # Convert to a data frame (dplyr version)
rename(word_list = value) %>% # Rename the column to `word_list`
filter(word_list != "") %>% # Now indexing is easy!
group_by(word_list) %>%
summarise(word_count = n()) %>%
arrange(desc(word_count)) %>%
slice(1:6)
## # A tibble: 6 x 2
## word_list word_count
## <chr> <int>
## 1 the 189
## 2 and 146
## 3 of 127
## 4 to 126
## 5 our 90
## 6 will 83
dplyr: left_join() and inner_join()
A “join” operation in database terminology is a merging of two data frames for us. There are 4 types of joins:
Column values that cannot be filled in are assigned NA values
It helps to visualize the join types:
tab1 = data.frame(name = c("Alexis", "Bernie", "Charlie"),
children = 1:3,
stringsAsFactors = FALSE)
tab2 = data.frame(name = c("Alexis", "Bernie", "David"),
age = c(54, 34, 63),
stringsAsFactors = FALSE)
tab1
## name children
## 1 Alexis 1
## 2 Bernie 2
## 3 Charlie 3
tab2
## name age
## 1 Alexis 54
## 2 Bernie 34
## 3 David 63
inner_join()Suppose we want to join tab1 and tab2 by name, but keep only customers in intersection:
inner_join(x=tab1, y=tab2, by="name")
## name children age
## 1 Alexis 1 54
## 2 Bernie 2 34
left_join()Suppose we want to join tab1 and tab2 by name, but keep all customers from tab1:
left_join(x=tab1, y=tab2, by="name")
## name children age
## 1 Alexis 1 54
## 2 Bernie 2 34
## 3 Charlie 3 NA
right_join()Suppose we want to join tab1 and tab2 by name, but keep all customers from tab2:
right_join(x=tab1, y=tab2, by="name")
## name children age
## 1 Alexis 1 54
## 2 Bernie 2 34
## 3 David NA 63
full_join()Finally, suppose we want to join tab1 and tab2 by name, and keep all customers from both:
full_join(x=tab1, y=tab2, by="name")
## name children age
## 1 Alexis 1 54
## 2 Bernie 2 34
## 3 Charlie 3 NA
## 4 David NA 63
tidyr is a package for manipulating the structure of data framesgather(): make wide data longerspread(): make long data widerunite() and separate(): combine or split columnsdplyr has advanced functionality that mirrors SQLgroup_by(): create groups of rows according to a conditionsummarise(): apply computations across groups of rows*_join() where * = inner, left, right, or full: join two data frames together according to common values in certain columns, and * indicates how many rows to keep