---
title: "Split-Apply-Combine"
author: "Statistical Computing, 36-350"
date: "Monday November 9, 2016"
---
Reminder: iterating in R without `for()`
===
We've learned some tools in R for iteration without explicit `for()` loops:
- Indexing with conditionals + vectorization
- `apply()`: apply a function to rows or columns of a matrix or data frame
- `lapply()`: apply a function to elements of a list or vector
- `sapply()`: same as the above, but simplify the output (if possible)
- `tapply()`: apply a function to levels of a factor vector
Split-apply-combine
===
Today we will learn a general strategy that can be summmarized in three conceptual steps:
- **Split** whatever data object we have into meaningful chunks
- **Apply** the function of interest to each element in this division
- **Combine** the results into a new object of the desired structure
These are conceptual steps; often the apply and combine steps can be performed for us by a single call to the appropriate function from the `apply()` family
Simple but powerful
===
Does split-apply-combine sound simple? It is, but it's very powerful when combined with the right data structures
- As usual, compared to explicit `for()` loops, often requires far less code
- Fits nicely with our previous ideas of **top-down function design**, simply at a broader scope
- Sets you in the right direction towards learning how to use MapReduce/Hadoop for really, really big data sets
Strikes data set
===
Data set on 18 countries over 35 years (compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables:
- `country`, `year`: country and year of data collection
- `strike.volume`: days on strike per 1000 workers
- `unemployment`: unemployment rate
- `inflation`: inflation rate
- `left.parliament`: leftwing share of the goverment
- `centralization`: centralization of unions
- `density`: density of unions
```{r}
strikes.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp/data/strikes.csv")
dim(strikes.df) # Since 18 × 35 = 630, some years missing from some countries
head(strikes.df)
```
An interesting question
===
Is there a relationship between a country's ruling party alignment (left versus right) and the volume of strikes?
![washington](http://www.stat.cmu.edu/~ryantibs/statcomp/lectures/washington63.jpg)
![madison](http://www.stat.cmu.edu/~ryantibs/statcomp/lectures/madison11.jpg)
How could we approach this?
- Worst way: by hand, write 18 separate code blocks
- Bad way: explicit `for()` loop, where we loop over countries
- Best way: split appropriately, then use `sapply()`
Work with just one chunk of data
===
Step 0: design a function that will work on one chunk of data
So let's write code to do regression on the data from (say) just Italy
```{r}
strikes.df.italy = strikes.df[strikes.df$country=="Italy", ] # Data for Italy
dim(strikes.df.italy)
head(strikes.df.italy)
italy.lm = lm(strike.volume ~ left.parliament, data=strikes.df.italy)
summary(italy.lm)
plot(strikes.df.italy$left.parliament, strikes.df.italy$strike.volume,
main="Italy strike volume versus leftwing alignment",
ylab="Strike volume", xlab="Leftwing alignment")
abline(coef(italy.lm), col=2)
```
(Continued)
===
Now let's turn this into a function
```{r}
my.strike.lm = function(country.df) {
coef(lm(strike.volume ~ left.parliament, data=country.df))
}
my.strike.lm(strikes.df.italy) # New way
coef(italy.lm) # Old way, same result
```
Split our data into appropriate chunks
===
Step 1: split our data into appropriate chunks, each of which can be handled by our function. Here, the function `split()` is often helpful: `split(df, f=my.factor)` splits a data frame `df` into several data frames, defined by constant levels of the factor `my.factor`
So we want to split `strikes.df` into 18 smaller data frames, each of which has the data for just one country
```{r}
strikes.by.country = split(strikes.df, f=strikes.df$country)
class(strikes.by.country) # It's a list
names(strikes.by.country) # It has one element for each country
head(strikes.by.country$Italy) # Same as what we saw before
```
Apply our function and combine the results
===
Steps 2 and 3: apply our function to each chunk of data, and combine the results. Here, the functions `lapply()` or `sapply()` are often helpful
So we want to apply `my.strikes.lm()` to each data frame in `strikes.by.country`. Think about what the output will be from each function call: vector of length 2 (intercept and slope), so we can use `sapply()`
```{r}
strikes.coefs = sapply(strikes.by.country, FUN=my.strike.lm)
strikes.coefs
# We don't care about the intercepts, only the slopes (2nd row).
# Some are positive, some are negative! Let's plot them:
plot(1:ncol(strikes.coefs), strikes.coefs[2,], xaxt="n",
xlab="", ylab="Regression coefficient",
main="Countrywise labor activity by leftwing score")
axis(side=1, at=1:ncol(strikes.coefs),
labels=colnames(strikes.coefs), las=2, cex.axis=0.5)
abline(h=0, col="grey")
```