# The Apply Family

Wednesday October 12, 2016

# The apply family

R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for() loop. Can be simpler and faster than a for() loop, though not always

Below is a summary. We’ll cover apply() today, and the rest next time

• apply(): apply a function to rows or columns of a matrix or data frame
• lapply(): apply a function to elements of a list or vector
• sapply(): same as the above, but simplify the output (if possible)
• tapply(): apply a function to levels of a factor vector

# apply(), rows or columns of a matrix or data frame

The apply() function takes inputs of the following form:

• apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across rows of a matrix or data frame x
• apply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across columns of a matrix or data frame x
x = matrix(rnorm(9), 3, 3) # Create a 9 x 9 matrix of random normals
x
##            [,1]      [,2]       [,3]
## [1,]  1.7999350  1.529366  0.7187638
## [2,] -0.2829916 -0.319885 -1.9895921
## [3,]  0.3861865  1.944046 -0.7270550
apply(x, MARGIN=1, FUN=min) # Smallest entry in each row
## [1]  0.7187638 -1.9895921 -0.7270550
apply(x, MARGIN=1, FUN=sum) # Sum of entries in each row
## [1]  4.048065 -2.592469  1.603177

# (Continued)

head(state.x77) # Matrix of states data, 50 states x 8 variables
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766
apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column
## Population     Income Illiteracy   Life Exp     Murder    HS Grad
##    21198.0     6315.0        2.8       73.6       15.1       67.3
##      Frost       Area
##      188.0   566432.0
apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column
## Population     Income Illiteracy   Life Exp     Murder    HS Grad
##          5          2         18         11          1         44
##      Frost       Area
##         28          2
apply(state.x77, MARGIN=2, FUN=summary) # Summary of each col, get back matrix!
##         Population Income Illiteracy Life Exp Murder HS Grad  Frost   Area
## Min.           365   3098      0.500    67.96  1.400   37.80   0.00   1049
## 1st Qu.       1080   3993      0.625    70.12  4.350   48.05  66.25  36990
## Median        2838   4519      0.950    70.68  6.850   53.25 114.50  54280
## Mean          4246   4436      1.170    70.88  7.378   53.11 104.50  70740
## 3rd Qu.       4968   4814      1.575    71.89 10.680   59.15 139.80  81160
## Max.         21200   6315      2.800    73.60 15.100   67.30 188.00 566400

# Applying a custom function

For a custom function, we can just define it before hand, and the use apply() as usual

# Our custom function
my.fun = function(v) {  v.mean = mean(v)
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
cat(paste("The 0.1 quantile is", q1, "! "))
cat(paste("The 0.9 quantile is", q2, "!\n"))
v.mean = mean(v) # Regular mean
v.trimmed.mean = mean(v[q1 <= v & v <= q2]) # Trimmed mean!
c(v.mean, v.trimmed.mean)
}

mat = apply(state.x77, MARGIN=2, FUN=my.fun) # We get back a matrix
## The 0.1 quantile is 632.3 ! The 0.9 quantile is 10781.2 !
## The 0.1 quantile is 3623.3 ! The 0.9 quantile is 5117.5 !
## The 0.1 quantile is 0.6 ! The 0.9 quantile is 2.11 !
## The 0.1 quantile is 69.048 ! The 0.9 quantile is 72.582 !
## The 0.1 quantile is 2.67 ! The 0.9 quantile is 11.66 !
## The 0.1 quantile is 40.96 ! The 0.9 quantile is 62.96 !
## The 0.1 quantile is 20 ! The 0.9 quantile is 168.4 !
## The 0.1 quantile is 7795.5 ! The 0.9 quantile is 114216.5 !
mat # First row is the mean, second row is the trimmed mean
##      Population   Income Illiteracy Life Exp Murder HS Grad    Frost
## [1,]   4246.420 4435.800    1.17000 70.87860 7.3780 53.1080 104.4600
## [2,]   3384.275 4430.075    1.07381 70.91775 7.2975 53.3375 104.6829
##          Area
## [1,] 70735.88
## [2,] 56575.72

# Applying a custom function “on-the-fly”

Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient

# Compute trimmed means, defining custom function on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
mean(v[q1 <= v & v <= q2]) # Trimmed mean!
})
##  Population      Income  Illiteracy    Life Exp      Murder     HS Grad
##  3384.27500  4430.07500     1.07381    70.91775     7.29750    53.33750
##       Frost        Area
##   104.68293 56575.72500

# Applying a function that takes extra arguments

Sometimes we want to use a function over rows or columns of a matrix, that takes extra arguments (besides the row or column itself). We can pass these as inputs to apply(), as in: apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2), for two extra arguments extra.arg.1, extra.arg.2 to be passed to my.fun()

# Function that gets indices of the biggest 3 entries of v, then returns the
# corresponding 3 elements of names.v
top.3.names = function(v, names.v) { names.v[order(v, decreasing=TRUE)[1:3]] }
# Now we'll run this function on each column of state.x77. Note: here v will
# be a column, and for names.v, we'll pass in rownames(state.x77), i.e., the
# state names
apply(state.x77, MARGIN=2, FUN=top.3.names, names.v=rownames(state.x77))
##      Population   Income        Illiteracy       Life Exp    Murder
## [1,] "California" "Alaska"      "Louisiana"      "Hawaii"    "Alabama"
## [2,] "New York"   "Connecticut" "Mississippi"    "Minnesota" "Georgia"
## [3,] "Texas"      "Maryland"    "South Carolina" "Utah"      "Louisiana"
## [2,] "Alaska" "North Dakota"  "Texas"
## [3,] "Nevada" "New Hampshire" "California"

# What’s the return argument?

What kind of data type will apply() give us? Depends on what function we pass. Summary, say, with FUN=my.fun():

• If my.fun() returns a single value, then apply() will return a vector
• If my.fun() returns k values, then apply() will return a matrix with k rows (note: this is true regardless of whether MARGIN=1 or MARGIN=2)
• If my.fun() returns different length output for different inputs, then apply() will return a list
• If my.fun() returns a list, then apply() will return a list

# Optimized functions for special tasks

Don’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply(). E.g.,

• rowSums(), colSums(): for computing row, column sums of a matrix
• rowMeans(), colMeans(): for computing row, column means of a matrix
• max.col(): for finding the maximum position in each row of a matrix

Combining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?

x
##            [,1]      [,2]       [,3]
## [1,]  1.7999350  1.529366  0.7187638
## [2,] -0.2829916 -0.319885 -1.9895921
## [3,]  0.3861865  1.944046 -0.7270550
# DON'T do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { sum(v>0) })
## [1] 3 0 2
# DO do this (much faster, simpler for big matrices)
rowSums(x > 0)
## [1] 3 0 2