Statistical Computing, 36-350
Wednesday October 12, 2016
R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for() loop. Can be simpler and faster than a for() loop, though not always
Below is a summary. We’ll cover apply() today, and the rest next time
apply(): apply a function to rows or columns of a matrix or data framelapply(): apply a function to elements of a list or vectorsapply(): same as the above, but simplify the output (if possible)tapply(): apply a function to levels of a factor vectorapply(), rows or columns of a matrix or data frameThe apply() function takes inputs of the following form:
apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across rows of a matrix or data frame xapply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across columns of a matrix or data frame xx = matrix(rnorm(9), 3, 3) # Create a 9 x 9 matrix of random normals
x##            [,1]      [,2]       [,3]
## [1,]  1.7999350  1.529366  0.7187638
## [2,] -0.2829916 -0.319885 -1.9895921
## [3,]  0.3861865  1.944046 -0.7270550apply(x, MARGIN=1, FUN=min) # Smallest entry in each row## [1]  0.7187638 -1.9895921 -0.7270550apply(x, MARGIN=1, FUN=sum) # Sum of entries in each row## [1]  4.048065 -2.592469  1.603177head(state.x77) # Matrix of states data, 50 states x 8 variables##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##    21198.0     6315.0        2.8       73.6       15.1       67.3 
##      Frost       Area 
##      188.0   566432.0apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##          5          2         18         11          1         44 
##      Frost       Area 
##         28          2apply(state.x77, MARGIN=2, FUN=summary) # Summary of each col, get back matrix!##         Population Income Illiteracy Life Exp Murder HS Grad  Frost   Area
## Min.           365   3098      0.500    67.96  1.400   37.80   0.00   1049
## 1st Qu.       1080   3993      0.625    70.12  4.350   48.05  66.25  36990
## Median        2838   4519      0.950    70.68  6.850   53.25 114.50  54280
## Mean          4246   4436      1.170    70.88  7.378   53.11 104.50  70740
## 3rd Qu.       4968   4814      1.575    71.89 10.680   59.15 139.80  81160
## Max.         21200   6315      2.800    73.60 15.100   67.30 188.00 566400For a custom function, we can just define it before hand, and the use apply() as usual
# Our custom function
my.fun = function(v) {  v.mean = mean(v) 
  q1 = quantile(v, prob=0.1)
  q2 = quantile(v, prob=0.9)
  cat(paste("The 0.1 quantile is", q1, "! "))
  cat(paste("The 0.9 quantile is", q2, "!\n"))
  v.mean = mean(v) # Regular mean
  v.trimmed.mean = mean(v[q1 <= v & v <= q2]) # Trimmed mean!
  c(v.mean, v.trimmed.mean)
}
mat = apply(state.x77, MARGIN=2, FUN=my.fun) # We get back a matrix## The 0.1 quantile is 632.3 ! The 0.9 quantile is 10781.2 !
## The 0.1 quantile is 3623.3 ! The 0.9 quantile is 5117.5 !
## The 0.1 quantile is 0.6 ! The 0.9 quantile is 2.11 !
## The 0.1 quantile is 69.048 ! The 0.9 quantile is 72.582 !
## The 0.1 quantile is 2.67 ! The 0.9 quantile is 11.66 !
## The 0.1 quantile is 40.96 ! The 0.9 quantile is 62.96 !
## The 0.1 quantile is 20 ! The 0.9 quantile is 168.4 !
## The 0.1 quantile is 7795.5 ! The 0.9 quantile is 114216.5 !mat # First row is the mean, second row is the trimmed mean##      Population   Income Illiteracy Life Exp Murder HS Grad    Frost
## [1,]   4246.420 4435.800    1.17000 70.87860 7.3780 53.1080 104.4600
## [2,]   3384.275 4430.075    1.07381 70.91775 7.2975 53.3375 104.6829
##          Area
## [1,] 70735.88
## [2,] 56575.72Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient
# Compute trimmed means, defining custom function on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) { 
  q1 = quantile(v, prob=0.1)
  q2 = quantile(v, prob=0.9)
  mean(v[q1 <= v & v <= q2]) # Trimmed mean!
})##  Population      Income  Illiteracy    Life Exp      Murder     HS Grad 
##  3384.27500  4430.07500     1.07381    70.91775     7.29750    53.33750 
##       Frost        Area 
##   104.68293 56575.72500Sometimes we want to use a function over rows or columns of a matrix, that takes extra arguments (besides the row or column itself). We can pass these as inputs to apply(), as in: apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2), for two extra arguments extra.arg.1, extra.arg.2 to be passed to my.fun()
# Function that gets indices of the biggest 3 entries of v, then returns the
# corresponding 3 elements of names.v
top.3.names = function(v, names.v) { names.v[order(v, decreasing=TRUE)[1:3]] }
# Now we'll run this function on each column of state.x77. Note: here v will
# be a column, and for names.v, we'll pass in rownames(state.x77), i.e., the
# state names
apply(state.x77, MARGIN=2, FUN=top.3.names, names.v=rownames(state.x77))##      Population   Income        Illiteracy       Life Exp    Murder     
## [1,] "California" "Alaska"      "Louisiana"      "Hawaii"    "Alabama"  
## [2,] "New York"   "Connecticut" "Mississippi"    "Minnesota" "Georgia"  
## [3,] "Texas"      "Maryland"    "South Carolina" "Utah"      "Louisiana"
##      HS Grad  Frost           Area        
## [1,] "Utah"   "Nevada"        "Alaska"    
## [2,] "Alaska" "North Dakota"  "Texas"     
## [3,] "Nevada" "New Hampshire" "California"What kind of data type will apply() give us? Depends on what function we pass. Summary, say, with FUN=my.fun():
my.fun() returns a single value, then apply() will return a vectormy.fun() returns k values, then apply() will return a matrix with k rows (note: this is true regardless of whether MARGIN=1 or MARGIN=2)my.fun() returns different length output for different inputs, then apply() will return a listmy.fun() returns a list, then apply() will return a listDon’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply(). E.g.,
rowSums(), colSums(): for computing row, column sums of a matrixrowMeans(), colMeans(): for computing row, column means of a matrixmax.col(): for finding the maximum position in each row of a matrixCombining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
x##            [,1]      [,2]       [,3]
## [1,]  1.7999350  1.529366  0.7187638
## [2,] -0.2829916 -0.319885 -1.9895921
## [3,]  0.3861865  1.944046 -0.7270550# DON'T do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { sum(v>0) })## [1] 3 0 2# DO do this (much faster, simpler for big matrices)
rowSums(x > 0)## [1] 3 0 2