nchar(), substr(): functions for substring extractions and replacementsstrsplit(), paste(): functions for splitting and combining stringstable(): function to get word counts, useful way of summarizing text dataData frames
The format for the “classic” data table in statistics: data frame. Lots of the “really-statistical” parts of the R programming language presume data frames
Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)
Use data.frame(), similar to how we create lists
my.df = data.frame(nums=seq(0.1,0.6,by=0.1), chars=letters[1:6],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.df
## nums chars bools
## 1 0.1 a FALSE
## 2 0.2 b FALSE
## 3 0.3 c FALSE
## 4 0.4 d TRUE
## 5 0.5 e TRUE
## 6 0.6 f TRUE
# Note, a list can have different lengths for different elements!
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] FALSE TRUE TRUE FALSE TRUE FALSE
my.df[,1] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df[,"nums"] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df$nums # Doesn't work for a matrix, but works for a list
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df$chars # Note: this one has been converted into a factor data type
## [1] a b c d e f
## Levels: a b c d e f
as.character(my.df$chars) # Converting it back to a character data type
## [1] "a" "b" "c" "d" "e" "f"
Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame
class(state.x77) # Built-in matrix of states data, 50 states x 8 variables
## [1] "matrix"
head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
class(state.region) # Factor of regions for the 50 states
## [1] "factor"
head(state.region)
## [1] South West West South West West
## Levels: Northeast South North Central West
class(state.division) # Factor of divisions for the 50 states
## [1] "factor"
head(state.division)
## [1] East South Central Pacific Mountain
## [4] West South Central Pacific Mountain
## 9 Levels: New England Middle Atlantic ... Pacific
# Combine these into a data frame with 50 rows and 10 columns
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
class(state.df)
## [1] "data.frame"
head(state.df) # Note that the first 8 columns name carried over from state.x77
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area Region Division
## Alabama 50708 South East South Central
## Alaska 566432 West Pacific
## Arizona 113417 West Mountain
## Arkansas 51945 South West South Central
## California 156361 West Pacific
## Colorado 103766 West Mountain
To add columns: we can either use data.frame(), or directly define a new named column
# First way: use data.frame() to concatenate on a new column
state.df = data.frame(state.df, Cool=sample(c(T,F), nrow(state.df), rep=TRUE))
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool
## Alabama South East South Central FALSE
## Alaska West Pacific FALSE
## Arizona West Mountain TRUE
## Arkansas South West South Central TRUE
# Second way: just directly define a new named column
state.df$Score = sample(1:100, nrow(state.df), replace=TRUE)
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool Score
## Alabama South East South Central FALSE 86
## Alaska West Pacific FALSE 92
## Arizona West Mountain TRUE 66
## Arkansas South West South Central TRUE 23
To delete columns: we can either use negative integer indexing, or set a column to NULL
# First way: use negative integer indexing
state.df = state.df[,-ncol(state.df)]
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool
## Alabama South East South Central FALSE
## Alaska West Pacific FALSE
## Arizona West Mountain TRUE
## Arkansas South West South Central TRUE
# Second way: just directly set a column to NULL
state.df$Cool = NULL
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division
## Alabama South East South Central
## Alaska West Pacific
## Arizona West Mountain
## Arkansas South West South Central
With matrices or data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing
# Compare the averages of the Frost column between states in New England and
# Pacific divisions
mean(state.df[state.df$Division == "New England", "Frost"])
## [1] 145.3333
mean(state.df[state.df$Division == "Pacific", "Frost"]) # Those wimps!
## [1] 49.6
subset()The subset() function provides a convenient alternative way of accessing rows for data frames
# Using subset(), we can just use the column names directly (i.e., no need for
# using $)
state.df.ne.1 = subset(state.df, Division == "New England")
# Get same thing by extracting the appropriate rows manually
state.df.ne.2 = state.df[state.df$Division == "New England", ]
all(state.df.ne.1 == state.df.ne.2)
## [1] TRUE
# Same calculation as in the last slide, using subset()
mean(subset(state.df, Division == "New England")$Frost)
## [1] 145.3333
mean(subset(state.df, Division == "Pacific")$Frost) # Wimps
## [1] 49.6
apply()
R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for() loop; can be simpler and faster, though not always. Summary of functions:
apply(): apply a function to rows or columns of a matrix or data framelapply(): apply a function to elements of a list or vectorsapply(): same as the above, but simplify the output (if possible)tapply(): apply a function to levels of a factor vectorapply(), rows or columns of a matrix or data frameThe apply() function takes inputs of the following form:
apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across rows of a matrix or data frame xapply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across columns of a matrix or data frame xapply(state.x77, MARGIN=2, FUN=min) # Minimum entry in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 365.00 3098.00 0.50 67.96 1.40 37.80
## Frost Area
## 0.00 1049.00
apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 21198.0 6315.0 2.8 73.6 15.1 67.3
## Frost Area
## 188.0 566432.0
apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 5 2 18 11 1 44
## Frost Area
## 28 2
apply(state.x77, MARGIN=2, FUN=summary) # Summary of each col, get back matrix!
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Min. 365.00 3098.00 0.500 67.9600 1.400 37.800 0.00
## 1st Qu. 1079.50 3992.75 0.625 70.1175 4.350 48.050 66.25
## Median 2838.50 4519.00 0.950 70.6750 6.850 53.250 114.50
## Mean 4246.42 4435.80 1.170 70.8786 7.378 53.108 104.46
## 3rd Qu. 4968.50 4813.50 1.575 71.8925 10.675 59.150 139.75
## Max. 21198.00 6315.00 2.800 73.6000 15.100 67.300 188.00
## Area
## Min. 1049.00
## 1st Qu. 36985.25
## Median 54277.00
## Mean 70735.88
## 3rd Qu. 81162.50
## Max. 566432.00
For a custom function, we can just define it before hand, and the use apply() as usual
# Our custom function: trimmed mean
trimmed.mean = function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean)
## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
We’ll learn more about functions later (don’t worry too much at this point about the details of the function definition)
Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient
# Compute trimmed means, defining this on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
})
## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
Can tell apply() to pass extra arguments to the function in question. E.g., can use: apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2), for two extra arguments extra.arg.1, extra.arg.2 to be passed to my.fun()
# Our custom function: trimmed mean, with user-specified percentiles
trimmed.mean = function(v, p1, p2) {
q1 = quantile(v, prob=p1)
q2 = quantile(v, prob=p2)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean, p1=0.01, p2=0.99)
## Population Income Illiteracy Life Exp Murder
## 3974.125000 4424.520833 1.136735 70.882708 7.341667
## HS Grad Frost Area
## 53.131250 104.895833 61860.687500
What kind of data type will apply() give us? Depends on what function we pass. Summary, say, with FUN=my.fun():
my.fun() returns a single value, then apply() will return a vectormy.fun() returns k values, then apply() will return a matrix with k rows (note: this is true regardless of whether MARGIN=1 or MARGIN=2)my.fun() returns different length outputs for different inputs, then apply() will return a listmy.fun() returns a list, then apply() will return a listWe’ll grapple with this on the lab. One advantage of **ply() functions in the plyr package: there is a much more transparent return object type
Don’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply(). E.g.,
rowSums(), colSums(): for computing row, column sums of a matrixrowMeans(), colMeans(): for computing row, column means of a matrixmax.col(): for finding the maximum position in each row of a matrixCombining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })
## [1] 2 2 3
# Do this insted (much faster, simpler)
rowSums(x > 0)
## [1] 2 2 3
lapply(), vapply(), tapply()
lapply(), elements of a list or vectorThe lapply() function takes inputs as in: lapply(x, FUN=my.fun), to apply my.fun() across elements of a list or vector x. The output is always a list
my.list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] FALSE TRUE TRUE FALSE TRUE FALSE
lapply(my.list, FUN=mean) # Get a warning: mean() can't be applied to chars
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## $nums
## [1] 0.35
##
## $chars
## [1] NA
##
## $bools
## [1] 0.5
lapply(my.list, FUN=summary)
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE
## logical 3 3
sapply(), elements of a list or vectorThe sapply() function works just like lapply(), but tries to simplify the return value whenever possible. E.g., most common is the conversion from a list to a vector
sapply(my.list, FUN=mean) # Simplifies the result, now a vector
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## nums chars bools
## 0.35 NA 0.50
sapply(my.list, FUN=summary) # Can't simplify, so still a list
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE
## logical 3 3
tapply(), levels of a factor vectorThe function tapply() takes inputs as in: tapply(x, INDEX=my.index, FUN=my.fun), to apply my.fun() to subsets of entries in x that share a common level in my.index
# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)
## Northeast South North Central West
## 132.7778 64.6250 138.8333 102.1538
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=sd)
## Northeast South North Central West
## 30.89408 31.30682 23.89307 68.87652
split(), split by levels of a factorThe function split() split up the rows of a data frame by levels of a factor, as in: split(x, f=my.index) to split a data frame x according to levels of my.index
# Split up the state.x77 matrix according to region
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list
## [1] "list"
names(state.by.reg) # This has 4 elements for the 4 regions
## [1] "Northeast" "South" "North Central" "West"
class(state.by.reg[[1]]) # Each element is a data frame
## [1] "data.frame"
# For each region, display the first 3 rows of the data frame
lapply(state.by.reg, FUN=head, 3)
## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## Maine 1058 3694 0.7 70.39 2.7 54.7 161
## Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103
## Area
## Connecticut 4862
## Maine 30920
## Massachusetts 7826
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Illinois 11197 5107 0.9 70.14 10.3 52.6 127 55748
## Indiana 5313 4458 0.7 70.88 7.1 52.9 122 36097
## Iowa 2861 4628 0.5 72.56 2.3 59.0 140 55941
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Area
## Alaska 566432
## Arizona 113417
## California 156361
# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) {
return(apply(df, MARGIN=2, mean))
})
## $Northeast
## Population Income Illiteracy Life.Exp Murder
## 5495.111111 4570.222222 1.000000 71.264444 4.722222
## HS.Grad Frost Area
## 53.966667 132.777778 18141.000000
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4208.12500 4011.93750 1.73750 69.70625 10.58125 44.34375
## Frost Area
## 64.62500 54605.12500
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4803.00000 4611.08333 0.70000 71.76667 5.27500 54.51667
## Frost Area
## 138.83333 62652.00000
##
## $West
## Population Income Illiteracy Life.Exp Murder
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00
## HS.Grad Frost Area
## 6.200000e+01 1.021538e+02 1.344630e+05
subset(): function for extracting rows of a data frame meeting a conditionsplit(): function for splitting up rows of a data frame, according to a factor variableapply(): function for applying a given routine to rows or columns of a matrix or data framelapply(): similar, but used for applying a routine to elements of a vector or listsapply(): similar, but will try to simplify the return type, in comparison to lapply()tapply(): function for applying a given routine to groups of elements in a vector or list, according to a factor variable