---
title: "Data Frames and Apply"
author: "Statistical Computing, 36-350"
date: "Tuesday September 14, 2021"
---
Last week: Indexing and iteration
===
- Three ways to index vectors, matrices, data frames, lists: integers, Booleans, names
- Boolean on-the-fly indexing can be very useful
- Named indexing will be especially useful for data frames and lists
- Indexing lists can be a bit tricky (beware of the difference between `[ ]` and `[[ ]]`)
- `if()`, `else if()`, `else`: standard conditionals
- `ifelse()`: shortcut for using `if()` and `else` in combination
- `switch()`: shortcut for using `if()`, `elseif()`, and `else` in combination
- `for()`, `while()`: standard loop constructs
- Don't overuse explicit `for()` loops, vectorization is your friend!
- apply and map functions: can also be very useful (we'll see them today and next week, respectively)
Part I
===
*Data frames*
Data frames
===
The format for the "classic" data table in statistics: **data frame**. Lots of the "really-statistical" parts of the R programming language presume data frames
- Think of each row as an observation/case
- Think of each column as a variable/feature
- Not just a matrix because variables can have different types
- Both rows and columns can be assigned names
Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)
Creating a data frame
===
Use `data.frame()`, similar to how we create lists
```{r}
my.df = data.frame(nums=seq(0.1,0.6,by=0.1), chars=letters[1:6],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.df
# Recall, a list can have different lengths for different elements!
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.list
```
Indexing a data frame
===
- By rows/columns: similar to how we index matrices
- By columns only: similar to how we index lists
```{r}
my.df[,1] # Also works for a matrix
my.df[,"nums"] # Also works for a matrix
my.df$nums # Doesn't work for a matrix, but works for a list
my.df$chars # Ditto
```
Creating a data frame from a matrix
===
Often times it's helpful to start with a matrix, and add columns (of different data types) to make it a data frame
```{r}
class(state.x77) # Built-in matrix of states data, 50 states x 8 variables
head(state.x77)
class(state.region) # Factor of regions for the 50 states
head(state.region)
class(state.division) # Factor of divisions for the 50 states
head(state.division)
```
---
```{r}
# Combine these into a data frame with 50 rows and 10 columns
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
class(state.df)
head(state.df) # Note that the first 8 columns name carried over from state.x77
```
Adding columns to a data frame
===
To add columns: we can either use `data.frame()`, or directly define a new named column
```{r}
# First way: use data.frame() to concatenate on a new column
state.df = data.frame(state.df, Cool=sample(c(T,F), nrow(state.df), rep=TRUE))
head(state.df, 4)
# Second way: just directly define a new named column
state.df$Score = sample(1:100, nrow(state.df), replace=TRUE)
head(state.df, 4)
```
Deleting columns from a data frame
===
To delete columns: we can either use negative integer indexing, or set a column to `NULL`
```{r}
# First way: use negative integer indexing
state.df = state.df[,-ncol(state.df)]
head(state.df, 4)
# Second way: just directly set a column to NULL
state.df$Cool = NULL
head(state.df, 4)
```
Reminder: Boolean indexing
===
With matrices or data frames, we'll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing
```{r}
# Compare the averages of the Frost column between states in New England and
# Pacific divisions
mean(state.df[state.df$Division == "New England", "Frost"])
mean(state.df[state.df$Division == "Pacific", "Frost"]) # Those wimps!
```
`subset()`: extract rows based on a condition
===
The `subset()` function provides a convenient alternative way of accessing rows for data frames
```{r}
# Using subset(), we can just use the column names directly (i.e., no need for
# using $)
state.df.ne.1 = subset(state.df, Division == "New England")
# Get same thing by extracting the appropriate rows manually
state.df.ne.2 = state.df[state.df$Division == "New England", ]
all(state.df.ne.1 == state.df.ne.2)
# Same calculation as in the last slide, using subset()
mean(subset(state.df, Division == "New England")$Frost)
mean(subset(state.df, Division == "Pacific")$Frost) # Wimps
```
Part II
===
*`apply()`*
The apply family
===
R offers a family of **apply functions**, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using `for()` loop; can be simpler and faster, though not always. Summary of functions:
- `apply()`: apply a function to rows or columns of a matrix or data frame
- `lapply()`: apply a function to elements of a list or vector
- `sapply()`: same as the above, but simplify the output (if possible)
- `tapply()`: apply a function to levels of a factor vector
`apply()`: rows or columns of a matrix or data frame
===
The `apply()` function takes inputs of the following form:
- `apply(x, MARGIN=1, FUN=my.fun)`, to apply `my.fun()` across rows of a matrix or data frame `x`
- `apply(x, MARGIN=2, FUN=my.fun)`, to apply `my.fun()` across columns of a matrix or data frame `x`
```{r}
apply(state.x77, MARGIN=2, FUN=min) # Minimum entry in each column
apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column
apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column
apply(state.x77, MARGIN=2, FUN=summary) # Summary of each col, get back matrix!
```
Applying a custom function
===
For a custom function, we can just define it before hand, and the use `apply()` as usual
```{r}
# Our custom function: trimmed mean
trimmed.mean = function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean)
```
We'll learn more about functions later (don't worry too much at this point about the details of the function definition)
Applying a custom function "on-the-fly"
===
Instead of defining a custom function before hand, we can just define it "on-the-fly". Sometimes this is more convenient
```{r}
# Compute trimmed means, defining this on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
})
```
Applying a function that takes extra arguments
===
Can tell `apply()` to pass **extra arguments** to the function in question. E.g., can use: `apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2)`, for two extra arguments `extra.arg.1`, `extra.arg.2` to be passed to `my.fun()`
```{r}
# Our custom function: trimmed mean, with user-specified percentiles
trimmed.mean = function(v, p1, p2) {
q1 = quantile(v, prob=p1)
q2 = quantile(v, prob=p2)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean, p1=0.01, p2=0.99)
```
What's the return argument?
===
What kind of data type will `apply()` give us? Depends on what function we pass. Summary, say, with `FUN=my.fun()`:
- If `my.fun()` returns a single value, then `apply()` will return a vector
- If `my.fun()` returns k values, then `apply()` will return a matrix with k rows (note: this is true *regardless* of whether `MARGIN=1` or `MARGIN=2`)
- If `my.fun()` returns different length outputs for different inputs, then `apply()` will return a list
- If `my.fun()` returns a list, then `apply()` will return a list
We'll grapple with this on the lab. This is one main advantage of `purrr` package: there is a much more transparent return object type
Optimized functions for special tasks
===
**Don't overuse** the apply paradigm! There's lots of special functions that **optimized** are will be both simpler and faster than using `apply()`. E.g.,
- `rowSums()`, `colSums()`: for computing row, column sums of a matrix
- `rowMeans()`, `colMeans()`: for computing row, column means of a matrix
- `max.col()`: for finding the maximum position in each row of a matrix
Combining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
```{r}
x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })
# Do this insted (much faster, simpler)
rowSums(x > 0)
```
Part III
===
*`lapply()`, `sapply()`, `tapply()`*
`lapply()`: elements of a list or vector
===
The `lapply()` function takes inputs as in: `lapply(x, FUN=my.fun)`, to apply `my.fun()` across elements of a list or vector `x`. The output is always a list
```{r}
my.list
lapply(my.list, FUN=mean) # Get a warning: mean() can't be applied to chars
lapply(my.list, FUN=summary)
```
`sapply()`: elements of a list or vector
===
The `sapply()` function works just like `lapply()`, but tries to **simplify** the return value whenever possible. E.g., most common is the conversion from a list to a vector
```{r}
sapply(my.list, FUN=mean) # Simplifies the result, now a vector
sapply(my.list, FUN=summary) # Can't simplify, so still a list
```
`tapply()`: levels of a factor vector
===
The function `tapply()` takes inputs as in: `tapply(x, INDEX=my.index, FUN=my.fun)`, to apply `my.fun()` to subsets of entries in `x` that share a common level in `my.index`
```{r}
# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=sd)
```
`split()`: split by levels of a factor
===
The function `split()` split up the rows of a data frame by levels of a factor, as in: `split(x, f=my.index)` to split a data frame `x` according to levels of `my.index`
```{r}
# Split up the state.x77 matrix according to region
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list
names(state.by.reg) # This has 4 elements for the 4 regions
class(state.by.reg[[1]]) # Each element is a data frame
```
---
```{r}
# For each region, display the first 3 rows of the data frame
lapply(state.by.reg, FUN=head, 3)
```
---
```{r}
# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) {
return(apply(df, MARGIN=2, mean))
})
```
Summary
===
- Data frames are a representation of the "classic" data table in R: rows are observations/cases, columns are variables/features
- Each column can be a different data type (but must be the same length)
- `subset()`: function for extracting rows of a data frame meeting a condition
- `split()`: function for splitting up rows of a data frame, according to a factor variable
- `apply()`: function for applying a given routine to rows or columns of a matrix or data frame
- `lapply()`: similar, but used for applying a routine to elements of a vector or list
- `sapply()`: similar, but will try to simplify the return type, in comparison to `lapply()`
- `tapply()`: function for applying a given routine to groups of elements in a vector or list, according to a factor variable