36-350
6 October 2014
data("cats",package="MASS")
Problem: get the positions in a vector / columns in a matrix / rows in a dataframe matching some condition
head(cats$Hwt[cats$Sex=="M"])
[1] 6.5 6.5 10.1 7.2 7.6 7.9
head(cats[cats$Sex=="M","Hwt"])
[1] 6.5 6.5 10.1 7.2 7.6 7.9
cats$Sex=="M"
is a Boolean vector, as long as cats$Sex
training_rows <- sample(1:nrow(cats),size=nrow(cats)/2)
head(training_rows)
[1] 79 7 55 94 90 103
cats.trained <- cats[training_rows,]
head(cats.trained)
Sex Bwt Hwt
79 M 2.7 8.0
7 F 2.1 8.1
55 M 2.2 9.1
94 M 2.8 13.5
90 M 2.8 10.2
103 M 3.0 11.6
males <- cats$Sex=="M"
apply
tricks belowmovies$gross[movies$genre]
if you are trying to get all the gross revenues of some movies of a certain genre
for (i in 1:nrow(movies)) {
if (movies$genre[i]=="comedy") {
gross.comedy <- c(gross.comedy, movies$gross[i])
}
}
movies$gross[movies$genre=="comedy"]
or
movies[movies$genre=="comedy","gross"]
dim(is.na(cats)) # checks each element for being NA
[1] 144 3
mean(log(cats$Hwt))
[1] 2.339
rnorm(n=5,mean=-2:2,sd=(1:5)/5)
[1] -1.8038 -1.0411 -0.2527 0.3155 4.3772
grep
, regexp
)apply
family of functionssapply
or lapply
mean.omitting.one <- function(i,x) {mean(x[-i])}
jackknifed.means <- sapply(1:nrow(cats),mean.omitting.one,x=cats$Bwt)
length(jackknifed.means)
[1] 144
sd(jackknifed.means)
[1] 0.003394
sapply
tries to return a vector or an array (with one column per entry in the original vector)lapply
, which just returns a list FUN
to every row of an array or dataframe X
:
apply(X,1,FUN)
rows_with_NAs <- apply(is.na(movies),1,any)
apply
tries to return a vector or an array; will return a list if it can'tapply
assumes FUN
will work on a row of X
; might need to write a little adapter function to make that true# Make 3rd and 5th cols. of X the 1st and 2nd args. of f
apply(X,1,function(z){f(z[3],z[5])})
FUN
to every column of an array or dataframe X
apply(X,2,FUN)
apply(cats[,2:3],2,median)
Bwt Hwt
2.7 10.1
f
which takes 2+ arguments; vectors x
, y
, … z
f(x[1],y[1],..., z[1]), f(x[2],y[2],...,z[2])
, etc.mapply(FUN=f,x,y,z)
You go to analysis with the data you have, not the data you want.
The variables in the data are often either not what's most relevant to the analysis, or they're not arranged conveniently, or both
Satisfying model assumptions is a big issue here
\( \therefore \) often want to transform the data to make it closer to the data we wish we had to start with
Lossless transformations: the original data could be recovered exactly
(at least in principle; function is invertible; same \( \sigma \)-algebra)
Lossy transformations irreversibly destroy some information
(function is not invertible; new \( \sigma \)-algebra is coarser)
Many common transformations are lossless
Many useful transformations are lossy, sometimes very lossy
Because you're documenting your transformations in commented code
yes?
and kept a safe copy of the original data on the disk
yes?
and your disk is backed up regularly
YES?!?
you can use even very lossy transformations without fear
log
: Because \( Y = f(X)g(Z) \Leftrightarrow \log{Y} = \log{f(X)} + \log{g(X)} \), taking logs lets us use linear or additive models when the real relationship is multiplicative
log
of a whole column?head(scale(cats[,-1],center=TRUE,scale=TRUE))
Bwt Hwt
1 -1.491 -1.4912
2 -1.491 -1.3269
3 -1.491 -0.4644
4 -1.285 -1.4091
5 -1.285 -1.3680
6 -1.285 -1.2448
center=TRUE
\( \Rightarrow \) subtract the mean; alternately, FALSE
or a vectorscale=TRUE
\( \Rightarrow \) divide by standard deviation, after centering; same options
scale
produce “Z-scores”diff(x)
; differences between x[t]
and x[t-k]
, diff(x,lag=k)
cumsum
, cumprod
, cummax
, cummin
cummean
rollmean
from the zoo
package; see Recipe 14.10 in The R Cookbook
rollapply
rank(x)
outputs the rank of each element of x
within the vector, 1 being the smallest:head(cats$Hwt)
[1] 7.0 7.4 9.5 7.2 7.3 7.6
head(rank(cats$Hwt))
[1] 4.0 11.0 50.5 6.5 9.0 12.5
qnorm(ecdf(x)(x),mean=100,sd=15)
qnorm
therename due to L. Wasserman
gdp_trend <- gdp[1]*exp(growth.rate*(0:length(gdp)-1))
gdp_vs_trend <- gdp/gdp_rend
residuals
when the trend is a regression model:head(residuals(lm(Hwt ~ Bwt, data=cats)))
1 2 3 4 5 6
-0.7115 -0.3115 1.7885 -0.9149 -0.8149 -0.5149
aggregate
takes a dataframe, a list containing the variable(s) to group the rows by, and a scalar -valued summarizing function:aggregate(cats[,-1],by=cats[1],mean)
Sex Bwt Hwt
1 F 2.36 9.202
2 M 2.90 11.323
Note: No comma in cats[1]
; treating dataframe as a list of vectors
by
list must be as long as the number of rows of the dataaggregate
doesn't work on vectors, but it has a cousin, tapply
:tapply(cats$Hwt,INDEX=cats$Sex,max)
F M
13.0 20.5
tapply(cats$Hwt,cats$Sex,summary)
$F
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.30 8.35 9.10 9.20 10.10 13.00
$M
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.5 9.4 11.4 11.3 12.8 20.5
More complicated actions on subsets usually need the split/apply pattern, which we'll get to in a few weeks
order
takes in a vector, and returns the vector of indices which would put it in order (increasing by default)
decreasing=TRUE
option to change thatorder
can be saved to re-order multiple dataframes the same wayhead(cats,4)
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
head(order(cats$Hwt))
[1] 31 48 49 1 13 4
head(cats[order(cats$Hwt),],4)
Sex Bwt Hwt
31 F 2.4 6.3
48 M 2.0 6.5
49 M 2.0 6.5
1 F 2.0 7.0
rank(x)
does not deliver the same thing as order(x)
!sort
returns the sorted vector, not the orderinghead(sort(cats$Hwt))
[1] 6.3 6.5 6.5 7.0 7.1 7.2
which.min
or which.max
which.min(cats$Hwt) == order(cats$Hwt)[1]
[1] TRUE
t(x)
aperm
similarly for higher-dimensional arraysYou have two dataframes, say movies.info
and movies.biz
, and you want to combine them into one dataframe, say movies
movies <- data.frame(movies.info, movies.biz)
movies <- data.frame(year=movies.info$year,
avg_rating=movies.info$avg_rating,
num_rates=movies.info$num_raters,
genre=movies.info$genre,
gross=movies.biz$gross)
Next best case: same number of rows but in different order
merge
Worse cases: different numbers of rows…
merge
Claim: People in larger cities travel more
More precise claim: miles driven per person per day increases with the area of the city
Distance driven, and city population: table HM-71 in the 2011 “Highway Statistics Series” [http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm]
fha <- read.csv("fha.csv",na.strings="NA",
colClasses=c("character","double","double","double"))
nrow(fha)
[1] 498
colnames(fha)
[1] "City" "Population" "Miles.of.Road"
[4] "Daily.Miles.Traveled"
Area and population of “urbanized areas”: [http://www2.census.gov/geo/ua/ua_list_all.txt]
ua <- read.csv("ua.txt",sep=";")
nrow(ua)
[1] 3598
colnames(ua)
[1] "UACE" "NAME" "POP" "HU"
[5] "AREALAND" "AREALANDSQMI" "AREAWATER" "AREAWATERSQMI"
[9] "POPDEN" "LSADC"
This isn't a simple case, because:
fha
orders cities by population, ua
is alphabetical by nameBut both use the same Census figures for population, and it turns out every settlement (in the top 498) has a unique Census population:
length(unique(fha$Population)) == nrow(fha)
[1] TRUE
identical(fha$Population,sort(ua$POP[1:nrow(fha)],decreasing=TRUE))
[1] FALSE
Option 1: re-order the 2nd table by population
ua <- ua[order(ua$POP,decreasing=TRUE),]
df1 <- data.frame(fha, area=ua$AREALANDSQMI[1:nrow(fha)])
# Neaten up names
colnames(df1) <- c("City","Population","Roads","Mileage","Area")
nrow(df1)
[1] 498
head(df1)
City Population Roads Mileage Area
1 New York--Newark, NY--NJ--CT 18351295 43893 286101 3450
2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877 270807 1736
3 Chicago, IL--IN 8608208 25905 172708 2443
4 Miami, FL 5502379 15641 125899 1239
5 Philadelphia, PA--NJ--DE--MD 5441567 19867 99190 1981
6 Dallas--Fort Worth--Arlington, TX 5121892 21610 125389 1779
Option 2: Use the merge
function
df2 <- merge(x=fha,y=ua,
by.x="Population",by.y="POP")
nrow(df2)
[1] 498
tail(df2,3)
Population City Miles.of.Road
496 8608208 Chicago, IL--IN 25905
497 12150996 Los Angeles--Long Beach--Anaheim, CA 24877
498 18351295 New York--Newark, NY--NJ--CT 43893
Daily.Miles.Traveled UACE NAME
496 172708 16264 Chicago, IL--IN
497 270807 51445 Los Angeles--Long Beach--Anaheim, CA
498 286101 63217 New York--Newark, NY--NJ--CT
HU AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC
496 3459257 6.327e+09 2443 105649916 40.79 3524 75
497 4217448 4.496e+09 1736 61141327 23.61 6999 75
498 7263095 8.936e+09 3450 533176599 205.86 5319 75
by.x
and by.y
say which columns need to match to do a mergecolnames
merge
is doing a JOIN
You'd think merging on names would be easy…
df2.1 <- merge(x=fha,y=ua,by.x="City", by.y="NAME")
nrow(df2.1)
[1] 492
We can force unmatched rows of either dataframe to be included, with NA values as appropriate:
df2.2 <- merge(x=fha,y=ua,by.x="City",by.y="NAME",all.x=TRUE)
nrow(df2.2)
[1] 498
Database speak: takes us from a “natural join” to a “left outer join”
Where are the mis-matches?
df2.2$City[is.na(df2.2$POP)]
[1] "Aguadilla--Isabela--San Sebastián, PR"
[2] "Danville, VA – NC"
[3] "Florida--Imbéry--Barceloneta, PR"
[4] "Juana DÃaz, PR"
[5] "Mayagüez, PR"
[6] "San Germán--Cabo Rojo--Sabana Grande, PR"
On investigation, fha.csv
and ua.txt
use 2 different encodings for accent characters, and one writes things like VA -- NC
and the other says VA--NC
merge
takes some learningmerge
handles many columnsmerge
handles # Convert 1,000s of miles to miles
df1$Mileage <- 1000*df1$Mileage
# Plot daily miles per person vs. area
plot(Mileage/Population ~ Area, data=df1, log="x",
xlab="Miles driven per person per day",
ylab="City area in square miles")
# Impressively flat regression line
abline(lm(Mileage/Population~Area,data=df1),col="blue")
apply
and friends for doing the same thing to all parts of the datamerge
Common to have data where some variables identify units, and others are measurements
Wide form: columns for ID variables plus 1 column per measurement
Narrow form: columns for ID variables, plus 1 column identifying measurement, plus 1 column giving value
Often want to convert from wide to narrow, or change what's ID and what's measure
reshape
package introduced data-reshaping toolsreshape2
package simplifies lots of common usesmelt
turns a wide dataframe into a narrow onedcast
turns a narrow dataframe into a wide one
acast
turns a narrow dataframe into a wide arraysnoqualmie.csv
has precipitation every day in Snoqualmie, WA for 36 years (1948–1983)snoq <- read.csv("snoqualmie.csv",header=FALSE)
colnames(snoq) <- 1:366
snoq[1:3,1:6]
1 2 3 4 5 6
1 136 100 16 80 10 66
2 17 14 0 0 1 11
3 1 35 13 13 18 122
snoq$year <- 1948:1983
snoq.melt <- melt(snoq,id.vars="year",
variable.name="day",value.name="precip")
head(snoq.melt)
year day precip
1 1948 1 136
2 1949 1 17
3 1950 1 1
4 1951 1 34
5 1952 1 0
6 1953 1 2
Being sorted by day of the year and then by year is a bit odd
snoq.melt.chron <- snoq.melt[order(snoq.melt$year,snoq.melt$day),]
head(snoq.melt.chron)
year day precip
1 1948 1 136
37 1948 2 100
73 1948 3 16
109 1948 4 80
145 1948 5 10
181 1948 6 66
Most years have 365 days so some missing values:
sum(is.na(snoq.melt.chron$precip[snoq.melt.chron$day==366]))
[1] 27
Tidy with na.omit
:
snoq.melt.chron <- na.omit(snoq.melt.chron)
Today's precipitation vs. next day's:
snoq.pairs <- data.frame(snoq.melt.chron[-nrow(snoq.melt.chron),],
precip.next=snoq.melt.chron$precip[-1])
head(snoq.pairs)
year day precip precip.next
1 1948 1 136 100
37 1948 2 100 16
73 1948 3 16 80
109 1948 4 80 10
145 1948 5 10 66
181 1948 6 66 88
dcast
turns back into wide form, with a formula of IDs ~
measuressnoq.recast <- dcast(snoq.melt,year~...)
dim(snoq.recast)
[1] 36 367
snoq.recast[1:4,1:4]
year 1 2 3
1 1948 136 100 16
2 1949 17 14 0
3 1950 1 35 13
4 1951 34 183 11
acast
casts into an array rather than a dataframeHadley Wickham, “Reshaping Data with the reshape Package”, Journal of Statistical Software 21 (2007): 12, [http://www.jstatsoft.org/v21/i12]