Lecture 12: Transforming and Reshaping Data ======================================================== author: 36-350 date: 6 October 2014 font-family: Garamond transition: None In Previous Episodes === - Accessing vectors, arrays, and data frames - Applying a function across a vector, array, or data frame - Extracting data values from more-or-less formatted text - Functions to automate repetitive tasks Agenda === - Review of selective access - Review of applying functions - Lossless vs. lossy transformations - Common transformations of numerical data - Re-ordering data frames - Merging data frames Not Our Only Example, Really === {r} data("cats",package="MASS")  Access Tricks === Problem: get the positions in a vector / columns in a matrix / rows in a dataframe matching some condition - Vector of Boolean indicators {r} head(cats$Hwt[cats$Sex=="M"]) head(cats[cats$Sex=="M","Hwt"])  - N.B., cats$Sex=="M" is a Boolean vector, as long as cats$Sex Access Tricks (cont'd.) === - Vector of index numbers {r} training_rows <- sample(1:nrow(cats),size=nrow(cats)/2) head(training_rows) cats.trained <- cats[training_rows,] head(cats.trained)  Access Tricks (cont'd.) === - Vectors of Booleans and vectors of indices can be stored and re-used {r} males <- cats$Sex=="M"  - See also apply tricks below Access Tricks: Don't Do These === - Non-binary, non-integer vectors do not make good indices; don't say {r,eval=FALSE} movies$gross[movies$genre]  if you are trying to get all the gross revenues of some movies of a certain genre - Loops are a last resort, not a first; don't say {r,eval=FALSE} for (i in 1:nrow(movies)) { if (movies$genre[i]=="comedy") { gross.comedy <- c(gross.comedy, movies$gross[i]) } }  Access Tricks: Don't Do These (cont'd.) === - In either case, say {r,eval=FALSE} movies$gross[movies$genre=="comedy"]  or {r,eval=FALSE} movies[movies$genre=="comedy","gross"]  Apply Tricks === - Lots of functions will _automatically_ apply themselves to each element in a vector or dataframe; they are **vectorized** {r} dim(is.na(cats)) # checks each element for being NA  - Math functions are vectorized {r} mean(log(cats$Hwt))  Apply Tricks (cont'd.) === - Distribution-related functions are vectorized {r} rnorm(n=5,mean=-2:2,sd=(1:5)/5)  - Lots of the text functions vectorize across target strings, not patterns (e.g., grep, regexp) - If the function doesn't, or that's not quite what you want, turn to the apply family of functions Apply Tricks: Vectors === - Apply the same function to every element in a vector: sapply or lapply {r} mean.omitting.one <- function(i,x) {mean(x[-i])} jackknifed.means <- sapply(1:nrow(cats),mean.omitting.one,x=cats$Bwt) length(jackknifed.means) sd(jackknifed.means)  - sapply tries to return a vector or an array (with one column per entry in the original vector) - If that doesn't make sense, use lapply, which just returns a list Apply Tricks: Rows === - Apply the same function FUN to every row of an array or dataframe X: apply(X,1,FUN) {r,eval=FALSE} rows_with_NAs <- apply(is.na(movies),1,any)  - apply tries to return a vector or an array; will return a list if it can't - apply assumes FUN will work on a row of X; might need to write a little adapter function to make that true {r,eval=FALSE} # Make 3rd and 5th cols. of X the 1st and 2nd args. of f apply(X,1,function(z){f(z[3],z[5])})  Apply Tricks: Columns === - Apply the same function FUN to every column of an array or dataframe X apply(X,2,FUN) {r} apply(cats[,2:3],2,median)  - Same notes as applying across rows Apply Tricks: Multiple Vectors or Columns === - Given: function f which takes 2+ arguments; vectors x, y, ... z - Wanted; f(x[1],y[1],..., z[1]), f(x[2],y[2],...,z[2]), etc. - Solution: {r,eval=FALSE} mapply(FUN=f,x,y,z)  - Will recycle the vectors to the length of the longest if needed - Often very useful when the vectors are columns, not necessarily from the same object Transformations === > You go to analysis with the data you have, not the data you want. The variables in the data are often either not what's most relevant to the analysis, or they're not arranged conveniently, or both Satisfying model assumptions is a big issue here$\therefore$often want to _transform_ the data to make it closer to the data we wish we had to start with **Lossless** transformations: the original data could be recovered exactly (at least in principle; function is invertible; same$\sigma$-algebra) **Lossy** transformations irreversibly destroy some information (function is not invertible; new$\sigma$-algebra is coarser) Lossless vs. Lossy === Many common transformations are lossless Many useful transformations are lossy, sometimes very lossy Because you're documenting your transformations in commented code yes? and kept a safe copy of the original data on the disk yes? and your disk is backed up regularly YES?!? you can use even very lossy transformations without fear Some Common Transformations of Numerical Data === - log: Because$Y = f(X)g(Z) \Leftrightarrow \log{Y} = \log{f(X)} + \log{g(X)}$, taking logs lets us use linear or additive models when the real relationship is multiplicative + How would you take the log of a whole column? Numerical Transformations (cont'd.) === - Z-scores, centering and scaling: {r} head(scale(cats[,-1],center=TRUE,scale=TRUE))  - center=TRUE$\Rightarrow$subtract the mean; alternately, FALSE or a vector - scale=TRUE$\Rightarrow$divide by standard deviation, after centering; same options + Defaults in scale produce "Z-scores" Numerical Transformations (cont'd.) === - Successive differences: diff(x); differences between x[t] and x[t-k], diff(x,lag=k) + Vectorizes over columns of a matrix - Cumulative totals etc.: cumsum, cumprod, cummax, cummin + Exercise: write cummean - Rolling means: rollmean from the zoo package; see Recipe 14.10 in _The R Cookbook_ + See also Recipe 14.12 on rollapply Numerical Transformations (cont'd.) === - Magnitudes to ranks: rank(x) outputs the **rank** of each element of x within the vector, 1 being the smallest: {r} head(cats$Hwt) head(rank(cats$Hwt))  Numerical Transformations (cont'd.) === - "Para-normal" values: Based on the percentile, where would this be if it were Gaussian/normal? {r,eval=FALSE} qnorm(ecdf(x)(x),mean=100,sd=15)  - Obviously nothing magic about using qnorm there - This is how IQ tests are scored; raw scores are highly skewed and don't follow bell curves at all - "Gaussian copula" = run this trick on two or more variables and then measure the correlations name due to L. Wasserman Numerical Transformations (cont'd.) === - Extracting deviations from a trend + Calculate the predicted value per trend + Take the difference {r,eval=FALSE} gdp_trend <- gdp[1]*exp(growth.rate*(0:length(gdp)-1)) gdp_vs_trend <- gdp/gdp_rend  - Use residuals when the trend is a regression model: {r} head(residuals(lm(Hwt ~ Bwt, data=cats)))  Summarizing Subsets === - aggregate takes a dataframe, a _list_ containing the variable(s) to group the rows **by**, and a _scalar_ -valued summarizing function: {r} aggregate(cats[,-1],by=cats[1],mean)  Note: No comma in cats[1]; treating dataframe as a list of vectors - Each vector in the by list must be as long as the number of rows of the data Summarizing Subsets (cont'd.) === - aggregate doesn't work on vectors, but it has a cousin, tapply: {r} tapply(cats$Hwt,INDEX=cats$Sex,max)  - tapply can return more than just a scalar value: {r} tapply(cats$Hwt,cats$Sex,summary)  Summarizing Subsets (cont'd.) === More complicated actions on subsets usually need the split/apply pattern, which we'll get to in a few weeks Re-Organizing === - Even if the numbers (or strings, etc.) are fine, they may not be arranged very conveniently - Lots of data manipulation involves re-arrangement: + sorting arrays and dataframes by certain columns + exchanging rows and columns + merging dataframes + Turning short, wide dataframes into long, narrow ones, and vice versa Re-Ordering === order takes in a vector, and returns the vector of indices which would put it in order (increasing by default) - Use the decreasing=TRUE option to change that - Output of order can be saved to re-order multiple dataframes the same way order (cont'd.) === {r} head(cats,4) head(order(cats$Hwt)) head(cats[order(cats$Hwt),],4)  Related to order === - rank(x) does _not_ deliver the same thing as order(x)! - sort returns the sorted vector, not the ordering {r} head(sort(cats$Hwt))  - To just get the index of the smallest or largest element, use which.min or which.max {r} which.min(cats$Hwt) == order(cats$Hwt)[1]  Flipping Arrays === - To transpose, converting rows to columns, use t(x) + Use cautiously on dataframes! - Use aperm similarly for higher-dimensional arrays Merging Dataframees === You have two dataframes, say movies.info and movies.biz, and you want to combine them into one dataframe, say movies - Simplest case: the dataframes have exactly the same number of rows, that the rows represent exactly the same units, and you want all columns from both {r,eval=FALSE} movies <- data.frame(movies.info, movies.biz)  Merging Dataframes (cont'd.) === - Next best case: you know that the two dataframes have the same rows, but you only want certain columns from each {r,eval=FALSE} movies <- data.frame(year=movies.info$year, avg_rating=movies.info$avg_rating, num_rates=movies.info$num_raters, genre=movies.info$genre, gross=movies.biz$gross)  Merging Dataframes (cont'd.) === - Next best case: same number of rows but in different order + Put one of them in the same order as the other + Use merge - Worse cases: different numbers of rows... + Cleverer re-ordering tricks + Use merge An Example That Is Not Part of the Midterm === Claim: People in larger cities travel more More precise claim: miles driven per person per day increases with the area of the city Example of Merging (cont'd.) === Distance driven, and city population: table HM-71 in the 2011 "Highway Statistics Series" [http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm] {r} fha <- read.csv("fha.csv",na.strings="NA", colClasses=c("character","double","double","double")) nrow(fha) colnames(fha)  Example of Merging (cont'd.) === Area and population of "urbanized areas": [http://www2.census.gov/geo/ua/ua_list_all.txt] {r} ua <- read.csv("ua.txt",sep=";") nrow(ua) colnames(ua)  Example of Merging (cont'd.) === This isn't a simple case, because: 1.$\approx 500$cities vs.$\approx 4000$"urbanized areas" 2. fha orders cities by population, ua is alphabetical by name 3. Both have place-names, but those don't always agree 4. Not even common names for the shared columns _But_ both use the same Census figures for population, and it turns out every settlement (in the top 498) has a unique Census population: {r} length(unique(fha$Population)) == nrow(fha) identical(fha$Population,sort(ua$POP[1:nrow(fha)],decreasing=TRUE))  Example of Merging (cont'd.) === Option 1: re-order the 2nd table by population {r} ua <- ua[order(ua$POP,decreasing=TRUE),] df1 <- data.frame(fha, area=ua$AREALANDSQMI[1:nrow(fha)]) # Neaten up names colnames(df1) <- c("City","Population","Roads","Mileage","Area") nrow(df1) head(df1)  Example of Merging (cont'd.) === Option 2: Use the merge function {r} df2 <- merge(x=fha,y=ua, by.x="Population",by.y="POP") nrow(df2) tail(df2,3)  merge === - by.x and by.y say which columns need to match to do a merge + Default: merge on all columns with shared names - New dataframe has _all_ the columns of _both_ dataframes + Here, should really delete the ones we don't need and tidy colnames - If you know databases, then merge is doing a JOIN + If you don't know what that means, wait until November Example of Merging (cont'd.) === You'd think merging on names would be easy... {r} df2.1 <- merge(x=fha,y=ua,by.x="City", by.y="NAME") nrow(df2.1)  Example of Merging (cont'd.) === We can force unmatched rows of either dataframe to be included, with NA values as appropriate: {r} df2.2 <- merge(x=fha,y=ua,by.x="City",by.y="NAME",all.x=TRUE) nrow(df2.2)  Database speak: takes us from a "natural join" to a "left outer join" Example of Merging (cont'd.) === Where are the mis-matches? {r} df2.2$City[is.na(df2.2$POP)]  On investigation, fha.csv and ua.txt use 2 different encodings for accent characters, and one writes things like VA -- NC and the other says VA--NC Using order+data.frame vs. merge === - Re-ordering is easier to grasp; merge takes some learning - Re-ordering is simplest when there's only one column to merge on; merge handles many columns - Re-orderng is simplest when the dataframes are the same size; merge handles So, Do Bigger Cities Mean More Driving? === {r} # Convert 1,000s of miles to miles df1$Mileage <- 1000*df1$Mileage # Plot daily miles per person vs. area plot(Mileage/Population ~ Area, data=df1, log="x", xlab="Miles driven per person per day", ylab="City area in square miles") # Impressively flat regression line abline(lm(Mileage/Population~Area,data=df1),col="blue")  Take-Aways === - Boolean vectors and vectors of indices to access selected parts of the data - apply and friends for doing the same thing to all parts of the data - Numerical transformations - Re-ordering dataframes - Merging dataframes with merge Bonus Topic: Reshaping === - Common to have data where some variables identify units, and others are measurements - **Wide** form: columns for ID variables plus 1 column per measurement + Good for things like correlating measurements, or running regressions - **Narrow** form: columns for ID variables, plus 1 column identifying measurement, plus 1 column giving value + Good for summarizing, subsetting Often want to convert from wide to narrow, or change what's ID and what's measure reshape2 === - reshape package introduced data-reshaping tools - reshape2 package simplifies lots of common uses - melt turns a wide dataframe into a narrow one - dcast turns a narrow dataframe into a wide one + acast turns a narrow dataframe into a wide array Reshaping Example === - snoqualmie.csv has precipitation every day in Snoqualmie, WA for 36 years (1948--1983) - One row per year, one column per day, units of $1/100$ inch From P. Guttorp, _Stochastic Modeling of Scientific Data_ (London: Chapman and Hall, 1995) {r} snoq <- read.csv("snoqualmie.csv",header=FALSE) colnames(snoq) <- 1:366 snoq[1:3,1:6] snoq$year <- 1948:1983  Reshaping Example (cont'd.) === {r,echo=FALSE} require(reshape2)  {r} snoq.melt <- melt(snoq,id.vars="year", variable.name="day",value.name="precip") head(snoq.melt)  Reshaping Example (cont'd.) === Being sorted by day of the year and then by year is a bit odd {r} snoq.melt.chron <- snoq.melt[order(snoq.melt$year,snoq.melt$day),] head(snoq.melt.chron)  Reshaping Example (cont'd.) === Most years have 365 days so some missing values: {r} sum(is.na(snoq.melt.chron$precip[snoq.melt.chron$day==366]))  Tidy with na.omit: {r} snoq.melt.chron <- na.omit(snoq.melt.chron)  Reshaping Example (cont'd.) === Today's precipitation vs. next day's: {r} snoq.pairs <- data.frame(snoq.melt.chron[-nrow(snoq.melt.chron),], precip.next=snoq.melt.chron$precip[-1]) head(snoq.pairs)  Reshaping Example (cont'd.) === - dcast turns back into wide form, with a formula of IDs ~ measures {r} snoq.recast <- dcast(snoq.melt,year~...) dim(snoq.recast) snoq.recast[1:4,1:4]  - acast casts into an array rather than a dataframe Reshaping (cont'd.) === - The formula could also specify multiple ID variables (including original measure variables), different measure variables (including original ID variables)... - Also possible to apply functions to aggregates which all have the same IDs, select subsets of the data, etc. - Strongly recommended reading: > Hadley Wickham, "Reshaping Data with the reshape Package", _Journal of Statistical Software_ **21** (2007): 12, [http://www.jstatsoft.org/v21/i12]