Data Frames and Control ======================================================== author: 36-350 date: 3 September 2014 font-family: 'Garamond' Agenda === - Making and working with data frames - Conditionals: switching between different calculations - Iteration: Doing something over and over - Vectorizing: Avoiding explicit iteration In Our Last Thrilling Episode === - Vectors: series of values all of the same type v[5], v["name"] - Arrays: multi-dimensional generalization of vectors a[5,6,2], a[,6,], a[rowname, colname, layername] - Matrices: special 2D arrays with matrix math m[5,6], m[,6], m[,colname] - Lists: series of values of mixed types l[[3]], l$name - Dataframes: hybrid of matrix and list Dataframes, Encore === - 2D tables of data - Each case/unit is a row - Each variable is a column - Variables can be of any type (numbers, text, Booleans, ...) - Both rows and columns can get names Creating an example dataframe === {r} library(datasets) states <- data.frame(state.x77, abb=state.abb, region=state.region, division=state.division)  data.frame() is combining here a pre-existing matrix (state.x77), a vector of characters (state.abb), and two vectors of qualitative categorical variables (**factors**; state.region, state.division) Column names are preserved or guessed if not explicitly set === {r} colnames(states) states[1,]  Dataframe access === - By row and column index {r} states[49,3]  - By row and column names {r} states["Wisconsin","Illiteracy"]  Dataframe access (cont'd) === - All of a row: {r} states["Wisconsin",]  Exercise: what class is states["Wisconsin",]? Dataframe access (cont'd.) === - All of a column: {r} head(states[,3]) head(states[,"Illiteracy"]) head(states$Illiteracy)  Dataframe access (cont'd.) === - Rows matching a condition: {r} states[states$division=="New England", "Illiteracy"] states[states$region=="South", "Illiteracy"]  Replacing values === Parts or all of the dataframe can be assigned to: {r} summary(states$HS.Grad) states$HS.Grad <- states$HS.Grad/100 summary(states$HS.Grad) states$HS.Grad <- 100*states$HS.Grad  with() === What percentage of literate adults graduated HS? {r} head(100*(states$HS.Grad/(100-states$Illiteracy)))  with() takes a data frame and evaluates an expression "inside" it: {r} with(states, head(100*(HS.Grad/(100-Illiteracy))))  Data arguments === Lots of functions take data arguments, and look variables up in that data frame: {r} plot(Illiteracy~Frost, data=states)  $R^2 =0.45$, $p \approx {10}^{-7}$ Conditionals === Have the computer decide what to do next - Mathematically: $|x| = \left\{ \begin{array}{cl} x & \mathrm{if}~x\geq 0 \\ -x &\mathrm{if}~ x < 0\end{array}\right. ~,~ \psi(x) = \left\{ \begin{array}{cl} x^2 & \mathrm{if}~|x|\leq 1\\ 2|x|-1 &\mathrm{if}~ |x| > 1\end{array}\right.$ Exercise: plot $\psi$ in R - Computationally:  if the country code is not "US", multiply prices by current exchange rate  if() === Simplest conditional:  if (x >= 0) { x } else { -x }  Condition in if needs to give _one_ TRUE or FALSE value else clause is optional one-line actions don't need braces  if (x >= 0) x else -x  Nested if() === if can *nest* arbitrarily deeply:  if (x^2 < 1) { x^2 } else { if (x >= 0) { 2*x-1 } else { -2*x-1 } }  Can get ugly though Combining Booleans: && and || === & work | like + or *: combine terms element-wise Flow control wants *one* Boolean value, and to skip calculating what's not needed && and || give _one_ Boolean, lazily: {r} (0 > 0) && (all.equal(42%%6, 169%%13))  This *never* evaluates the complex expression on the right Use && and || for control, & and | for subsetting Iteration === Repeat similar actions multiple times: {r} table.of.logarithms <- vector(length=7,mode="numeric") table.of.logarithms for (i in 1:length(table.of.logarithms)) { table.of.logarithms[i] <- log(i) } table.of.logarithms  for() ===  for (i in 1:length(table.of.logarithms)) { table.of.logarithms[i] <- log(i) }  for increments a **counter** (here i) along a vector (here 1:length(table.of.logarithms)) and **loops through** the **body* until it runs through the vector "**iterates over** the vector" N.B., there is a better way to do this job! The body of the for() loop === Can contain just about anything, including: - if() clauses - other for() loops (nested iteration) Nested iteration example ===  c <- matrix(0, nrow=nrow(a), ncol=ncol(b)) if (ncol(a) == nrow(b)) { for (i in 1:nrow(c)) { for (j in 1:ncol(c)) { for (k in 1:ncol(a)) { c[i,j] <- c[i,j] + a[i,k]*b[k,j] } } } } else { stop("matrices a and b non-conformable") }  while(): conditional iteration ===  while (max(x) - 1 > 1e-06) { x <- sqrt(x) }  Condition in the argument to while must be a single Boolean value (like if) Body is looped over until the condition is FALSE so can loop forever Loop never begins unless the condition starts TRUE for() vs. while() === for() is better when the number of times to repeat (values to iterate over) is clear in advance while() is better when you can recognize when to stop once you're there, even if you can't guess it to begin with Every for() could be replaced with a while() Exercise: show this Avoiding iteration === R has many ways of _avoiding_ iteration, by acting on whole objects - It's conceptually clearer - It leads to simpler code - It's faster (sometimes a little, sometimes drastically) Vectorized arithmetic === How many languages add 2 vectors:  c <- vector(length(a)) for (i in 1:length(a)) { c[i] <- a[i] + b[i] }  How R adds 2 vectors:  a+b  or a triple for() loop for matrix multiplication vs. a %*% b Advantages of vectorizing === - Clarity: the syntax is about _what_ we're doing - Concision: we write less - Abstraction: the syntax hides _how the computer does it_ - Generality: same syntax works for numbers, vectors, arrays, ... - Speed: modifying big vectors over and over is slow in R; work gets done by optimized low-level code Vectorized calculations === Many functions are set up to vectorize automatically {r} abs(-3:3) log(1:7)  See also apply() from last week We'll come back to this in great detail later Vectorized conditions: ifelse() ===  ifelse(x^2 > 1, 2*abs(x)-1, x^2)  1st argument is a Boolean vector, then pick from the 2nd or 3rd vector arguments as TRUE or FALSE Summary === - Dataframes - if, nested if, switch - Iteration: for, while - Avoiding iteration with whole-object ("vectorized") operations What Is Truth? === 0 counts as FALSE; other numeric values count as TRUE; the strings "TRUE" and "FALSE" count as you'd hope; most everything else gives an error Advice: Don't play games here; try to make sure control expressions are getting Boolean values Conversely, in arithmetic, FALSE is 0 and TRUE is 1 {r} mean(states$Murder > 7)  switch() === Simplify nested if with switch(): give a variable to select on, then a value for each option  switch(type.of.summary, mean=mean(states$Murder), median=median(states$Murder), histogram=hist(states$Murder), "I don't understand")  Exercise (off-line) === Set type.of.summary to, succesively, "mean", "median", "histogram", and "mode", and explain what happens Unconditional iteration ===  repeat { print("Help! I am Dr. Morris Culpepper, trapped in an endless loop!") }  "Manual" control over iteration ===  repeat { if (watched) { next() } print("Help! I am Dr. Morris Culpepper, trapped in an endless loop!") if (rescued) { break() } }  break() exits the loop; next() skips the rest of the body and goes back into the loop both work with for() and while() as well Exercise: how would you replace while() with repeat()`?