# Learning R, Lesson 1 # H. Seltman 1/23/07 # What is R and what are its strengths and weaknesses? # R is a free, popular, reliable, extensible, interpreter-based programming, graphics and statistics # package. It is an implementation of "S" available for all standard computer operating systems. # Unlike the expensive, commercial implementation, S-PLUS, R is not menu based. It can perform # standard statistical analyses easily, can be programmed for simple or complex non-standard # analyses, and is often the first place new statistical procedures are available. It has many # graph options, and can be programmed for essentially any graphing task. It is reasonably quick # for an interpreted language. For new complex tasks, it is often used as a prototyping system # before porting to C or Fortran. The command line interface is good for trying quick ideas, while # the use of script files is highly recommended for creating documented and resusable code. Very # large datasets are not well handled. Although there are no menus, it is usually possible to learn # a small subset of R, then learn individual additional parts as needed. # Everything in R is an "object", named or unnamed. Objects created in the "workspace" are saved # to disk when you quit R. Objects range from a single number to large datasets to a set of # results from a complex analysis of data. # The two main types of objects are language elements and data. Language elements include # functions, expressions and formulas. Data objects are all vectors of various types (modes), even # if of length one. The main modes are numeric, character (string), logical (TRUE/FALSE), and list # (each element of which is an object of any type). Every object has a mode and a length. All of # the data modes allow missing elements coded as NA which have the same mode as the rest of the # vector and contribute to its length. A special object called NULL is a zero length object # with mode NULL (essentially not yet determined). Objects may have additional properties called # "attributes", each of which is a named object. Most attributes have corresponding functions to # read and set them for any object. # A special attribute called "class" is heavily used in R for "object-oriented programming". # The class of any simple data structure or a function is its mode. More complex objects # have special classes such as "factor", "matrix", "array", "data.frame", "lm" (linear # model), etc. The classes allow one function, such as print() or summary(), to behave # differently and appropriately for different classes of objects. # When you feed "commands" to R, the commands are all "expressions" (including function evaluations) # that are evaluated, with results optionally printed (normally on the screen) and/or assigned # to a named object. Object names are case sensitive; should start with a letter or period; # may contain letters, numbers and periods; and cannot contain an underscore, dash, etc. # Meaningful names are recommended. Everything after a "#" character is ignored. # The R prompt is ">". If you enter a syntactically incorrect command, you'll get an error # and can try again. If you enter an incomplete command, you'll get a "+" prompt and will # need to complete the command before continuing. Generally this involves adding one or more # right parentheses (or quotes if unmatched quotes were entered). # Assignment can be done with an equal sign, in which case nothing is printed. Typing # a non-assignment expression is an implicit call to the print() function. Like most # calculators and languages, multiplication and division have higher precendence than # addition and subtraction. Use parentheses as needed. # The following are "reserved" names which you cannot use: if else repeat while function for in # next break TRUE FALSE NULL NA Inf NaN. The following have special meaning and should be # avoided as names: c, q, s, t, C, D, F, I, T, diff, length, mean, pi, range, rank, time, # tree and var. Use of these may cause programs to fail or (rarely) give wrong answers. 3+5*2 print(3+5*2) # apply function "print" to expression "3+5*2" (3+5)/2 x=3+5*2 y=(3+5)/2 x y x-y x^y length(x) # apply function "length" to object "x" mode(x) class(x) sqrt(y) log(100) log10(100) # Two common ways to create a vector are the concatenate function, c(), and the # sequence operator, ":". Standard arithmetic operators work on each element of # a vector or pair of vectors. You will often want to use parentheses with the # sequence operator because it has high precendence. x=c(5,2,10,3.5,1,10) x y=2:7 y x^2 x+y x/y z=4 2:z+1 # see next section for what is happening here 2:(z+1) # this is probably what you wanted # The "recycling rule" applies when a two-vector operation is applied to vectors of # unequal length. This may be intended or unintended, the latter sometimes giving # a warning. Here's how the rule works: if a vector is too small to match the other # vector, it is replicated as needed. If fractional replication is needed the warning # is given. y y+1 y+c(1,2) y+1:4 # Functions may be built-in, loaded from public "packages", or be user-written. They # have complex "calling rules" to be discussed later, but are intuitive for the simple # situations. Here are some functions that take a vector argument and return a # vector of length one. mean(x) mean(3,5) # Danger!! Not what you wanted!! mean(c(3,5)) sd(x) var(x) median(x) sum(x) prod(x) # product cumsum(x) cumprod(x) # Here are some functions that (somewhat confusingly) take several vectors, concatenate # them together and return a length 1 or length 2 vector. min(x) min(9,7,5) min(9,c(7,5)) max(x) range(x) # Here are some functions that return results of the same length as the arguments. Note # that "p" mean "parallel" (i.e., element by element, here). x cumsum(x) # cummulative sum cumprod(x) y pmin(x,y) pmax(x,y) # Logical operators work elementwise and produce TRUE or FALSE (or NA) at each element. y==3 # double equal is comparison, not assignment y!=3 # not equal to y<3 y<=3 y>3 & y<5 # "and" y<3 | y>5 # "or" # Subsetting of vectors is used often, and has many forms, all of which involve # use of square brackets. First consider a vector of positive numbers: y y[4] y[4:5] y[4:9] y[c(1,3,5)] y[c(6,5,4,4,5,6)] # Negative numbers drop elements: y y[-4] y[c(-2,-5)] y[c(-5,-2)] y[-c(2,5)] #y[-12] # gives an error # Logical indices must match in length or be recycled. y y[c(TRUE,FALSE,TRUE,TRUE,FALSE,TRUE)] y[c(TRUE,FALSE)] y[c(TRUE,FALSE,FALSE)] x x<10 x[x<10] x>mean(x) y[x>mean(x)] tmp=y[x>mean(x)] tmp*2 # Vectors can have a "names" attribute with names for each element. The names # can be used for subsetting. x LETTERS # built in variable of 26 single character strings 1:length(x) seq(along=x) # preferred form to be discussed later LETTERS[seq(along=x)] names(x)=LETTERS[seq(along=x)] # attribute setting function go on the left of the "=" x x["E"] x[c("F","B","D")] # Names can be entered in the c() command. y=c(one=1, 2:5, six=6, last=7) y y["six"] y[c(3,6)] # A practical example: two-sample t-test (with correction for unequal variance) x1=c(2,5,3,4,8,10) x2=c(5,4,6,2,7,5,9,6) length(x1); length(x2) # semicolon allows multiple expressions on a line mean(x1); mean(x2); sd(x1); sd(x2) t.test(x1,x2) rslt=t.test(x1,x2) print(rslt) # print() is optional (except in scripts) names(rslt) mode(rslt) # list elements can be of different modes rslt[c(1,8)] # subsetting lists with [] gives a list mode(rslt[c(1,8)]) mode(rslt[1]) rslt[[1]] # subsetting with [[]] gives the underlying vector mode mode(rslt[[1]]) rslt[[8]] mode(rslt[[8]]) rslt$method # lists (not vectors) may be subsetted using the $ shortcut rslt$p.val # abbreviations may be used, but may be dangerous rslt$p # abbreviations must be unique # Power demo using rnorm() to generate random samples t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value # Use help.start() to get extensive help. # Use help(some.function) to see if a function exists and get brief help. help.start() # Search Engine & Keywords is very useful help(rnorm) # Use q() to quit. Normally you answer yes to "Save workspace image". ################################# ####### Practice problems ####### #1) Find the mean and variance of 3.5, 12.0 8.6 and 4.4. Find the s.d. without using # the function sd(). #2) Let vec1=rnorm(100,mean=10,sd=2). Check its length. Make a single expression that # pulls out those values bigger than 12. Make an expression that counts them. # Repeat for counting how many are between 8 and 12, and between 6 and 14. #3) Make a vector "vec2" equal to the centered version of vec1. Verify that if has # mean 0 and sd matching vec1. Make a vector "vec3" equal to the "Z-score" for vec1 # by dividing vec2 by the sd of either vector. Verify that the variance is 1.0. #4) Print out the values of vec3 that are bigger than the square root of 3. #5) Find the minimum and maximum of each of the 3 vectors. #6) Make the smallest expression that can pull out elements 10, 11, 12, 88, 89, 90, and # 91 from one of the vectors. #7) Print out various parts of this expression to see if you can figure out what it is # doing. sum((-4:4)>sqrt(4:6)) #8) Figure this one out. x=(1:100)[c(FALSE,TRUE)] # not the easiest way to do this x=x[floor(x/3)*3==x] x=x[floor(sqrt(x))==sqrt(x)] x #9) Figure this one out. p=c(t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value, t.test(rnorm(8),rnorm(8))$p.value) length(p) mean(p) range(p) #10) Figure this one out. Use help() as needed. n=500 # or set to any other number e=numeric(n) # creates an numeric vector of length n filled with zeros e=rep(NA,n) # my preferred way to better detect errors # for loops will be discussed another time for (i in 1:n) e[i]=diff(t.test(rnorm(10+i),rnorm(10+i))$estimate) plot(e, type="l")