# Learning R, Lesson 1
# H. Seltman 1/23/07
# What is R and what are its strengths and weaknesses?
# R is a free, popular, reliable, extensible, interpreter-based programming, graphics and statistics
# package. It is an implementation of "S" available for all standard computer operating systems.
# Unlike the expensive, commercial implementation, S-PLUS, R is not menu based. It can perform
# standard statistical analyses easily, can be programmed for simple or complex non-standard
# analyses, and is often the first place new statistical procedures are available. It has many
# graph options, and can be programmed for essentially any graphing task. It is reasonably quick
# for an interpreted language. For new complex tasks, it is often used as a prototyping system
# before porting to C or Fortran. The command line interface is good for trying quick ideas, while
# the use of script files is highly recommended for creating documented and resusable code. Very
# large datasets are not well handled. Although there are no menus, it is usually possible to learn
# a small subset of R, then learn individual additional parts as needed.
# Everything in R is an "object", named or unnamed. Objects created in the "workspace" are saved
# to disk when you quit R. Objects range from a single number to large datasets to a set of
# results from a complex analysis of data.
# The two main types of objects are language elements and data. Language elements include
# functions, expressions and formulas. Data objects are all vectors of various types (modes), even
# if of length one. The main modes are numeric, character (string), logical (TRUE/FALSE), and list
# (each element of which is an object of any type). Every object has a mode and a length. All of
# the data modes allow missing elements coded as NA which have the same mode as the rest of the
# vector and contribute to its length. A special object called NULL is a zero length object
# with mode NULL (essentially not yet determined). Objects may have additional properties called
# "attributes", each of which is a named object. Most attributes have corresponding functions to
# read and set them for any object.
# A special attribute called "class" is heavily used in R for "object-oriented programming".
# The class of any simple data structure or a function is its mode. More complex objects
# have special classes such as "factor", "matrix", "array", "data.frame", "lm" (linear
# model), etc. The classes allow one function, such as print() or summary(), to behave
# differently and appropriately for different classes of objects.
# When you feed "commands" to R, the commands are all "expressions" (including function evaluations)
# that are evaluated, with results optionally printed (normally on the screen) and/or assigned
# to a named object. Object names are case sensitive; should start with a letter or period;
# may contain letters, numbers and periods; and cannot contain an underscore, dash, etc.
# Meaningful names are recommended. Everything after a "#" character is ignored.
# The R prompt is ">". If you enter a syntactically incorrect command, you'll get an error
# and can try again. If you enter an incomplete command, you'll get a "+" prompt and will
# need to complete the command before continuing. Generally this involves adding one or more
# right parentheses (or quotes if unmatched quotes were entered).
# Assignment can be done with an equal sign, in which case nothing is printed. Typing
# a non-assignment expression is an implicit call to the print() function. Like most
# calculators and languages, multiplication and division have higher precendence than
# addition and subtraction. Use parentheses as needed.
# The following are "reserved" names which you cannot use: if else repeat while function for in
# next break TRUE FALSE NULL NA Inf NaN. The following have special meaning and should be
# avoided as names: c, q, s, t, C, D, F, I, T, diff, length, mean, pi, range, rank, time,
# tree and var. Use of these may cause programs to fail or (rarely) give wrong answers.
3+5*2
print(3+5*2) # apply function "print" to expression "3+5*2"
(3+5)/2
x=3+5*2
y=(3+5)/2
x
y
x-y
x^y
length(x) # apply function "length" to object "x"
mode(x)
class(x)
sqrt(y)
log(100)
log10(100)
# Two common ways to create a vector are the concatenate function, c(), and the
# sequence operator, ":". Standard arithmetic operators work on each element of
# a vector or pair of vectors. You will often want to use parentheses with the
# sequence operator because it has high precendence.
x=c(5,2,10,3.5,1,10)
x
y=2:7
y
x^2
x+y
x/y
z=4
2:z+1 # see next section for what is happening here
2:(z+1) # this is probably what you wanted
# The "recycling rule" applies when a two-vector operation is applied to vectors of
# unequal length. This may be intended or unintended, the latter sometimes giving
# a warning. Here's how the rule works: if a vector is too small to match the other
# vector, it is replicated as needed. If fractional replication is needed the warning
# is given.
y
y+1
y+c(1,2)
y+1:4
# Functions may be built-in, loaded from public "packages", or be user-written. They
# have complex "calling rules" to be discussed later, but are intuitive for the simple
# situations. Here are some functions that take a vector argument and return a
# vector of length one.
mean(x)
mean(3,5) # Danger!! Not what you wanted!!
mean(c(3,5))
sd(x)
var(x)
median(x)
sum(x)
prod(x) # product
cumsum(x)
cumprod(x)
# Here are some functions that (somewhat confusingly) take several vectors, concatenate
# them together and return a length 1 or length 2 vector.
min(x)
min(9,7,5)
min(9,c(7,5))
max(x)
range(x)
# Here are some functions that return results of the same length as the arguments. Note
# that "p" mean "parallel" (i.e., element by element, here).
x
cumsum(x) # cummulative sum
cumprod(x)
y
pmin(x,y)
pmax(x,y)
# Logical operators work elementwise and produce TRUE or FALSE (or NA) at each element.
y==3 # double equal is comparison, not assignment
y!=3 # not equal to
y<3
y<=3
y>3 & y<5 # "and"
y<3 | y>5 # "or"
# Subsetting of vectors is used often, and has many forms, all of which involve
# use of square brackets. First consider a vector of positive numbers:
y
y[4]
y[4:5]
y[4:9]
y[c(1,3,5)]
y[c(6,5,4,4,5,6)]
# Negative numbers drop elements:
y
y[-4]
y[c(-2,-5)]
y[c(-5,-2)]
y[-c(2,5)]
#y[-12] # gives an error
# Logical indices must match in length or be recycled.
y
y[c(TRUE,FALSE,TRUE,TRUE,FALSE,TRUE)]
y[c(TRUE,FALSE)]
y[c(TRUE,FALSE,FALSE)]
x
x<10
x[x<10]
x>mean(x)
y[x>mean(x)]
tmp=y[x>mean(x)]
tmp*2
# Vectors can have a "names" attribute with names for each element. The names
# can be used for subsetting.
x
LETTERS # built in variable of 26 single character strings
1:length(x)
seq(along=x) # preferred form to be discussed later
LETTERS[seq(along=x)]
names(x)=LETTERS[seq(along=x)] # attribute setting function go on the left of the "="
x
x["E"]
x[c("F","B","D")]
# Names can be entered in the c() command.
y=c(one=1, 2:5, six=6, last=7)
y
y["six"]
y[c(3,6)]
# A practical example: two-sample t-test (with correction for unequal variance)
x1=c(2,5,3,4,8,10)
x2=c(5,4,6,2,7,5,9,6)
length(x1); length(x2) # semicolon allows multiple expressions on a line
mean(x1); mean(x2); sd(x1); sd(x2)
t.test(x1,x2)
rslt=t.test(x1,x2)
print(rslt) # print() is optional (except in scripts)
names(rslt)
mode(rslt) # list elements can be of different modes
rslt[c(1,8)] # subsetting lists with [] gives a list
mode(rslt[c(1,8)])
mode(rslt[1])
rslt[[1]] # subsetting with [[]] gives the underlying vector mode
mode(rslt[[1]])
rslt[[8]]
mode(rslt[[8]])
rslt$method # lists (not vectors) may be subsetted using the $ shortcut
rslt$p.val # abbreviations may be used, but may be dangerous
rslt$p # abbreviations must be unique
# Power demo using rnorm() to generate random samples
t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value
t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value
t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value
t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value
# Use help.start() to get extensive help.
# Use help(some.function) to see if a function exists and get brief help.
help.start() # Search Engine & Keywords is very useful
help(rnorm)
# Use q() to quit. Normally you answer yes to "Save workspace image".
#################################
####### Practice problems #######
#1) Find the mean and variance of 3.5, 12.0 8.6 and 4.4. Find the s.d. without using
# the function sd().
#2) Let vec1=rnorm(100,mean=10,sd=2). Check its length. Make a single expression that
# pulls out those values bigger than 12. Make an expression that counts them.
# Repeat for counting how many are between 8 and 12, and between 6 and 14.
#3) Make a vector "vec2" equal to the centered version of vec1. Verify that if has
# mean 0 and sd matching vec1. Make a vector "vec3" equal to the "Z-score" for vec1
# by dividing vec2 by the sd of either vector. Verify that the variance is 1.0.
#4) Print out the values of vec3 that are bigger than the square root of 3.
#5) Find the minimum and maximum of each of the 3 vectors.
#6) Make the smallest expression that can pull out elements 10, 11, 12, 88, 89, 90, and
# 91 from one of the vectors.
#7) Print out various parts of this expression to see if you can figure out what it is
# doing.
sum((-4:4)>sqrt(4:6))
#8) Figure this one out.
x=(1:100)[c(FALSE,TRUE)] # not the easiest way to do this
x=x[floor(x/3)*3==x]
x=x[floor(sqrt(x))==sqrt(x)]
x
#9) Figure this one out.
p=c(t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value,
t.test(rnorm(8),rnorm(8))$p.value)
length(p)
mean(p)
range(p)
#10) Figure this one out. Use help() as needed.
n=500 # or set to any other number
e=numeric(n) # creates an numeric vector of length n filled with zeros
e=rep(NA,n) # my preferred way to better detect errors
# for loops will be discussed another time
for (i in 1:n) e[i]=diff(t.test(rnorm(10+i),rnorm(10+i))$estimate)
plot(e, type="l")