# Learning R, Lesson 1
# H. Seltman 1/23/07

# What is R and what are its strengths and weaknesses?
# R is a free, popular, reliable, extensible, interpreter-based programming, graphics and statistics
# package.  It is an implementation of "S" available for all standard computer operating systems.
# Unlike the expensive, commercial implementation, S-PLUS, R is not menu based. It can perform
# standard statistical analyses easily, can be programmed for simple or complex non-standard
# analyses, and is often the first place new statistical procedures are available.  It has many
# graph options, and can be programmed for essentially any graphing task.  It is reasonably quick
# for an interpreted language.  For new complex tasks, it is often used as a prototyping system
# before porting to C or Fortran.   The command line interface is good for trying quick ideas, while
# the use of script files is highly recommended for creating documented and resusable code.  Very
# large datasets are not well handled.  Although there are no menus, it is usually possible to learn
# a small subset of R, then learn individual additional parts as needed.

# Everything in R is an "object", named or unnamed.   Objects created in the "workspace" are saved
# to disk when you quit R.  Objects range from a single number to large datasets to a set of
# results from a complex analysis of data.

# The two main types of objects are language elements and data.  Language elements include
# functions, expressions and formulas.  Data objects are all vectors of various types (modes), even
# if of length one.  The main modes are numeric, character (string), logical (TRUE/FALSE), and list
# (each element of which is an object of any type).  Every object has a mode and a length.  All of
# the data modes allow missing elements coded as NA which have the same mode as the rest of the
# vector and contribute to its length.  A  special object called NULL is a zero length object
# with mode NULL (essentially not yet determined).  Objects may have additional properties called 
# "attributes", each of which is a named object.  Most attributes have corresponding functions to
# read and set them for any object.

# A special attribute called "class" is heavily used in R for "object-oriented programming".
# The class of any simple data structure or a function is its mode.  More complex objects
# have special classes such as "factor", "matrix", "array", "data.frame", "lm" (linear
# model), etc.  The classes allow one function, such as print() or summary(), to behave
# differently and appropriately for different classes of objects.


# When you feed "commands" to R, the commands are all "expressions" (including function evaluations)
# that are evaluated, with results optionally printed (normally on the screen) and/or assigned
# to a named object.  Object names are case sensitive; should start with a letter or period;
# may contain letters, numbers and periods; and cannot contain an underscore, dash, etc.
# Meaningful names are recommended.  Everything after a "#" character is ignored.

# The R prompt is ">".  If you enter a syntactically incorrect command, you'll get an error
# and can try again.  If you enter an incomplete command, you'll get a "+" prompt and will
# need to complete the command before continuing.  Generally this involves adding one or more
# right parentheses (or quotes if unmatched quotes were entered).

# Assignment can be done with an equal sign, in which case nothing is printed.  Typing
# a non-assignment expression is an implicit call to the print() function.  Like most
# calculators and languages, multiplication and division have higher precendence than
# addition and subtraction.  Use parentheses as needed.
# The following are "reserved" names which you cannot use: if else repeat while function for in
# next break TRUE FALSE NULL NA Inf NaN.  The following have special meaning and should be
# avoided as names: c, q, s, t, C, D, F, I, T, diff, length, mean, pi, range, rank, time,
# tree and var.  Use of these may cause programs to fail or (rarely) give wrong answers.

3+5*2
print(3+5*2) # apply function "print" to expression "3+5*2"
(3+5)/2
x=3+5*2
y=(3+5)/2
x
y
x-y
x^y
length(x)  # apply function "length" to object "x"
mode(x)
class(x)
sqrt(y)
log(100)
log10(100)


# Two common ways to create a vector are the concatenate function, c(), and the
# sequence operator, ":".  Standard arithmetic operators work on each element of
# a vector or pair of vectors.  You will often want to use parentheses with the
# sequence operator because it has high precendence.

x=c(5,2,10,3.5,1,10)
x
y=2:7
y
x^2
x+y
x/y
z=4
2:z+1   # see next section for what is happening here
2:(z+1) # this is probably what you wanted


# The "recycling rule" applies when a two-vector operation is applied to vectors of
# unequal length.  This may be intended or unintended, the latter sometimes giving
# a warning.  Here's how the rule works: if a vector is too small to match the other
# vector, it is replicated as needed.  If fractional replication is needed the warning
# is given.

y
y+1
y+c(1,2)
y+1:4


# Functions may be built-in, loaded from public "packages", or be user-written.  They
# have complex "calling rules" to be discussed later, but are intuitive for the simple
# situations.  Here are some functions that take a vector argument and return a
# vector of length one.

mean(x)
mean(3,5) # Danger!! Not what you wanted!!
mean(c(3,5))
sd(x)
var(x)
median(x)
sum(x)
prod(x) # product
cumsum(x)
cumprod(x)


# Here are some functions that (somewhat confusingly) take several vectors, concatenate
# them together and return a length 1 or length 2 vector.

min(x)
min(9,7,5)
min(9,c(7,5))
max(x)
range(x)


# Here are some functions that return results of the same length as the arguments.  Note
# that "p" mean "parallel" (i.e., element by element, here).

x
cumsum(x)  # cummulative sum
cumprod(x)
y
pmin(x,y)
pmax(x,y)


# Logical operators work elementwise and produce TRUE or FALSE (or NA) at each element.

y==3  # double equal is comparison, not assignment
y!=3  # not equal to
y<3
y<=3
y>3 & y<5   # "and"
y<3 | y>5 # "or"


# Subsetting of vectors is used often, and has many forms, all of which involve
# use of square brackets.  First consider a vector of positive numbers:

y
y[4]
y[4:5]
y[4:9]
y[c(1,3,5)]
y[c(6,5,4,4,5,6)]


# Negative numbers drop elements:

y
y[-4]
y[c(-2,-5)]
y[c(-5,-2)]
y[-c(2,5)]
#y[-12] # gives an error


# Logical indices must match in length or be recycled.

y
y[c(TRUE,FALSE,TRUE,TRUE,FALSE,TRUE)]
y[c(TRUE,FALSE)]
y[c(TRUE,FALSE,FALSE)]
x
x<10
x[x<10]
x>mean(x)
y[x>mean(x)]
tmp=y[x>mean(x)]
tmp*2


# Vectors can have a "names" attribute with names for each element.  The names
# can be used for subsetting.

x
LETTERS # built in variable of 26 single character strings
1:length(x)
seq(along=x)  # preferred form to be discussed later
LETTERS[seq(along=x)]
names(x)=LETTERS[seq(along=x)]  # attribute setting function go on the left of the "="
x
x["E"]
x[c("F","B","D")]


# Names can be entered in the c() command.

y=c(one=1, 2:5, six=6, last=7)
y
y["six"]
y[c(3,6)]


# A practical example: two-sample t-test (with correction for unequal variance)
x1=c(2,5,3,4,8,10)
x2=c(5,4,6,2,7,5,9,6)
length(x1); length(x2)  # semicolon allows multiple expressions on a line
mean(x1); mean(x2); sd(x1); sd(x2)
t.test(x1,x2)
rslt=t.test(x1,x2)
print(rslt)   # print() is optional (except in scripts)
names(rslt)
mode(rslt)   # list elements can be of different modes
rslt[c(1,8)] # subsetting lists with [] gives a list
mode(rslt[c(1,8)])
mode(rslt[1])
rslt[[1]]    # subsetting with [[]] gives the underlying vector mode
mode(rslt[[1]])
rslt[[8]]
mode(rslt[[8]])
rslt$method # lists (not vectors) may be subsetted using the $ shortcut
rslt$p.val  # abbreviations may be used, but may be dangerous
rslt$p      # abbreviations must be unique


# Power demo using rnorm() to generate random samples
t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value
t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value
t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value
t.test(rnorm(10,mean=1,sd=1),rnorm(12,mean=2,sd=1))$p.value


# Use help.start() to get extensive help.
# Use help(some.function) to see if a function exists and get brief help.
help.start() # Search Engine & Keywords is very useful
help(rnorm)


# Use q() to quit.  Normally you answer yes to "Save workspace image".


#################################
####### Practice problems #######

#1) Find the mean and variance of 3.5, 12.0 8.6 and 4.4.  Find the s.d. without using
#   the function sd().

#2) Let vec1=rnorm(100,mean=10,sd=2).  Check its length.  Make a single expression that
#   pulls out those values bigger than 12.  Make an expression that counts them.
#   Repeat for counting how many are between 8 and 12, and between 6 and 14.

#3) Make a vector "vec2" equal to the centered version of vec1.  Verify that if has
#   mean 0 and sd matching vec1.  Make a vector "vec3" equal to the "Z-score" for vec1
#   by dividing vec2 by the sd of either vector.  Verify that the variance is 1.0.

#4) Print out the values of vec3 that are bigger than the square root of 3.

#5) Find the minimum and maximum of each of the 3 vectors.

#6) Make the smallest expression that can pull out elements 10, 11, 12, 88, 89, 90, and
#   91 from one of the vectors.

#7) Print out various parts of this expression to see if you can figure out what it is
#   doing.
   sum((-4:4)>sqrt(4:6))

#8) Figure this one out.
   x=(1:100)[c(FALSE,TRUE)]  # not the easiest way to do this
   x=x[floor(x/3)*3==x]
   x=x[floor(sqrt(x))==sqrt(x)]
   x

#9) Figure this one out.
   p=c(t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value,
       t.test(rnorm(8),rnorm(8))$p.value)
   length(p)
   mean(p)
   range(p)

#10) Figure this one out.  Use help() as needed.
    n=500  # or set to any other number
    e=numeric(n)  # creates an numeric vector of length n filled with zeros
    e=rep(NA,n)   # my preferred way to better detect errors
    # for loops will be discussed another time
    for (i in 1:n) e[i]=diff(t.test(rnorm(10+i),rnorm(10+i))$estimate)
    plot(e, type="l")