##############################################################
#
# HANDOUT TO ACCOMPANY RWG CH 1 ON VARIABLE DISTRIBUTIONS AND 
# EXPLORATORY DATA ANALYSIS
#
# BRIAN JUNKER SEP 3, 2001
#
##############################################################

We are going to begin on about p 6 of the "Introduction to Splus"
handout that we worked through part of in the Wean cluster last week. 

Then we are going to explore the concord data set in more detail,
using 

  - some of Splus's built-in EDA features
  - some specially-written EDA functions for Splus that 
      more closely match the stuff shown in RWG Chapter 1

###############################################################


GETTING STARTED (AND STOPPED)
=============================


To start up Splus in UNIX, we type

% Splus -e

at the UNIX prompt.  The "-e" allows us to edit command lines as we
type them.  (alternatively, you could write commands in a text
editor and cut-and-paste them into the Splus command line).

The two most useful functions for reading in data are

        read.table()
        scan()

If you wanted to learn more about one of these you would say

        help(read.table)
        help(scan)

at the SPLUS prompt.  In most UNIX versions of SPLUS, if you wanted
the help to appear in a separate window you would type 

	help(read.table,window=T)

and if you want a menu of help topics to choose from you would type

        help.start()    or    help.start(gui="motif")

Oh, yes, and while I am thinking about it, the way to get out of SPLUS
is to type

        q()


THE CONCORD DATA
================


The concord data set comes from Hamilton's "Regression with Graphics"
text (RWG, p.2): During the 1970's, the city of Concord, New
Hampshire, experienced a growing demand for water, despite a roughly
stable population.  In late 1979 this rising demand, together with
unusually dry weather, led to a shortage in water supply.  In late
summer 1980, as the shortage worsened, the Concord Water Department
and municipal officails began a campaign (desciribed by one observer
as a "media blitz") to persuade citizens to use less water.  Over the
next year, water used declined by about 15%, which was naturally
interpreted as evidence of the conservation campaign's success.

An overall decrease of 15% does not mean that everyone decreased their
water usage by 15%: some citizens might have decreased their demand by
more or less than 15%, and some might even have increased their usage.
The 1981 Concord Water Survey examined such variations in water
savings.  Questionnaires were distributed to a random sample of
Concord households, asking about demographic characteristics,
opinions, and water conservation behavior.  These questionnnaires were
then matched with Water Department records (from meter readings) of
the amount of water each household had actually used during the
summers of 1979, 1980 and 1981, two years before and one year after
the water conservation campaign.

The Concord data consists of information on 496 households and how
they responded to the scarcity of a basic natural resource, and to the
campaign by public officials to conserve it.  The data look like this
(see concord1.dat):

  case water81 water80 water79 income educat retire peop81 cpeop peop80
1    5    2300    3500    3500     35     18     no      3     0      3
2    6    2500    4600    4100     60     20     no      5     0      5
3    7    4000    3800    5500     30     16     no      6     1      5
4    8     500    1500    1700     20     11    yes      1    -1      2

  [ etc, for 496 lines ]

The variables are (see concord1.txt):

  1. case         case ID number
  2. water81      Summer 1981 Water Use
  3. water80      Summer 1980 Water Use
  4. water79      Summer 1979 Water Use
  5. income       Income in Thousands
  6. educat       Education in Years
  7. retire       Head of house retired?
  8. peop81       # of People Resident, 1981
  9. cpeop        Increase in # of People
 10. peop80       # people living in 1980


READING IN THE CONCORD DATA
===========================


Since the data set has the variable names in the first line (see
concord1.dat) we can tell Splus's "read.table" command to use this
"header line" to name the variables.

> conc _ read.table("concord1.dat",header=T)

We can print out a few lines of the concord data set using some of the
special matrix indexing notation that we learned about in lab last week:

> conc[1:10,]
   case water81 water80 water79 income educat retire peop81 cpeop peop80
 1    5    2300    3500    3500     35     18     no      3     0      3
 2    6    2500    4600    4100     60     20     no      5     0      5
 3    7    4000    3800    5500     30     16     no      6     1      5
 4    8     500    1500    1700     20     11    yes      1    -1      2
 5    9    4400    4300    3700     50     20     no      2     0      2
 6   10    1900    1500    1600     18     13     no      5     0      5
 7   11    3600    3500    3500     25     14    yes      3     0      3
 8   12    2200    2400    4900     21     20    yes      2     0      2
 9   13    2300    1900    2900     13     11     no      4     0      4
10   14    2200    2500    3900     30     20     no      2     0      2


  DIGRESSION


  If the variable names had not been there, we could have read the
  data without the "header" argument, and then used the "names"
  function to assign variable names.  To illustrate, I created the
  data file concord0.dat, which is identical to concord1.dat except
  the header line is missing:

  > conc _ read.table("concord0.dat")

  > conc[1:3,]
    V1   V2   V3   V4 V5 V6 V7 V8 V9 V10
  1  5 2300 3500 3500 35 18 no  3  0   3
  2  6 2500 4600 4100 60 20 no  5  0   5
  3  7 4000 3800 5500 30 16 no  6  1   5

  > names(conc) _ c("case", "water81", "water80", "water79", "income",
  + "educat", "retire", "peop81", "cpeop", "peop80")

  > conc[1:3,]
    case water81 water80 water79 income educat retire peop81 cpeop peop80
  1    5    2300    3500    3500     35     18     no      3     0      3
  2    6    2500    4600    4100     60     20     no      5     0      5
  3    7    4000    3800    5500     30     16     no      6     1      5


  END OF DIGRESSION


Recall that "read.table" reads the data into an Splus data structure
called a "data frame".  There are two ways to think of a data frame:

   1. as a matrix where different columns can have different data
      types [e.g. numerical, as in "water81" and text, as in "retire"

   2. as a list of vectors.  Each vector is one column (one variable)
      in the data set. 

To see the "list" nature of a data frame, we can try:

> as.list(conc[1:3,])

$case:
[1] 5 6 7

$water81:
[1] 2300 2500 4000

$water80:
[1] 3500 4600 3800

  .
  .   [ I have omitted some of the output to save space in this file ]  
  .

$cpeop:
[1] 0 0 1

$peop80:
[1] 3 5 5

Most of the time it is easiest to think of a data.frame as a matrix
[meaning 1. above] but for some splus functions, thinking of the
"list" nature of a data.frame is better.

Two general pieces of information about a data frame are 

   * its dimensions as a matrix, and
   * the names of its variables (columns):

> dim(conc)
[1] 496  10
> names(conc)
 [1] "case"    "water81" "water80" "water79" "income"  "educat"  "retire"
 [8] "peop81"  "cpeop"   "peop80"

To get used to the different ways of referring to data in a data
frame, consider

  conc$water80[1:8]
  conc[1:8, 3]
  conc[1:8, "water80"]

these all refer to the same 8 numbers in "conc", namely 

  [1] 3500 4600 3800 1500 4300 1500 3500 2400


DATA SUMMARIES AND GRAPHS
=========================


Medians, boxplots, and other summaries of variables are easy to
obtain in splus, for example:

	median(conc$water80)
	summary(conc$water80)
	boxplot(conc$water80 - conc$water81,style.bxp="old")

	boxplot(conc$water79, conc$water80, conc$water81, style.bxp="old")
	median(conc$water79)
	summary(conc$water79)

Something is wrong with water79!  Examine that variable.  

> conc$water79

  [1] 3500  4100  5500  1700  3700  1600  3500  4900  2900  3900  4600  5200
 [13] 5400  14500 2800  2000  1300  1800  .     1200  1700  3700  3000  1000
 [25] 6200  1600  2900  1000  1300  700   2800  2800  1100  500   .     2500
 
          [ many lines omitted to save space ]

[445] 4100  3500  .     3100  4200  4600  7100  4700  3100  2400  6000  3300
[457] 2600  4300  3800  700   .     .     1800  1700  7000  7300  3500  .
[469] .     .     .     6300  4200  8100  600   1800  .     .     .     .
[481] 4300  .     3000  7300  1700  .     .     .     4500  1800  .     .
[493] .     .     .     .

Note the periods, which shouldn't be there.  Technically, read.table()
saw a mix of numbers and non-numeric items (periods) in that column of
the file and created a categorical ``factor'' with 82 levels
corresponding to the 82 different values in the column.  That is why
the boxplot only goes up to 82.  We will learn about factors later in
the course.  For now, we can re-read the file, instructing S-plus to
treat periods as missing data.

> conc_read.table("concord1.dat", header=T, na.strings=".")

and we can recheck the variable that was giving us problems before
with

> conc$water79

  [1]  3500  4100  5500  1700  3700  1600  3500  4900  2900  3900  4600  5200
 [13]  5400 14500  2800  2000  1300  1800    NA  1200  1700  3700  3000  1000
 [25]  6200  1600  2900  1000  1300   700  2800  2800  1100   500    NA  2500

          [ many lines omitted to save space ]

[445] 4100  3500  NA    3100  4200  4600  7100  4700  3100  2400  6000  3300
[457] 2600  4300  3800  700   NA    NA    1800  1700  7000  7300  3500  NA
[469] NA    NA    NA    6300  4200  8100  600   1800  NA    NA    NA    NA
[481] 4300  NA    3000  7300  1700  NA    NA    NA    4500  1800  NA    NA
[493] NA    NA    NA    NA

Several statistics commands have the optional argument ``na.rm'' to remove
the NA's when calculating their result.

> median(conc$water79)
> median(conc$water79, na.rm=T)

The "boxplot" command already knows to remove NA's before plotting;
now the plots we made before make more sense:

> boxplot(conc$water79, conc$water80, conc$water81,style.bxp="old")
> summary(conc$water79)

and here is a prettier version [see help(boxplot) and help(title) to
understand some of the arguments used to dress up the plot]

> boxplot(conc$water79, conc$water80, conc$water81,
+ names=c("1979","1980","1981"), ylab="Cubic Feet of Water",
+ main="Concord Water Study",style.bxp="old")

I have saved this picture in the file "water-use-boxes.ps".


  DIGRESSION


  Sometimes we want to save those pretty graphs for printing later, or
  including in report writeups, etc.  There are many ways to do this
  in Splus, but the easiest by far is to use the function "ps()" which
  we can download from the Splus area of www.stat.cmu.edu/~brian/401:

  After downloading the file "ps.q" we can type

  > source("ps.q")  # you only have to do this once since splus
                    # saves data and function definitions until
                    # you "rm()" them!

  > ps("pretty-graph")

  or if you want to both save the file and print it you can type

  > ps("pretty-graph", printit=T)  # or just ps("pretty-graph",T)

  In either case, the graph is saved as "pretty-graph.ps", a graphics
  file in "PostScript" format.  This is a printer-friendly format that
  can be viewed and printed with "ghostview" in UNIX and with "gsview"
  in Windows (these programs are installed on public cluster machines
  on campus).  They can also be incorporated in documents created with
  LaTeX (more later in the course) and MS-WORD (I am not a WORD moonie
  but I will try to figure out how to do this later in the course).

  END OF DIGRESSION


SPLUS IS ALSO A PROGRAMMING LANGUAGE
====================================


A "for" loop is a statement of the form 

  ``for (argument in range) { ... }''

It causes the following statement (or statements if the curly braces
are used) to be repeated with the argument taking on each of the
values in the range.

	for (i in 2:4) print(median(conc[,i], na.rm=T))
	for (n in 1:10) print(c(n, n^2, n^3))

	for (j in 1:ncol(conc)) {
	  cat(names(conc)[j],":\n")
          print(summary(conc[,j]))
	  cat("\n")
	}

The next one also uses the "if (condition) { ... }" statement; it is a
bit harder if you haven't studied programming, but still very useful:

	for (J in 1:ncol(conc)) {
	  if (is.numeric(conc[,J])) {
            cat(names(conc)[J], ":")
	    cat("  Median =", median(conc[,J]))
	    cat("  Mean =", round(mean(conc[,J]),3), "\n")
	  }
	}


An alternative to "for" loops for data.frames is lapply().  (There is
also apply() for matrices.)  It is much quicker than a for loop, so
consider using it when you have a slow "for" loop.

	lapply(conc, is.numeric)

The output is too spread out, one way to fix it is to use "unlist" to
list the elements in the list as a single vector:

	unlist(lapply(conc, is.numeric))

Here are some more examples.  (column 7,  "retired", is omitted since
it is not numeric (it is text, "yes" or "no"): 

	unlist(lapply(conc[, c(2:6, 8:10)], median))
	unlist(lapply(conc[, c(2:3, 5:6, 8:10)], median))

Notice how some NA's in water79 ruin the calculation of the median.
For many functions in Splus, this can be fixed by telling the function
to ignore or omit NA's in the calculation:

	unlist(lapply(conc[, c(2:6, 8:10)], median, na.rm=T))

	round(unlist(lapply(conc[, c(2:6, 8:10)], mean, na.rm=T)),2)

There are lots of parenthesis in that last example!  

The simplest way to "decode" all that is to read the command from the
inside out.  This takes some practice, but will pay off in the end.
In fact it is sometimes helpful to execute each piece, starting from
the inside, to see what is there:

        conc[, c(2:6, 8:10)]
	lapply(.Last.value, mean, na.rm=T)
	unlist(.Last.value)
        round(.Last.value,2)

[note the use of .Last.value to recall the value of the last
calculation with out having had to assign it to a variable name!]

We can create new functions in Splus as well.  One example is the "ps"
function that we used above to save a PostScript version of the
parallel boxplots.  But it is not hard to make functions of your own.

For example, Splus (wierdly!) doesn't have a function that computes
the standard deviation of a variable, but it does have one that
computes the variance.  We can fix that by defining a new function:

	sd _ function (x) {
	   return(sqrt(var(x)))
	}

A problem with this function is that (again, wierdly!) var does not
have a "na.rm=T" option to remove NA's.  For example,

> sd(conc$water79)
> var(conc$water79)
> var(conc$water79,na.rm=T)

To fix this, we can redefine "sd" to omit lines with NA's if we want
it to:

	sd _ function(x, na.rm=F) {
	   x0 _ x
           if (na.rm==T) x0=na.omit(x)
           return(sqrt(var(x0)))
	}

Let's try it:

> sd(conc$water79)

Problem in .C("S_Var1_NA",: There are 47 missing value(s) in x and/or
y passed to cor or var with na.method="fail".  See the help file for
other options for handlin g missing values., while calling subroutine
S_Var1_NA Use traceback() to see the call stack

> sd(conc$water79,na.rm=T)

 1895.684

Another wierd "feature" of Splus is that the "summary" function
doesn't include the SD, though it includes the 5-number summary and
the mean:

> summary(conc$water79)
  Min. 1st Qu. Median  Mean 3rd Qu.  Max.  NA's
   200  1700    2500   2974  3800   14500    47

So let's make a new summary function, and add to it both the SD and
the IQR for the data:

	mysummary _ function (x, na.rm=F) {
	   if (na.rm==T) {
	     x0 _ na.omit(x)
           } 
	   else {
	     x0_ x
           }
	   xsumm _ quantile(x0) # gives the standard 5-number summary
	   xiqr _ xsumm[4] - xsumm[2]
           xsumm _ c(xsumm[1:3],xiqr, mean(x0),sd(x0),xsumm[4:5])
	   names(xsumm) _ c("Min","Q1","Median","IQR","Mean",
	      "Std Dev","Q3","Max")
	   return(xsumm)
	}

Let's try it:

> mysummary(conc$water79)

Problem in quantile(x0): Missing values and NaN's not allowed if na.rm=F
Use traceback() to see the call stack

> mysummary(conc$water79,na.rm=T)

 Min   Q1 Median  IQR     Mean  Std Dev   Q3   Max
 200 1700   2500 2100 2974.165 1895.684 3800 14500


EXPLORING THE VARIABLES IN THE CONCORD DATA SET
===============================================


Now we will look at the sample distributions of the variables in the
Concord data set more closely.

Before producing them, I will show you a trick that saves typing.  If
you "attach" a data frame, like this:
 
> attach(conc)

the columns become individual variables (vectors) that you can access
directly.  When you are done, "detach()":

> detach()

So...

> attach(conc)

> names(conc)
 [1] "case"    "water81" "water80" "water79" "income"  "educat"  "retire"
 [8] "peop81"  "cpeop"   "peop80"

> stem(water79)
Removed 47 NAs

N = 449   Median = 2500
Quartiles = 1700, 3800

Decimal point is 3 places to the right of the colon

   0 : 2334444
   0 : 555556667777888888888
   1 : 0000000000000001111111111222223333333333333444444444
   1 : 555555555556666666666666777777777777777778888888888888999999999999
   2 : 00000000000000000001111111111111122222222222233333333333334444444444
   2 : 55555555555555666666666667777777777888888888889999999
   3 : 0000000011111111111111222222333333333444444
   3 : 55555555666667777777888888888889999999
   4 : 00000111122222223333444
   4 : 555666666677778999
   5 : 0000001122234444
   5 : 5556677779
   6 : 00001222344
   6 : 556788
   7 : 01233
   7 : 6678

High:   8100   8300   8600   8700  11500  11800  12600  14500

> stem(income)

N = 496   Median = 21
Quartiles = 15, 30

Decimal point is 1 place to the right of the colon

   0 : 22333
   0 : 455555555555555555555555555555555
   0 : 66667777
   0 : 8888899
   1 : 0000000000000000000000000000000000000000000000111
   1 : 22223333333
   1 : 444444555555555555555555555555555555555555555555555555555555555
   1 : 77777
   1 : 888888999
   2 : 00000000000000000000000000000000000000000000000000000000011111
   2 : 222333333
   2 : 444455555555555555555555555555555555555555555555555555555555555555555555x
   2 : 66667
   2 : 8899
   3 : 00000000000000000000000000000000000000000000000000001
   3 : 22333
   3 : 555555555555555555555555555
   3 : 677
   3 : 89
   4 : 0000000000000001
   4 :
   4 : 5555555555555555555555
   4 :
   4 :
   5 : 00000000
   5 : 233
   5 : 5
   5 :
   5 : 8
   6 : 000

High:   68   72   90  100

> stem(peop80)

N = 496   Median = 3
Quartiles = 2, 4

Decimal point is at the colon

   1 : 0000000000000000000000000000000000000000000000000000000000000000
   1 :
   2 : 000000000000000000000000000000000000000000000000000000000000000000000000x
   2 :
   3 : 000000000000000000000000000000000000000000000000000000000000000000000000x
   3 :
   4 : 000000000000000000000000000000000000000000000000000000000000000000000000x
   4 :
   5 : 0000000000000000000000000000000000000000000000000000
   5 :
   6 : 0000000000000000000
   6 :
   7 : 000000000
   7 :
   8 : 000000

High:   9  10  10  10


> stem(retire)
Problem in Summary.factor(x, na.rm = na.rm): A factor is not a numeric object
Use traceback() to see the call stack

As you can see, the stem and leaf diagrams are useful for some things
but not all.  As an alternativce we could make histograms of each of
the numeric variables in the data set. 

Splus has a nice histogram function, "hist()", that can do this:

> motif() # may be needed if you don't already have a graphics window
          # open in splus

> par(mfrow=c(3,3))
> for (i in names(conc)[-1]) hist(get(i),xlab=i)

[Alternatively we could have typed:

	for (i in 2:10) hist(conc[,i],xlab=names(conc)[i])

]

However, RWG likes to superimpose a normal curve over each histogram,
to get a feel for issues like

  - symmetry, vs skewed left/right
  - multiple modes
  - heavy vs light tails

etc.  We can download the function "ndhist" from the Splus area of the
401 web site to get plots like these:

> source("ndhist.q")

> par(mfrow=c(3,3))
> for (i in names(conc)[-c(1,7)]) ndhist(na.omit(get(i)),xlab=i)

I have saved this plot in "all-histograms.ps".

We looked at some boxplots before; here are some more:

> par(mfrow=c(3,3))
> for (i in names(conc)[-c(1,7)]) boxplot(get(i),style.bxp="old")

I have saved this plot in "all-boxplots.ps".

In some cases we are interested in comparing values of a continuous
variable (like "water79" or "income") in different groups identified
by a categorical variable (like "retire").  This can be accomplished
by combining "boxplot()" with "split()" as follows:

> boxplot(split(income,retire),style.bxp="old")

If the variables are already separated, we need only list them for
boxplot, and not bother with "split()":

> boxplot(water79,water80,water81,style.bxp="old",
+ names=c("1979","1980","1981"))


  DIGRESSION


  You may remember that the basic idea of a boxplot is something like
  this:

                 +---+-----+
   * *   *  -----|   |     |-------------  *                     o
                 +---+-----+

   OUTLIERS     Q1   MED   Q3  WHISKER     OUTLIER         EXTREME OUTLIER

  The simplest way to make a boxplot by hand is to 

  * sort the data using a stem and leaf plot;
  * compute the first and third quartiles and the median and plot the box
  * compute the value IQR = Q3 - Q1   (this is the width of the box)
  * compute the 
      - inner fences at Q1 - (1.5)*IQR and Q3 + (1.5)*IQR
      - outer fences at Q1 - (3.0)*IQR and Q3 + (3.0)*IQR
  * plot the whiskers out to the lowest and highest values inside the
    inner fences
  * decorate the outliers according to whether they fall inside or
    outside each set of fences

  Tukey, who invented boxplots (and stem and leaf diagrams, and many
  other useful EDA plots) has rather folksy names for the versions of
  quartiles he likes to calculate; he calls Q1 and Q3 the lower and
  upper hinges, and the IQR the hinge spread.


  END OF DIGRESSION


SYMMETRY PLOTS AND QUANTILE-NORMAL PLOTS
========================================


Although the above plots are useful for getting a general feel for the
sample distribution of the data, often we want to focus on either

 * whether the data distribution is approximately symmetric
 * whether the data distribution is approximately normally distributed

RWG suggests several useful plots for this purpose.  We show how to
make them in Splus here.

First we consider *just* the question of symmetry.  If a data set is
perfectly symmetrical, about its median, like this,

    10 20 30 40 50 60 70 80 90

what this means is that the absolute distances from the median match
up:

    40 30 20 10  0 10 20 30 40

so if we plot the distances "below" the median against the distances
"above" the median, we will get a straight line (the line y = x,
actually!).  This is the idea behind RWG's "symmetry plot".

We can make a symmetry plot by downloading "symplot.q" from the 401
website, and using "source()" to define it in splus for us:

> source("symplot.q")

> par(mfrow=c(1,1))

> symplot(c(10,20,30,40,50,60,70,80,90))

> par(mfrow=c(2,2))
> x _ rnorm(100); ndhist(x); symplot(x)
> x _ rchisq(100,4); ndhist(x); symplot(x)

We can see how symmetric the water usage data are by making symmetry
plots.  We also make histograms to help us interpret the plots:

> par(mfrow=c(3,2))

> for (i in c("water79","water80","water81")) {
+  ndhist(na.omit(get(i)),xlab=i)
+  symplot(na.omit(get(i)),main=i)
+ }

> par(mfrow=c(2,2))

> for (i in c("income","educat")) {
+  ndhist(na.omit(get(i)),xlab=i)
+  symplot(na.omit(get(i)),main=i)
+ }

> par(mfrow=c(3,2))

> for (i in c("peop80","cpeop","peop81")) {
+  ndhist(na.omit(get(i)),xlab=i)
+  symplot(na.omit(get(i)),main=i)
+ }


Another type of plot that is sometimes useful is the "quantile plot",
which can be used 

 * to assess whether data is uniformly distributed; and
 * to read-off "quantiles" or percentiles from the plot.

DIGRESSION: Quantiles are percentiles expressed as fractions instead
of percents.  So:   

      the 0.25 quantile = the 25th percentile = Q1.

END OF DIGRESSION

To make a quantile plot, we sort the data and plot each value against
its quantile.  The function "qlot.q" from the 401 website does this:

> source("qplot.q")

If the plot lies along a straight line, the data are approximately
uniform.  Otherwise, not.  Here are some examples:

> par(mfrow=c(3,2))

> for (i in c("water79","educat","case")) {
+  ndhist(na.omit(get(i)),xlab=i)
+  qplot(na.omit(get(i)),main=i)
+ }

I have saved the last plot as "qplots.ps".

This last plot shows that "case" (the case number assigned to each
household in the survey) is indeed almost perfectly uniformly
distributed.


Why??


Finally, the most often used plot of this type in applied statistics
is the "quantile-normal plot".  To make a quantile-normal plot we sort
the data, and then plot each data point against the corresponding
quantile of the normal distribution.  

Splus has good facilities for making quantile-normal plots
(a.k.a. "normal probability plots"): 

> par(mfrow=c(3,2))

> for (i in c("water79","educat","case")) {
+  ndhist(na.omit(get(i)),xlab=i)
+  qqnorm(na.omit(get(i)),ylab=i)
+  qqline(na.omit(get(i)))
+ }

An alternative form of the normal probabilty plot is available from
the 401 website:

> source("qqn.q")

> par(mfrow=c(3,2))

> for (i in c("water79","educat","case")) {
+  ndhist(na.omit(get(i)),xlab=i)
+  qqn(na.omit(get(i)),ylab=i)
+ }

I have saved this last plot as "qqn.ps".

Aside from having to type only one command (qqn) instead of two
(qqnorm and qqline), the main advantage of using the "qqn" version is
that you can use it to identify extreme values oin the plot:

> qqn(water80,labels=case)

[ click the middle mouse button when you are done identifying points ]


POWER TRANSFORMATIONS
=====================

There are many reasons to transform data.

Power transformations  x^p or (x^p-1)/p (p=0 <==> log) are often used:

  - to reduce skewness in X and/or Y  [ enhance symmetry or normality ]
  - to reduce nonlinearity (in Y)
  - to correct nonconstant spread

Skewness:
--------

To assess and correct skewness and other tail problems it is often
useful to think about quantile normal plots (normal probability plots)
and symmetry plots.  

The idea is to make these plots for each of several transformations of
the data, and pick the transformation that makes the data look the
best.

As a rule: 

  * Right skew can be fixed by raising the data to a fractional power
    (power between 0 and 1); taking the log of the data corresponds
    to raising to the 0'th power.  If the log doesn't work, try 1/x,
    etc. 

  * Left skew can be fixed by raising the data to a power greater than
    one; raising e to the power of the data (exp(x)) is the most
    extreme of these transformations.

  * In general we should choose powers that might be related to
    physical or social laws (e.g. squares, logs, and square roots, not
    .234535 powers).

We illustrate with the income data from the Concord study.

> par(mfrow=c(3,2))

> ndhist(income,xlab="income")
> qqn(income,ylab="income")

> ndhist(sqrt(income),xlab="sqrt(income)")
> qqn(sqrt(income),ylab="sqrt(income)")

> ndhist(log(income),xlab="log(income)")   # Nb., "log" = natural log
> qqn(log(income),ylab="log(income)")

I have added histograms of the original data so you can see hew the
data are moving as the transformation changes.  

Among these three transformations, the logarithm is a little too much
(creates left skewing) and the square-root doesn't quite fix the
right-skewing.  So next I'd probably try some fractional powers less
than 1/2.

We could do something simlilar with water usage in 1979:

> par(mfrow=c(4,2))

> w79 _ na.omit(water79)

> ndhist(w79,xlab="w79")
> qqn(w79,ylab="w79")

> ndhist(sqrt(w79),xlab="sqrt(w79)")
> qqn(sqrt(w79),ylab="sqrt(w79)")

> ndhist(w79^(0.3),xlab="w79^(0.3)")
> qqn(w79^(0.3),ylab="w79^(0.3)")

> ndhist(log(w79),xlab="log(w79)")   # Nb., "log" = natural log
> qqn(log(w79),ylab="log(w79)")

I have saved this plot as "powers-skewing.ps"

Clearly the logarithm is too strong a transformation.  It is difficult
to decide between the square root (0.5 power) and the 0.3 power
transformation.

Improving Linearity
-------------------

I'll talk about improving linearity with transformations after we've
discussed regression a little bit.  

Nonconstant Spread
------------------

Again, let's see what this is like, with water consumption in 1979,
split up according to whether the head of household is retired:

> par(mfrow=c(2,1))
> boxplot(split(water79,retire),style.bxp="old")
> title("income")
> boxplot(split(sqrt(water79),retire),style.bxp="old")
> title("sqrt(income)")

I have saved this plot as "nonconst-spread.ps".

It appears that the square root transformation fixed some of the
nonconstant spread problems.  We can verify this numerically with our
"mysummary" function, created above:

> lapply(split(water79,retire),mysummary,na.rm=T)
$no:
 Min   Q1 Median  IQR     Mean  Std Dev   Q3   Max
 300 2000   2800 2200 3302.236 1940.318 4200 14500

$yes:
 Min   Q1 Median  IQR     Mean  Std Dev   Q3  Max
 200 1100   1800 1625 2219.118 1549.409 2725 8300


> lapply(split(sqrt(water79),retire),mysummary,na.rm=T)
$no:
      Min       Q1   Median      IQR    Mean  Std Dev       Q3      Max
 17.32051 44.72136 52.91503 20.08605 55.3539 15.45787 64.80741 120.4159

$yes:
      Min       Q1   Median      IQR     Mean  Std Dev      Q3      Max
 14.14214 33.16625 42.42641 19.03365 44.56374 15.32703 52.1999 91.10434

The problem of nonconstant variance or nonconstant spread usually
arises when spread increases with mean or median.  In nice cases one
can eyeball the power transformation one needs by plotting
x=log(medians) vs y=log(IQR's), for example.

-------------------------------------------------------------------------