############################################################## # # HANDOUT TO ACCOMPANY RWG CH 1 ON VARIABLE DISTRIBUTIONS AND # EXPLORATORY DATA ANALYSIS # # BRIAN JUNKER SEP 3, 2001 # ############################################################## We are going to begin on about p 6 of the "Introduction to Splus" handout that we worked through part of in the Wean cluster last week. Then we are going to explore the concord data set in more detail, using - some of Splus's built-in EDA features - some specially-written EDA functions for Splus that more closely match the stuff shown in RWG Chapter 1 ############################################################### GETTING STARTED (AND STOPPED) ============================= To start up Splus in UNIX, we type % Splus -e at the UNIX prompt. The "-e" allows us to edit command lines as we type them. (alternatively, you could write commands in a text editor and cut-and-paste them into the Splus command line). The two most useful functions for reading in data are read.table() scan() If you wanted to learn more about one of these you would say help(read.table) help(scan) at the SPLUS prompt. In most UNIX versions of SPLUS, if you wanted the help to appear in a separate window you would type help(read.table,window=T) and if you want a menu of help topics to choose from you would type help.start() or help.start(gui="motif") Oh, yes, and while I am thinking about it, the way to get out of SPLUS is to type q() THE CONCORD DATA ================ The concord data set comes from Hamilton's "Regression with Graphics" text (RWG, p.2): During the 1970's, the city of Concord, New Hampshire, experienced a growing demand for water, despite a roughly stable population. In late 1979 this rising demand, together with unusually dry weather, led to a shortage in water supply. In late summer 1980, as the shortage worsened, the Concord Water Department and municipal officails began a campaign (desciribed by one observer as a "media blitz") to persuade citizens to use less water. Over the next year, water used declined by about 15%, which was naturally interpreted as evidence of the conservation campaign's success. An overall decrease of 15% does not mean that everyone decreased their water usage by 15%: some citizens might have decreased their demand by more or less than 15%, and some might even have increased their usage. The 1981 Concord Water Survey examined such variations in water savings. Questionnaires were distributed to a random sample of Concord households, asking about demographic characteristics, opinions, and water conservation behavior. These questionnnaires were then matched with Water Department records (from meter readings) of the amount of water each household had actually used during the summers of 1979, 1980 and 1981, two years before and one year after the water conservation campaign. The Concord data consists of information on 496 households and how they responded to the scarcity of a basic natural resource, and to the campaign by public officials to conserve it. The data look like this (see concord1.dat): case water81 water80 water79 income educat retire peop81 cpeop peop80 1 5 2300 3500 3500 35 18 no 3 0 3 2 6 2500 4600 4100 60 20 no 5 0 5 3 7 4000 3800 5500 30 16 no 6 1 5 4 8 500 1500 1700 20 11 yes 1 -1 2 [ etc, for 496 lines ] The variables are (see concord1.txt): 1. case case ID number 2. water81 Summer 1981 Water Use 3. water80 Summer 1980 Water Use 4. water79 Summer 1979 Water Use 5. income Income in Thousands 6. educat Education in Years 7. retire Head of house retired? 8. peop81 # of People Resident, 1981 9. cpeop Increase in # of People 10. peop80 # people living in 1980 READING IN THE CONCORD DATA =========================== Since the data set has the variable names in the first line (see concord1.dat) we can tell Splus's "read.table" command to use this "header line" to name the variables. > conc _ read.table("concord1.dat",header=T) We can print out a few lines of the concord data set using some of the special matrix indexing notation that we learned about in lab last week: > conc[1:10,] case water81 water80 water79 income educat retire peop81 cpeop peop80 1 5 2300 3500 3500 35 18 no 3 0 3 2 6 2500 4600 4100 60 20 no 5 0 5 3 7 4000 3800 5500 30 16 no 6 1 5 4 8 500 1500 1700 20 11 yes 1 -1 2 5 9 4400 4300 3700 50 20 no 2 0 2 6 10 1900 1500 1600 18 13 no 5 0 5 7 11 3600 3500 3500 25 14 yes 3 0 3 8 12 2200 2400 4900 21 20 yes 2 0 2 9 13 2300 1900 2900 13 11 no 4 0 4 10 14 2200 2500 3900 30 20 no 2 0 2 DIGRESSION If the variable names had not been there, we could have read the data without the "header" argument, and then used the "names" function to assign variable names. To illustrate, I created the data file concord0.dat, which is identical to concord1.dat except the header line is missing: > conc _ read.table("concord0.dat") > conc[1:3,] V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 5 2300 3500 3500 35 18 no 3 0 3 2 6 2500 4600 4100 60 20 no 5 0 5 3 7 4000 3800 5500 30 16 no 6 1 5 > names(conc) _ c("case", "water81", "water80", "water79", "income", + "educat", "retire", "peop81", "cpeop", "peop80") > conc[1:3,] case water81 water80 water79 income educat retire peop81 cpeop peop80 1 5 2300 3500 3500 35 18 no 3 0 3 2 6 2500 4600 4100 60 20 no 5 0 5 3 7 4000 3800 5500 30 16 no 6 1 5 END OF DIGRESSION Recall that "read.table" reads the data into an Splus data structure called a "data frame". There are two ways to think of a data frame: 1. as a matrix where different columns can have different data types [e.g. numerical, as in "water81" and text, as in "retire" 2. as a list of vectors. Each vector is one column (one variable) in the data set. To see the "list" nature of a data frame, we can try: > as.list(conc[1:3,]) $case: [1] 5 6 7 $water81: [1] 2300 2500 4000 $water80: [1] 3500 4600 3800 . . [ I have omitted some of the output to save space in this file ] . $cpeop: [1] 0 0 1 $peop80: [1] 3 5 5 Most of the time it is easiest to think of a data.frame as a matrix [meaning 1. above] but for some splus functions, thinking of the "list" nature of a data.frame is better. Two general pieces of information about a data frame are * its dimensions as a matrix, and * the names of its variables (columns): > dim(conc) [1] 496 10 > names(conc) [1] "case" "water81" "water80" "water79" "income" "educat" "retire" [8] "peop81" "cpeop" "peop80" To get used to the different ways of referring to data in a data frame, consider conc$water80[1:8] conc[1:8, 3] conc[1:8, "water80"] these all refer to the same 8 numbers in "conc", namely [1] 3500 4600 3800 1500 4300 1500 3500 2400 DATA SUMMARIES AND GRAPHS ========================= Medians, boxplots, and other summaries of variables are easy to obtain in splus, for example: median(conc$water80) summary(conc$water80) boxplot(conc$water80 - conc$water81,style.bxp="old") boxplot(conc$water79, conc$water80, conc$water81, style.bxp="old") median(conc$water79) summary(conc$water79) Something is wrong with water79! Examine that variable. > conc$water79 [1] 3500 4100 5500 1700 3700 1600 3500 4900 2900 3900 4600 5200 [13] 5400 14500 2800 2000 1300 1800 . 1200 1700 3700 3000 1000 [25] 6200 1600 2900 1000 1300 700 2800 2800 1100 500 . 2500 [ many lines omitted to save space ] [445] 4100 3500 . 3100 4200 4600 7100 4700 3100 2400 6000 3300 [457] 2600 4300 3800 700 . . 1800 1700 7000 7300 3500 . [469] . . . 6300 4200 8100 600 1800 . . . . [481] 4300 . 3000 7300 1700 . . . 4500 1800 . . [493] . . . . Note the periods, which shouldn't be there. Technically, read.table() saw a mix of numbers and non-numeric items (periods) in that column of the file and created a categorical ``factor'' with 82 levels corresponding to the 82 different values in the column. That is why the boxplot only goes up to 82. We will learn about factors later in the course. For now, we can re-read the file, instructing S-plus to treat periods as missing data. > conc_read.table("concord1.dat", header=T, na.strings=".") and we can recheck the variable that was giving us problems before with > conc$water79 [1] 3500 4100 5500 1700 3700 1600 3500 4900 2900 3900 4600 5200 [13] 5400 14500 2800 2000 1300 1800 NA 1200 1700 3700 3000 1000 [25] 6200 1600 2900 1000 1300 700 2800 2800 1100 500 NA 2500 [ many lines omitted to save space ] [445] 4100 3500 NA 3100 4200 4600 7100 4700 3100 2400 6000 3300 [457] 2600 4300 3800 700 NA NA 1800 1700 7000 7300 3500 NA [469] NA NA NA 6300 4200 8100 600 1800 NA NA NA NA [481] 4300 NA 3000 7300 1700 NA NA NA 4500 1800 NA NA [493] NA NA NA NA Several statistics commands have the optional argument ``na.rm'' to remove the NA's when calculating their result. > median(conc$water79) > median(conc$water79, na.rm=T) The "boxplot" command already knows to remove NA's before plotting; now the plots we made before make more sense: > boxplot(conc$water79, conc$water80, conc$water81,style.bxp="old") > summary(conc$water79) and here is a prettier version [see help(boxplot) and help(title) to understand some of the arguments used to dress up the plot] > boxplot(conc$water79, conc$water80, conc$water81, + names=c("1979","1980","1981"), ylab="Cubic Feet of Water", + main="Concord Water Study",style.bxp="old") I have saved this picture in the file "water-use-boxes.ps". DIGRESSION Sometimes we want to save those pretty graphs for printing later, or including in report writeups, etc. There are many ways to do this in Splus, but the easiest by far is to use the function "ps()" which we can download from the Splus area of www.stat.cmu.edu/~brian/401: After downloading the file "ps.q" we can type > source("ps.q") # you only have to do this once since splus # saves data and function definitions until # you "rm()" them! > ps("pretty-graph") or if you want to both save the file and print it you can type > ps("pretty-graph", printit=T) # or just ps("pretty-graph",T) In either case, the graph is saved as "pretty-graph.ps", a graphics file in "PostScript" format. This is a printer-friendly format that can be viewed and printed with "ghostview" in UNIX and with "gsview" in Windows (these programs are installed on public cluster machines on campus). They can also be incorporated in documents created with LaTeX (more later in the course) and MS-WORD (I am not a WORD moonie but I will try to figure out how to do this later in the course). END OF DIGRESSION SPLUS IS ALSO A PROGRAMMING LANGUAGE ==================================== A "for" loop is a statement of the form ``for (argument in range) { ... }'' It causes the following statement (or statements if the curly braces are used) to be repeated with the argument taking on each of the values in the range. for (i in 2:4) print(median(conc[,i], na.rm=T)) for (n in 1:10) print(c(n, n^2, n^3)) for (j in 1:ncol(conc)) { cat(names(conc)[j],":\n") print(summary(conc[,j])) cat("\n") } The next one also uses the "if (condition) { ... }" statement; it is a bit harder if you haven't studied programming, but still very useful: for (J in 1:ncol(conc)) { if (is.numeric(conc[,J])) { cat(names(conc)[J], ":") cat(" Median =", median(conc[,J])) cat(" Mean =", round(mean(conc[,J]),3), "\n") } } An alternative to "for" loops for data.frames is lapply(). (There is also apply() for matrices.) It is much quicker than a for loop, so consider using it when you have a slow "for" loop. lapply(conc, is.numeric) The output is too spread out, one way to fix it is to use "unlist" to list the elements in the list as a single vector: unlist(lapply(conc, is.numeric)) Here are some more examples. (column 7, "retired", is omitted since it is not numeric (it is text, "yes" or "no"): unlist(lapply(conc[, c(2:6, 8:10)], median)) unlist(lapply(conc[, c(2:3, 5:6, 8:10)], median)) Notice how some NA's in water79 ruin the calculation of the median. For many functions in Splus, this can be fixed by telling the function to ignore or omit NA's in the calculation: unlist(lapply(conc[, c(2:6, 8:10)], median, na.rm=T)) round(unlist(lapply(conc[, c(2:6, 8:10)], mean, na.rm=T)),2) There are lots of parenthesis in that last example! The simplest way to "decode" all that is to read the command from the inside out. This takes some practice, but will pay off in the end. In fact it is sometimes helpful to execute each piece, starting from the inside, to see what is there: conc[, c(2:6, 8:10)] lapply(.Last.value, mean, na.rm=T) unlist(.Last.value) round(.Last.value,2) [note the use of .Last.value to recall the value of the last calculation with out having had to assign it to a variable name!] We can create new functions in Splus as well. One example is the "ps" function that we used above to save a PostScript version of the parallel boxplots. But it is not hard to make functions of your own. For example, Splus (wierdly!) doesn't have a function that computes the standard deviation of a variable, but it does have one that computes the variance. We can fix that by defining a new function: sd _ function (x) { return(sqrt(var(x))) } A problem with this function is that (again, wierdly!) var does not have a "na.rm=T" option to remove NA's. For example, > sd(conc$water79) > var(conc$water79) > var(conc$water79,na.rm=T) To fix this, we can redefine "sd" to omit lines with NA's if we want it to: sd _ function(x, na.rm=F) { x0 _ x if (na.rm==T) x0=na.omit(x) return(sqrt(var(x0))) } Let's try it: > sd(conc$water79) Problem in .C("S_Var1_NA",: There are 47 missing value(s) in x and/or y passed to cor or var with na.method="fail". See the help file for other options for handlin g missing values., while calling subroutine S_Var1_NA Use traceback() to see the call stack > sd(conc$water79,na.rm=T) 1895.684 Another wierd "feature" of Splus is that the "summary" function doesn't include the SD, though it includes the 5-number summary and the mean: > summary(conc$water79) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 200 1700 2500 2974 3800 14500 47 So let's make a new summary function, and add to it both the SD and the IQR for the data: mysummary _ function (x, na.rm=F) { if (na.rm==T) { x0 _ na.omit(x) } else { x0_ x } xsumm _ quantile(x0) # gives the standard 5-number summary xiqr _ xsumm[4] - xsumm[2] xsumm _ c(xsumm[1:3],xiqr, mean(x0),sd(x0),xsumm[4:5]) names(xsumm) _ c("Min","Q1","Median","IQR","Mean", "Std Dev","Q3","Max") return(xsumm) } Let's try it: > mysummary(conc$water79) Problem in quantile(x0): Missing values and NaN's not allowed if na.rm=F Use traceback() to see the call stack > mysummary(conc$water79,na.rm=T) Min Q1 Median IQR Mean Std Dev Q3 Max 200 1700 2500 2100 2974.165 1895.684 3800 14500 EXPLORING THE VARIABLES IN THE CONCORD DATA SET =============================================== Now we will look at the sample distributions of the variables in the Concord data set more closely. Before producing them, I will show you a trick that saves typing. If you "attach" a data frame, like this: > attach(conc) the columns become individual variables (vectors) that you can access directly. When you are done, "detach()": > detach() So... > attach(conc) > names(conc) [1] "case" "water81" "water80" "water79" "income" "educat" "retire" [8] "peop81" "cpeop" "peop80" > stem(water79) Removed 47 NAs N = 449 Median = 2500 Quartiles = 1700, 3800 Decimal point is 3 places to the right of the colon 0 : 2334444 0 : 555556667777888888888 1 : 0000000000000001111111111222223333333333333444444444 1 : 555555555556666666666666777777777777777778888888888888999999999999 2 : 00000000000000000001111111111111122222222222233333333333334444444444 2 : 55555555555555666666666667777777777888888888889999999 3 : 0000000011111111111111222222333333333444444 3 : 55555555666667777777888888888889999999 4 : 00000111122222223333444 4 : 555666666677778999 5 : 0000001122234444 5 : 5556677779 6 : 00001222344 6 : 556788 7 : 01233 7 : 6678 High: 8100 8300 8600 8700 11500 11800 12600 14500 > stem(income) N = 496 Median = 21 Quartiles = 15, 30 Decimal point is 1 place to the right of the colon 0 : 22333 0 : 455555555555555555555555555555555 0 : 66667777 0 : 8888899 1 : 0000000000000000000000000000000000000000000000111 1 : 22223333333 1 : 444444555555555555555555555555555555555555555555555555555555555 1 : 77777 1 : 888888999 2 : 00000000000000000000000000000000000000000000000000000000011111 2 : 222333333 2 : 444455555555555555555555555555555555555555555555555555555555555555555555x 2 : 66667 2 : 8899 3 : 00000000000000000000000000000000000000000000000000001 3 : 22333 3 : 555555555555555555555555555 3 : 677 3 : 89 4 : 0000000000000001 4 : 4 : 5555555555555555555555 4 : 4 : 5 : 00000000 5 : 233 5 : 5 5 : 5 : 8 6 : 000 High: 68 72 90 100 > stem(peop80) N = 496 Median = 3 Quartiles = 2, 4 Decimal point is at the colon 1 : 0000000000000000000000000000000000000000000000000000000000000000 1 : 2 : 000000000000000000000000000000000000000000000000000000000000000000000000x 2 : 3 : 000000000000000000000000000000000000000000000000000000000000000000000000x 3 : 4 : 000000000000000000000000000000000000000000000000000000000000000000000000x 4 : 5 : 0000000000000000000000000000000000000000000000000000 5 : 6 : 0000000000000000000 6 : 7 : 000000000 7 : 8 : 000000 High: 9 10 10 10 > stem(retire) Problem in Summary.factor(x, na.rm = na.rm): A factor is not a numeric object Use traceback() to see the call stack As you can see, the stem and leaf diagrams are useful for some things but not all. As an alternativce we could make histograms of each of the numeric variables in the data set. Splus has a nice histogram function, "hist()", that can do this: > motif() # may be needed if you don't already have a graphics window # open in splus > par(mfrow=c(3,3)) > for (i in names(conc)[-1]) hist(get(i),xlab=i) [Alternatively we could have typed: for (i in 2:10) hist(conc[,i],xlab=names(conc)[i]) ] However, RWG likes to superimpose a normal curve over each histogram, to get a feel for issues like - symmetry, vs skewed left/right - multiple modes - heavy vs light tails etc. We can download the function "ndhist" from the Splus area of the 401 web site to get plots like these: > source("ndhist.q") > par(mfrow=c(3,3)) > for (i in names(conc)[-c(1,7)]) ndhist(na.omit(get(i)),xlab=i) I have saved this plot in "all-histograms.ps". We looked at some boxplots before; here are some more: > par(mfrow=c(3,3)) > for (i in names(conc)[-c(1,7)]) boxplot(get(i),style.bxp="old") I have saved this plot in "all-boxplots.ps". In some cases we are interested in comparing values of a continuous variable (like "water79" or "income") in different groups identified by a categorical variable (like "retire"). This can be accomplished by combining "boxplot()" with "split()" as follows: > boxplot(split(income,retire),style.bxp="old") If the variables are already separated, we need only list them for boxplot, and not bother with "split()": > boxplot(water79,water80,water81,style.bxp="old", + names=c("1979","1980","1981")) DIGRESSION You may remember that the basic idea of a boxplot is something like this: +---+-----+ * * * -----| | |------------- * o +---+-----+ OUTLIERS Q1 MED Q3 WHISKER OUTLIER EXTREME OUTLIER The simplest way to make a boxplot by hand is to * sort the data using a stem and leaf plot; * compute the first and third quartiles and the median and plot the box * compute the value IQR = Q3 - Q1 (this is the width of the box) * compute the - inner fences at Q1 - (1.5)*IQR and Q3 + (1.5)*IQR - outer fences at Q1 - (3.0)*IQR and Q3 + (3.0)*IQR * plot the whiskers out to the lowest and highest values inside the inner fences * decorate the outliers according to whether they fall inside or outside each set of fences Tukey, who invented boxplots (and stem and leaf diagrams, and many other useful EDA plots) has rather folksy names for the versions of quartiles he likes to calculate; he calls Q1 and Q3 the lower and upper hinges, and the IQR the hinge spread. END OF DIGRESSION SYMMETRY PLOTS AND QUANTILE-NORMAL PLOTS ======================================== Although the above plots are useful for getting a general feel for the sample distribution of the data, often we want to focus on either * whether the data distribution is approximately symmetric * whether the data distribution is approximately normally distributed RWG suggests several useful plots for this purpose. We show how to make them in Splus here. First we consider *just* the question of symmetry. If a data set is perfectly symmetrical, about its median, like this, 10 20 30 40 50 60 70 80 90 what this means is that the absolute distances from the median match up: 40 30 20 10 0 10 20 30 40 so if we plot the distances "below" the median against the distances "above" the median, we will get a straight line (the line y = x, actually!). This is the idea behind RWG's "symmetry plot". We can make a symmetry plot by downloading "symplot.q" from the 401 website, and using "source()" to define it in splus for us: > source("symplot.q") > par(mfrow=c(1,1)) > symplot(c(10,20,30,40,50,60,70,80,90)) > par(mfrow=c(2,2)) > x _ rnorm(100); ndhist(x); symplot(x) > x _ rchisq(100,4); ndhist(x); symplot(x) We can see how symmetric the water usage data are by making symmetry plots. We also make histograms to help us interpret the plots: > par(mfrow=c(3,2)) > for (i in c("water79","water80","water81")) { + ndhist(na.omit(get(i)),xlab=i) + symplot(na.omit(get(i)),main=i) + } > par(mfrow=c(2,2)) > for (i in c("income","educat")) { + ndhist(na.omit(get(i)),xlab=i) + symplot(na.omit(get(i)),main=i) + } > par(mfrow=c(3,2)) > for (i in c("peop80","cpeop","peop81")) { + ndhist(na.omit(get(i)),xlab=i) + symplot(na.omit(get(i)),main=i) + } Another type of plot that is sometimes useful is the "quantile plot", which can be used * to assess whether data is uniformly distributed; and * to read-off "quantiles" or percentiles from the plot. DIGRESSION: Quantiles are percentiles expressed as fractions instead of percents. So: the 0.25 quantile = the 25th percentile = Q1. END OF DIGRESSION To make a quantile plot, we sort the data and plot each value against its quantile. The function "qlot.q" from the 401 website does this: > source("qplot.q") If the plot lies along a straight line, the data are approximately uniform. Otherwise, not. Here are some examples: > par(mfrow=c(3,2)) > for (i in c("water79","educat","case")) { + ndhist(na.omit(get(i)),xlab=i) + qplot(na.omit(get(i)),main=i) + } I have saved the last plot as "qplots.ps". This last plot shows that "case" (the case number assigned to each household in the survey) is indeed almost perfectly uniformly distributed. Why?? Finally, the most often used plot of this type in applied statistics is the "quantile-normal plot". To make a quantile-normal plot we sort the data, and then plot each data point against the corresponding quantile of the normal distribution. Splus has good facilities for making quantile-normal plots (a.k.a. "normal probability plots"): > par(mfrow=c(3,2)) > for (i in c("water79","educat","case")) { + ndhist(na.omit(get(i)),xlab=i) + qqnorm(na.omit(get(i)),ylab=i) + qqline(na.omit(get(i))) + } An alternative form of the normal probabilty plot is available from the 401 website: > source("qqn.q") > par(mfrow=c(3,2)) > for (i in c("water79","educat","case")) { + ndhist(na.omit(get(i)),xlab=i) + qqn(na.omit(get(i)),ylab=i) + } I have saved this last plot as "qqn.ps". Aside from having to type only one command (qqn) instead of two (qqnorm and qqline), the main advantage of using the "qqn" version is that you can use it to identify extreme values oin the plot: > qqn(water80,labels=case) [ click the middle mouse button when you are done identifying points ] POWER TRANSFORMATIONS ===================== There are many reasons to transform data. Power transformations x^p or (x^p-1)/p (p=0 <==> log) are often used: - to reduce skewness in X and/or Y [ enhance symmetry or normality ] - to reduce nonlinearity (in Y) - to correct nonconstant spread Skewness: -------- To assess and correct skewness and other tail problems it is often useful to think about quantile normal plots (normal probability plots) and symmetry plots. The idea is to make these plots for each of several transformations of the data, and pick the transformation that makes the data look the best. As a rule: * Right skew can be fixed by raising the data to a fractional power (power between 0 and 1); taking the log of the data corresponds to raising to the 0'th power. If the log doesn't work, try 1/x, etc. * Left skew can be fixed by raising the data to a power greater than one; raising e to the power of the data (exp(x)) is the most extreme of these transformations. * In general we should choose powers that might be related to physical or social laws (e.g. squares, logs, and square roots, not .234535 powers). We illustrate with the income data from the Concord study. > par(mfrow=c(3,2)) > ndhist(income,xlab="income") > qqn(income,ylab="income") > ndhist(sqrt(income),xlab="sqrt(income)") > qqn(sqrt(income),ylab="sqrt(income)") > ndhist(log(income),xlab="log(income)") # Nb., "log" = natural log > qqn(log(income),ylab="log(income)") I have added histograms of the original data so you can see hew the data are moving as the transformation changes. Among these three transformations, the logarithm is a little too much (creates left skewing) and the square-root doesn't quite fix the right-skewing. So next I'd probably try some fractional powers less than 1/2. We could do something simlilar with water usage in 1979: > par(mfrow=c(4,2)) > w79 _ na.omit(water79) > ndhist(w79,xlab="w79") > qqn(w79,ylab="w79") > ndhist(sqrt(w79),xlab="sqrt(w79)") > qqn(sqrt(w79),ylab="sqrt(w79)") > ndhist(w79^(0.3),xlab="w79^(0.3)") > qqn(w79^(0.3),ylab="w79^(0.3)") > ndhist(log(w79),xlab="log(w79)") # Nb., "log" = natural log > qqn(log(w79),ylab="log(w79)") I have saved this plot as "powers-skewing.ps" Clearly the logarithm is too strong a transformation. It is difficult to decide between the square root (0.5 power) and the 0.3 power transformation. Improving Linearity ------------------- I'll talk about improving linearity with transformations after we've discussed regression a little bit. Nonconstant Spread ------------------ Again, let's see what this is like, with water consumption in 1979, split up according to whether the head of household is retired: > par(mfrow=c(2,1)) > boxplot(split(water79,retire),style.bxp="old") > title("income") > boxplot(split(sqrt(water79),retire),style.bxp="old") > title("sqrt(income)") I have saved this plot as "nonconst-spread.ps". It appears that the square root transformation fixed some of the nonconstant spread problems. We can verify this numerically with our "mysummary" function, created above: > lapply(split(water79,retire),mysummary,na.rm=T) $no: Min Q1 Median IQR Mean Std Dev Q3 Max 300 2000 2800 2200 3302.236 1940.318 4200 14500 $yes: Min Q1 Median IQR Mean Std Dev Q3 Max 200 1100 1800 1625 2219.118 1549.409 2725 8300 > lapply(split(sqrt(water79),retire),mysummary,na.rm=T) $no: Min Q1 Median IQR Mean Std Dev Q3 Max 17.32051 44.72136 52.91503 20.08605 55.3539 15.45787 64.80741 120.4159 $yes: Min Q1 Median IQR Mean Std Dev Q3 Max 14.14214 33.16625 42.42641 19.03365 44.56374 15.32703 52.1999 91.10434 The problem of nonconstant variance or nonconstant spread usually arises when spread increases with mean or median. In nice cases one can eyeball the power transformation one needs by plotting x=log(medians) vs y=log(IQR's), for example. -------------------------------------------------------------------------