R notes for discrete multivariate, Tue Nov 09 09:39:02 2004 =========================================================== Useful R packages: (From L Thompson's guide) [see also http://cran.r-project.org/ for more info!] MASS -- (VR bundle, Venables and Ripley) - install as VR! rmutil (J. Lindsey) -- (used with gnlm) http://alpha.luc.ac.be/~lucp0753/rcode.html gnlm (J. Lindsey) -- http://alpha.luc.ac.be/~lucp0753/rcode.html repeated (J. Lindsey)-- http://alpha.luc.ac.be/~lucp0753/rcode.html SuppDists (B. Wheeler)-- (used in chapter 1) combinant (V. Carey)-- (used in chapters 1, 3) methods -- (used in chapter 3) Bhat (E. Luebeck) -- (used throughout) mgcv (S. Wood) -- (used for fitting GAM models) modreg (B. Ripley) -- (used for fitting GAM models) gee and geepack (J. Yan)-- (used in chapter 11) yags (V. Carey) -- (used in chapter 11) gllm -- (used for generalized log linear models) GlmmGibbs (Myles and Clayton -- (used for generalized linear mixed models) glmmML (G. Broström)-- (used for generalized linear mixed models) CoCoAn (S. Dray) -- (used for correspondence analysis) e1071 (A. Weingessel)-- (used for latent class analysis) vcd (M. Friendly) -- (used in chapters 2, 3, 5, 9 and 10) brlr (D. Firth) -- (used in chapter 6) BradleyTerry (D. Firth)-- (used in chapter 10) ordinal (P. Lindsey)-- (used in chapter 7) http://euridice.tue.nl/~plindsey/rlibs.html design (F. Harrell) -- (used throughout) http://hesweb1.med.virginia.edu/biostat Hmisc (F. Harrell) -- (used throughout) VGAM (T. Yee) -- (used in chapters 7 and 9) http://www.stat.auckland.ac.nz/~yee/VGAM/download.shtml mph (J. Lang) -- (used in chapters 7, 10) http://www.stat.uiowa.edu/~jblang/mph.fitting /mph.fit.documentation.htm#availability exactLoglinTest (B. Caffo)-- (used in chapters 9, 10) http://www.biostat.jhsph.edu/~bcaffo/downloads.htm R packages can be installed from Windows using the install.packages function. This function can be called from the console or from the pull-down "Packages" menu. If all else fails, download the zip or tar file from CRAN and use "packages -> install from local zip file" etc. in the R console. A package is loaded in the same way 3 as for S-PLUS. As well, the command help(package=pkg) can used to invoke the help menu for package pkg. To detach a library or package, named library.name or pkg, respectively, one can issue the following commands, among others. detach("library.name") # S-PLUS, e.g. detach("Design") detach("package:pkg") # R, e.g. detach("package:Design") =================================================================== SOME ANALYSES OF THE HORSESHOE CRAB DATA IN AGRESTI CH's 4-5 First we have to get the data in a form R can read it. Agresti provides many data sets for his book at his website http://www.stat.ufl.edu/~aa/cda/cda.html in the files agresti-sas-datasets.htm agresti-homework-tables.htm The sas file contains data sets and initial analyses using SAS. In our case we simply find the right data set, and cut and paste into text files that can be used for R. It is useful to grab variable names for the header line in an R data file as well. I have done this and saved the data as horseshoe-data.txt. =================================================================== crabs <- read.table("horseshoe-data.txt",header=T) crabs <- cbind(crabs,sat=ifelse(crabs$satell>0,1,0)) crabs[1:4,] color spine width satell weight sat 1 3 3 28.3 8 3050 1 2 4 3 22.5 0 1550 0 3 2 1 26.0 9 2300 1 4 4 3 24.8 0 2100 0 ==================================================================== First we will look briefly at a Poisson analysis of the number of satellites (males in females harem besides the male that shares the female's nest). A simple analysis might be based on carapace width of the females for example. As a first analysis we build up a figure like Fig 4.4, Agresti p. 129: ==================================================================== par(mfrow=c(1,1)) width.bins <- cut(crabs$width,breaks=c(0,seq(23.25, 29.25),Inf)) plot.y<-aggregate(crabs$satell, by=list(width=width.bins), mean)$x plot.x<-aggregate(crabs$width, by=list(width=width.bins), mean)$x plot(x=plot.x, y=plot.y, ylab="Number of Satellites", xlab="Width (cm)", bty="L", axes=F, type="p", pch=16) axis(2, at=0:5) axis(1, at=seq(20,34,2)) ==================================================================== To get a better idea for the trend, let's look at a nonparametric smooth of the data. A standard modern smoothing model is the "gam" model (gam="generalized additive model"). It is a glm like we have been talking about except that instead of the structural part being eta_i = sum_j ( beta_j x_ij ) the structural part is eta_i = sum_j ( s_j(x_ij) ) where s_j() are piecewise polynomial smoothing splines. Since the s_j() are piecewise polynomial a similar MLE theory (penalized for the complexity of the polynomials) holds for them (except that the fitting algorithm typically iterates through IRLS for each s_j() holding the others fixed). Fortunately these are easy to fit in R: =================================================================== library(mgcv) # fitting using the grouped data gamfit<-gam(plot.y~s(plot.x, k=6, fx=TRUE, bs="cr"), family=poisson(link=log)) lines(x=plot.x,y=gamfit$fitted.values) par(mfrow=c(1,1)) plot.gam(gamfit,SE=T) # for fitted model with conf bands # Nb., details on controlling the degree of smoothing are in # help(gam) and help(s) ############################################### # fitting using the raw data gamfit2 <- gam(satell ~ s(width,k=5,fx=TRUE,bs="cr"), family=poisson,data=crabs) attach(crabs) ind <- order(width) lines(x=width[ind],y=gamfit2$fitted.values[ind]) detach(crabs) par(mfrow=c(1,1)) plot.gam(gamfit2,SE=T) # for fitted model with conf bands =================================================================== Now let's try a Poisson regression glm for this data =================================================================== log.fit<-glm(satell~width, family=poisson(link=log),data=crabs) summary(log.fit) 1-pchisq(567.88,171) =================================================================== The residual deviance (or just "deviance") is the LR test of fit against the saturated model. The null deviance is the deviance or LR test against a "constant" model i.e. a model with intercept only. Therefore the difference, log.fit$null.deviance-log.fit$deviance [1] 64.91309 1-pchisq(64.91,1) gives an LR test of the regression part of the model (just the slope on weight, in this case). CONCLUSION: the model doesn't fit very well, but it's better than the null model! We can also look at fitted values and compare the fitted values with the data... ================================================================== attributes(summary(log.fit)) summary(log.fit)$coefficients attributes(log.fit) log.fit$fitted.values fitted(log.fit) # predict at given width value predict.glm(log.fit, type="response", newdata=data.frame(width=26.3)) par(mfrow=c(1,1)) plot(x=plot.x, y=plot.y, ylab="Number of Satellites", xlab="Width (cm)", bty="L", axes=F, type="p", pch=16) axis(2, at=0:5) axis(1, at=seq(20,34,2)) lines(crabs$width[ind],log.fit$fitted.values[ind]) ############################################################# # Now let's compare with a Poisson glm with identity link (Agresti fig 4.5) id.fit<-glm(satell~width, family=poisson(link=identity),data=crabs, start=coef(log.fit)) summary(id.fit) lines(crabs$width[ind],id.fit$fitted.values[ind],lty=2) # could also compare with # lines(crabs$width[ind],gamfit2$fitted.values[ind], lty=3) # finally Compare AICs library(MASS) summary.glm(log.fit)$aic summary.glm(id.fit)$aic extractAIC(log.fit)[[2]] ================================================================== Finally it might be worth looking at some residual plots.... ================================================================== dev.resid <- resid(log.fit, type="deviance") pear.res <- resid(log.fit, type="pearson") pear.std <- resid(log.fit, type="pearson")/sqrt(1-lm.influence(log.fit)$hat) par(mfrow=c(2,2)) plot(pear.res, xlab="observation",ylab="Pearson Residuals") abline(h=0) plot(pear.std, xlab="observation",ylab="Standardized Pearson Residuals") abline(h=0) plot(dev.resid,xlab="observation",ylab="Deviance Residuals") abline(h=0) ############# par(mfrow=c(2,2)) plot(log.fit) ================================================================== ================================================================== ================================================================== Now let's look at a logistic regression for the presence of satellites... crabs[1:4,] color spine width satell weight sat 1 3 3 28.3 8 3050 1 2 4 3 22.5 0 1550 0 3 2 1 26.0 9 2300 1 4 4 3 24.8 0 2100 0 First the basic model and some basic exploratory plots.... =================================================================== logit.fit<-glm(sat ~ width, family=binomial, data=crabs) width.bins <- cut(crabs$width, breaks=c(0,seq(23.25, 29.25),Inf)) prop<-aggregate(crabs$sat, by=list(width=width.bins), mean)$x plot.x<-aggregate(crabs$width, by=list(width=width.bins), mean)$x par(mfrow=c(1,1)) plot(y=crabs$sat,x=crabs$width, xlab=expression(paste("Width, ", italic(x), "(cm)")), ylab=expression(paste("Proportion having satellites,", {pi}, "(x)")), axes=F, type="n") axis(2, at=seq(0,1,.2)) axis(1, at=seq(20,34,2)) lines(y=prop, x=plot.x, pch=16, type="p") ind<-order(crabs$width) lines(x=crabs$width[ind],y=predict(logit.fit, type="response")[ind], type="l", lty=3) # library(mgcv) # for a gam fit gamlogit<-gam(sat~s(width, k=3, fx=TRUE, bs="cr"), family=binomial,data=crabs) lines(x=crabs$width[ind],y=gamlogit$fitted.values[ind],type="l",lty=1) # in this case the gam fit doesn't help - too much wiggle in the data! # plot.gam(gamlogit,SE=T) # for fitted model with conf bands ########################## # # or think about confidence envelopes... crab.predict<-predict(logit.fit, type="response", se=T) ind<-order(crabs$width) par(mfrow=c(1,1)) plot(crabs$width[ind],crab.predict$fit[ind], axes=F, type="l", xlim=c(20,33), ylab="Probability of satellite", xlab=expression(paste("Width, ", italic(x),"(cm)"))) axis(2, at=seq(0,1,.2)) axis(1, at=seq(20,32,2)) lines(crabs$width[ind],crab.predict$fit[ind]-1.96*crab.predict$se[ind],lty=3) lines(crabs$width[ind],crab.predict$fit[ind]+1.96*crab.predict$se[ind],lty=3) ===================================================================== Start thinking about building a model... summary(logit.fit2<-glm(sat ~ width + as.factor(color), family=binomial, data=crabs)) logit.fit$deviance [1] 194.4527 logit.fit2$deviance [1] 187.4570 logit.fit$deviance - logit.fit2$deviance [1] 6.995631 unique(crabs$color) [1] 3 4 2 5 1-pchisq(6.99,4) [1] 0.1364176 ====================================================================