R Tips and Links

Links     Tips for working in R     R Programming Tips     Problems and Solutions     Useful Functions    


Helpful R Links

  1. My R Class Notes    (Advanced)
  2. Official R Site Search     RSeek (Google-type search for R related material)
  3. Documentation (incl. Download)   (Hint: Try creating a bookmark to "C:\Program Files\R\rw2001\doc\html\rwin.html" in Windows; substitute your most recent version for "rw2001"; the link to "Search Engine and Keywords" is most helpful.)
  4. Packages     Crantastic Package Page     R Example Graph Library
  5. FAQ     R-help for asking questions Bug reporting    
  6. R Inferno: problems and solutions
  7. R Studio: an indegrated development environment for R
  8. R color chart
  9. Intro to R
  10. R Language Definition
  11. Operator Precedence
  12. R Data Import/Export
  13. R Reference Guide (pdf, 450 pages, 12MB)
  14. R Data Import/Export
  15. A good proposed Programming Style Guide
  16. Areas of statistics: CRAN Task Views
  17. Lumley's R Fundamentals
  18. R Reference Card
  19. Peng's Debugging in R (pdf)
  20. R Programming Resource Center
  21. Mathematical annotation in plots
  22. R-News
  23. R wiki
  24. Spherula (R quick reference, scripts, book notes, etc.)
  25. OmegaHat: interfaces to other languages
  26. RExcel: communicate with rcom/MS Excel   See dtfFromExcel() & getExcel() under Utility Functions, below, for a robust wrapper. An alternative is the RODBC package.
  27. Rtools package to help with C within R
  28. Nabble R forum
  29. Kickstarting R
  30. Using R for psychological research (personality-project)
  31. York U R tips
  32. Books    Resources
  33. Theresa Scott's tutorial
  34. CUNY R Tutorial
  35. ILSTU 1-Page R Tutorial (Windows)
  36. Tips for Creating, Modifying, and Checking Data Frames
  37. Practical Regression and Anova using R
  38. Non-Parametric Inference with R (Larry and Chad)
  39. Example of running repeated measures in R to match SPSS (etc)
  40. Possible link to your copy of the R Manuals/a> (Copy the link location and paste into your browser)
  41. nlme() mixed models guide (pdf)

Tips for Working in R

  1. Use helpstart() to bring up help in a browser; the link to "Search Engine and Keywords" is most useful.
  2. When looking at complex expressions, decode them by working from the inside out. E.g., here is a decomposition of some code to make a density plot of the product to two normal random variables using a sample of size 20. (Note the optional use of "tmp" to keep the lines so that they all use the same random numbers.)
         plot(density(apply(matrix(rnorm(40),20), 1, prod)))
         tmp = rnorm(40)   # a vector of 40 standard normal variates
         tmp
         matrix(tmp,20)  # put into a matrix of 20 rows and 40/20=2 columns
         apply(matrix(tmp,20), 1, prod)  # the 20 products
         density(apply(matrix(tmp,20), 1, prod)) # the density estimate
         plot(density(apply(matrix(tmp,20), 1, prod))) # the plot
        
  3. Keep a text record of all working R commands needed to re-run your analysis. Ideally you should be able to source() the file and recreate your work, e.g. if your client finds an error in the data (which happens 98% of the time according to Seltman's Law of Data Analysis).
  4. Under Linux, "ESS" (Emacs Speaks S) is usually the most efficient way to work. Briefly, you start emacs, then use "Alt-X R" to start R from within emacs. The ESS menu (with keyboard shortcuts) allows you to automatically run code that you write, among other features. The home page is ESS .
  5. Write out TRUE and FALSE, because T and F can be redefined.
  6. When defining and redefining columns of a data.frame, make liberal use of summary() and table(..., exclude=NULL) to verify that you accomplished what you tried to accomplish.
  7. Important:Remember that table() ignores missing data; use table(..., exclude=NULL) to also see missing data.
  8. You can use the .First function to automatically load libraries that you frequently use, or to perform other startup tasks. E.g.
     .First = function() {library(nlme); options(locatorBell=FALSE)} 
    will load the nlme library every time R starts up in the current directory. It also turns off the annoying sound associated with the locator() function.
  9. Use this function to find large, unneeded objects that can be removed to free up space:
         sizes = function() {
           ob = objects(name=parent.frame())
           rslt = sapply(ob,function(x){object.size(get(x))})
           return(sort(rslt))
         }
         
  10. Contrast testing in R (using C() or contrasts()), ignores your scaling, so although the t-values and p-values are correct, the estimates and standard errors (and any confidence intervals you construct from them) are incorrect. To do this correctly, use fit.contrast() in package "gmodels". E.g.
         x = factor(rep(LETTERS[1:3], each=20)); y = rnorm(60)
         m1 = aov(y~x)
         library(gmodels)
         cont = rbind(AvsBC = c(1, -1/2, -1/2), BvsC = c(0, 1, -1))
         fit.contrast(m1, "x", cont, conf.int=0.95)
         
  11. If you are working on a public computer without write access to where most of R lives, you can still install packages to a private space (Windows example shown here, but it is similar on Linux). Make a directory you can write to, e.g., c:\\myPackages. In R, to install, e.g., package "mice" use
         install.packages("mice", "c:\\myPackages")
         
    Then each R session use
         library(mice, lib.loc="c:\\myPackages")
         
  12. A system for documenting data analysis projects:

    Here is an idea for making R code that stores comments and results in a separate, readable file. This is especially nice when you might need to re-source() your code due to changes in the data or analysis (i.e., essentially always). Optionally, you can run reportLatex() after all of your report() commands to create a .tex file that is formatted better and incorporates graphical output (see below).

    (An alternative is sweave. Unlike sweave, report does not require you to understand latex, and it has only a single command to learn.)

    The code and more documentation are at report.R.

    Put these two lines near the top of your code:

         if (!exists("report")) source("http://www.stat.cmu.edu/~hseltman/files/report.R")
         report("Start of my report on project X", new=TRUE, prefix="myProjectX")
         
    This creates a file named "myProjectXYYYY-MM-DD.txt" with the quoted string in the first argument as the text at the top of the file. You can include "\n" in the first argument to write multiple lines in one call. You can optionally add the argument useTime=TRUE to include the time of creation along with the date in the file name if you want to keep multiple versions from the same day.

    Note that the variable "reportFileName" is created in your global environment and you should not delete this variable, at least while you are working on any one report.

    Now anytime in your code, you can include code of the form

     report(x) 
    or
     report(x, ..., z)
    to cause the value of x (or all of the variables x through z) to go to both the screen and the report file. This constructs the report on-the-fly as you work through your analysis. (If multiple arguments are used with report() and they are all strings or single numbers, then they are pasted together without any spaces between them (i.e., using sep="")).

    Note that you can manually erase errors from the report file using a text editor.

    Note that with a little planning, you will be in the situation such that if you re-source() your whole .R file, e.g., after correcting an error in the data or analysis, you will end up with a brand new, complete, readable report of the entire analysis with no effort.

    Note that the screen width affects the output by controlling the usual R text wrapping, e.g., with table(). Normally, you will want to keep the screen width around 60-70 characters to make it easier to read the report.

    Note that whenever you run report(x, new=TRUE, ...), if the report file name matches an existing file, the old file is deleted.

    Here are some examples that demonstrate what you can do:

         report("\nDemographics")
         report(table(age, gender))
         report(paste("Number of visits =", nrow(dat)))
         report("\nSuccess by treatment")
         report(with(dat, table(success, treatment, exclude=NULL)))
         report("\nYears of education:")
         report(summary(demog$educ))
         report(paste("\nDroppping", sum(noVisits|oneVisit), 
                      "subjects with no CERAD's or only 1 visit"))
         report(expression(str(my.data.frame)))
         
    The last example uses "expression()" because the "str()" function breaks the usual R rules and uses "cat()" rather than returning its result as an object. The "stem()" function is another example.

    There three helper functions in report.R.

    1. matForm(x, cols=12) converts a vector (string, numeric or factor) into a string matrix with a specific number of columns (even if length(x)%%cols!=0), so that long vectors don't ruin the appearance of the report.
    2. total(tab, margins=1:2) adds totals to the result of table()
    3. pct(tab, margins=1:2) adds percents to the result of table()

    Note that pct() and total() can both be used on the same table, in either order. Each respects the results of the other to avoid the incorrect and/or confusing output that could result from, e.g., including data and their total when computing percents.

    Optionally, you can use reportLatex() (code and description in reportLatex.R) to convert your .txt file into a .tex (Latex) file. This can incorporate plots as follows: when you are going through your analysis use the report text "See ... in myPlotFile.pdf", e.g.,

         plot(rnorm(20, type="b", main="Random normals", xlab="time", ylab="x")
         fname = "rnorm.pdf"
         dev.copy(pdf, fname); dev.off()
         # Important: Be sure to put a blank between "in" and the end quote
         #            since  sep="" will be in effect.
         report("\nSee 20 Gaussians in ", fname)
         
    With or without these special graphics commands, when you run reportLatex() a .tex file is created with the same base name as your .txt report file. Note that you can manually edit the .tex file at this point if desired.

    You then process this .tex file with pdflatex myReportFile.tex in Linux (or however else you know to process Latex files on any operating system) to produce the .pdf report file.

    If you used the special graphics indicator text "See ... in someFileName.pdf", then the plots will be included in the report, and the caption of the figure will be the text between "See" and "in". Also the caption will include figure numbers starting at "Figure 1".

    If you prefer to use a different graphics file type than "pdf" (as long as it is compatible with whatever version on pdflatex or latex that you are using) just run the optional form, e.g., reportLatex(extension=".pdf") substituting your graphics extension for "pdf".

Tips for Programming in R

  1. End each function with return() or invisible() rather than using implicit returns. This conforms to standard programming practice in most other languages and make your program easier to read.

  2. Start each function with checks of the arguments. It takes a little extra time but will usually repay you (or other users of the function) by pointing out the source of errors. Here is an example:
         myfun = function(dtf, name, p=0.5) {
           if (is.matrix(dtf)) dtf = data.frame(dtf)
           if (!is.data.frame(dtf)) stop("dtf must be a data.frame or matrix")
           if (!is.character(name) || length(name)!=1) stop("name must be a single character string")
           if (p<=0 || p>=1) stop("p must be in the interval (0,1)"
           ...
           return(rslt)
         }
         
  3. Allow for stopping and restarting of functions with long loops (e.g., MCMC).
    A good trick is to setup your function (or even just a loop) as follows:
         myfun = function() {
           if (file.exists("myresults.dat")) {
              ...load and use old results...
           }
           ...
           for (i in 1:10000) {
             if (file.exists("stop")) {
               write.table(myresults, file="myresults.dat")
               stop("Early stop due to detection of stop file")
             }
             ...
           }
           ...
           return(...)
         }
         
    Then, you can create a file called "stop" at any time (e.g., in Linux using "echo stop>stop" at the Linux prompt) and the function will gracefully stop at the start of the next loop iteration. Without too much work, you can probably set up your function to automatically continue wherever you left off. Just remember to delete or rename the "stop" file before running the function again.

  4. Avoid using "attach" as a way to save typing. The major problem is that modification of old elements or creation of new ones is not saved when you quit (and "save workspace") R. This leads to insidious errors. One alternative is to use with(), e.g., something like:
         with(mydtf, plot(x, y, col=gender))
         
    where the columns of "mydtf" are x, y, and gender.

  5. Working with "non-visible functions": If you try, e.g., methods(logLik), you will find some methods (e.g., logLik.glm) that are marked with an asterisk and are "non-visible". Here is how to get a copy of those functions. Use getAnywhere(logLik.glm) to find that it is in the "namespace" called "stats". Then mylogLik.glm=stats:::logLik.glm will get you a copy of the function.

  6. (Advanced) To make a nice user interface with dialog boxes, etc. consider the Tcl/Tk package. Here is a good introduction. Here is the R help. Here is a primer with an update. You might prefer a higher-level package called rpanel, described here and here, with this home page, and this package reference, and this cute little example which needs spacer.gif. Here are more R examples. Here are more R examples. Here is a nice discussion about using tcl/tk vs. java package interfaces to R. And here is a (non-R) Tcl/Tk Electronic Reference.

Problems and Solutions

  1. Problem: Loading dates, e.g., from Excel, and working with dates is poorly documented.     Solution: Load datetest.csv, then try the examples in Rdates.R.
  2. Problem: Each click for the locator() annoyingly causes the bell to ring.
        Solution: options(locatorBell=FALSE)
  3. Problem: Create a new data.frame column that is a complex code based on old columns.
        Solution: Create a function for one subject and apply() it to all subjects. This is much more efficient than a for loop. E.g.
         myfun = function(x) {
           # Argument x should contain one row, columns a,b,e,f.
           # The result is the mean of a and f unless b is missing or negative,
           # in which case the min of e and f is returned.
           if (is.na(x[2]) || x[2]<=0) {
             return(min(c(x[3],x[4])))
           } else {
             return((x[1]+x[4])/2)
           }
         }
         dtf$new = apply(dtf[,c("a","b","e","f")], 1, myfun)
         
    An alternative is as follows. The optional first line may prevent some wonky errors, and is good practice.
         dtf$new = NA  # in general, NA is safer than 0, protecting against bad logic
         Sel = is.na(dtf$b) | dtf$b<=0
         dtf[Sel, "new"] = pmin(dtf$e[Sel], dtf$f[Sel])
         Sel = !is.na(dtf$b) & dtf$b>0
         dtf[Sel, "new"] = (dtf$a[Sel]+dtf$f[Sel])/2
         
  4. Problem: Analyze (all) subsets of a data.frame
       Solution: To analyze a single subset of a data.frame, you can use an index vector (logical or numeric) as the "row.selector" (first argument) of the form "incdata[row.selector, col.selector]". For example, the expression median(incdata[incdata$sex=="female", "income"]) calculates the median income of just the female subjects in data.frame "incdata".

    But expressions for each of several categories is awkward and inefficient. So the methods below present efficient alternatives. If no subsetting variable exists, consider using the R function cut() or Problem/Solution #2 to create it.

    Here is code you can paste into R to generate a sample data.frame to use as an example:

         n = 20
         incdata = data.frame(sex=c("male","female")[1+rbinom(n,1,0.5)],
                           race=c("black","white","hispanic","Asian")[1+rbinom(n,3,0.5)],
                           income=round(rnorm(n,50000,15000)),
                           networth=pmax(0,round(rnorm(n,50000,30000))))
         
    • The aggregate() command is designed to apply a built-in or user-defined function to each of one or several columns of a data.frame after automatically dividing the data.frame by the levels of one or more "by" variables.
           aggregate(incdata[,c("income","networth")], by=list(gender=incdata$sex,race=incdata$race), FUN=median)
           
      To get the results of a two-way aggregate into a table form, see aggregate.table() in the gdata package.

    • The tapply() or a combination of the split() and lapply() commands can be used similarly to aggregate(). (In fact, aggregate() is a wrapper for tapply().) The split command can be applied to a vector, matrix, or data.frame, and will corresponding produce a list of vectors, matrices, or data.frames based on the splitting variable(s). Try, for example,
                with(incdata, split(income, race))
                split(incdata[,3:4], list(gender=incdata$sex, race=incdata$race))
                
      Two things to note that differ from using aggregate are that list() is not needed when a single split variable is used, and that the result includes empty elements for categories with no data. The second step is to use lapply() or sapply() on the split() list to carry out some function of the data, either a built-in function, a function that you have created, on an on-the-fly "anonymous" function. The difference between lapply() and sapply() is the latter will return a simple, non-list object, if possible. This example calculates the mean of the ratio of income to net worth for each sex/race subgroup using an anonymous function on the two-column elements of the list:
                tmp = split(incdata[,3:4], list(gender=incdata$sex, race=incdata$race))
                lapply(tmp, function(x) mean(x[,1]/x[,2]))
                sapply(tmp, function(x) mean(x[,1]/x[,2]))
                
      This example calculates range of income for each race:
                with(incdata, sapply(split(income, race), function(x) diff(range(x))))
                
      The simpler version uses tapply():
                with(incdata, tapply(income, race, function(x) diff(range(x))))
                
  5. R is too slow
    • First try some version of the "apply" command instead of "for" loops. (Note that sapply(1:n,FUN) can be used to run FUN n times if the first argument of FUN is a dummy (unused) argument.)
    • Consider using parallel computing, e.g., SNOW: (web page) (package) or Parallel R (paper) (package)
    • More experimental is Luke Tierney's R byte compiler. The help file is here. The zipped tar file is here. To load on Linux use something like gunzip foo.tar.gz to convert foo.tar.gz to foo.tar, then tar xvf foo.tar to extract all of the files. Use the Linux command R CMD INSTALL -l compiler compiler to install the package (-l compiler is needed only if you don't have admin permissions). In R, use library(compiler, lib.loc="compiler") to load the library for use (each time). (Change lib.loc= if the main directory is not named compiler in your current directory.)
    • Consider writing all or part of your code in C (see here).

Useful Functions

Note: In R, use source("somefunction.R"), including the quotes, to make the functions in somefunction.R available.


All links active 8/2/2005. Please report missing links, errors, and suggestions to


up To my Home Page