R Beowulf Controller

Motivation and implementation overview: Statisticians often develop a new analysis method, then want to test it's performance using simulated data, often in comparison to a standard analysis method. The entire simulate/analyze procedure is usually repeated hundreds of times and then some summary output is compiled across all of the simulations. This setup seems ideal for "parallel processing" using our (Unix) Beowulf cluster.

Parallel processing would normally involve a major rewrite of the software, incorporating SNOW for R and/or MPI for C. This page presents a method and software for repeatedly running a "series" of programs consisting of some number of elements including quick R functions (which don't need to be run in parallel), longer R programs (to be run in parallel), and executable programs (e.g., C or Fortran) to be run in parallel, all using a Beowulf cluster.

I start with the idea that each series can be run independently in parallel, as long as for any one series all of its elements are run in a strict order. Because all series run in a common file space, any files create by the series or used as input for individual series must be numbered by a "series number" to keep the series independent. Also, most commonly, summary statistics are created for each series, and compiled together into a summary output file. The method described here allows use of existing executable programs and R functions and programs with little or no modification.


Brief instructions: For those who like to jump in rather than read the details, here is a quick start. The steps required to run a simulation study on banshee in the CMU statistics department are:
  1. Create a working directory not on the Beowulf cluster (because the cluster is neither backed up nor guaranteed to be persistent).
  2. Copy BeoR.R into the directory and modify it according to the instructions in the top half. Run the program in R with SYNTAX.CHECKING.ONLY=TRUE, making any needed corrections until it is error free.
  3. ssh into banshee, go to the directory within which you want to work, and sftp the following files onto banshee: the .R file(s) with the R function(s), the executable files your non-R program(s), and your customized version of BeoR.R. (I recommend using an sftp batchfile for this.)
  4. Use "nice R CMD BATCH BeoR.R&" to start the simulation study. You probably want to try it once with REP.SERIES=1 as a final check of your analysis, before setting the desired number of series repetitions. You can use "tail BeoR.Rout" to monitor the status of the analysis. BeoR.R will e-mail you when your job is done.
  5. Use sftp to copy your output summary file back to your non-Beowulf file space for analysis.


Some more details: The guts of BeoR are designed to keep one C or (parallel) R process running on each Beowulf cluster slave node at all times. In the spare time while the C processes are running, the brief R functions are run. If nothing can be done at any moment, BeoR goes to sleep to avoid using too many system resources. The main guidelines BeoR follows are:
  1. For each set of "elements" which constitutes one simulation and analysis cycle (called a "series" in BeoR.R), the order of execution is guaranteed. The elements must be one of three types, but there may be any number of each type in any pre-set order. The element types are
  2. Each series runs independently of every other series, managed by BeoR. You don't need to write any "parallel" code, but you must assure that any files unique to a series include the series number in their name to keep them unique. Results are stored in the summary output file in the order they are completed, which depends on cluster load.

As described above, your main task in getting ready to use BeoR is to create a customized copy of BeoR.R (with the same or a different name), and to enter the optional information describing your particular situation in the places provided and documented in the top half of the file. The lower half of the BeoR.R file has the cluster controller program. Normally you will not need to modify or understand this, and it is overly complex to assure a simple form for the top half of the file to allow easy setup for any new problem.

The instructions for completing each entry in the top half of BeoR.R are included in that file, but here is some supplementary information for each entry:


Example 1

Example one is a silly little example for which each series has these elements:
  1. element "P1" is an R program called SimP.R that generates random data with 3 columns and a user-specified number of rows and places it into a file named Sim#.dat, where # represents the series number. Because I put "SimP.R" in the PARALLEL.R.PROGRAM.NAMES vector, it will be run in parallel.
  2. elements "E1" and "E2" are both (the same) C program called Trig (with source code in Trig.c). Trig calculates the sum and mean of either the arc-tangent of the random data from "P1" for a series, or the absolute value of the arc-tangent of the random data, depending on the setting of one of its runstring arguments. If it is doing absolute values, it writes it's output into a serially numbered data file containing the letter A for absolute, otherwise the file name has S for signed.
  3. element "R1" is an R function (not program) called analyze1 (found in Anaylze1.R) which analyzes the output of "E1" and "E2", in this case simply concatenating all four results into a single vector, which it returns, allowing BeoR to add the vector to the summary output file.

Detailed Description of how to run Example 1:

  1. Make a directory for this project e.g., Example1 in your home directory or in some subdirectory. Copy these files into the directory:
  2. Examine the "Required and optional user input" portion at the top of BeoRExample1.R to see how this analysis was defined. Note that BeoRExample1.R has SYNTAX.CHECKING.ONLY=TRUE. Perform syntax checking on a non-Beowulf computer using the Unix command R CMD BATCH BeoRExample1.R. Examine the R batch output in BeoRExample1.Rout to verify that there are no errors. If everything is OK, the last line will read "Syntax appears correct. Now set SYNTAX.CHECKING.ONLY=FALSE and try to run on the Beowulf cluster."
  3. Change SYNTAX.CHECKING.ONLY to FALSE and leave SERIES.REP at 1 for further testing on the Beowulf cluster. In the CMU Statistics department, use ssh banshee to get onto the Beowulf cluster. Use sftp myMachine to set up a secure ftp session to read files from your machine. Use cd myDirectory followed by get BeoRExample1.R, get SimP.R, get Trig, get Analyze1.R and, finally, exit. This will copy the files onto the Beowulf cluster.
  4. Run your analysis on the Beowulf cluster using nice R CMD BATCH BeoRExample1.R&. Use top or ps -fu myUsername to see when the one series is complete. Then examine BeoRExample1.Rout to assure that there are no errors. Also, examine TrigResults.txt to assure successful analysis (it should have a header and one row of 4 numbers).
  5. Change SERIES.REP in BeoRExample1.R to the desired number of series and re-run the "nice...&" command above.
  6. Note that every time you try tail BeoRExample1.Rout (or if you let tail -f BeoRExample1.Rout run for a while) you will see output telling you which series are currently running on each node, and a tabulation for each element showing how many series are currently at that element. You can also examine your results in TrigResults.txt as the series are being analyzed using wc TrigResults.txt or cat TrigResults.txt or tail TrigResults.txt.
  7. You will receive e-mail when the whole analysis is complete. Then you would use sftp to transfer the summary output file and any other files you want back to your usual filespace for further analysis.


Details on error messages from syntax checking mode

Here are possible error messages produced during syntax checking. For any that are not completely obvious, hints about what to fix are given.
  1. The variable SYNTAX.CHECKING.ONLY must exist and equal TRUE or FALSE.
  2. You erased the R.FILES= line!!!
  3. R.FILES must be a vector of quoted, non-blank strings.
  4. R file --- does not exist in -----.
  5. Bad R file ----: some error message is shown which tells why the R file could not be source()ed by R.
  6. Bad R file ----: some error message is shown which tells why the R file could not be source()ed by R.
  7. FIRST.SERIES must be a single integer >=1
  8. R.FUNCTION.NAMES must be a vector of quoted function names or just c() (Variable does not exist.)
  9. R.FUNCTION.NAMES must be a vector of quoted function names (Names not quoted.)
  10. R.ARGUMENT.LISTS must be a vector of quoted argument lists (Variable does not exist.)
  11. R.ARGUMENT.LISTS must be a vector of quoted argument lists matching R.FUNCTION.NAMES (Need one runstring per R function.)
  12. ---- is not an existing R function
  13. ---- must define at least one argument (Functions in R.FUNCTION.NAMES must have at least one argument.)
  14. Error evaluating R.ARGUMENT.LISTS for -----: some error message is shown which tells why the argument list could not be evaluated.
  15. Missing argument(s): ---- (These arguments are required by the R function, but not given in the argument list.)
  16. Extra argument(s): ---- (These arguments are in the argument list, but not allowed for by the R function.)
  17. PARALLEL.R.PROGRAM.NAMES must be a vector of quoted R program names or just c() (Variable missing.)
  18. PARALLEL.R.PROGRAM.NAMES must be a vector of quoted program names"
  19. File ----- does not exist (R source file for a parallel R program does not exist in the current directory.)
  20. ----- is a directory, not an R file (For PARALLEL.R.PROGRAM.NAME, a directory was given instead of a source file name.)
  21. ----- is an executable, not an R source file (Perhaps it should be on the EXEC.PROGRAM.NAMES list instead.)
  22. EXEC.PROGRAM.NAMES must be a vector of quoted executable program names or just c() (Variable missing.")
  23. EXEC.PROGRAM.NAMES must be a vector of quoted program names
  24. EXEC.RUNSTRINGS must be a vector of quoted expressons, matching EXEC.PROGRAM.NAMES (Need one runstring per program.)
  25. EXEC.INPUT must be a vector of quoted expressons, matching EXEC.PROGRAM.NAMES (Need one runstring per program.)
  26. EXEC.OUTPUT must be a vector of quoted expressons, matching EXEC.PROGRAM.NAMES (Need one runstring per program.)
  27. Program ----- was not found (Executable is not in the current directory. Is "./" missing?)
  28. ----- is a directory, not a program
  29. ----- is not executable
  30. Error evaluating runstring for executable ----- : some error message is shown which tells why the runstring could not be evaluated.
  31. The runstring for program ----- must evaluate to a string, not -----.
  32. The runstring for ----- must be of length one, not --: Your vector is shown where a single value is needed.
  33. Error evaluating input redirect for executable ----- : some error message is shown which tells why the input redirect could not be evaluated.
  34. The input redirect for program ----- must evaluate to a string, not -----.
  35. The input redirect for ----- must be of length one, not --: Your vector is shown where a single value is needed.
  36. Error evaluating ouput redirect for executable ----- : some error message is shown which tells why the ouput redirect could not be evaluated.
  37. The ouput redirect for program ----- must evaluate to a string, not -----.
  38. The ouput redirect for ----- must be of length one, not --: Your vector is shown where a single value is needed.
  39. Variable "OUTFILE" does not exist.
  40. Variable "OUTFILE" must be a non-blank character string.
  41. PREPEND.SERIES must exist and be set to TRUE or FALSE.
  42. APPEND.OUTFILE must exist and be set to TRUE or FALSE.
  43. SERIES REP must be a number >=1.
  44. SLEEP.SECONDS must be a number >=2.
  45. Variable ----- must exist. (Ancillary functions do nothing, but cannot be missing.)
  46. Variable ----- must be a function. (Anciallary functions must be functions, even if they do nothing.)
  47. SERIES.DEFINITION must exist and be a character string vector with at least one value.
  48. Bad format for SERIES.DEFINITION.
  49. Each element of SERIES.DEFINITION must start with R, P or E.
  50. Each element of SERIES.DEFINITION must have only a number following the letter.
  51. ----- has no elements, but SERIES.DEFINITION contains some ----- elements.
  52. ----- has -- elements, but SERIES.DEFINITION does not contain exactly L# through L#. (Some elements are declared above but not used in the SERIES.DEFINITION.)
  53. DEBUG must be TRUE or FALSE
  54. Can't find system program mpdtrace--are you really on a Beowulf cluster? (mpdtrace is a required cluster program that lists slave node names.)
  55. Problem with mpdtrace (mpdtrace is a required cluster program that lists slave node names.)
  56. Stopping because no/only # nodes are available (A minimum of 5 slave nodes is required, but not available.)
  57. Can't read $USER from operating system (System environmental variable USER is required, but not present.)
  58. PRAGMA.USE.NODE.LOAD not yet implemented (Currently this variable may only be set to 0.)
  59. Failure calling L# for series -- (A problem occured while attempting to run the specified element.)