R Beowulf Controller
Motivation and implementation overview: Statisticians often develop a new analysis method, then want to test it's performance
using simulated data, often in comparison to a standard analysis method. The entire simulate/analyze
procedure is usually repeated hundreds of times and then some summary output is compiled across all
of the simulations. This setup seems ideal for "parallel processing" using our (Unix) Beowulf cluster.
Parallel processing would normally
involve a major rewrite of the software, incorporating SNOW for R and/or MPI for C. This page presents
a method and software for repeatedly running a "series" of programs consisting of some number of elements
including quick R functions (which don't need to be run in parallel), longer R programs (to be run in parallel),
and executable programs (e.g., C or Fortran) to be run in parallel, all using a Beowulf cluster.
I start with the idea that each series can be run independently in parallel, as long as for any one series
all of its elements are run in a strict order. Because all series run in a common file space, any
files create by the series or used as input for individual series must be numbered by a "series number"
to keep the series independent. Also, most commonly, summary statistics are created for each series,
and compiled together into a summary output file. The method described here allows use of existing
executable programs and R functions and programs with little or no modification.
Brief instructions: For those who like to jump in rather than read the details, here is a
quick start. The steps required to run a simulation study on banshee in the CMU statistics department are:
- Create a working directory not on the Beowulf cluster (because the cluster is neither backed up
nor guaranteed to be persistent).
- Copy BeoR.R into the directory and modify it according to the instructions in the top half.
Run the program in R with SYNTAX.CHECKING.ONLY=TRUE, making any needed corrections until it is error free.
- ssh into banshee, go to the directory within which you want to work, and sftp the following files
onto banshee: the .R file(s) with the R function(s), the executable files your non-R program(s), and
your customized version of BeoR.R. (I recommend using an sftp batchfile for this.)
- Use "nice R CMD BATCH BeoR.R&" to start the simulation study. You probably want to try it once
with REP.SERIES=1 as a final check of your analysis, before setting the desired number of series
repetitions. You can use "tail BeoR.Rout" to
monitor the status of the analysis. BeoR.R will e-mail you when your job is done.
- Use sftp to copy your output summary file back to your non-Beowulf file space for analysis.
Some more details: The guts of BeoR are designed to keep one C or (parallel) R process running on each
Beowulf cluster slave node at all times. In the spare time while the C processes are running, the brief R functions
are run. If nothing can be done at any moment, BeoR goes to sleep to avoid using too many system resources.
The main guidelines BeoR follows are:
- For each set of "elements" which constitutes one simulation and analysis
cycle (called a "series" in BeoR.R), the order of execution is guaranteed. The elements
must be one of three types, but there may be any number of each type in any pre-set order.
The element types are
- Brief R function calls which are run on the master node (not in parallel)
- Longer R program runs which are managed to run in parallel
- Executable programs which are managed to run in parallel
- Each series runs independently of every other series, managed by BeoR. You don't need to write
any "parallel" code, but you must assure that any files unique to a series include the series number
in their name to keep them unique. Results are stored in
the summary output file in the order they are completed, which depends on cluster load.
As described above, your main task in getting ready to use BeoR is to create a customized copy of
BeoR.R (with the same or a different name), and to enter the optional information describing your
particular situation in the places provided and documented in the top half of the file. The lower
half of the BeoR.R file has the cluster controller program. Normally you will not need to modify
or understand this, and it is overly complex to assure a simple form for the top half of the file
to allow easy setup for any new problem.
The instructions for completing each entry in the top half of BeoR.R are included in that
file, but here is some supplementary information for each entry:
- rm(list=objects()): You may delete this line if you don't want BeoR.R to start "clean", but I
don't recommend it because your analysis may become inadvertently "broken" in the future.
- "define your own variables": Often you will run BeoR.R several times for various settings of some
parameter(s). It is most convenient to define them here so that only this section needs to be changed
between runs. (I suggest using the all capital letters convention with periods between words.)
- "R.FILES": Change c() to a vector of quoted filenames (e.g., c("myfile1.R","myfile2.R").
These files will be loaded into R using source(). The variables and function present in these R
files will be available to both the BeoR.R program itself and any R programs that are run in
parallel on the nodes. Normally they will contain the definitions of any R functions called for each
series (see R.FUNCTION.NAMES, below); the definitions of any functions called by the PARALLEL.R.FUNCTION.NAMES
programs (see below); and constants needed in by the PARALLEL.R.FUNCTION.NAMES programs,
the function calls to the R.FUNCTION.NAMES functions, and the arguments lists for the
EXEC.PROGRAM.NAMES programs.
- "R.FUNCTION.NAMES": This is where the names of any quick R functions that are elements of
your series are declared. Change c() to a vector of quoted function names (e.g.,
c("myfun1","mufun2"). These functions are not parallelized, and are intended for quick
operations that are too brief to be worth parallelizing. They must operate independently on a series,
knowing only the series number and any other constants passed to their argument lists, and the fact that
earlier elements in the series have been run already, and later elements have not yet been run.
If a function returns a vector, that vector is appended to the summary output file (OUTFILE), otherwise
the return value is ignored. Communication between elements of a series, if needed, should be
via reading and writing of uniquely named files that include the series number.
- "R.ARGUMENT.LISTS": This is where the argument list(s) for the (quick) R function(s) are defined.
Often the value rep('list(count=n)',NRF), where NRF is the number of R functions defined
in R.FUNCTION.NAMES, can be used. This variables must always be in the form of a
vector, each element of which contains a quoted list of comma separated equalities. Each equality has an argument of
the function on the left and an assigned value on the right. The assigned value can be the variable
series which always is the current series number, or a constant, or one of the "define your own"
variables, or an expression combining any of these. Because the R functions normally create and/or analyze
unique files, they need series to create the unique file names.
- "PARALLEL.R.PROGRAM.NAMES": This is a vector of quoted file names, each of which refers to an
R program that constitutes an element in the series that should be run in parallel. These programs
have access to any constants or functions defined in "R.FILES", but any R variables created by
these programs are not available to other elements of the series. See BeoR.R for important details
on how your PARALLEL.R.PROGRAM.NAMES programs obtain the series number and assure that they obtain
their own unique random numbers. Like any other elements in a series, these programs normally
operate by reading and/or writing files containing unique names constructed from the series number.
- "EXEC.PROGRAM.NAMES": This variable contains the names of any C, Fortran, etc. programs that
are elements of a series and can be run in parallel across series. These programs may obtain the
series number using their runstring (see EXEC.RUNSTRINGS, below). Alternatively, these programs
may use input/output redirection (< and >) in which you specify input and output filenames
constructed using EXEC.INPUT and EXEC.OUTPUT (below) that encode the series number. Either way,
these programs normally operate by reading and/or writing files containing unique names constructed
from the series number.
- "EXEC.RUNSTRINGS": This variable contains a vector (with length matching EXEC.PROGRAM.NAMES) that
defines the runstring for each EXEC.PROGRAM.NAMES element. Each element of the EXEC.RUNSTRINGS
vector is a string which, when unquoted, can be evaluated to produce a string that is the
runstring for the corresponding program. Most commonly, this is just the series number, in
which case EXEC.RUNSTRINGS=rep('paste(series)',NEP), where NEP is the number of executable
programs, can be used. If needed, more complex runstrings can be easily constructed using the
paste function (but in quotes).
- "EXEC.INPUT": This variable contains a vector (with length matching EXEC.PROGRAM.NAMES) that
defines the input redirection filename for each EXEC.PROGRAM.NAMES element. Each element of the EXEC.INPUT
vector is a string which, when unquoted, can be evaluated to produce a string that is the
input redirection filename for the corresponding program. Most commonly, this is constructed using
paste, e.g., EXEC.INPUT=rep('paste("Input",series,".dat",sep="")',NEP), where NEP is the number of executable
programs, can be used. Use '' for any unneeded value(s).
- "EXEC.OUTPUT": This variable contains a vector (with length matching EXEC.PROGRAM.NAMES) that
defines the output redirection filename for each EXEC.PROGRAM.NAMES element. Each element of the EXEC.OUTPUT
vector is a string which, when unquoted, can be evaluated to produce a string that is the
output redirection filename for the corresponding program. Most commonly, this is constructed using
paste, e.g., EXEC.OUTPUT=c('paste("RsltA",series,".txt",sep="")', 'paste("RsltB",series,".txt",sep="")').
Use '' for any unneeded value(s).
- "OUTFILE": Presumably all summary results for each series go into a single file for
final analysis. If you define OUTFILE and if any of the R.FUNCTION.NAMES functions returns a vector,
then BeoR will append the vector to OUTFILE. Often it is convenient to set the variable
OUTFILE equal to a paste command that
automatically defines a different summary file name for each BeoR run based on the "define your
own" variable values.
In addition, set PREPEND.SERIES to TRUE if you want the summary output file to put the series number in the
first position of each line. Also, set APPEND.OUTFILE to FALSE if you want any old version to
be erased before running BeoR.
- "FIRST.SERIES and SERIES.REP": SERIES.REP is set to the number of desired series in
the whole simulation set. If, for some reason, such as the desire to add results to an existing
simulation, you want to start series at a number other than 1, change FIRST.SERIES.
If FIRST.SERIES=1, as recommended initially for debugging purposes, then DEBUG (see below)
is automatically set to true.
- "SLEEP.SECONDS": This is the number of seconds that BeoR sleeps when it finds nothing to do.
Setting too low of a number wastes system resources, adversely affecting other users. You should
set SLEEP.SECONDS equal to a few percent of the normal time it takes to run the fastest parallel program.
- "initialSetup": You can redefine this function (or erase it and use a version in your R.FILES)
if you have some R tasks that you want to perform each time BeoR runs. An example is to
create a summary output file containing only a column header line, if none exists.
- "perIterationSetup": You can redefine (or erase it and use a version in your R.FILES)
this function to perform some setup for each series
just before the first series element is run. An example would be deleting old data files using the
R unlink() function.
- "perIterationCleanup": You can redefine this function (or erase it and use a version in
your R.FILES) to perform some cleanup for each series just after the last element is run.
An example would be deleting data and/or analysis files if you don't plan to save them.
- "finalCleanup": You can redefine this function (or erase it and use a version in your R.FILES)
to perform an final cleanup after all series are run. An example is cleaning up the nodexx files
used to tell the parallel R programs their series number.
Example 1
Example one is a silly little example for which each series has these elements:
- element "P1" is an R program called SimP.R that generates random data with 3 columns
and a user-specified number of rows and places it into a file
named Sim#.dat, where # represents the series number. Because I put "SimP.R" in the
PARALLEL.R.PROGRAM.NAMES vector, it will be run in parallel.
- elements "E1" and "E2" are both (the same) C program called Trig (with source code in
Trig.c). Trig calculates the sum and mean of either the arc-tangent of the random
data from "P1" for a series, or the absolute value of the arc-tangent of the random data, depending
on the setting of one of its runstring arguments. If it is doing absolute values, it writes
it's output into a serially numbered data file containing the letter A for absolute, otherwise the
file name has S for signed.
- element "R1" is an R function (not program) called analyze1 (found in Anaylze1.R) which
analyzes the output of "E1" and "E2", in this case
simply concatenating all four results into a single vector, which it returns, allowing BeoR to
add the vector to the summary output file.
Detailed Description of how to run Example 1:
- Make a directory for this project e.g., Example1 in your home directory or in
some subdirectory. Copy these files into the directory:
- the modification of BeoR.R to define the whole analysis: BeoRExample1.R
- the parallel R program: SimP.R
- the C executable: Trig
- the R function for "analysis" of results: Analyze1.R
- Examine the "Required and optional user input" portion at the top of BeoRExample1.R to see how
this analysis was defined. Note that BeoRExample1.R has SYNTAX.CHECKING.ONLY=TRUE. Perform syntax checking
on a non-Beowulf computer using the Unix command R CMD BATCH BeoRExample1.R. Examine
the R batch output in BeoRExample1.Rout to verify that there are no errors. If everything is OK, the
last line will read "Syntax appears correct. Now set SYNTAX.CHECKING.ONLY=FALSE and
try to run on the Beowulf cluster."
- Change SYNTAX.CHECKING.ONLY to FALSE and leave SERIES.REP at 1 for further testing on the Beowulf
cluster. In the CMU Statistics department, use ssh banshee to get onto the Beowulf cluster.
Use sftp myMachine to set up a secure ftp session to read files from your machine. Use cd myDirectory
followed by get BeoRExample1.R, get SimP.R, get Trig, get Analyze1.R
and, finally, exit. This will copy the files onto the Beowulf cluster.
- Run your analysis on the Beowulf cluster
using nice R CMD BATCH BeoRExample1.R&. Use top or ps -fu myUsername to
see when the one series is complete. Then examine BeoRExample1.Rout to assure that there are no errors.
Also, examine TrigResults.txt to assure successful analysis (it should have a header and one row of 4 numbers).
- Change SERIES.REP in BeoRExample1.R to the desired number of series and re-run the "nice...&" command above.
- Note that every time you try tail BeoRExample1.Rout (or if you let tail -f BeoRExample1.Rout run
for a while) you will see output telling you which series are currently running on each node, and a
tabulation for each element showing how many series are currently at that element. You can also examine your
results in TrigResults.txt as the series are being analyzed using wc TrigResults.txt or
cat TrigResults.txt or tail TrigResults.txt.
- You will receive e-mail when the whole analysis is complete. Then you would use sftp to transfer
the summary output file and any other files you want back to your usual filespace for further analysis.
Details on error messages from syntax checking mode
Here are possible error messages produced during syntax checking. For any that are not
completely obvious, hints about what to fix are given.
- The variable SYNTAX.CHECKING.ONLY must exist and equal TRUE or FALSE.
- You erased the R.FILES= line!!!
- R.FILES must be a vector of quoted, non-blank strings.
- R file --- does not exist in -----.
- Bad R file ----: some error message is shown which tells why the R file could not be source()ed by R.
- Bad R file ----: some error message is shown which tells why the R file could not be source()ed by R.
- FIRST.SERIES must be a single integer >=1
- R.FUNCTION.NAMES must be a vector of quoted function names or just c() (Variable does not exist.)
- R.FUNCTION.NAMES must be a vector of quoted function names (Names not quoted.)
- R.ARGUMENT.LISTS must be a vector of quoted argument lists (Variable does not exist.)
- R.ARGUMENT.LISTS must be a vector of quoted argument lists matching R.FUNCTION.NAMES (Need one runstring per R function.)
- ---- is not an existing R function
- ---- must define at least one argument (Functions in R.FUNCTION.NAMES must have at least one argument.)
- Error evaluating R.ARGUMENT.LISTS for -----: some error message is shown which tells why the argument list could not be evaluated.
- Missing argument(s): ---- (These arguments are required by the R function, but not given in the argument list.)
- Extra argument(s): ---- (These arguments are in the argument list, but not allowed for by the R function.)
- PARALLEL.R.PROGRAM.NAMES must be a vector of quoted R program names or just c() (Variable missing.)
- PARALLEL.R.PROGRAM.NAMES must be a vector of quoted program names"
- File ----- does not exist (R source file for a parallel R program does not exist in the current directory.)
- ----- is a directory, not an R file (For PARALLEL.R.PROGRAM.NAME, a directory was given instead of a source file name.)
- ----- is an executable, not an R source file (Perhaps it should be on the EXEC.PROGRAM.NAMES list instead.)
- EXEC.PROGRAM.NAMES must be a vector of quoted executable program names or just c() (Variable missing.")
- EXEC.PROGRAM.NAMES must be a vector of quoted program names
- EXEC.RUNSTRINGS must be a vector of quoted expressons, matching EXEC.PROGRAM.NAMES (Need one runstring per program.)
- EXEC.INPUT must be a vector of quoted expressons, matching EXEC.PROGRAM.NAMES (Need one runstring per program.)
- EXEC.OUTPUT must be a vector of quoted expressons, matching EXEC.PROGRAM.NAMES (Need one runstring per program.)
- Program ----- was not found (Executable is not in the current directory. Is "./" missing?)
- ----- is a directory, not a program
- ----- is not executable
- Error evaluating runstring for executable ----- : some error message is shown which tells why the runstring could not be evaluated.
- The runstring for program ----- must evaluate to a string, not -----.
- The runstring for ----- must be of length one, not --: Your vector is shown where a single value is needed.
- Error evaluating input redirect for executable ----- : some error message is shown which tells why the input redirect could not be evaluated.
- The input redirect for program ----- must evaluate to a string, not -----.
- The input redirect for ----- must be of length one, not --: Your vector is shown where a single value is needed.
- Error evaluating ouput redirect for executable ----- : some error message is shown which tells why the ouput redirect could not be evaluated.
- The ouput redirect for program ----- must evaluate to a string, not -----.
- The ouput redirect for ----- must be of length one, not --: Your vector is shown where a single value is needed.
- Variable "OUTFILE" does not exist.
- Variable "OUTFILE" must be a non-blank character string.
- PREPEND.SERIES must exist and be set to TRUE or FALSE.
- APPEND.OUTFILE must exist and be set to TRUE or FALSE.
- SERIES REP must be a number >=1.
- SLEEP.SECONDS must be a number >=2.
- Variable ----- must exist. (Ancillary functions do nothing, but cannot be missing.)
- Variable ----- must be a function. (Anciallary functions must be functions, even if they do nothing.)
- SERIES.DEFINITION must exist and be a character string vector with at least one value.
- Bad format for SERIES.DEFINITION.
- Each element of SERIES.DEFINITION must start with R, P or E.
- Each element of SERIES.DEFINITION must have only a number following the letter.
- ----- has no elements, but SERIES.DEFINITION contains some ----- elements.
- ----- has -- elements, but SERIES.DEFINITION does not contain exactly L# through L#. (Some elements are declared above
but not used in the SERIES.DEFINITION.)
- DEBUG must be TRUE or FALSE
- Can't find system program mpdtrace--are you really on a Beowulf cluster? (mpdtrace is a required cluster
program that lists slave node names.)
- Problem with mpdtrace (mpdtrace is a required cluster program that lists slave node names.)
- Stopping because no/only # nodes are available (A minimum of 5 slave nodes is required, but not available.)
- Can't read $USER from operating system (System environmental variable USER is required, but not present.)
- PRAGMA.USE.NODE.LOAD not yet implemented (Currently this variable may only be set to 0.)
- Failure calling L# for series -- (A problem occured while attempting to run the specified element.)