Summary and Version Information
|Description||R statistical analysis package|
|Categories||Numerical Analysis, Research|
|3.0.3||R/3.0.3|| Non-HPC Glue systems
|3.1.2||R/3.1.2|| Non-HPC Glue systems
|3.2.2||R/3.2.2|| Non-HPC Glue systems
|3.3.2||R/3.3.2|| Non-HPC Glue systems
|N||Rmpi built with openmpi/1.8.6 and gcc/4.9.3|
|3.5.1||R/3.5.1|| Non-HPC Glue systems
*: Packages labelled as "available" on an HPC cluster means that it can be used on the compute nodes of that cluster. Even software not listed as available on an HPC cluster is generally available on the login nodes of the cluster (assuming it is available for the appropriate OS version; e.g. RedHat Linux 6 for the two Deepthought clusters). This is due to the fact that the compute nodes do not use AFS and so have copies of the AFS software tree, and so we only install packages as requested. Contact us if you need a version listed as not available on one of the clusters.
In general, you need to prepare your Unix environment to be able to use this software. To do this, either:
module load MODFOO
where TAPFOO and MODFOO are one of the tags in the tap
and module columns above, respectively. The
tap command will
print a short usage text (use
-q to supress this, this is needed
in startup dot files); you can get a similar text with
module help MODFOO. For more information on
the tap and module commands.
For packages which are libraries which other codes get built against, see the section on compiling codes for more help.
Tap/module commands listed with a version of current will set up for what we considered the most current stable and tested version of the package installed on the system. The exact version is subject to change with little if any notice, and might be platform dependent. Versions labelled new would represent a newer version of the package which is still being tested by users; if stability is not a primary concern you are encouraged to use it. Those with versions listed as old set up for an older version of the package; you should only use this if the newer versions are causing issues. Old versions may be dropped after a while. Again, the exact versions are subject to change with little if any notice.
In general, you can abbreviate the module tags. If no version is given, the default current version is used. For packages with compiler/MPI/etc dependencies, if a compiler module or MPI library was previously loaded, it will try to load the correct build of the package for those packages. If you specify the compiler/MPI dependency, it will attempt to load the compiler/MPI library for you if needed.
R's capabilities can be significantly enhanced through the addition of
modules. Code can then enable the library with the
The supported R interpretters on the system have a
selection of modules
preinstalled. If a module you are interested in is not in that
list, you can either install a personal copy of the module for yourself,
or request that it be installed system wide. We will make reasonable efforts
to accomodate such requests as staffing resources allow.
Installing modules yourself
The method for installing R packages is usually fairly straightforward, but obviously not all packages will install in the same manner. But most will follow the procedure below:
module load R/X.Y.Zto select the version of R you wish to use
- Create the directory to hold your R modules, if you have not already done
so. The default is in the directory
Runderneath your home directory, but you might wish to put it elsewhere; this will have subdirectories for R version and platform added.
- Unless you opted for the default directory
~/R, you need to tell R what directory you are using. To do this, you must set the environmental variable
R_LIBS_USER. Multiple directories can be listed; separate the paths with the colon (:) character. This needs to be set whenever you wish to use the modules in R, so you will generally want to set it in your
- There are two standard methods for installing a package, one from the
command line, and one from inside R itself. Assuming you are putting
~/myRpkgsand installing the package
foothe commands would be:
- From the command line, you will first need to download a tarball
with the source code for the package. Many packages can be found
at the Comprehensive R Archive Network
(CRAN). Assuming you downloaded
foo.tar.gzto the current directory, you could then install it with:
R CMD INSTALL -l ~/myRpkgs foo.tar.gz
- From within R, the
install.packagesfunction will connect to CRAN and download and install the package all in one step, with:
install.packages("foo", lib="~/myRpkgs", repos="http://cran.r-project.org")
- From the command line, you will first need to download a tarball with the source code for the package. Many packages can be found at the Comprehensive R Archive Network (CRAN). Assuming you downloaded
If all goes well, the package is now installed in the directory you specified and should be available for use by your R scripts.
Of course, not all packages install quite that easily. If you are comfortable building modules, hopefully the error messages will provide reasonable guidance as to how to proceed. Otherwise, you can just request for Division of Information Technology staff to install it, but that might take time depending on the availability of our time.
Using R and MPI
User of one of the high-performance computing (HPC) clusters will likely be interested in running R codes that span multiple processors often over multiple nodes. This generally is done using MPI. There are a number of R packages that deal with MPI, including
- doSNOW: provides a
Most users seem to prefer the
snow package, which is presumably
higher level and therefore easier to use than
Rmpi. There are
assorted guides to using R with the
snow package on the web,
- Glenn Lockwoods page on R and HPC clusters
- University of Chicago's R page
- Bioinfomagician's page on Rmpi
- Simon Fraser University's snow page
Below are just a few tips gleaned from these pages, etc. that users at UMD might find helpful.
- For best results, use the same version of compiler and MPI as used for
building R and its MPI packages. In particular, for
- R/3.1.2: this is gcc/4.6.1 and openmpi/1.6.5
- R/3.2.2: this is gcc/4.9.3 and openmpi/1.8.6
- R/3.3.2: this is gcc/4.9.3 and openmpi/1.8.6
module loadthe compiler first (not needed for gcc/4.6.1) and then the OpenMPI library.
- When using snow or one of its derivatives (e.g. doSNOW), you should launch
your code with something like
#!/bin/bash #Request 5 hours #SBATCH -t 5:00 #Request 4 GiB per CPU-core #SBATCH --mem-per-cpu=4096 #Request 40 cores #SBATCH -n 40 #Get our profile (and define module command) . ~/.profile #Load required modules module load gcc/4.9.3 module load openmpi/1.8.6 module load R/3.3.2 cd MY_WORK_DIRECTORY #Make sure OpenMP is not "on" OMP_NUM_THREADS=1 export OMP_NUM_THREADS #NOTE THE -np 1 below!!!! mpirun -np 1 R CMD BATCH my_R_code.R
NOTE the use of
-np 1in the above. Although that looks suspicious (telling mpirun to only start one MPI tasks when we asked for 40 cores), it is actually correct for most uses of the snow (and derivative) libraries. This is because when using
snowwill spawn its own workers. If you request something more than 1 MPI task to be launched via the
openmpi, or omit the
-np 1altogether (which effectively is asking for
mpirunto launch the number of tasks given in the
#SBATCH -nline, 40 in this case), you will end up running e.g. 40 copies of the same code, each of which will try to spawn about 40 workers via
snow, resulting in a mess (at best very sluggish performance, and more likely wierd errors).
snowbased R code will at some point invoke the
makeClusterfunction. This takes a parameter specifying the size of the "cluster" to create. Typically, one wants this size to be one less than the number of cores requested from Slurm. This is because the process running the R code which spawns the workers is already consuming one CPU core, so if you try to spawn a number of workers equal to the number of cores requested of Slurm, there will be one core oversubscribed, which causes issues. I typically see an error about there being an insufficient number of "slots" available, and typically the R script just hangs (doing nothing, but not dying until the job is killed for exceeding its walltime, and thereby wasting a lot of SUs). Typically, it is better to do something like: