This is an old revision of the document!
The popular statistics software R is installed and ready on the HPC and eases:
* running jobs that take a long time * running jobs that require a lot of memory * and of course, running jobs in parallel
At the moment, two different libraries are available on the cluster for parallization:
- Rmpi, Interface (Wrapper) to MPI (Message-Passing Interface)
- doMPI , foreach parallel adaptor for the Rmpi package
NOTES:
* Rmpi (which is required by both libraries) was compiled using openmpi, so make
sure the modules are available
* the R-script must be executable (`` chmod 766 r-script-name ``) * add the path to the R-script executable at the top line as a Shebang, so the example
Using the package Rmpi requires a sound scripting in order to distribute the work and handle the communication. In addition, a script is necessary the make the workers available to R at the beginning. This part is usually shipped with an .Rprofile, located either in the working or in the HOME directory.
There are multiple ways to ensure that:
* copy the .Rprofile into the working directory (e.g. cp
~/R/x86_64-unknown-linux-gnu-library/2.15/Rmpi/Rprofile ./.Rprofile
)
* .Rprofile located in HOME
* set the environment variable R_PROFILE that points to any file holding the required contents, this might be done within the jobfile, “.bashrc” or in a module file
Using the package doMPI enables parallization with foreach loops in R. See the package documentation for details.
In order to starts the jobs properly, there must not be a .Rprofile available within the current environment.
An example for doMPI and a corresponding job script.
#!/bin/bash # show commands being executed, per debug set -x # Set job parameters ###################################################################### ## check what MPI implementation is loaded before setting the parameter ## module list ?? #BSUB -a openmpi # Set number of CPUs #BSUB -n 16 # name of the output file #BSUB -o 16_cpus.out # Start R MPI job mpirun.lsf ./dompi_example.R
And the R-script that is called.
<file bash dompi_example.R>
################################################################################################ ################################################################################################
################################################################################################
rm(list=ls())
################################################################################################ library(reshape) # general stuff
library(cluster) # used for the cluster algorithms library(fpc) # used for the cluster validation
library(doMPI) # framework for the Rmpi package library(foreach) # parallel backend
################################################################################################
# create some random data in a matrix, make them reproducible
set.seed(1) data=matrix(c(runif(300, min=100, max=1000), rnorm(300, mean=-100, sd=2)), ncol=2)
# create distance matrix the cluster algorithm requires dist_mat=dist(data, method="euclidean")
#compute the cluster Dendrogramm for the data cluster_fit=hclust(dist_mat, method="ward")
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
# That bit runs distributed over multiple nodes specified in the # jobscript compute cluster validation indices for each cluster # class, we only check classes 2 to nrow(data)/2
cl = startMPIcluster() # we do not give a number here, which means we take as much as we # get- managed by the startup script
registerDoMPI(cl) # register cl for the foreach framework
cluster_groups=foreach(i=c(2:(nrow(data)/2)), .combine=cbind, .packages=c("cluster","fpc"))%dopar% { cluster.stats(dist_mat, cutree(cluster_fit, k=i), G2=T) }
closeCluster(cl)
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
#extract indices of interest, that is a list of names as returned by the function # "cluster.stat", see the help for more information
indices=c("cluster.number", "dunn", "dunn2", "g2")
cluster_performance=as.data.frame(t(cluster_groups[which(row.names(cluster_groups) %in% indices==TRUE),]))
row.names(cluster_performance)=NULL
## save the R workspace as ".Rdata" in the working dir (optional of course) save.image() ############################################################################################### ################################################################################################
<file>
doMPI provides a simple framework that allows running parallel applications almost no extra effort. However, this does not make sense at all time. Is the single iteration pretty fast, a serial implementation makes more sense because the communication overhead will slow down the entire process.
A more flexible framework is provided by the Rmpi implementation, which is not as easy the implement. A popular example can be found at at this page.