User Tools

Site Tools


hpc-cluster:r

This is an old revision of the document!


Table of Contents

The popular statistics software R is installed and ready on the HPC and eases:

* running jobs that take a long time * running jobs that require a lot of memory * and of course, running jobs in parallel

At the moment, two different libraries are available on the cluster for parallization:

- Rmpi, Interface (Wrapper) to MPI (Message-Passing Interface)

- doMPI , foreach parallel adaptor for the Rmpi package

NOTES:

* Rmpi (which is required by both libraries) was compiled using openmpi, so make

sure the modules are available

* the R-script must be executable (`` chmod 766 r-script-name ``) * add the path to the R-script executable at the top line as a Shebang, so the example

Rmpi

Using the package Rmpi requires a sound scripting in order to distribute the work and handle the communication. In addition, a script is necessary the make the workers available to R at the beginning. This part is usually shipped with an .Rprofile, located either in the working or in the HOME directory.

There are multiple ways to ensure that:

* copy the .Rprofile into the working directory (e.g. cp ~/R/x86_64-unknown-linux-gnu-library/2.15/Rmpi/Rprofile ./.Rprofile)

* .Rprofile located in HOME

* set the environment variable R_PROFILE that points to any file holding the required contents, this might be done within the jobfile, “.bashrc” or in a module file

dompi

Using the package doMPI enables parallization with foreach loops in R. See the package documentation for details.

In order to starts the jobs properly, there must not be a .Rprofile available within the current environment.

An example for doMPI and a corresponding job script.

start_par_R.sh
	#!/bin/bash
	# show commands being executed, per debug
	set -x
 
	# Set job parameters
	######################################################################
	## check what MPI implementation is loaded before setting the parameter
	## module list ??
	#BSUB -a openmpi
 
	# Set number of CPUs
	#BSUB -n 16
 
	# name of the output file
	#BSUB -o 16_cpus.out
 
 
	# Start R MPI job
	mpirun.lsf ./dompi_example.R

And the R-script that is called.

<file bash dompi_example.R>

################################################################################################
################################################################################################
################################################################################################
rm(list=ls())
################################################################################################
library(reshape)	# general stuff
library(cluster)	# used for the cluster algorithms
library(fpc)		# used for the cluster validation
library(doMPI)		# framework for the Rmpi package
library(foreach)	# parallel backend 
################################################################################################
# create some random data in a matrix, make them reproducible 
set.seed(1)
data=matrix(c(runif(300, min=100, max=1000), rnorm(300, mean=-100, sd=2)), ncol=2)
# create distance matrix the cluster algorithm requires
dist_mat=dist(data, method="euclidean")
#compute the cluster Dendrogramm for the data
cluster_fit=hclust(dist_mat, method="ward")
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
# 	That bit runs distributed over multiple nodes specified in the
# 	jobscript compute cluster validation indices for each cluster 
#	class, we only check classes 2 to nrow(data)/2
cl = startMPIcluster()	# we do not give a number here, which means we take as much as we
			# get- managed by the startup script
registerDoMPI(cl) 	# register cl for the foreach framework
cluster_groups=foreach(i=c(2:(nrow(data)/2)), .combine=cbind, 
			  .packages=c("cluster","fpc"))%dopar%
{
	cluster.stats(dist_mat, cutree(cluster_fit, k=i), G2=T)
}
closeCluster(cl)
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#extract indices of interest, that is a list of names as returned by the function
# "cluster.stat", see the help for more information
indices=c("cluster.number", "dunn", "dunn2", "g2")
cluster_performance=as.data.frame(t(cluster_groups[which(row.names(cluster_groups) 
			%in% indices==TRUE),]))
row.names(cluster_performance)=NULL
## save the R workspace as ".Rdata" in the working dir (optional of course)
save.image()
###############################################################################################
################################################################################################

<file>

Attention

doMPI provides a simple framework that allows running parallel applications almost no extra effort. However, this does not make sense at all time. Is the single iteration pretty fast, a serial implementation makes more sense because the communication overhead will slow down the entire process.

A more flexible framework is provided by the Rmpi implementation, which is not as easy the implement. A popular example can be found at at this page.

hpc-cluster/r.1400446346.txt.gz · Last modified: 2014/05/18 20:52 by sluedtke