User Tools

Site Tools


hpc-cluster:r

This is an old revision of the document!


Table of Contents

The popular statistics software R is installed and ready on the HPC and eases:

* running jobs that take a long time * running jobs that require a lot of memory * and of course, running jobs in parallel

At the moment, two different libraries are available on the cluster for parallization:

- Rmpi, Interface (Wrapper) to MPI (Message-Passing Interface)

- doMPI , foreach parallel adaptor for the Rmpi package

NOTES:

* Rmpi (which is required by both libraries) was compiled using openmpi, so make

sure the modules are available

* the R-script must be executable (`` chmod 766 r-script-name ``) * add the path to the R-script executable at the top line as a Shebang, so the example

Rmpi

Using the package Rmpi requires a sound scripting in order to distribute the work and handle the communication. In addition, a script is necessary the make the workers available to R at the beginning. This part is usually shipped with an .Rprofile, located either in the working or in the HOME directory.

There are multiple ways to ensure that:

* copy the .Rprofile into the working directory (e.g. cp ~/R/x86_64-unknown-linux-gnu-library/2.15/Rmpi/Rprofile ./.Rprofile)

* .Rprofile located in HOME

* set the environment variable R_PROFILE that points to any file holding the required contents, this might be done within the jobfile, “.bashrc” or in a module file

dompi

Using the package doMPI enables parallization with foreach loops in R. See the package documentation for details.

In order to starts the jobs properly, there must not be a .Rprofile available within the current environment.

An example for doMPI and a corresponding job script.

start_par_R.sh
	#!/bin/bash
	# show commands being executed, per debug
	set -x
 
	# Set job parameters
	######################################################################
	## check what MPI implementation is loaded before setting the parameter
	## module list ??
	#BSUB -a openmpi
 
	# Set number of CPUs
	#BSUB -n 16
 
	# name of the output file
	#BSUB -o 16_cpus.out
 
 
	# Start R MPI job
	mpirun.lsf ./dompi_example.R

And the R-script that is called.

dompi_example.R
#!/usr/local/bin/Rscript
 
# that is the path to Rscript on the cluster, might be different on your local machine
 
################################################################################################
 
#       Filename: dompi_example.R
 
#       Author: Stefan Luedtke
 
#       Created: Saturday 18 August 2012 14:16:36 CEST
 
#       Last modified: Tuesday 04 March 2014 15:52:00 CET
 
################################################################################################
#       PURPOSE         ########################################################################
# 
# This script should illustrate the use of the "doMPI" package on the HP-Cluster of GFZ.
#
# Of course, a basic knowledge of R is required. 
#
# First, we will perform a hierarchical clustering algorithm of some random data. Second,
# a group validation indices will be computed for each level of the resulting tree.
# Although the first bit is done quite fast, the computation of the validation indices
# might take a long time. This will be done in parallel.
 
########################
 
# The "doMPI" package, together with the foreach package, is an easy to apply framework for
# parallel processing but must be handled with care. We will come back to this later- take
# care. 
 
########################
 
# Note: Since the word "cluster" might be used in two different contexts we define here,
# that: If we talk about the machine we use "hpcluster", when talking about the algorithms,
# validation and so on, we stick to the word "cluster".
 
################################################################################################
################################################################################################
 
	################################################################################################
	################################################################################################
 
	################################################################################################
 
	rm(list=ls())
 
	################################################################################################
	library(reshape)	# general stuff
 
	library(cluster)	# used for the cluster algorithms
	library(fpc)		# used for the cluster validation
 
	library(doMPI)		# framework for the Rmpi package
	library(foreach)	# parallel backend 
 
	################################################################################################
 
 
	# create some random data in a matrix, make them reproducible 
 
	set.seed(1)
	data=matrix(c(runif(300, min=100, max=1000), rnorm(300, mean=-100, sd=2)), ncol=2)
 
 
	# create distance matrix the cluster algorithm requires
	dist_mat=dist(data, method="euclidean")
 
 
	#compute the cluster Dendrogramm for the data
	cluster_fit=hclust(dist_mat, method="ward")
 
	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
 
	# 	That bit runs distributed over multiple nodes specified in the
	# 	jobscript compute cluster validation indices for each cluster 
	#	class, we only check classes 2 to nrow(data)/2
 
	cl = startMPIcluster()	# we do not give a number here, which means we take as much as we
				# get- managed by the startup script
 
	registerDoMPI(cl) 	# register cl for the foreach framework
 
 
	cluster_groups=foreach(i=c(2:(nrow(data)/2)), .combine=cbind, 
				  .packages=c("cluster","fpc"))%dopar%
	{
		cluster.stats(dist_mat, cutree(cluster_fit, k=i), G2=T)
	}
 
	closeCluster(cl)
 
	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
 
	#extract indices of interest, that is a list of names as returned by the function
	# "cluster.stat", see the help for more information
 
	indices=c("cluster.number", "dunn", "dunn2", "g2")
 
	cluster_performance=as.data.frame(t(cluster_groups[which(row.names(cluster_groups) 
				%in% indices==TRUE),]))
 
	row.names(cluster_performance)=NULL
 
 
	## save the R workspace as ".Rdata" in the working dir (optional of course)
	save.image()
	###############################################################################################
	################################################################################################

Attention

doMPI provides a simple framework that allows running parallel applications almost no extra effort. However, this does not make sense at all time. Is the single iteration pretty fast, a serial implementation makes more sense because the communication overhead will slow down the entire process.

A more flexible framework is provided by the Rmpi implementation, which is not as easy the implement. A popular example can be found at at this page.

hpc-cluster/r.1400446769.txt.gz · Last modified: 2014/05/18 20:59 by sluedtke