DataWiki

This is an old revision of the document!

Rmpi

Using the package Rmpi requires a sound scripting in order to distribute the work and handle the communication. In addition, a script is necessary the make the workers available to R at the beginning. This part is usually shipped with an .Rprofile, located either in the working or in the HOME directory.

There are multiple ways to ensure that:

* copy the .Rprofile into the working directory (e.g. cp ~/R/x86_64-unknown-linux-gnu-library/2.15/Rmpi/Rprofile ./.Rprofile)

* .Rprofile located in HOME

* set the environment variable R_PROFILE that points to any file holding the required contents, this might be done within the jobfile, “.bashrc” or in a module file

dompi

Using the package doMPI enables parallization with foreach loops in R. See the package documentation for details.

In order to starts the jobs properly, there must not be a .Rprofile available within the current environment.

An example for doMPI and a corresponding job script.

start_par_R.sh

	#!/bin/bash
	# show commands being executed, per debug
	set -x
 
	# Set job parameters
	######################################################################
	## check what MPI implementation is loaded before setting the parameter
	## module list ??
	#BSUB -a openmpi
 
	# Set number of CPUs
	#BSUB -n 16
 
	# name of the output file
	#BSUB -o 16_cpus.out
 
 
	# Start R MPI job
	mpirun.lsf ./dompi_example.R

And the R-script that is called.

################################################################################################
################################################################################################

################################################################################################

rm(list=ls())

################################################################################################
library(reshape)	# general stuff

library(cluster)	# used for the cluster algorithms
library(fpc)		# used for the cluster validation

library(doMPI)		# framework for the Rmpi package
library(foreach)	# parallel backend

################################################################################################

# create some random data in a matrix, make them reproducible

set.seed(1)
data=matrix(c(runif(300, min=100, max=1000), rnorm(300, mean=-100, sd=2)), ncol=2)

# create distance matrix the cluster algorithm requires
dist_mat=dist(data, method="euclidean")

#compute the cluster Dendrogramm for the data
cluster_fit=hclust(dist_mat, method="ward")

#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#

# 	That bit runs distributed over multiple nodes specified in the
# 	jobscript compute cluster validation indices for each cluster 
#	class, we only check classes 2 to nrow(data)/2

cl = startMPIcluster()	# we do not give a number here, which means we take as much as we
			# get- managed by the startup script

registerDoMPI(cl) 	# register cl for the foreach framework

cluster_groups=foreach(i=c(2:(nrow(data)/2)), .combine=cbind, 
			  .packages=c("cluster","fpc"))%dopar%
{
	cluster.stats(dist_mat, cutree(cluster_fit, k=i), G2=T)
}

closeCluster(cl)

#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#

#extract indices of interest, that is a list of names as returned by the function
# "cluster.stat", see the help for more information

indices=c("cluster.number", "dunn", "dunn2", "g2")

cluster_performance=as.data.frame(t(cluster_groups[which(row.names(cluster_groups) 
			%in% indices==TRUE),]))

row.names(cluster_performance)=NULL

## save the R workspace as ".Rdata" in the working dir (optional of course)
save.image()
###############################################################################################
################################################################################################

<file>

Attention

doMPI provides a simple framework that allows running parallel applications almost no extra effort. However, this does not make sense at all time. Is the single iteration pretty fast, a serial implementation makes more sense because the communication overhead will slow down the entire process.

A more flexible framework is provided by the Rmpi implementation, which is not as easy the implement. A popular example can be found at at this page.

DataWiki

User Tools

Site Tools

Table of Contents

Rmpi

dompi

Attention

Page Tools