Differences

This shows you the differences between two versions of the page.

--- hpc-cluster:r [2014/05/18 20:51]
sluedtke [dompi]
+++ — (current)
@@ Line 1: / Line 1: @@
-The popular statistics software [[http://www.r-project.org/| R]] is installed and ready on
-the HPC and eases:
-* running jobs that take a long time
-* running jobs that require a lot of memory
-* and of course, running jobs in parallel
-At the moment, two different libraries are available on the cluster for parallization:
--[[http://cran.r-project.org/web/packages/Rmpi/index.html | Rmpi]], Interface (Wrapper) to
-MPI (Message-Passing Interface)
--[[http://cran.r-project.org/web/packages/doMPI/index.html | doMPI ]],
-[[http://cran.r-project.org/web/packages/foreach/index.html | foreach ]] parallel adaptor
-for the Rmpi package
-NOTES:
-* **Rmpi** (which is required by both libraries) was compiled using **openmpi**, so make
-	sure the modules are available
-* the R-script must be executable (`` chmod 766 r-script-name ``)
-* add the path to the R-script executable at the top line as a //Shebang//, so the example
-===== Rmpi =====
-Using the package **Rmpi** requires a sound scripting in order to distribute the work
-and handle the communication. In addition, a script is necessary the make the workers
-available to **R** at the beginning. This part is usually shipped with an **.Rprofile**,
-located either in the working or in the HOME directory.
-There are multiple ways to ensure that:
-*  copy the **.Rprofile**  into the working directory (e.g. ''cp
-  ~/R/x86_64-unknown-linux-gnu-library/2.15/Rmpi/Rprofile ./.Rprofile'')
-* **.Rprofile** located in HOME
-* set the environment variable  **R_PROFILE**  that points to any file holding the
- required contents, this might be done within the jobfile,  ".bashrc" or in a module
-file
-===== dompi =====
-Using the package **doMPI** enables parallization with //foreach// loops in **R**. See the
-package documentation for details.
-**In order to starts the jobs properly, there must not be a .Rprofile available within the
-current environment.**
-An example for **doMPI** and a corresponding job script.
-<file bash start_par_R.sh>
-	#!/bin/bash
-	# show commands being executed, per debug
-	set -x
-	# Set job parameters
-	######################################################################
-	## check what MPI implementation is loaded before setting the parameter
-	## module list ??
-	#BSUB -a openmpi
-	# Set number of CPUs
-	#BSUB -n 16
-	# name of the output file
-	#BSUB -o 16_cpus.out
-	# Start R MPI job
-	mpirun.lsf ./dompi_example.R
-</file>
-And the R-script that is called.
-<file R dompi_example.sh>
-	################################################################################################
-	################################################################################################
-	################################################################################################
-	rm(list=ls())
-	################################################################################################
-	library(reshape)	# general stuff
-	library(cluster)	# used for the cluster algorithms
-	library(fpc)		# used for the cluster validation
-	library(doMPI)		# framework for the Rmpi package
-	library(foreach)	# parallel backend
-	################################################################################################
-	# create some random data in a matrix, make them reproducible
-	set.seed(1)
-	data=matrix(c(runif(300, min=100, max=1000), rnorm(300, mean=-100, sd=2)), ncol=2)
-	# create distance matrix the cluster algorithm requires
-	dist_mat=dist(data, method="euclidean")
-	#compute the cluster Dendrogramm for the data
-	cluster_fit=hclust(dist_mat, method="ward")
-	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
-	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
-	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
-	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
-	# 	That bit runs distributed over multiple nodes specified in the
-	# 	jobscript compute cluster validation indices for each cluster
-	#	class, we only check classes 2 to nrow(data)/2
-	cl = startMPIcluster()	# we do not give a number here, which means we take as much as we
-				# get- managed by the startup script
-	registerDoMPI(cl) 	# register cl for the foreach framework
-	cluster_groups=foreach(i=c(2:(nrow(data)/2)), .combine=cbind,
-				  .packages=c("cluster","fpc"))%dopar%
-	{
-		cluster.stats(dist_mat, cutree(cluster_fit, k=i), G2=T)
-	}
-	closeCluster(cl)
-	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
-	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
-	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
-	#	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	#
-	#extract indices of interest, that is a list of names as returned by the function
-	# "cluster.stat", see the help for more information
-	indices=c("cluster.number", "dunn", "dunn2", "g2")
-	cluster_performance=as.data.frame(t(cluster_groups[which(row.names(cluster_groups)
-				%in% indices==TRUE),]))
-	row.names(cluster_performance)=NULL
-	## save the R workspace as ".Rdata" in the working dir (optional of course)
-	save.image()
-	###############################################################################################
-	################################################################################################
-<file>
-===== Attention =====
-**doMPI** provides a simple framework that allows running parallel applications almost no
-extra effort. However, this does not make sense at all time. Is the single iteration
-pretty fast, a serial implementation makes more sense because the communication overhead
-will slow down the entire process.
-A more flexible framework is provided by the **Rmpi** implementation, which is not as easy
-the implement. A popular example can be found at at this
-[[http://math.acadiau.ca/ACMMaC/Rmpi/| page]].

DataWiki

User Tools

Site Tools

Differences

Page Tools