User Tools

Site Tools


hpc-cluster:r

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
hpc-cluster:r [2015/01/26 13:17]
sluedtke
— (current)
Line 1: Line 1:
-The popular statistics software [[http://www.r-project.org/| R]] is installed and ready on 
-the HPC and eases: 
  
-  * running jobs that take a long time 
-  * running jobs that require a lot of memory 
-  * and of course, running jobs in parallel 
- 
-At the moment, two different libraries are available on the cluster for parallization: 
- 
-  * [[http://cran.r-project.org/web/packages/Rmpi/index.html | Rmpi]], Interface (Wrapper) to 
-MPI (Message-Passing Interface) 
- 
-  * [[http://cran.r-project.org/web/packages/doMPI/index.html | doMPI ]], 
-[[http://cran.r-project.org/web/packages/foreach/index.html | foreach ]] parallel adaptor 
-for the Rmpi package 
- 
-NOTES: 
-  
-  * **Rmpi** (which is required by both libraries) was compiled using **openmpi** and **gcc** so make sure the   modules are available (`module list`, `module avail`) 
-  * the R-script must be executable (`` chmod 766 r-script-name ``) 
-  * add the path to the R-script executable at the top line as a //Shebang//, see the example, or put the path into your '~/.bashrc' 
- 
-===== Rmpi ===== 
-Using the package **Rmpi** requires a sound scripting in order to distribute the work 
-and handle the communication. In addition, a script is necessary the make the workers 
-available to **R** at the beginning. This part is usually shipped with an **.Rprofile**, 
-located either in the working or in the HOME directory. 
- 
-There are multiple ways to ensure that: 
-  
-  *  copy the **.Rprofile**  into the working directory (e.g. ''cp 
-  ~/R/x86_64-unknown-linux-gnu-library/2.15/Rmpi/Rprofile ./.Rprofile'') 
-  * **.Rprofile** located in HOME  
-  * set the environment variable  **R_PROFILE**  that points to any file holding the 
- required contents, this might be done within the jobfile, ".bashrc" or in a module 
-file 
- 
-===== dompi ===== 
-Using the package **doMPI** enables parallization with //foreach// loops in **R**. See the 
-package documentation for details. 
- 
-**In order to starts the jobs properly, there must not be a .Rprofile available within the 
-current environment.** 
- 
-An example for **doMPI** and a corresponding job script. 
- 
-<file bash start_par_R.sh> 
- 
- #!/bin/bash 
- # show commands being executed, per debug 
- set -x 
- 
- # Set job parameters 
- ###################################################################### 
- ## check what MPI implementation is loaded before setting the parameter 
- ## module list ?? 
- #BSUB -a openmpi 
- 
- # Set number of CPUs 
- #BSUB -n 16 
- 
- # name of the output file 
- #BSUB -o 16_cpus.out 
- 
- 
- # Start R MPI job 
- mpirun.lsf ./dompi_example.R 
- 
-</file> 
- 
-And the R-script that is called. 
- 
-<file rsplus dompi_example.R> 
-#!/usr/local/bin/Rscript 
- 
-# that is the path to Rscript on the cluster, might be different on your local machine 
- 
-################################################################################################ 
- 
-#       Filename: dompi_example.R 
- 
-#       Author: Stefan Luedtke 
- 
-#       Created: Saturday 18 August 2012 14:16:36 CEST 
- 
-#       Last modified: Tuesday 04 March 2014 15:52:00 CET 
- 
-################################################################################################ 
-#       PURPOSE         ######################################################################## 
- 
-# This script should illustrate the use of the "doMPI" package on the HP-Cluster of GFZ. 
-# 
-# Of course, a basic knowledge of R is required.  
-# 
-# First, we will perform a hierarchical clustering algorithm of some random data. Second, 
-# a group validation indices will be computed for each level of the resulting tree. 
-# Although the first bit is done quite fast, the computation of the validation indices 
-# might take a long time. This will be done in parallel. 
- 
-######################## 
- 
-# The "doMPI" package, together with the foreach package, is an easy to apply framework for 
-# parallel processing but must be handled with care. We will come back to this later- take 
-# care.  
- 
-######################## 
- 
-# Note: Since the word "cluster" might be used in two different contexts we define here, 
-# that: If we talk about the machine we use "hpcluster", when talking about the algorithms, 
-# validation and so on, we stick to the word "cluster". 
- 
-################################################################################################ 
-################################################################################################ 
- 
- ################################################################################################ 
- ################################################################################################ 
- 
- ################################################################################################ 
- 
- rm(list=ls()) 
- 
- ################################################################################################ 
- library(reshape) # general stuff 
- 
- library(cluster) # used for the cluster algorithms 
- library(fpc) # used for the cluster validation 
- 
- library(doMPI) # framework for the Rmpi package 
- library(foreach) # parallel backend  
- 
- ################################################################################################ 
- 
- 
- # create some random data in a matrix, make them reproducible  
- 
- set.seed(1) 
- data=matrix(c(runif(300, min=100, max=1000), rnorm(300, mean=-100, sd=2)), ncol=2) 
- 
- 
- # create distance matrix the cluster algorithm requires 
- dist_mat=dist(data, method="euclidean") 
- 
- 
- #compute the cluster Dendrogramm for the data 
- cluster_fit=hclust(dist_mat, method="ward") 
- 
- # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # 
- # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # 
- # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # 
- # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # 
- 
- # That bit runs distributed over multiple nodes specified in the 
- # jobscript compute cluster validation indices for each cluster  
- # class, we only check classes 2 to nrow(data)/2 
- 
- cl = startMPIcluster() # we do not give a number here, which means we take as much as we 
- # get- managed by the startup script 
- 
- registerDoMPI(cl) # register cl for the foreach framework 
- 
- 
- cluster_groups=foreach(i=c(2:(nrow(data)/2)), .combine=cbind,  
-   .packages=c("cluster","fpc"))%dopar% 
- { 
- cluster.stats(dist_mat, cutree(cluster_fit, k=i), G2=T) 
- } 
- 
- closeCluster(cl) 
- 
- # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # 
- # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # 
- # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # 
- # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! # 
- 
- #extract indices of interest, that is a list of names as returned by the function 
- # "cluster.stat", see the help for more information 
- 
- indices=c("cluster.number", "dunn", "dunn2", "g2") 
- 
- cluster_performance=as.data.frame(t(cluster_groups[which(row.names(cluster_groups)  
- %in% indices==TRUE),])) 
- 
- row.names(cluster_performance)=NULL 
- 
- 
- ## save the R workspace as ".Rdata" in the working dir (optional of course) 
- save.image() 
- ############################################################################################### 
- ################################################################################################ 
- 
-</file> 
- 
-===== Attention ===== 
- 
-**doMPI** provides a simple framework that allows running parallel applications almost no 
-extra effort. However, this does not make sense at all time. Is the single iteration 
-pretty fast, a serial implementation makes more sense because the communication overhead 
-will slow down the entire process.  
- 
-A more flexible framework is provided by the **Rmpi** implementation, which is not as easy 
-the implement. A popular example can be found at at this 
-[[http://math.acadiau.ca/ACMMaC/Rmpi/| page]]. 
-  
hpc-cluster/r.1422278254.txt.gz · Last modified: 2015/01/26 13:17 by sluedtke