Texas Tech University

Running R Jobs on HPCC Resources

Introduction

R is a language and software environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc...) and graphical techniques, and is highly extensible.  R is a software package especially suited for data analysis and graphical representation. Functions and results of analysis are all stored as objects, allowing easy function modification and model building.

 

Table of Contents

  1. Setting up the environment
  2. Running multithreaded R jobs
    1. Enable Intel MKL threads
    2. Re-write your code to make use of threading
    3. Use parallel packages
  3. Submitting R jobs to the Quanah Cluster
  4. Submitting R jobs to the Hrothgar Cluster

 


 

Setting up the environment

Modules are used to set up a user's environment. Typically users initialize their environment when they log in by setting environment information for every application they will reference during the session. The HPCC uses the Environment Modules package as a tool to simplify shell initialization and allow users the ability to easily modify their environment during the session with module files.

For more information about how to load and maintain your software environment using modules, please refer to the user guide "Software Environment Setup".

In order to set the environment variables for R you will first need to check which versions of R are installed using the "module spider R" command.  This command will return a description of the software and which versions are currently installed.  To load the most recent version of R you can run "module load intel R" or "module load gnu R" command. You can then run the "module list" command to verify that the R module has been successfully loaded. This can be done using one of the following sets of commands:

#To load the Intel compiled version of R run the following:
module load intel R
module list

#To load the GNU compiled version of R run the following:
module load gnu R
module list

 


 

Running multithreaded R jobs

The HPCC strongly recommends that you have a good grasp of what your code is doing and what resource requirements it will need to function properly before running it on the cluster. By default, R will always attempt to run serially, using only a single core. If you wish to run across multiple cores then you will need to do one of the following:

Enable Intel MKL threads

R has been compiled on HPCC resources to make it possible to use the MKL libraries to automatically attempt to parallelize portions of your R scripts.  This method will use the Math Kernel Library (MKL), BLAS and LAPACK to improve performance of those portions of your R code that rely on matrix computations.

Enabling MKL threads simply requires that you export an environment variable "MKL_NUM_THREADS" that is set to the maximum number of threads you want MKL to try and run.  We strongly suggest setting this value in your Job Submission Script as follows:

export MKL_NUM_THREADS=$NSLOTS

 

Examples of this variable being exported and used can be found in the Quanah and Hrothgar tutorials below.

 

Re-write your code to make use of threading

Writing your code to run in parallel will grant you the greatest speed up of any of the options, however it can also be the most time consuming of the options.  R contains a number of packages for parallelizing your code explicitly.  One of the more well known ones is 'parallel'.  Discussing how to re-write your code to work in the cluster is beyond the scope of this document.  If you wish to begin taking this path then we would strongly suggest you look at some of the following links.

  1. Parallel Documentation
  2. R-Bloggers discussion on Parallel R
  3. How-to go parallel in R basics + tips

 

Use parallel packages

When writing your R code, you may want to pay close attention to the packages you make use of.  There often exists numerous packages that attempt to solve the same problem, and in many of these cases you will find that some are serial while others are parallel.  Whenever possible you should try to make use of parallel packages versus serial ones.  You may also want to look into the documentation for the packages you are using to see if they have the ability to run in a parallel fashion.  You may often find that simply setting a certain variable or passing some additional parameters into a function will change the behavior from serial to parallel, causing a massive increase in the speed of your application.

Example Parallel Packages:

  • caret - R package containing functions that attempt to streamline the process for creating predictive models.
  • xgboost - R package for "extreme gradient boosting".

 

 

Submitting R jobs to the Quanah Cluster

For this tutorial we will be relying on an example script that will perform an R benchmarking test.  To pull down a copy of these scripts you can run the following commands:

Step 1. From Quanah, copy the R examples directory

cp -r /lustre/work/examples/quanah/R ~/R_tutorial

 

Step 2. Enter the newly created directory.

cd ~/R_tutorial/

 

Step 3. View the Quanah submission script for R.

cat R_quanah.sh

------------OUTPUT-------------
#!/bin/sh
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N RJOB
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q omni
#$ -P quanah
#$ -pe sm 1
#Load the latest version of the R language - compiled using the Intel compilers.
module load intel R
#Allow R to perform some automatic parallelization.
# MKL_NUM_THREADS - The maximum number of threads you want R to spawn on your behalf.
# $NSLOTS - This will be replaced by the number of slots you request in yout parallel environment.
# Example: -pe sm 36 -> $NSLOTS=36.
export MKL_NUM_THREADS=$NSLOTS
#Run the example R script using the Rscript application.
Rscript example.R
-------------------------------

Step 4. Submit the R example script into the cluster.

qsub R_quanah.sh

 

Your R script is now queued to run on the Quanah cluster.  You can use the "qstat" command to check the status of the job.  Please read the Job Submission Guide for more information about running jobs and checking their status.

 


 

 

Submitting R jobs to the Hrothgar Cluster

For this tutorial we will be relying on an example script that will perform an R benchmarking test. To pull down a copy of these scripts you can run the following commands:

Step 1. From hrothgar, copy the R examples directory

cp -r /lustre/work/examples/hrothgar/R ~/R_tutorial

 

Step 2. Enter the newly created directory.

cd ~/R_tutorial/

 

Step 3. View the Hrothgar submission script for R.

cat R_hrothgar.sh
------------OUTPUT-------------
#!/bin/sh
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N RJOB
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q west
#$ -P hrothgar
#$ -pe west 12
#If you would prefer to run this on Ivy - then update the following above:
# -q ivy
# -pe ivy 20
#Load the latest version of the R language - compiled using the Intel compilers.
module load intel R
#Allow R to perform some automatic parallelization.
# MKL_NUM_THREADS - The maximum number of threads you want R to spawn on your behalf.
# $NSLOTS - This will be replaced by the number of slots you request in yout parallel environment.
# Example: -pe sm 36 -> $NSLOTS=36.
export MKL_NUM_THREADS=$NSLOTS
#Run the example R script using the Rscript application.
Rscript example.R
-------------------------------

 

Step 4. Submit the R example script into the cluster.

qsub R_hrothgar.sh

 

Your R script is now queued to run on the Hrothgar cluster. You can use the "qstat" command to check the status of the job. Please read the Job Submission Guide for more information about running jobs and checking their status.

 

 

 

High Performance Computing Center