Availability

The snow library provides an overlay onto the parallel library to provide certain functionality not provided in the base parallel library. The functionality we will use here is the ability to define a cluster of TYPE=MPI, which is what we will use.

Accessing snow

To use snow you first load the Rmpi module. From within your R program, you would then need to load three libraries to obtain the full functionality.

library(Rmpi)
library(parallel)
library(snow)

Attaching package: ‘snow’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

The message printed explains which of the parallel library’s functions are overloaded by snow.

It is worth noting that running in parallel does not automatically make your R job faster, and it can, in fact, make it slower. At the bottom of the snow Simplified page is a simple timing test that demonstrates this. Often there is a “sweet spot” for how many processors and nodes you need for optimum performance. You should try this with your own problem to determine what will get you the best speed-up.

There is a very good tutorial that illustrates using several methods of parallelizing R code in MJA Eugster, J Knaus, et al., “Hands-on tutorial for parallel computing with R”, Computational Statistics, Jun 2011, Vol 26, pp 219–239.

Running snow interactively

To use snow interactively, you should submit an interactive batch job, instructions for which can be found on the page on Interactive PBS jobs.

The following commands set up a cluster based on the number of processors in the job. They then request that it create a 1,000 by 1,000 matrix of random values, A, that is then multiplied by itself, the time taken to do the multiplication on a single processor is reported as is the time using the cluster. It then asks each node to report back its nodename. Here are the commands as they might be run in an interactive session.

library(Rmpi)
library(parallel)
library(snow)
cl <- makeMPIcluster(mpi.universe.size()-1)
A <- matrix(rnorm(1000000), 1000)
system.time(A %*% A)
system.time(parMM(cl, A, A))
clusterCall(cl, function() Sys.info()['nodename'])
stopCluster(cl)
mpi.quit()

Running snow in batch

We can run the same code as above in batch mode, but a more realistic example may be useful. Here is an example that does cross-validation taken from Eugster, et al.[1]; put these R commands into a file called xvalidate.R to run the example.

library(Rmpi)
library(parallel)
library(snow)
n <- 100
set.seed(123)

# generate some data
x <- rnorm(n)
y <- x + rnorm(n)
rand.data <- data.frame(x, y)

# create samples
K <- 10
# Split a sample of 100 into 10 groups of 10
samples <- split(sample(1:n), rep(1:K, length = n))

#  Cross-validation function
cv.fold.fun <- function(index) {
  fit <- lm(y~x, data = rand.data[-samples[[index]],])
  pred <- predict(fit, newdata = rand.data[samples[[index]],])
  return((pred - rand.data$y[samples[[index]]])^2)
}

#####  Sequential version
res.fun  <- lapply(seq(along = samples), cv.fold.fun)
mean(unlist(res.fun))

#####  Parallel version
# create the cluster object
cl <- makeMPIcluster(mpi.universe.size()-1)
# export the data to the workers
clusterExport(cl, list("rand.data", "samples"))
# run the function on all the workers
snowres.fun <- parLapply(cl, seq(along=samples), cv.fold.fun)
# get the mean of the collected result
mean(unlist(snowres.fun))

# always stop the cluster when done
stopCluster(cl)

#####  Compare results to sequential version
all.equal(res.fun, snowres.fun)
mpi.quit()

To run this in batch in an interactive session, using xvalidate.R as input, you would run

$ mpirun -np 1 Rmpi CMD BATCH --quiet --no-restore --no-save xvalidate.R

The output will be in the default output file, xvalidate.Rout.

Running snow from PBS

Once you have an R script – the xvalidate.R script from above – you can create a PBS script to run it and submit that to the cluster. Here is an example of what that might look like.

####  PBS preamble
#PBS -N Rsnow_test
#PBS -M uniqname@umich.edu
#PBS -m abe

#PBS -l procs=4,pmem=1gb,walltime=24:00:00
#PBS -j oe
#PBS -V

#PBS -A example_flux
#PBS -l qos=flux
#PBS -q flux

####  End PBS preamble

#  Put your job commands after this line
if [ -e "${PBS_NODEFILE}" ] ; then
    uniq -c $PBS_NODEFILE
fi

#  Change to the PBS working directory
if [ -d "$PBS_O_WORKDIR" ]; then
    cd "$PBS_O_WORKDIR"
fi
echo "Working from $(pwd)"

mpirun -np 1 Rmpi CMD BATCH --quiet --no-restore --no-save xvalidate.R

If you save that script with, for example, the name xvalidate.pbs, you can then submit it with

$ qsub xvalidate.pbs

References

[1] Robert Yuen, http://www.stat.lsa.umich.edu/~bobyuen/gpuwiki/

[2] Adapted from MJA Eugster, J Knaus, et al., $ldquo;Hands-on tutorial for parallel computing with R”, Computational Statistics, Jun 2011, Vol 26, pp 219–239.

Some other useful web sites with information on R and snow are:

http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html

http://rweb.stat.umn.edu/R/library/snow/html/00Index.html

http://www.sfu.ca/~sblay/R/snow.html