## The R parallel package

The `parallel` package is meant to supercede both the `multicore` and the `snow` packages, which have been merged and given a new front-end with `parallel`. This page documents the `multicore` equivalent functionality.

## Multiprocessor parallelism

The `parallel` package enables you to use multiple processors on the same physical machine, the “shared memory” model of parallel computing. What follows are two examples: one that computes distance matrices, modified from Yuen[1], and one that computes cross-validation, modified from Eugster, et al.[2].

### Distance matrix

The first example creates a 4 by 4 matrix of random numbers, then defines a function that calculates the distance between the elements and creates a distance object, and finally that function is applied to the original matrix four times, first on a single processor, then on four processors. We’ll assume these commands get put into a file called `distance.R` for batch processing.

library(parallel) set.seed(5446) lfun <- function(i){ d <- dist(X) } p <- 4 X <- matrix(rnorm(2^p),ncol = 2^(p/2)) res <- lapply(1:4,lfun) options(cores = 4) mcres <- mclapply(1:4,lfun) all.equal(res, mcres)

Note that we are asking for four cores for the worker processes to do the calculations. You should request one *additional* core when you write your PBS script, so that the master process still has a core of its own.

To run those commands using PBS, you will need to modify the PBS script from above so that the `#PBS -l`

line and the line that runs R look like this.

#PBS -l nodes=1:ppn=5,mem=1gb [ . . . . ]< R CMD BATCH --vanilla distance.R

For more information about the `#PBS`

line, please see the PBS web page.

### Cross validation example

The second example creates a random data set, sets up 10 equally sized samples, then fits a model to the samples, calculates the squared predicted values difference of the predicted to the observed values, and creates a list of the results. Finally, the mean squared difference is calculated.

library(parallel) n <- 100 set.seed(123) x <- rnorm(n) y <- x + rnorm(n) rand.data <- data.frame(x, y) K <- 10 samples <- split(sample(1:n), rep(1:K, length = n)) cv.fold.fun <- function(index) { fit <- lm(y~x, data = rand.data[-samples[[index]],]) pred <- predict(fit, newdata = rand.data[samples[[index]],]) return((pred - rand.data$y[samples[[index]]])^2) } # Sequential version res.fun <- lapply(seq(along = samples), cv.fold.fun) mean(unlist(res.fun)) # Parallel version mcres.fun <- mclapply(seq(along = samples), cv.fold.fun) mean(unlist(mcres.fun)) # All the elements are identical all.equal(res.fun, mcres.fun)

There is a very good tutorial that illustrates using several methods of parallelizing R code in MJA Eugster, J Knaus, et al., “Hands-on tutorial for parallel computing with R”, *Computational Statistics*, Jun 2011, Vol 26, pp 219–239. This is definitely accessible online via the UM library website if you are a UM affiliate. Log into the library web site first, then click this link (it may not work if you do not pre-authenticate to the Library web site).

## References

[1] Robert Yuen, http://www.stat.lsa.umich.edu/~bobyuen/gpuwiki/

[2] Adapted from MJA Eugster, J Knaus, et al., “Hands-on tutorial for parallel computing with R”, *Computational Statistics*, Jun 2011, Vol 26, pp 219–239.