Using multicore

The multicore package enables you to use multiple processors on the same physical machine, the “shared memory” model of parallel computing. What follows are two examples: one that computes distance matrices, modified from Yuen[1], and one that computes cross-validation, modified from Eugster, et al.[2].

Distance matrix

The first example creates a 4 by 4 matrix of random numbers, then defines a function that calculates the distance between the elements and creates a distance object, and finally that function is applied to the original matrix four times, first on a single processor, then on four processors. We’ll assume these commands get put into a file called distance.R for batch processing.

library(multicore)
set.seed(5446)

lfun <- function(i){
  d <- dist(X) 
}

p <- 4
X <- matrix(rnorm(2^p),ncol = 2^(p/2))

res <- lapply(1:4,lfun)

options(cores = 4)
mcres <- mclapply(1:4,lfun)
all.equal(res, mcres)

Note that we are asking for four cores for the worker processes to do the calculations. You should request one additional core when you write your PBS script, so that the master process still has a core of its own.

To run those commands using PBS, you will need to modify the PBS script from above so that the #PBS -l line and the line that runs R look like this.

#PBS -l nodes=1:ppn=5,mem=1gb
  [ . . . . ]<
R CMD BATCH --vanilla distance.R

For more information about the #PBS line, please see the PBS web page.

Cross validation example

The second example creates a random data set, sets up 10 equally sized samples, then fits a model to the samples, calculates the squared predicted values difference of the predicted to the observed values, and creates a list of the results. Finally, the mean squared difference is calculated.

library(multicore)
n <- 100
set.seed(123)
x <- rnorm(n)
y <- x + rnorm(n)
rand.data <- data.frame(x, y)

K <- 10
samples <- split(sample(1:n), rep(1:K, length = n))

cv.fold.fun <- function(index) {
   fit <- lm(y~x, data = rand.data[-samples[[index]],])
   pred <- predict(fit, newdata = rand.data[samples[[index]],])
   return((pred - rand.data$y[samples[[index]]])^2)
}

#  Sequential version
res.fun   <- lapply(seq(along = samples), cv.fold.fun)
mean(unlist(res.fun))

#  Parallel version
mcres.fun <- mclapply(seq(along = samples), cv.fold.fun)
mean(unlist(mcres.fun))

#  All the elements are identical
all.equal(res.fun, mcres.fun)

References

[1] Robert Yuen, http://www.stat.lsa.umich.edu/~bobyuen/gpuwiki/
[2] Adapted from MJA Eugster, J Knaus, et al., “Hands-on tutorial for parallel computing with R”, Computational Statistics, Jun 2011, Vol 26, pp 219–239.