Matlab parallel computing and problems with shared filesystems

In a cluster environment, such as Flux, your home directory and other data directories are on a shared filesystem that is accessed via the network. This can lead to circumstances in which updates to the network files do not propagate sufficiently quickly, which is called latency. Matlab uses files to store information about parallel jobs, and there are a couple of error situations that can be triggered by the latency.

Depending on the nature of the parallel session, there are some configuration options that can be used to ameliorate the situation, which will be detailed below. There are two main classes of Matlab parallel jobs on the cluster: single-machine jobs and multi-machine jobs. Changing the preferences is done using a combination of environment variables that are set prior to running Matlab (and which Matlab inherits) or by issuing commands to modify Matlab settings within your Matlab scripts.

Changing the preferences directory

The first thing you can change is the preferences directory. By default, Matlab will use a directory called .matlab under your home directory to store preferences for the Matlab session. In the cluster environment, however, instances of Matlab might be running in many jobs at the same time. This could lead to circumstances where information from one might overwrite information from another.

To change this behavior, you can create a unique folder for the Matlab preferences for each job, then providing that location to Matlab at startup. This is done in your PBS script with the following command.

#  Create parent matlab data directory
mkdir -p $HOME/matlabdata

#  If the directory doesn't exist, exit with large error code
test -d $HOME/matlabdata || exit 999

#  Check whether we are in a PBS job and set preference directory
#  to either something random or the JobID
if [ "$PBS_JOBID" == "" ] ; then
    export MATLAB_PREFDIR=$(mktemp -d $HOME/matlabdata/matlab-prefs-XXXXXXXX)
else
    mkdir $HOME/matlabdata/$PBS_JOBID
    export MATLAB_PREFDIR=$HOME/matlabdata/$PBS_JOBID
fi

#  Finish by setting the MATLAB_CLUSTER_WORKDIR to the MATLAB_PREFDIR
export MATLAB_CLUSTER_WORKDIR=$MATLAB_PREFDIR

Changing the job storage location within Matlab

Using network storage

The safest way to do this is, in your PBS script, to create the shared location before starting Matlab, then run Matlab using it, then when Matlab has finished, remove it. This can be done with

mkdir ${HOME}/matlabdata/${PBS_JOBID}
sleep 5

matlab -nodisplay -r my_script   

rm -rf ${HOME}/matlabdata/${PBS_JOBID}

Then, in your Matlab script, create the pool like this

% This assumes a multinode job
% Set the value for the job storage location
JSL = fullfile(getenv('HOME'), 'matlabdata', getenv('PBS_JOBID'))

% If not inside a PBS job, use 4 processors; assumes you want all
% the processors assigned to the PBS job
if isempty(getenv('PBS_NP'))
    NP = 4;
else
    NP = str2double(getenv('PBS_NP'));
end

% Create the cluster object, set the job storage location, start pool
myCluster = parcluster('current')
myCluster.JobStorageLocation = JSL

myPool = parpool(myCluster, NP);

[ . . . .  Put your Matlab code here . . . . ]

delete(myPool);
exit

Using local disk

If you are using the local profile and staying on one physical node, then you may see a performance increase by using local disk for the shared files. In that case, use

#####  Request free space on /tmp for your job data; using 10gb
#####  as an example

#PBS -l ddisk=10gb

mkdir -p /tmp/${USER}/${PBS_JOBID}

matlab -nodisplay -r my_script

rm -rf /tmp/${USER}/${PBS_JOBID}

and this for your Matlab code

% This assumes a local job
% Set the value for the job storage location
JSL = fullfile(/tmp, getenv('USER'), getenv('PBS_JOBID'))

% If not inside a PBS job, use 4 processors; assumes you want all
% the processors assigned to the PBS job
if isempty(getenv('PBS_NP'))
    NP = 4;
else
    NP = str2double(getenv('PBS_NP'));
end

% Create the cluster object, set the job storage location, start pool
myCluster = parcluster('local')
myCluster.JobStorageLocation = JSL

myPool = parpool(myCluster, NP);

[ . . . .  Put your Matlab code here . . . . ]

delete(myPool);
exit