NFS latency and MATLAB directories

In a cluster environment, your home directory and other data directories are on a shared filesystem that is accessed via the network. This can lead to circumstances in which updates to the network files do not propagate sufficiently quickly to all the nodes that use them. This is called a problem with latency. MATLAB uses files to store information about its configuration and about parallel jobs, and there are a couple of error situations that can be triggered by the latency.

Depending on the nature of the parallel session, there are some configuration options that can be used to ameliorate the situation, which will be detailed below. There are two main classes of MATLAB parallel jobs on the cluster: single-machine jobs and multi-machine jobs. Changing the preferences is done using a combination of environment variables that are set prior to running MATLAB (and which MATLAB inherits) or by issuing commands to modify MATLAB settings within your MATLAB scripts.

In what follows, we assume that your job will use one CPU per task, i.e., you will have

#SBATCH --cpus-per-task=1

in your job script; this is also the default on ARC-TS clusters. If you change that, you will need to adjust the code below that sets the value of NP from the environment.

Changing the preferences directory

The first thing you can change is the preferences directory. By default, MATLAB will use a directory called .matlab under your home directory to store preferences for the MATLAB session.

In the cluster environment, however, instances of MATLAB might be running in many jobs at the same time. This could lead to circumstances where information from one might overwrite information from another.

To change this behavior, you can create a unique folder for the MATLAB preferences for each job, then provide that location to MATLAB at startup. This is done in your Slurm job script with the following command.

#  Create parent matlab data directory
mkdir -p $HOME/matlabdata

#  If the directory doesn't exist, exit with large error code
test -d $HOME/matlabdata || exit 999

#  Check whether we are in a Slurm job and set preference directory
#  to either something random or the JobID

if [ "$SLURM_JOBID" == "" ] ; then
    export MATLAB_PREFDIR=$(mktemp -d $HOME/matlabdata/matlab-prefs-XXXXXXXX)
else
    mkdir $HOME/matlabdata/$SLURM_JOBID
    export MATLAB_PREFDIR=$HOME/matlabdata/$SLURM_JOBID
fi

#  Finish by setting the MATLAB_CLUSTER_WORKDIR to the MATLAB_PREFDIR
export MATLAB_CLUSTER_WORKDIR=$MATLAB_PREFDIR

Changing the job storage location within MATLAB

Using network storage

The safest way to do this is, in your Slurm job script, to create the shared location before starting MATLAB, then run MATLAB using it, then when MATLAB has finished, remove it. This can be done with

mkdir ${HOME}/matlabdata/${SLURM_JOBID}
sleep 5

matlab -nodisplay -r my_script   

rm -rf ${HOME}/matlabdata/${SLURM_JOBID}

Then, in your MATLAB script, create the pool like this

% This assumes a multinode job
% Set the value for the job storage location
JSL = fullfile(getenv('HOME'), 'matlabdata', getenv('SLURM_JOBID'))

% If not inside a Slurm job, use 4 processors; assumes you want all
% the processors assigned to the Slurm job
if isempty(getenv('SLURM_NTASKS'))
    NP = 4;
else
    NP = str2double(getenv('SLURM_NTASKS'));
end

% Initialize ARCTS cluster 'current' profile
setupUmichClusters

% Create the cluster object, set the job storage location, start pool
myCluster = parcluster('current')
myCluster.JobStorageLocation = JSL

myPool = parpool(myCluster, NP);

[ . . . .  Put your MATLAB code here . . . . ]

delete(myPool);
exit

Using local disk

If you are using the local profile and staying on one physical node, then you may see a performance increase by using local disk for the shared files. In that case, use

#####  Request free space on /tmp for your job data; using 10gb
#####  as an example

mkdir -p /tmp/${USER}/${SLURM_JOBID}

matlab -nodisplay -r my_script

rm -rf /tmp/${USER}/${SLURM_JOBID}

and this for your Matlab code

% This assumes a local job
% Set the value for the job storage location
JSL = fullfile('/tmp', getenv('USER'), getenv('SLURM_JOBID'))

% If not inside a Slurm job, use 4 processors; assumes you want all
% the processors assigned to the Slurm job
if isempty(getenv('SLURM_NTASKS'))
    NP = 4;
else
    NP = str2double(getenv('SLURM_NTASKS'));
end

% Create the cluster object, set the job storage location, start pool
myCluster = parcluster('local')
myCluster.JobStorageLocation = JSL

myPool = parpool(myCluster, NP);

[ . . . .  Put your MATLAB code here . . . . ]

delete(myPool);
exit