Slurm User Guide
Slurm is a combined batch scheduler and resource manager that allows users to run their jobs on the University of Michigan’s high performance computing (HPC) clusters. This document describes the process for submitting and running jobs under the Slurm Workload Manager
The Batch Scheduler and Resource Manager
The batch scheduler and resource manager work together to run jobs on an HPC cluster. The batch scheduler, sometimes called a workload manager, is responsible for finding and allocating the resources that fulfill the job’s request at the soonest available time. When a job is scheduled to run, the scheduler instructs the resource manager to launch the application(s) across the job’s allocated resources. This is also known as “running the job”.
The user can specify conditions for scheduling the job. One condition is the completion (successful or unsuccessful) of an earlier submitted job. Other conditions include the availability of a specific license or access to a specific hardware accelerator.
An HPC cluster is made up of a number of compute nodes, each with a complement of processors, memory and GPUs. The user submits jobs that specify the application(s) they want to run along with a description of the computing resources needed to run the application(s).
Users interact with an HPC cluster through login nodes. Login nodes are a place where users can login, edit files, view job results and submit new jobs. Login nodes are a shared resource and should not be used to run application workloads.
Jobs and Job Steps
A job is an allocation of resources assigned to an individual user for a specified amount of time. Job steps are sets of (possibly parallel) tasks within a job. When a job runs, the scheduler selects and allocates resources to the job. The invocation of the application happens within the batch script, or at the command line for interactive and jobs.
When an application is launched using srun, it runs within a “job step”. The srun command causes the simultaneous launching of multiple tasks of a single application. Arguments to srun specify the number of tasks to launch as well as the number of nodes (and CPUs and memory) on which to launch the tasks.
srun can be invoked in parallel or sequentially (by backgrounding them). Furthermore, the number of nodes specified by srun (the -N option) can be less than but no more than the number of nodes (and CPUs and memory) that were allocated to the job.
srun can also be invoked directly at the command line (outside of a job allocation). Doing so will submit a job to the batch scheduler and srun will block until that job is scheduled to run. When the srun job runs, a single job step will be created. The job will complete when that job step terminates.
The sbatch command is used to submit a batch script to Slurm. It is designed to reject the job at submission time if there are requests or constraints that Slurm cannot fulfill as specified. This gives the user the opportunity to examine the job request and resubmit it with the necessary corrections. To submit a batch script simply run sbatch <scriptName>
$ sbatch myJob.sh
Anatomy of a Batch Job
The batch job script is composed of four main components:
- The interpreter used to execute the script
- #SBATCH directives that convey submission options.
- The setting of environment and/or script variables (if necessary)
- The application(s) to execute along with its input arguments and options
An Example Slurm job
#!/bin/bash # The interpreter used to execute the script #“#SBATCH” directives that convey submission options: #SBATCH --job-name=example_job #SBATCH --email@example.com #SBATCH --mail-type=BEGIN,END #SBATCH --cpus-per-task=1 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=12 #SBATCH --mem-per-cpu=1000m #SBATCH --gres=gpu:1 #SBATCH -L matlab@slurmdb:5 #SBATCH --time=10:00 #SBATCH -A example_account #SBATCH -p standard #SBATCH -e /home/uniqname/pbs_scripts/outputfiles/outputfile #SBATCH -o /home/uniqname/pbs_scripts/outputfiles/outputfile # The setting of environment and/or script variables (if necessary): --export=EDITOR=/bin/vi # The application(s) to execute along with its input arguments and options: /bin/hostname sleep 60
Common Job Submission Options
|Wall time limit||--time=<hh:mm:ss>|
|Process count per node||--ntasks-per-node=<count>|
|core count (per process)||--cpus-per-task=<cores>|
|Memory limit||--mem=<limit> (Memory per node in mega bytes – MB)|
|Minimum memory per processor||--mem-per-cpu=<memory>|
|Request specific nodes||-w, --nodelist=<node>[,node2[,...]]>
-F, --nodefile=<node file>
|Job array||-a <array indices>|
|Standard output file||--output=<file path> (path must exist)|
|Standard error file||--error=<file path> (path must exist)|
|Combine stdout/stderr to stdout||--output=<combined out and err file path>|
|Copy environment||--export=ALL (default)
--export=NONE to not export environment
|Copy environment variable||--export=<variable[=value][,variable2=value2[,...]]>|
|Request event notification||
Note: multiple mail-type requests may be specified in a comma separated list:
|Email address||--mail-user=<email address>|
|Defer job until the specified time||--begin=<date/time>|
|Node exclusive job||--exclusive|
An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. Interactive jobs are useful when debugging or interacting with an application. The srun command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.
[danbarke@beta-login ~]$ srun --pty /bin/bash srun: job 309 queued and waiting for resources srun: job 309 has been allocated resources [danbarke@bn1 ~]$ hostname bn1.dev.arc-ts.umich.edu [danbarke@bn1 ~]$
Jobs submitted with `srun –pty /bin/bash` will be assigned the cluster default values of 1 CPU and 768MB of memory. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 4 CPUS and 4GB of memory each.
[danbarke@beta-login ~]$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --pty /bin/bash srun: job 894 queued and waiting for resources srun: job 894 has been allocated resources [danbarke@bn1 ~]$ srun hostname bn1.stage.arc-ts.umich.edu bn2.stage.arc-ts.umich.edu bn1.stage.arc-ts.umich.edu bn1.stage.arc-ts.umich.edu bn1.stage.arc-ts.umich.edu bn2.stage.arc-ts.umich.edu bn2.stage.arc-ts.umich.edu bn2.stage.arc-ts.umich.edu [danbarke@bn1 ~]$
In the above example srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job. See the srun man page for more details. https://slurm.schedmd.com/srun.html
Jobs can request GPUs with the job submission option –gres=gpu:<count>. GPUs can be requested in both Batch and Interactive jobs.
An xterm job is a job that launches an xterm window when the job runs. The --x11 option is used to request an x11 job with sbatch and srun. At that point, the user can launch their application(s) from the xterm window across the computing resources which have been allocated to the job.
In 17.11 --x11 was added to sbatch, srun, salloc.
You may want to run a set of jobs sequentially, so that the second job runs only after the first one has completed. This can be accomplished using Slurm’s job dependencies options. For example, if you have two jobs, Job1.bat and Job2.bat, you can utilize job dependencies as in the example below.
[user@beta-login]$ sbatch Job1.sh 123213 [user@beta-login]$ sbatch --dependency=afterany:123213 Job2.sh 123214
The flag --dependency=afterany:123213 tells the batch system to start the second job only after completion of the first job. afterany indicates that Job2 will run regardless of the exit status of Job1, i.e. regardless of whether the batch system thinks Job1 completed successfully or unsuccessfully.
Once job 123213 completes, job 123214 will be released by the batch system and then will run as the appropriate nodes become available.
Exit status: The exit status of a job is the exit status of the last command that was run in the batch script. An exit status of ‘0’ means that the batch system thinks the job completed successfully. It does not necessarily mean that all commands in the batch script completed successfully.
There are several options for the ‘–dependency’ flag that depend on the status of Job1. e.g.
|–dependency=afterany:Job1||Job2 will start after Job1 completes with any exit status|
|–dependency=after:Job1||Job2 will start any time after Job1 starts|
|–dependency=afterok:Job1||Job2 will run only if Job1 completed with an exit status of 0|
|–dependency=afternotok:Job1||Job2 will run only if Job1 completed with a non-zero exit status|
Making several jobs depend on the completion of a single job is trivial. This is accomplished in the example below:
[user@beta-login]$ sbatch Job1.sh 13205 [user@beta-login]$ sbatch --dependency=afterany:13205 Job2.sh 13206 [user@beta-login]$ sbatch --dependency=afterany:13205 Job3.sh 13207 [user@beta-login]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E" JOBID NAME ST DEPENDENCY 13205 Job1.bat R 13206 Job2.bat PD afterany:13205 13207 Job3.bat PD afterany:13205
Making a job depend on the completion of several other jobs: example below.
[user@beta-login]$ sbatch Job1.sh 13201 [user@beta-login]$ sbatch Job2.sh 13202 [user@beta-login]$ sbatch --dependency=afterany:13201,13202 Job3.sh 13203 [user@beta-login]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E" JOBID NAME ST DEPENDENCY 13201 Job1.sh R 13202 Job2.sh R 13203 Job3.sh PD afterany:13201,afterany:13202
Chaining jobs is most easily done by submitting the second dependent job from within the first job. Example batch script:
#!/bin/bash cd /data/mydir run_some_command sbatch --dependency=afterany:$SLURM_JOB_ID my_second_job
Job dependencies documentation adapted from https://hpc.nih.gov/docs/userguide.html#depend
Job arrays are multiple jobs to be executed with identical parameters. Job arrays are submitted with -a, –array=<indexes>. The indexes specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a “-” separator.
For example, “–array=0-15” or “–array=0,6,16-32”.
A step function can also be specified with a suffix containing a colon and number. For example, “–array=0-15:4” is equivalent to “–array=0,4,8,12”.
A maximum number of simultaneously running tasks from the job array may be specified using a “%” separator. For example “–array=0-15%4” will limit the number of simultaneously running tasks from this job array to 4. The minimum index value is 0. The maximum value is 499999.
For each job type above, the user has the ability to define the execution environment. This includes environment variable definitions as well as shell limits (bash ulimit or csh limit). sbatch and salloc provide the --export option to convey specific environment variables to the execution environment. sbatch and salloc provide the --propagate option to convey specific shell limits to the execution environment. By default Slurm does not source the files ~./bashrc or ~/.profile when requesting resources via sbatch (although it does when running srun / salloc ). So, if you have a standard environment that you have set in either of these files or your current shell then you can do one of the following:
- Add the command #SBATCH --get-user-env to your job script (i.e. the module environment is propagated).
- Source the configuration file in your job script:
Sourcing your .bashrc file
< #SBATCH statements > source ~/.bashrc
- You may want to remove the influence of any other current environment variables by adding #SBATCH --export=NONE to the script. This removes all set/exported variables and then acts as if #SBATCH --get-user-env has been added (module environment is propagated).
Slurm recognizes and provides a number of environment variables.
The first category of environment variables are those that Slurm inserts into the job’s execution environment. These convey to the job script and application information such as job ID (SLURM_JOB_ID) and task ID (SLURM_PROCID).
The next category of environment variables are those use user can set in their environment to convey default options for every job they submit. These include options such as the wall clock limit. For the complete list, see the “INPUT ENVIRONMENT VARIABLES” section under the sbatch, salloc, and srun man pages.
Finally, Slurm allows the user to customize the behavior and output of some commands using environment variables. For example, one can specify certain fields for the squeue command to display by setting the SQUEUE_FORMAT variable in the environment from which you invoke squeue.
Commonly Used Environment Variables
|Submit directory||$SLURM_SUBMIT_DIR||Slurm jobs starts from the submit directory by default.|
|Node list||$SLURM_JOB_NODELIST||The Slurm variable has a different format to the PBS one.
To get a list of nodes use:
scontrol show hostnames $SLURM_JOB_NODELIST
|Job array index||$SLURM_ARRAY_TASK_ID|
|Number of nodes allocated||$SLURM_JOB_NUM_NODES
|Number of processes||$SLURM_NTASKS|
|Number of processes per node||$SLURM_TASKS_PER_NODE|
|Requested tasks per node||$SLURM_NTASKS_PER_NODE|
|Requested CPUs per task||$SLURM_CPUS_PER_TASK|
|Hostname||$HOSTNAME == $SLURM_SUBMIT_HOST||Unless a shell is invoked on an allocated resource, the HOSTNAME variable is propagated (copied) from the submit machine environments will be the same on all allocated nodes.|
Slurm merges the job’s error and output by default and saves it to an output file with a name that includes the job ID (slurm-<job_ID>.out). You can specify your own output and error files to the sbatch command using the -o and -e options respectively. Slurm will append the job’s output to the specified file(s). If you want the output to overwrite any existing files, add the --open-mode=truncate option.
Serial vs. Parallel jobs
Parallel jobs launch applications that are comprised of many processes (aka tasks) that communicate with each other, typically over a high speed switch. Serial jobs launch one or more tasks that work independently on separate problems.
Parallel applications must be launched by the srun command. Serial applications can use srun to launch them, but it is not required in one node allocations.
A cluster is often highly utilized and may not be able to run a job when it is submitted. When this occurs, the job is placed in a partition. Specific compute node resources are defined for every job partition. The Slurm partition is synonymous with the term queue.
Each partition can be configured with a set of limits which specify the requirements for every job that can run in that partition. These limits include job size, wall clock limits, and the users who are allowed to run in that partition.
The Bighouse cluster has the following partitions:
- standard – the production partition for running production jobs.
- debug – the debug partition providing quick turnaround for shorter and smaller jobs.
- priority – the partition for starting jobs before jobs in other partitions. This partition has a higher charge rate.
- long – the partition for jobs over 48 hours
The squeue command lists all the jobs currently in the system, one line per job.
Users must request a charge (aka bank) account for each job they submit or have a valid charge account assigned by default. If the user is not assigned to any charge accounts, they cannot submit a job to the batch system. Computing resources allocated to a job are tracked and charged to the job’s specified charge account. The charge accounts each user is permitted to use can be seen by running the sshare command.
Jobs will be ordered in the partition of pending jobs based on a number of factors. The scheduler will always be looking to schedule the job that is at the top of the partition. The scheduler is also configured to schedule jobs lower in the partition if doing so does not delay the start of any higher priority partition. This is known as backfill scheduling.
The active factors that contribute to a job’s priority can be seen by invoking the sprio command. These factors include:
- Fair-share: a number derived from the difference between the shares of the cluster that have been allotted to a user for a specific charge account and the usage accrued to the user and charge account, as well as any parent charge accounts.
For a more detailed description of the algorithms used to calculate the fair-share component of the job priority, see Fair Tree.
- Job size: a number proportional to the quantity of computing resources the job has requested.
- Age: a number proportional to the period of time that has elapsed since the job was submitted to the partition. Note: time during which queued jobs in a held state does not contribute to the age factor.
- Partition: a number set by the partition the job is submitted to.
For a more detailed description of the algorithms for calculating job priority, see Multi-factor Priority.
Most of a job’s specifications can be seen by invoking scontrol show job <jobID>. More details about the job can be seen by adding the -d flag. A second -d flag will show the job script. A user is unable to see the script of the job of another user.
Slurm captures and reports the exit code of the job script (sbatch jobs) as well as the signal that caused the job’s termination when a signal caused a job’s termination.
A job’s record remains in Slurm’s memory for 30 minutes after it completes. scontrol show job will return “Invalid job id specified” for a job that completed more than 30 minutes ago. At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.
Modifying a Batch Job
Many of the batch job specifications can be modified after a batch job is submitted and before it runs. Typical fields that can be modified include the job size (number of nodes), partition (queue), and wall clock limit. Job specifications cannot be modified by the user once the job enters the Running state.
Beside displaying a job’s specifications, the scontrol command is used to modify them. Examples:
Displays all of a job’s characteristics:
scontrol -dd show job <jobID>
Retrieve the batch script for a given job:
scontrol write batch_script <jobID>
Change the job’s account to the “science” account:
scontrol update JobId=<jobID> Account=science
Changes the job’s partition to the priority partition:
scontrol update JobId=<jobID> Partition=priority
Holding and Releasing a Batch Job
If a user’s job is in the pending state waiting to be scheduled, the user can prevent the job from being scheduled by invoking the scontrol hold <jobID> command to place the job into a Held state. Jobs in the held state do not accrue any job priority based on queue wait time. Once the user is ready for the job to become a candidate for scheduling once again, they can release the job using the scontrol release <jobID> command.
Signalling and Cancelling a Batch Job
Pending jobs can be cancelled (withdrawn from the queue) using the scancel command (scancel <jobID>). The scancel command can also be used to terminate a running job. The default behavior is to issue the job a SIGTERM, wait 30 seconds, and if processes from the job continue to run, issue a SIGKILL command.
The -s option of the scancel command (scancel -s <signal> <jobID>) allows the user to issue any signal to a running job.
Common Job Commands
|Submit a job||sbatch <job script>|
|Delete a job||scancel <job ID>|
|Job status (all)||squeue|
|Job status (by job)||squeue -j <job ID>|
|Job status (by user)||squeue -u <user>|
|Job status (detailed)||scontrol show job -dd <job ID>|
|Show expected start time||squeue -j <job ID> --start|
|Queue list / info||scontrol show partition [queue]|
|Node list||scontrol show nodes|
|Node details||scontrol show node <node>|
|Hold a job||scontrol hold <job ID>|
|Release a job||scontrol release <job ID>|
|Start an interactive job||salloc <args>
srun --pty <args>
|X forwarding||srun --pty <args> --x11(Update with --x11 once 17.11 is released)|
|Read stdout messages at runtime||No equivalent command / not needed. Use the --output option instead.|
|Monitor or review a job’s resource usage||sacct -j <job_num> --format JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed
(see sacct for all format options)
|View job batch script||scontrol write batch_script <jobID> [filename]|
The basic job states are these:
- Pending – the job is in the queue, waiting to be scheduled
- Held – the job was submitted, but was put in the held state (ineligible to run)
- Running – the job has been granted an allocation. If it’s a batch job, the batch script has been run
- Complete – the job has completed successfully
- Timeout – the job was terminated for running longer than its wall clock limit
- Preempted – the running job was terminated to reassign its resources to a higher QoS job
- Failed – the job terminated with a non-zero status
- Node Fail – the job terminated after a compute node reported a problem
For the complete list, see the “JOB STATE CODES” section under the squeue man page.
A pending job can remain pending for a number of reasons:
- Dependency – the pending job is waiting for another job to complete
- Priority – the job is not high enough in the queue
- Resources – the job is high in the queue, but there are not enough resources to satisfy the job’s request
- Partition Down – the queue is currently closed to running any new jobs
For the complete list, see the “JOB REASON CODES” section under the squeue man page.
Displaying Computing Resources
As stated above, computing resources are nodes, CPUs, memory, and generic resources like GPUs. The resources of each compute node can be seen by running the scontrol show node command. The characteristics of each partition can be seen by running the scontrol show partition command. Finally, a load summary report for each partition can be seen by running sinfo.
Job Statistics and Accounting
The sreport command provides aggregated usage reports by user and account over a specified period. Examples:
sreport -T billing cluster AccountUtilizationByUser Start=2017-01-01 End=2017-12-31
sreport -T billing cluster UserUtilizationByAccount Start=2017-01-01 End=2017-12-31
For all of the sreport options see the sreport man page.
Time Remaining in an Allocation
If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, applications have two means for discovering the time remaining in the application.
The first means is to use the sbatch --signal=<sig_num>[@<sig_time>] option to request a signal (like USR1 or USR2) at sig_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.
The second means is for the application to issue a library call to retrieve its remaining time periodically. When the library call returns a remaining time below a certain threshold, the application can take the necessary steps to write a checkpoint file and terminate gracefully.
Slurm offers the slurm_get_rem_time() library call that returns the time remaining. On some systems, the yogrt library (man yogrt) is also available to provide the time remaining.
Trackable RESources (TRES)
Trackable RESources (TRES) are cluster resources for which usage is tracked. Each resource a job requests contributes to a portion of the job TRES. CPU, memory, and GPU are all examples of TRES. There is also an additional TRES called billing which is calculated from the other TRES.
TRESBillingWeights are multiplicative factors used to weight usage of individual TRES components to calculate the billing TRES. TRESBillingWeights are useful as different resources have different costs. Without TRESBillingWeights 1MB of memory would be equivalent to 1 CPU. TRESBillingWeights are set at the partition level and may be different for each partition.
The billing TRES for jobs run under a partition with TRESBillingWeights is calculated by the sum of the multiplication of individual TRESBillingWeights components by their corresponding TRES components.
M= memory in MB
CB=CPU billing rate
MB=Memory billing rate
GB=GPU Billing rate
A job with 1 CPU, 8 GB of memory, and 1 GPU run in a partition with TRESBillingWeights=”CPU=1.0,Mem=0.25,gres/gpu=2.0 has
Trackable RESources (TRES) Minutes
TRES account for the different amounts of resources used by a job and TRESBillingWeights account for the different costs of those resources. TRES minutes account for the duration those resources are used. TRES minutes for each TRES are calculated by the multiplication of the total job walltime by the each TRES.
TRES minutes are calculated with the following formulas:
CPU TRES minutes=W(C)
Memory TRES minutes=W(M)
GPU TRES minutes=W(G)
Billing TRES minutes= W(B)
W= walltime in minutes
M= Memory in MB
The TRES minutes for a job that runs for 10 minutes with 1 CPU, 8GB of memory, and 1 GPU on a partition with TRESBillingWeights=”CPU=1.0,Mem=0.25G,GRES/gpu=2.0 are
CPU TRES Minutes=10*1=10
Memory TRES Minutes=10*8*1024=81920
GPU TRES Minutes= 10*1=10
Billing TRES Minutes = 10((1*1.0) + (8*1024*0.25) + (1*2.0))=20510
GrpTRESMins is a limit on the TRES minutes that may be accumulated by all jobs against an account . The limit can be set for any TRES. On the Bighouse cluster we set the GrpTRESMins limit for the billing TRES.
Account hpcstaff has a GrpTRESMins=billing limit of 100. Emily runs a job which uses 50 billing TRES minutes. Then Scott runs a job which uses 30 billing TRES minutes. If Dan tries to run a job which requires 25 billing TRES minutes, his job is unable to run against the account hpcstaff as running Dan’s job would put the account over its 100 billing TRES minute limit as set by GrpTRESMins.
Adapted from the Lawrence Livermore National Laboratory website.