Explore ARCExplore ARC

Slurm User Guide for Lighthouse

By | HPC

Go to Lighthouse Overview     To search this user guide, use the Command + F (Mac) or Ctrl + F (Win) keyboard shortcuts.

Slurm User Guide for Lighthouse

Slurm is a combined batch scheduler and resource manager that allows users to run their jobs on the University of Michigan’s high performance computing (HPC) clusters. This document describes the process for submitting and running jobs under the Slurm Workload Manager on the Lighthouse cluster.

The Batch Scheduler and Resource Manager

The batch scheduler and resource manager work together to run jobs on an HPC cluster. The batch scheduler, sometimes called a workload manager, is responsible for finding and allocating the resources that fulfill the job’s request at the soonest available time.  When a job is scheduled to run, the scheduler instructs the resource manager to launch the application(s) across the job’s allocated resources.  This is also known as “running the job”.

The user can specify conditions for scheduling the job. One condition is the completion (successful or unsuccessful) of an earlier submitted job.  Other conditions include the availability of a specific license or access to a specific hardware accelerator.

Computing Resources

An HPC cluster is made up of a number of compute nodes, each with a complement of processors, memory and GPUs.  The user submits jobs that specify the application(s) they want to run along with a description of the computing resources needed to run the application(s).

Login Resources

Users interact with an HPC cluster through login nodes. Login nodes are a place where users can login, edit files, view job results and submit new jobs. Login nodes are a shared resource and should not be used to run application workloads.

Jobs and Job Steps

A job is an allocation of resources assigned to an individual user for a specified amount of time. Job steps are sets of (possibly parallel) tasks within a job. When a job runs, the scheduler selects and allocates resources to the job. The invocation of the application happens within the batch script, or at the command line for interactive and jobs.

When an application is launched using srun, it runs within a “job step”. The srun command causes the simultaneous launching of multiple tasks of a single application. Arguments to srun specify the number of tasks to launch as well as the number of nodes (and CPUs and memory) on which to launch the tasks.

srun can be invoked in parallel or sequentially (by backgrounding them). Furthermore, the number of nodes specified by srun (the -N option) can be less than but no more than the number of nodes (and CPUs and memory) that were allocated to the job.

srun can also be invoked directly at the command line (outside of a job allocation). Doing so will submit a job to the batch scheduler and srun will block until that job is scheduled to run. When the srun job runs, a single job step will be created. The job will complete when that job step terminates.

Batch Jobs

Most work will be queued to be run on Lighthouse and is described through a batch script. The sbatch command is used to submit a batch script to Slurm. To submit a batch script simply run the following from a shared file system; those include your home directory, /scratch, and any directory under /nfs that you can normally use in a job on Flux. Output will be sent to this working directory (jobName-jobID.log). Do not submit jobs from /tmp or any of its subdirectories.  sbatch is designed to reject the job at submission time if there are requests or constraints that Slurm cannot fulfill as specified. This gives the user the opportunity to examine the job request and resubmit it with the necessary corrections.

$ sbatch myJob.sh

The batch job script is composed of three main components:

  • The interpreter used to execute the script
  • #SBATCH directives that convey submission options
  • The application(s) to execute along with its input arguments and options

Example:

#!/bin/bash
# The interpreter used to execute the script

#“#SBATCH” directives that convey submission options:

#SBATCH --job-name=example_job
#SBATCH --mail-type=BEGIN,END
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=1000m 
#SBATCH --time=10:00
#SBATCH --account=test
#SBATCH --partition=standard

# The application(s) to execute along with its input arguments and options:

/bin/hostname
sleep 60

How many nodes and processors you request will depend on the capability of your software and what it can do. There are four common scenarios:

Example: One Node, One Processor

This is the simplest case and is shown in the example above. The majority of software cannot use more than this. Some examples of software for which this would be the right configuration are SAS, Stata, R, many Python programs, most Perl programs.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: One Node, Multiple Processors

This is similar to what a modern desktop or laptop is likely to have. Software that can use more than one processor may be described as multicore, multiprocessor, or mulithreaded. Some examples of software that can benefit from this are MATLAB and Stata/MP. You should read the documentation for your software to see if this is one of its capabilities.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Node, One Process per CPU

This is the classic MPI approach, where multiple machines are requested, one process per processor on each node is started using MPI. This is the way most MPI-enabled software is written to work.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Nodes, Multiple CPUs per Process

This is often referred to as the “hybrid mode” MPI approach, where multiple machines are requested and multiple processes are requested. MPI will start a parent process or processes on each node, and those in turn will be able to use more than one processor for threaded calculations.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Common Job Submission Options

Option Slurm Command (#SBATCH) Lighthouse Usage
Job name --job-name=<name> --job-name=lhjob1
Account --account=<account> --account=test
Queue --partition=<name> --partition=name

The partition name depends on the PI’s requested cluster

Wall time limit --time=<hh:mm:ss> --time=02:00:00
Node count --nodes=<count> --nodes=2
Process count per node --ntasks-per-node=<count> --ntasks-per-node=1
Core count (per process) --cpus-per-task=<cores> --cpus-per-task=1
Memory limit --mem=<limit> (Memory per node in MB) --mem=12000m
Minimum memory per processor --mem-per-cpu=<memory> --mem-per-cpu=1000m
Request GPUs --gres=gpu:<count> --gres=gpu:2
Job array --array=<array indices> --array=0-15
Standard output file --output=<file path> (path must exist) --output=/home/%u/%x-%j.log
%u = username
%x = job name
%j = job ID 
Standard error file --error=<file path> (path must exist) --error=/home/%u/error-%x-%j.log
Combine stdout/stderr to stdout --output=<combined out and err file path> --output=/home/%u/%x-%j.log
Copy environment --export=ALL (default)

--export=NONE (to not export environment)

--export=ALL
Copy environment variable --export=<variable=value,var2=val2> --export=EDITOR=/bin/vim
Job dependency --dependency=after:jobID[:jobID...]--dependency=afterok:jobID[:jobID...]

--dependency=afternotok:jobID[:jobID...]

--dependency=afterany:jobID[:jobID...]

--dependency=after:1234[:1233]
Request software license(s)

--licenses=<application>@slurmdb:<N>

--licenses=stata@slurmdb:1
requests one license for Stata

Request event notification

--mail-type=<events>

Note: multiple mail-type requests may be specified in a comma separated list:

--mail-type=BEGIN,END,NONE,FAIL,REQUEUE,ARRAY_TASKS

--mail-type=BEGIN,END,FAIL

Email address --mail-user=<email address> --mail-user=uniqname@umich.edu
Defer job until the specified time --begin=<date/time> --begin=2020-12-25T12:30:00

Please note that if your job is set to utilize more than one node, make sure your code is MPI enabled in order to run across these nodes and you must use srun rather then mpirun or mpiexec.

Requesting software licenses

Many of the software packages that are licensed for use on ARC clusters are licensed for a limited number of concurrent uses. If you will use one of those packages, then you must request a license or licenses in your submission script. As an example, to request one Stata license, you would use

#SBATCH --licenses=stata@slurmdb:1

The list of software can be found from Lighthouse by using the command

$ scontrol show licenses

Interactive Jobs

If you need to interact with your job to complete a task, you can submit an interactive job.  When your interactive job runs, you will have command line access on a compute node and can interact with all of the resources you requested. Interactive jobs are useful when debugging or interacting with an application.  The srun command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

[user@login ~]$ srun --pty /bin/bash
srun: job 309 queued and waiting for resources
srun: job 309 has been allocated resources
[user@node0001 ~]$ hostname
node0001.lh.arc-ts.umich.edu
[user@node0001 ~]$

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 4 CPUS and 4GB of memory each:

[user@login ~]$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --pty /bin/bash
srun: job 894 queued and waiting for resources
srun: job 894 has been allocated resources
[user@node0001 ~]$ srun hostname
node0001.lh.arc-ts.umich.edu
node0001.lh.arc-ts.umich.edu
node0002.lh.arc-ts.umich.edu
node0001.lh.arc-ts.umich.edu
node0002.lh.arc-ts.umich.edu
node0002.lh.arc-ts.umich.edu
node0002.lh.arc-ts.umich.edu
node0001.lh.arc-ts.umich.edu

In the above example srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job. See the srun man page for more details.

Exit status:

The exit status of a job is the exit status of the last command that was run in the batch script. An exit status of ‘0’ means that the batch system thinks the job completed successfully. It does not necessarily mean that all commands in the batch script completed successfully.

Job Dependencies

You may want to run a set of jobs sequentially, so that the second job runs only after the first one has completed. This can be accomplished using Slurm’s job dependencies options. For example, if you have two jobs, Job1.sh and Job2.sh, you can utilize job dependencies as in the example below.

[user@login]$ sbatch Job1.sh
123213

[user@login]$ sbatch --dependency=afterany:123213 Job2.sh
123214

The flag --dependency=afterany:123213 tells the batch system to start the second job only after completion of the first job. afterany indicates that Job2 will run regardless of the exit status of Job1, i.e. regardless of whether the batch system thinks Job1 completed successfully or unsuccessfully.

Once job 123213 completes, job 123214 will be released by the batch system and then will run as the appropriate nodes become available.

There are several options for the –dependency flag that depend on the status of Job1:

–dependency=afterany:Job1 Job2 will start after Job1 completes with any exit status
–dependency=after:Job1 Job2 will start any time after Job1 starts
–dependency=afterok:Job1 Job2 will run only if Job1 completed with an exit status of 0
–dependency=afternotok:Job1 Job2 will run only if Job1 completed with a non-zero exit status

Making several jobs depend on the completion of a single job is done in the example below:

[user@login]$ sbatch Job1.sh 
13205 
[user@login]$ sbatch --dependency=afterany:13205 Job2.sh 
13206 
[user@login]$ sbatch --dependency=afterany:13205 Job3.sh 
13207 
[user@login]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E" 
JOBID        NAME            ST   DEPENDENCY                    
13205        Job1.bat        R                                  
13206        Job2.bat        PD   afterany:13205                
13207        Job3.bat        PD   afterany:13205

Making a job depend on the completion of several other jobs: example below.

[user@login]$ sbatch Job1.sh
13201
[user@login]$ sbatch Job2.sh
13202
[user@login]$ sbatch --dependency=afterany:13201,13202 Job3.sh
13203
[user@login]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E"
JOBID        NAME            ST   DEPENDENCY                    
13201        Job1.sh         R                                  
13202        Job2.sh         R                                  
13203        Job3.sh         PD   afterany:13201,afterany:13202

Chaining jobs is most easily done by submitting the second dependent job from within the first job. Example batch script:

#!/bin/bash

cd /data/mydir
run_some_command
sbatch --dependency=afterany:$SLURM_JOB_ID  my_second_job

Job dependencies documentation adapted from https://hpc.nih.gov/docs/userguide.html#depend

Job Arrays

Job arrays are multiple jobs to be executed with identical parameters. Job arrays are submitted with -a <indices> or --array=<indices>. The indices specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a “-” separator: --array=0-15 or --array=0,6,16-32.

A step function can also be specified with a suffix containing a colon and number. For example,--array=0-15:4 is equivalent to --array=0,4,8,12.
A  maximum  number  of  simultaneously running tasks from the job array may be specified using a “%” separator. For example --array=0-15%4 will limit the number of simultaneously running tasks from this job array to 4. The minimum index value is 0. The maximum value is 499999.

To receive mail alerts for each individual array task, --mail-type=ARRAY_TASKS should be added to the Slurm job script. Unless this option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.

Execution Environment

For each job type above, the user has the ability to define the execution environment. This includes environment variable definitions as well as shell limits (bash ulimit or csh limit). sbatch and salloc provide the --export option to convey specific environment variables to the execution environment. sbatch and salloc provide the --propagate option to convey specific shell limits to the execution environment. By default Slurm does not source the files ~./bashrc or ~/.profile when requesting resources via sbatch (although it does when running srun / salloc ).  So, if you have a standard environment that you have set in either of these files or your current shell then you can do one of the following:

  1. Add the command #SBATCH --get-user-env to your job script (i.e. the module environment is propagated).
  2. Source the configuration file in your job script:
< #SBATCH statements >
source ~/.bashrc

Note: You may want to remove the influence of any other current environment variables by adding #SBATCH --export=NONE to the script. This removes all set/exported variables and then acts as if #SBATCH --get-user-env has been added (module environment is propagated).

Environment Variables

Slurm recognizes and provides a number of environment variables.

The first category of environment variables are those that Slurm inserts into the job’s execution environment. These convey to the job script and application information such as job ID (SLURM_JOB_ID) and task ID (SLURM_PROCID). For the complete list, see the “OUTPUT ENVIRONMENT VARIABLES” section under the sbatchsalloc, and srun man pages.

The next category of environment variables are those use user can set in their environment to convey default options for every job they submit. These include options such as the wall clock limit. For the complete list, see the “INPUT ENVIRONMENT VARIABLES” section under the sbatchsalloc, and srun man pages.

Finally, Slurm allows the user to customize the behavior and output of some commands using environment variables. For example, one can specify certain fields for the squeue command to display by setting the SQUEUE_FORMAT variable in the environment from which you invoke squeue.

Commonly Used Environment Variables

Info Slurm Notes
Job name $SLURM_JOB_NAME
Job ID $SLURM_JOB_ID
Submit directory $SLURM_SUBMIT_DIR Slurm jobs starts from the submit directory by default.
Submit host $SLURM_SUBMIT_HOST
Node list $SLURM_JOB_NODELIST The Slurm variable has a different format to the PBS one.

To get a list of nodes use:

scontrol show hostnames $SLURM_JOB_NODELIST

Job array index $SLURM_ARRAY_TASK_ID
Queue name $SLURM_JOB_PARTITION
Number of nodes allocated $SLURM_JOB_NUM_NODES

$SLURM_NNODES

Number of processes $SLURM_NTASKS
Number of processes per node $SLURM_TASKS_PER_NODE
Requested tasks per node $SLURM_NTASKS_PER_NODE
Requested CPUs per task $SLURM_CPUS_PER_TASK
Scheduling priority $SLURM_PRIO_PROCESS
Job user $SLURM_JOB_USER
Hostname $HOSTNAME == $SLURM_SUBMIT_HOST Unless a shell is invoked on an allocated resource, the HOSTNAME variable is propagated (copied) from the submit machine environments will be the same on all allocated nodes.

Job Output

Slurm merges the job’s standard error and output by default and saves it to an output file with a name that includes the job ID (slurm-<job_ID>.out for normal jobs and "slurm-<job_ID_index.out for arrays"). You can specify your own output and error files to the sbatch command using the -o /file/to/output and -e /file/to/error options respectively. If both standard out and error should go to the same file, only specify -o /file/to/output Slurm will append the job’s output to the specified file(s). If you want the output to overwrite any existing files, add the --open-mode=truncate option. The files are written as soon as output is created. It does not spool on the compute node and then get copied to the final location after the job ends. If not specified in the job submission, standard output and error are combined and written into a file in the working directory from which the job was submitted.

For example if I submit job 93 from my home directory, the job output and error will be written to my home directory in a file called slurm-93.out. The file appears while the job is still running.

[user@login ~]$ sbatch test.sh
Submitted batch job 93 
[user@login ~]$ ll slurm-93.out
-rw-r–r– 1 user hpcstaff 122 Jun 7 15:28 slurm-93.out 
[user@login ~]$ squeue 
JOBID PARTITION NAME    USER ST TIME NODES NODELIST(REASON) 
93    standard  example user R  0:04 1     node0002

If you submit from a working directory which is not a shared filesystem, your output will only be available locally on the compute node and will need to be copied to another location after the job completes. /home, /scratch, and /nfs are all networked filesystems which are available on the login nodes and all compute nodes.

For example if I submit a job from /tmp on the login node, the output will be in /tmp on the compute node.

[user@login tmp]$ pwd
/tmp
[user@login tmp]$ sbatch /home/user/test.sh
Submitted batch job 98
[user@login tmp]$ squeue
JOBID PARTITION     NAME     USER ST  TIME  NODES NODELIST(REASON)
98    standard      example  user R   0:03  1     node0002
[user@login tmp]$ ssh bn01
[user@node0002 ~]$ ll /tmp/slurm-98.out
-rw-r–r– 1 user hpcstaff 78 Jun 7 15:46 /tmp/slurm-98.out

Serial vs. Parallel jobs

Parallel jobs launch applications that are comprised of many processes (aka tasks) that communicate with each other, typically over a high speed switch. Serial jobs launch one or more tasks that work independently on separate problems.

Parallel applications must be launched by the srun command. Serial applications can use srun to launch them, but it is not required in one node allocations.

Job Partitions

A cluster is often highly utilized and may not be able to run a job when it is submitted. When this occurs, the job is placed in a partition. Specific compute node resources are defined for every job partition. The Slurm partition is synonymous with the term queue.

Each partition can be configured with a set of limits which specify the requirements for every job that can run in that partition. These limits include job size, wall clock limits, and the users who are allowed to run in that partition.

The Lighthouse cluster currently has a temporary “test” partition, which can be used for testing the Lighthouse Slurm environment before migrating nodes from the Flux Operating Environment (FOE).  There will be a partition named for the set of nodes for every PI or project.  A user will only have access to partitions that the PI has granted access, along with the “test” partition.

Commands related to partitions include:

sinfo Lists all partitions currently configured
scontrol show partition <name> Provides details about each partition
squeue Lists all jobs currently on the system, one line per job

Job Status

Most of a job’s specifications can be seen by invoking scontrol show job <jobID>.  More details about the job can be written to a file by using  scontrol write batch_script <jobID> output.txt. If no output file is specified, the script will be written to slurm<jobID>.sh.

Slurm captures and reports the exit code of the job script (sbatch jobs) as well as the signal that caused the job’s termination when a signal caused a job’s termination.

A job’s record remains in Slurm’s memory for 30 minutes after it completes.  scontrol show job will return “Invalid job id specified” for a job that completed more than 30 minutes ago.  At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.

Modifying a Batch Job

Many of the batch job specifications can be modified after a batch job is submitted and before it runs.  Typical fields that can be modified include the job size (number of nodes), partition (queue), and wall clock limit.  Job specifications cannot be modified by the user once the job enters the Running state.

Beside displaying a job’s specifications, the scontrol command is used to modify them.  Examples:

scontrol -dd show job <jobID> Displays all of a job’s characteristics
scontrol write batch_script <jobID> Retrieve the batch script for a given job
scontrol update JobId=<jobID> Account=science Change the job’s account to the “science” account
scontrol update JobId=<jobID> Partition=priority Changes the job’s partition to the priority partition

Holding and Releasing a Batch Job

If a user’s job is in the pending state waiting to be scheduled, the user can prevent the job from being scheduled by invoking the scontrol hold <jobID> command to place the job into a Held state. Jobs in the held state do not accrue any job priority based on queue wait time.  Once the user is ready for the job to become a candidate for scheduling once again, they can release the job using the scontrol release <jobID> command.

Signalling and Cancelling a Batch Job

Pending jobs can be cancelled (withdrawn from the queue) using the scancel command (scancel <jobID>).  The scancel command can also be used to terminate a running job.  The default behavior is to issue the job a SIGTERM, wait 30 seconds, and if processes from the job continue to run, issue a SIGKILL command.

The -s option of the scancel command (scancel -s <signal> <jobID>) allows the user to issue any signal to a running job.

Common Job Commands

Command Slurm
Submit a job sbatch <job script>
Delete a job scancel <job ID>
Job status (all) squeue
Job status (by job) squeue -j <job ID>
Job status (by user) squeue -u <user>
Job status (detailed) scontrol show job -dd <job ID>
Show expected start time squeue -j <job ID> --start
Queue list / info scontrol show partition <name>
Node list scontrol show nodes
Node details scontrol show node <node>
Hold a job scontrol hold <job ID>
Release a job scontrol release <job ID>
Cluster status sinfo
Start an interactive job salloc <args>srun --pty <args>
X forwarding srun --pty <args> --x11(Update with --x11 once 17.11 is released)
Read stdout messages at runtime No equivalent command / not needed. Use the --output option instead.
Monitor or review a job’s resource usage sacct -j <job_num> --format JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed

(see sacct for all format options)

View job batch script scontrol write batch_script <jobID> [filename]
View accounts you can submit to sacctmgr show assoc user=$USER
View users with access to an account sacctmgr show assoc account=<account>
View default submission account and wckey sacctmgr show User <account>

Job States

The basic job states are these:

  • Pending (PD) – the job is in the queue, waiting to be scheduled
  • Held – the job was submitted, but was put in the held state (ineligible to run)
  • Running (R) – the job has been granted an allocation.  If it’s a batch job, the batch script has been run
  • Complete (CD) – the job has completed successfully
  • Timeout (TO) – the job was terminated for running longer than its wall clock limit
  • Preempted (PR) – the running job was terminated to reassign its resources to a higher QoS job
  • Failed (F) – the job terminated with a non-zero status
  • Node Fail (NF) – the job terminated after a compute node reported a problem

For the complete list, see the “JOB STATE CODES” section under the squeue man page.

Pending Reasons

A pending job can remain pending for a number of reasons:

  • Dependency – the pending job is waiting for another job to complete
  • Priority – the job is not high enough in the queue
  • Resources – the job is high in the queue, but there are not enough resources to satisfy the job’s request
  • Partition Down – the queue is currently closed to running any new jobs

For the complete list, see the “JOB REASON CODES” section under the squeue man page.

Displaying Computing Resources

As stated above, computing resources are nodes, CPUs, memory, and generic resources like GPUs.  The resources of each compute node can be seen by running the scontrol show node command.  The characteristics of each partition can be seen by running the scontrol show partition command.  Finally, a load summary report for each partition can be seen by running sinfo.

To show a summary of cluster resources on a per partition basis:

[user@login ~]$ sinfo
PARTITION     AVAIL    TIMELIMIT    NODES STATE   NODELIST
example       up       14-00:00:00  20    comp    node[0001-0020]
[user@login ~]$ sstate
———————————————————————————————————————
Node      AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
node0001  0        16       0.00            0.03    0        64170    0.00           IDLE
node0002  0        16       0.00            0.04    0        64170    0.00           IDLE
...
———————————————————————————————————————
Totals:
Node    AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
20      0        320      0.00                    0        1283556  0.00

In this example the user “user” has access to submit workloads to the account “example” on the Lighthouse cluster. To show associations for the current user:

[user@login ~]$ sacctmgr show assoc user=$USER

Cluster      Account   User  Partition  ...
———————————————————————————————————————
lighthouse   example   user  1

Job Statistics and Accounting

The sreport command provides aggregated usage reports by user and account over a specified period. Examples:

By user: sreport -T billing cluster AccountUtilizationByUser Start=2019-01-01 End=2019-12-31

By account: sreport -T billing cluster UserUtilizationByAccount Start=2019-01-01 End=2019-12-31

For all of the sreport options see the sreport man page.

Time Remaining in an Allocation

If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, applications have two means for discovering the time remaining in the application.

The first means is to use the sbatch --signal=<sig_num>[@<sig_time>] option to request a signal (like USR1 or USR2) at sig_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.

The second means is for the application to issue a library call to retrieve its remaining time periodically. When the library call returns a remaining time below a certain threshold, the application can take the necessary steps to write a checkpoint file and terminate gracefully.

Slurm offers the slurm_get_rem_time() library call that returns the time remaining. On some systems, the yogrt library (man yogrt) is also available to provide the time remaining.

Slurm User Guide for Great Lakes

By | Uncategorized

Go to Beta Overview     To search this user guide, use the Command + F (Mac) or Ctrl + F (Win) keyboard shortcuts.

Slurm User Guide for Great Lakes

Slurm is a combined batch scheduler and resource manager that allows users to run their jobs on the University of Michigan’s high performance computing (HPC) clusters. This document describes the process for submitting and running jobs under the Slurm Workload Manager on the Great Lakes cluster.

The Batch Scheduler and Resource Manager

The batch scheduler and resource manager work together to run jobs on an HPC cluster. The batch scheduler, sometimes called a workload manager, is responsible for finding and allocating the resources that fulfill the job’s request at the soonest available time.  When a job is scheduled to run, the scheduler instructs the resource manager to launch the application(s) across the job’s allocated resources.  This is also known as “running the job”.

The user can specify conditions for scheduling the job. One condition is the completion (successful or unsuccessful) of an earlier submitted job.  Other conditions include the availability of a specific license or access to a specific hardware accelerator.

Computing Resources

An HPC cluster is made up of a number of compute nodes, each with a complement of processors, memory and GPUs.  The user submits jobs that specify the application(s) they want to run along with a description of the computing resources needed to run the application(s).

Login Resources

Users interact with an HPC cluster through login nodes. Login nodes are a place where users can login, edit files, view job results and submit new jobs. Login nodes are a shared resource and should not be used to run application workloads.

Jobs and Job Steps

A job is an allocation of resources assigned to an individual user for a specified amount of time. Job steps are sets of (possibly parallel) tasks within a job. When a job runs, the scheduler selects and allocates resources to the job. The invocation of the application happens within the batch script, or at the command line for interactive and jobs.

When an application is launched using srun, it runs within a “job step”. The srun command causes the simultaneous launching of multiple tasks of a single application. Arguments to srun specify the number of tasks to launch as well as the number of nodes (and CPUs and memory) on which to launch the tasks.

srun can be invoked in parallel or sequentially (by backgrounding them). Furthermore, the number of nodes specified by srun (the -N option) can be less than but no more than the number of nodes (and CPUs and memory) that were allocated to the job.

srun can also be invoked directly at the command line (outside of a job allocation). Doing so will submit a job to the batch scheduler and srun will block until that job is scheduled to run. When the srun job runs, a single job step will be created. The job will complete when that job step terminates.

Batch Jobs

The sbatch command is used to submit a batch script to Slurm. It is designed to reject the job at submission time if there are requests or constraints that Slurm cannot fulfill as specified. This gives the user the opportunity to examine the job request and resubmit it with the necessary corrections. To submit a batch script simply run sbatch <scriptName>

$ sbatch myJob.sh

 

Anatomy of a Batch Job

The batch job script is composed of three main components:

  • The interpreter used to execute the script
  • #SBATCH directives that convey submission options
  • The application(s) to execute along with its input arguments and options

An Example Slurm job

#!/bin/bash
# The interpreter used to execute the script

#“#SBATCH” directives that convey submission options:

#SBATCH --job-name=example_job
#SBATCH --mail-user=uniqname@umich.edu
#SBATCH --mail-type=BEGIN,END
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=1000m 
#SBATCH --time=10:00
#SBATCH -A test
#SBATCH -p standard
#SBATCH --output=/home/%u/%x-%j.log

# The application(s) to execute along with its input arguments and options:

/bin/hostname
sleep 60

Common Job Submission Options

Option Slurm Command (#SBATCH) Great Lakes Usage
Job name --job-name=<name> --job-name=gljob1
Account --account=<account> --account=test
Queue --partition=<name> --partition=partitionname

Available partitions: standard, gpu (GPU jobs only), largemem (RAM-intensive jobs only), debug

Wall time limit --time=<hh:mm:ss> --time=02:00:00
Node count --nodes=<count> --nodes=2
Process count per node --ntasks-per-node=<count> --ntasks-per-node=1
Core count (per process) --cpus-per-task=<cores> --cpus-per-task=1
Memory limit --mem=<limit> (Memory per node in MB) --mem=12000m
Minimum memory per processor --mem-per-cpu=<memory> --mem-per-cpu=1000m
Request GPUs --gres=gpu:<count> --gres=gpu:2
Job array --array=<array indices> --array=0-15
Standard output file --output=<file path> (path must exist) --output=/home/%u/%x-%j.log
%u = username
%x = job name
%j = job ID 
Standard error file --error=<file path> (path must exist) --error=/home/%u/error-%x-%j.log
Combine stdout/stderr to stdout --output=<combined out and err file path> --output=/home/%u/%x-%j.log
Copy environment --export=ALL (default)

--export=NONE (to not export environment)

--export=ALL
Copy environment variable --export=<variable=value,var2=val2> --export=EDITOR=/bin/vim
Job dependency --dependency=after:jobID[:jobID...]

--dependency=afterok:jobID[:jobID...]

--dependency=afternotok:jobID[:jobID...]

--dependency=afterany:jobID[:jobID...]

--dependency=after:1234[:1233]
Request software license(s)

--licenses=<application>@slurmdb:<N>

--licenses=stata@slurmdb:1
requests one license for Stata

Request event notification

--mail-type=<events>

Note: multiple mail-type requests may be specified in a comma separated list:

--mail-type=BEGIN,END,NONE,FAIL,REQUEUE,ARRAY_TASKS

--mail-type=BEGIN,END,FAIL

Email address --mail-user=<email address> --mail-user=uniqname@umich.edu
Defer job until the specified time --begin=<date/time> --begin=2020-12-25T12:30:00

Please note that if your job is set to utilize more than one node, make sure your code is MPI enabled in order to run across these nodes and you must use srun rather then mpirun or mpiexec.

Interactive Jobs

An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. Interactive jobs are useful when debugging or interacting with an application. The srun command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

[user@login1 ~]$ srun --pty /bin/bash
srun: job 309 queued and waiting for resources
srun: job 309 has been allocated resources
[user@bn01 ~]$ hostname
bn01.stage.arc-ts.umich.edu
[user@bn01 ~]$

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 4 CPUS and 4GB of memory each:

[user@login1 ~]$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --pty /bin/bash
srun: job 894 queued and waiting for resources
srun: job 894 has been allocated resources
[user@bn01 ~]$ srun hostname
bn01.stage.arc-ts.umich.edu
bn02.stage.arc-ts.umich.edu
bn01.stage.arc-ts.umich.edu
bn01.stage.arc-ts.umich.edu
bn01.stage.arc-ts.umich.edu
bn02.stage.arc-ts.umich.edu
bn02.stage.arc-ts.umich.edu
bn02.stage.arc-ts.umich.edu

In the above example srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job. See the srun man page for more details.

GPU and Large Memory Jobs

Jobs can request GPUs with the job submission options --partition=gpu and --gres=gpu:<count>. GPUs can be requested in both Batch and Interactive jobs.

Similarly, jobs can request nodes with large amounts of RAM with --partition=largemem.

Job Dependencies

You may want to run a set of jobs sequentially, so that the second job runs only after the first one has completed. This can be accomplished using Slurm’s job dependencies options. For example, if you have two jobs, Job1.sh and Job2.sh, you can utilize job dependencies as in the example below.

[user@login1]$ sbatch Job1.sh
123213

[user@login1]$ sbatch --dependency=afterany:123213 Job2.sh
123214

The flag --dependency=afterany:123213 tells the batch system to start the second job only after completion of the first job. afterany indicates that Job2 will run regardless of the exit status of Job1, i.e. regardless of whether the batch system thinks Job1 completed successfully or unsuccessfully.

Once job 123213 completes, job 123214 will be released by the batch system and then will run as the appropriate nodes become available.

Exit status: The exit status of a job is the exit status of the last command that was run in the batch script. An exit status of ‘0’ means that the batch system thinks the job completed successfully. It does not necessarily mean that all commands in the batch script completed successfully.

There are several options for the –dependency flag that depend on the status of Job1:

–dependency=afterany:Job1 Job2 will start after Job1 completes with any exit status
–dependency=after:Job1 Job2 will start any time after Job1 starts
–dependency=afterok:Job1 Job2 will run only if Job1 completed with an exit status of 0
–dependency=afternotok:Job1 Job2 will run only if Job1 completed with a non-zero exit status

Making several jobs depend on the completion of a single job is done in the example below:

[user@login1]$ sbatch Job1.sh 
13205 
[user@login1]$ sbatch --dependency=afterany:13205 Job2.sh 
13206 
[user@login1]$ sbatch --dependency=afterany:13205 Job3.sh 
13207 
[user@login1]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E" 
JOBID        NAME            ST   DEPENDENCY                    
13205        Job1.bat        R                                  
13206        Job2.bat        PD   afterany:13205                
13207        Job3.bat        PD   afterany:13205

Making a job depend on the completion of several other jobs: example below.

[user@login1]$ sbatch Job1.sh
13201
[user@login1]$ sbatch Job2.sh
13202
[user@login1]$ sbatch --dependency=afterany:13201,13202 Job3.sh
13203
[user@login1]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E"
JOBID        NAME            ST   DEPENDENCY                    
13201        Job1.sh         R                                  
13202        Job2.sh         R                                  
13203        Job3.sh         PD   afterany:13201,afterany:13202

Chaining jobs is most easily done by submitting the second dependent job from within the first job. Example batch script:

#!/bin/bash

cd /data/mydir
run_some_command
sbatch --dependency=afterany:$SLURM_JOB_ID  my_second_job

Job dependencies documentation adapted from https://hpc.nih.gov/docs/userguide.html#depend

Job Arrays

Job arrays are multiple jobs to be executed with identical parameters. Job arrays are submitted with -a <indices> or --array=<indices>. The indices specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a “-” separator: --array=0-15 or --array=0,6,16-32.

A step function can also be specified with a suffix containing a colon and number. For example,--array=0-15:4 is equivalent to --array=0,4,8,12.
A  maximum  number  of  simultaneously running tasks from the job array may be specified using a “%” separator. For example --array=0-15%4 will limit the number of simultaneously running tasks from this job array to 4. The minimum index value is 0. The maximum value is 499999.

To receive mail alerts for each individual array task, --mail-type=ARRAY_TASKS should be added to the Slurm job script. Unless this option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.

Execution Environment

For each job type above, the user has the ability to define the execution environment. This includes environment variable definitions as well as shell limits (bash ulimit or csh limit). sbatch and salloc provide the --export option to convey specific environment variables to the execution environment. sbatch and salloc provide the --propagate option to convey specific shell limits to the execution environment. By default Slurm does not source the files ~./bashrc or ~/.profile when requesting resources via sbatch (although it does when running srun / salloc ).  So, if you have a standard environment that you have set in either of these files or your current shell then you can do one of the following:

  1. Add the command #SBATCH --get-user-env to your job script (i.e. the module environment is propagated).
  2. Source the configuration file in your job script:
< #SBATCH statements >
source ~/.bashrc

Note: You may want to remove the influence of any other current environment variables by adding #SBATCH --export=NONE to the script. This removes all set/exported variables and then acts as if #SBATCH --get-user-env has been added (module environment is propagated).

Environment Variables

Slurm recognizes and provides a number of environment variables.

The first category of environment variables are those that Slurm inserts into the job’s execution environment. These convey to the job script and application information such as job ID (SLURM_JOB_ID) and task ID (SLURM_PROCID). For the complete list, see the “OUTPUT ENVIRONMENT VARIABLES” section under the sbatchsalloc, and srun man pages.

The next category of environment variables are those use user can set in their environment to convey default options for every job they submit. These include options such as the wall clock limit. For the complete list, see the “INPUT ENVIRONMENT VARIABLES” section under the sbatchsalloc, and srun man pages.

Finally, Slurm allows the user to customize the behavior and output of some commands using environment variables. For example, one can specify certain fields for the squeue command to display by setting the SQUEUE_FORMAT variable in the environment from which you invoke squeue.

Commonly Used Environment Variables

Info Slurm Notes
Job name $SLURM_JOB_NAME
Job ID $SLURM_JOB_ID
Submit directory $SLURM_SUBMIT_DIR Slurm jobs starts from the submit directory by default.
Submit host $SLURM_SUBMIT_HOST
Node list $SLURM_JOB_NODELIST The Slurm variable has a different format to the PBS one.

To get a list of nodes use:

scontrol show hostnames $SLURM_JOB_NODELIST

Job array index $SLURM_ARRAY_TASK_ID
Queue name $SLURM_JOB_PARTITION
Number of nodes allocated $SLURM_JOB_NUM_NODES

$SLURM_NNODES

Number of processes $SLURM_NTASKS
Number of processes per node $SLURM_TASKS_PER_NODE
Requested tasks per node $SLURM_NTASKS_PER_NODE
Requested CPUs per task $SLURM_CPUS_PER_TASK
Scheduling priority $SLURM_PRIO_PROCESS
Job user $SLURM_JOB_USER
Hostname $HOSTNAME == $SLURM_SUBMIT_HOST Unless a shell is invoked on an allocated resource, the HOSTNAME variable is propagated (copied) from the submit machine environments will be the same on all allocated nodes.

Job Output

Slurm merges the job’s standard error and output by default and saves it to an output file with a name that includes the job ID (slurm-<job_ID>.out for normal jobs and "slurm-<job_ID_index.out for arrays"). You can specify your own output and error files to the sbatch command using the -o /file/to/output and -e /file/to/error options respectively. If both standard out and error should go to the same file, only specify -o /file/to/output Slurm will append the job’s output to the specified file(s). If you want the output to overwrite any existing files, add the --open-mode=truncate option. The files are written as soon as output is created. It does not spool on the compute node and then get copied to the final location after the job ends. If not specified in the job submission, standard output and error are combined and written into a file in the working directory from which the job was submitted.

For example if I submit job 93 from my home directory, the job output and error will be written to my home directory in a file called slurm-93.out. The file appears while the job is still running.

[user@beta-login ~]$ sbatch test.sh
Submitted batch job 93 
[user@beta-login ~]$ ll slurm-93.out
-rw-r–r– 1 user hpcstaff 122 Jun 7 15:28 slurm-93.out 
[user@beta-login ~]$ squeue 
JOBID PARTITION NAME    USER ST TIME NODES NODELIST(REASON) 
93    standard  example user R  0:04 1     bn02

If you submit from a working directory which is not a shared filesystem, your output will only be available locally on the compute node and will need to be copied to another location after the job completes. /home, /scratch, and /nfs are all networked filesystems which are available on the login nodes and all compute nodes.

For example if I submit a job from /tmp on the login node, the output will be in /tmp on the compute node.

[user@beta-login tmp]$ pwd
/tmp
[user@beta-login tmp]$ sbatch /home/user/test.sh
Submitted batch job 98
[user@beta-login tmp]$ squeue
JOBID PARTITION     NAME     USER ST  TIME  NODES NODELIST(REASON)
98    standard      example  user R   0:03  1     bn02
[user@beta-login tmp]$ ssh bn01
[user@bn01 ~]$ ll /tmp/slurm-98.out
-rw-r–r– 1 user hpcstaff 78 Jun 7 15:46 /tmp/slurm-98.out

Serial vs. Parallel jobs

Parallel jobs launch applications that are comprised of many processes (aka tasks) that communicate with each other, typically over a high speed switch. Serial jobs launch one or more tasks that work independently on separate problems.

Parallel applications must be launched by the srun command. Serial applications can use srun to launch them, but it is not required in one node allocations.

Job Partitions

A cluster is often highly utilized and may not be able to run a job when it is submitted. When this occurs, the job is placed in a partition. Specific compute node resources are defined for every job partition. The Slurm partition is synonymous with the term queue.

Each partition can be configured with a set of limits which specify the requirements for every job that can run in that partition. These limits include job size, wall clock limits, and the users who are allowed to run in that partition.

The Beta cluster currently has the “standard” partition, used for most production jobs.  The “gpu” partition is currently running a single node and should only be used for GPU-intensive tasks.

Commands related to partitions include:

sinfo Lists all partitions currently configured
scontrol show partition <name> Provides details about each partition
squeue Lists all jobs currently on the system, one line per job

Job Status

Most of a job’s specifications can be seen by invoking scontrol show job <jobID>.  More details about the job can be written to a file by using  scontrol write batch_script <jobID> output.txt. If no output file is specified, the script will be written to slurm<jobID>.sh.

Slurm captures and reports the exit code of the job script (sbatch jobs) as well as the signal that caused the job’s termination when a signal caused a job’s termination.

A job’s record remains in Slurm’s memory for 30 minutes after it completes.  scontrol show job will return “Invalid job id specified” for a job that completed more than 30 minutes ago.  At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.

Modifying a Batch Job

Many of the batch job specifications can be modified after a batch job is submitted and before it runs.  Typical fields that can be modified include the job size (number of nodes), partition (queue), and wall clock limit.  Job specifications cannot be modified by the user once the job enters the Running state.

Beside displaying a job’s specifications, the scontrol command is used to modify them.  Examples:

scontrol -dd show job <jobID> Displays all of a job’s characteristics
scontrol write batch_script <jobID> Retrieve the batch script for a given job
scontrol update JobId=<jobID> Account=science Change the job’s account to the “science” account
scontrol update JobId=<jobID> Partition=priority Changes the job’s partition to the priority partition

Holding and Releasing a Batch Job

If a user’s job is in the pending state waiting to be scheduled, the user can prevent the job from being scheduled by invoking the scontrol hold <jobID> command to place the job into a Held state. Jobs in the held state do not accrue any job priority based on queue wait time.  Once the user is ready for the job to become a candidate for scheduling once again, they can release the job using the scontrol release <jobID> command.

Signalling and Cancelling a Batch Job

Pending jobs can be cancelled (withdrawn from the queue) using the scancel command (scancel <jobID>).  The scancel command can also be used to terminate a running job.  The default behavior is to issue the job a SIGTERM, wait 30 seconds, and if processes from the job continue to run, issue a SIGKILL command.

The -s option of the scancel command (scancel -s <signal> <jobID>) allows the user to issue any signal to a running job.

Common Job Commands

Command Slurm
Submit a job sbatch <job script>
Delete a job scancel <job ID>
Job status (all) squeue
Job status (by job) squeue -j <job ID>
Job status (by user) squeue -u <user>
Job status (detailed) scontrol show job -dd <job ID>
Show expected start time squeue -j <job ID> --start
Queue list / info scontrol show partition <name>
Node list scontrol show nodes
Node details scontrol show node <node>
Hold a job scontrol hold <job ID>
Release a job scontrol release <job ID>
Cluster status sinfo
Start an interactive job salloc <args>srun --pty <args>
X forwarding srun --pty <args> --x11(Update with --x11 once 17.11 is released)
Read stdout messages at runtime No equivalent command / not needed. Use the --output option instead.
Monitor or review a job’s resource usage sacct -j <job_num> --format JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed

(see sacct for all format options)

View job batch script scontrol write batch_script <jobID> [filename]
View accounts you can submit to sacctmgr show assoc user=$USER
View users with access to an account sacctmgr show assoc account=<account>
View default submission account and wckey sacctmgr show User <account>

Job States

The basic job states are these:

  • Pending – the job is in the queue, waiting to be scheduled
  • Held – the job was submitted, but was put in the held state (ineligible to run)
  • Running – the job has been granted an allocation.  If it’s a batch job, the batch script has been run
  • Complete – the job has completed successfully
  • Timeout – the job was terminated for running longer than its wall clock limit
  • Preempted – the running job was terminated to reassign its resources to a higher QoS job
  • Failed – the job terminated with a non-zero status
  • Node Fail – the job terminated after a compute node reported a problem

For the complete list, see the “JOB STATE CODES” section under the squeue man page.

Pending Reasons

A pending job can remain pending for a number of reasons:

  • Dependency – the pending job is waiting for another job to complete
  • Priority – the job is not high enough in the queue
  • Resources – the job is high in the queue, but there are not enough resources to satisfy the job’s request
  • Partition Down – the queue is currently closed to running any new jobs

For the complete list, see the “JOB REASON CODES” section under the squeue man page.

Displaying Computing Resources

As stated above, computing resources are nodes, CPUs, memory, and generic resources like GPUs.  The resources of each compute node can be seen by running the scontrol show node command.  The characteristics of each partition can be seen by running the scontrol show partition command.  Finally, a load summary report for each partition can be seen by running sinfo.

To show a summary of cluster resources on a per partition basis:

[user@beta-login ~]$ sinfo
PARTITION     AVAIL    TIMELIMIT    NODES STATE   NODELIST
standard*     up       14-00:00:00  5     comp    bn[16-20]
standard*     up       14-00:00:00  15    idle    bn[01-15]
gpu           up       14-00:00:00  1     idle    bn15
[user@beta-login ~]$ sstate
———————————————————————————————————————
Node    AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
bn01    0        16       0.00            0.03    0        64170    0.00           IDLE
bn02    0        16       0.00            0.04    0        64170    0.00           IDLE
bn03    0        16       0.00            0.05    0        64170    0.00           IDLE
bn04    0        16       0.00            0.01    0        64170    0.00           IDLE
bn05    0        16       0.00            0.04    0        64170    0.00           IDLE
bn06    0        16       0.00            0.05    0        64170    0.00           IDLE
bn07    0        16       0.00            0.03    0        64170    0.00           IDLE
bn08    0        16       0.00            0.04    0        64170    0.00           IDLE
bn09    0        16       0.00            0.08    0        64221    0.00           IDLE
bn10    0        16       0.00            0.05    0        64170    0.00           IDLE
bn11    0        16       0.00            0.02    0        64170    0.00           IDLE
bn12    0        16       0.00            0.07    0        64170    0.00           IDLE
bn13    0        16       0.00            0.01    0        64170    0.00           IDLE
bn14    0        16       0.00            0.03    0        64170    0.00           IDLE
bn15    0        16       0.00            0.02    0        64224    0.00           IDLE
bn16    0        16       0.00            0.06    0        64170    0.00           IDLE
bn17    0        16       0.00            0.03    0        64170    0.00           IDLE
bn18    0        16       0.00            0.03    0        64221    0.00           IDLE
bn19    0        16       0.00            0.02    0        64170    0.00           IDLE
bn20    0        16       0.00            0.07    0        64170    0.00           IDLE
———————————————————————————————————————
Totals:
Node    AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
20      0        320      0.00                    0        1283556  0.00

In this example the user “user” has access to submit workloads to the accounts support and hpcstaff on the Beta cluster. To show associations for the current user:

[user@beta-login ~]$ sacctmgr show assoc user=$USER

Cluster  Account  User  Partition  ...
———————————————————————————————————————
beta     support  user  1    
beta     hpcstaff user  1

Job Statistics and Accounting

The sreport command provides aggregated usage reports by user and account over a specified period. Examples:

By user: sreport -T billing cluster AccountUtilizationByUser Start=2017-01-01 End=2017-12-31

By account: sreport -T billing cluster UserUtilizationByAccount Start=2017-01-01 End=2017-12-31

For all of the sreport options see the sreport man page.

Time Remaining in an Allocation

If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, applications have two means for discovering the time remaining in the application.

The first means is to use the sbatch --signal=<sig_num>[@<sig_time>] option to request a signal (like USR1 or USR2) at sig_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.

The second means is for the application to issue a library call to retrieve its remaining time periodically. When the library call returns a remaining time below a certain threshold, the application can take the necessary steps to write a checkpoint file and terminate gracefully.

Slurm offers the slurm_get_rem_time() library call that returns the time remaining. On some systems, the yogrt library (man yogrt) is also available to provide the time remaining.

Slurm User Guide for Armis2

By | Uncategorized

Go to Armis2 Overview     To search this user guide, use the Command + F (Mac) or Ctrl + F (Win) keyboard shortcuts.

Slurm User Guide for Armis2

Slurm is a combined batch scheduler and resource manager that allows users to run their jobs on the University of Michigan’s high performance computing (HPC) clusters. This document describes the process for submitting and running jobs under the Slurm Workload Manager on the Armis2 cluster.

The Batch Scheduler and Resource Manager

The batch scheduler and resource manager work together to run jobs on an HPC cluster. The batch scheduler, sometimes called a workload manager, is responsible for finding and allocating the resources that fulfill the job’s request at the soonest available time.  When a job is scheduled to run, the scheduler instructs the resource manager to launch the application(s) across the job’s allocated resources.  This is also known as “running the job”.

The user can specify conditions for scheduling the job. One condition is the completion (successful or unsuccessful) of an earlier submitted job.  Other conditions include the availability of a specific license or access to a specific hardware accelerator.

Computing Resources

An HPC cluster is made up of a number of compute nodes, each with a complement of processors, memory and GPUs.  The user submits jobs that specify the application(s) they want to run along with a description of the computing resources needed to run the application(s).

Login Resources

Users interact with an HPC cluster through login nodes. Login nodes are a place where users can login, edit files, view job results and submit new jobs. Login nodes are a shared resource and should not be used to run application workloads.

Jobs and Job Steps

A job is an allocation of resources assigned to an individual user for a specified amount of time. Job steps are sets of (possibly parallel) tasks within a job. When a job runs, the scheduler selects and allocates resources to the job. The invocation of the application happens within the batch script, or at the command line for interactive and jobs.

When an application is launched using srun, it runs within a “job step”. The srun command causes the simultaneous launching of multiple tasks of a single application. Arguments to srun specify the number of tasks to launch as well as the number of nodes (and CPUs and memory) on which to launch the tasks.

srun can be invoked in parallel or sequentially (by backgrounding them). Furthermore, the number of nodes specified by srun (the -N option) can be less than but no more than the number of nodes (and CPUs and memory) that were allocated to the job.

srun can also be invoked directly at the command line (outside of a job allocation). Doing so will submit a job to the batch scheduler and srun will block until that job is scheduled to run. When the srun job runs, a single job step will be created. The job will complete when that job step terminates.

Batch Jobs

Most work will be queued to be run on Armis2 and is described through a batch script. The sbatch command is used to submit a batch script to Slurm. To submit a batch script simply run the following from a shared file system; those include your home directory, /scratch, and any directory under /nfs that you can normally use in a job on Flux. Output will be sent to this working directory (jobName-jobID.log). Do not submit jobs from /tmp or any of its subdirectories.  sbatch is designed to reject the job at submission time if there are requests or constraints that Slurm cannot fulfill as specified. This gives the user the opportunity to examine the job request and resubmit it with the necessary corrections.

$ sbatch myJob.sh

Anatomy of a Batch Job

The batch job script is composed of three main components:

  • The interpreter used to execute the script
  • #SBATCH directives that convey submission options
  • The application(s) to execute along with its input arguments and options

Example:

#!/bin/bash
# The interpreter used to execute the script

#“#SBATCH” directives that convey submission options:

#SBATCH --job-name=example_job
#SBATCH --mail-type=BEGIN,END
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=1000m 
#SBATCH --time=10:00
#SBATCH --account=test
#SBATCH --partition=standard

# The application(s) to execute along with its input arguments and options:

/bin/hostname
sleep 60

How many nodes and processors you request will depend on the capability of your software and what it can do. There are four common scenarios:

Example: One Node, One Processor

This is the simplest case and is shown in the example above. The majority of software cannot use more than this. Some examples of software for which this would be the right configuration are SAS, Stata, R, many Python programs, most Perl programs.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: One Node, Multiple Processors

This is similar to what a modern desktop or laptop is likely to have. Software that can use more than one processor may be described as multicore, multiprocessor, or mulithreaded. Some examples of software that can benefit from this are MATLAB and Stata/MP. You should read the documentation for your software to see if this is one of its capabilities.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Node, One Process per CPU

This is the classic MPI approach, where multiple machines are requested, one process per processor on each node is started using MPI. This is the way most MPI-enabled software is written to work.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Nodes, Multiple CPUs per Process

This is often referred to as the “hybrid mode” MPI approach, where multiple machines are requested and multiple processes are requested. MPI will start a parent process or processes on each node, and those in turn will be able to use more than one processor for threaded calculations.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Common Job Submission Options

Option Slurm Command (#SBATCH) Armis2 Usage
Job name --job-name=<name> --job-name=armis2job1
Account --account=<account> --account=test
Queue --partition=<name> --partition=partitionname

Available partitions: standard, gpu (GPU jobs only), largemem (RAM-intensive jobs only), debug

Wall time limit --time=<hh:mm:ss> --time=02:00:00
Node count --nodes=<count> --nodes=2
Process count per node --ntasks-per-node=<count> --ntasks-per-node=1
Core count (per process) --cpus-per-task=<cores> --cpus-per-task=1
Memory limit --mem=<limit> (Memory per node in MB) --mem=12000m
Minimum memory per processor --mem-per-cpu=<memory> --mem-per-cpu=1000m
Request GPUs --gres=gpu:<count> --gres=gpu:2
Job array --array=<array indices> --array=0-15
Standard output file --output=<file path> (path must exist) --output=/home/%u/%x-%j.log
%u = username
%x = job name
%j = job ID 
Standard error file --error=<file path> (path must exist) --error=/home/%u/error-%x-%j.log
Combine stdout/stderr to stdout --output=<combined out and err file path> --output=/home/%u/%x-%j.log
Copy environment --export=ALL (default)

--export=NONE (to not export environment)

--export=ALL
Copy environment variable --export=<variable=value,var2=val2> --export=EDITOR=/bin/vim
Job dependency --dependency=after:jobID[:jobID...]

--dependency=afterok:jobID[:jobID...]

--dependency=afternotok:jobID[:jobID...]

--dependency=afterany:jobID[:jobID...]

--dependency=after:1234[:1233]
Request software license(s)

--licenses=<application>@slurmdb:<N>

--licenses=stata@slurmdb:1
requests one license for Stata

Request event notification

--mail-type=<events>

Note: multiple mail-type requests may be specified in a comma separated list:

--mail-type=BEGIN,END,NONE,FAIL,REQUEUE,ARRAY_TASKS

--mail-type=BEGIN,END,FAIL

Email address --mail-user=<email address> --mail-user=uniqname@umich.edu
Defer job until the specified time --begin=<date/time> --begin=2020-12-25T12:30:00

Please note that if your job is set to utilize more than one node, make sure your code is MPI enabled in order to run across these nodes and you must use srun rather then mpirun or mpiexec.

Interactive Jobs

An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. Interactive jobs are useful when debugging or interacting with an application. The srun command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

[user@login ~]$ srun --pty /bin/bash
srun: job 309 queued and waiting for resources
srun: job 309 has been allocated resources
[user@node0001 ~]$ hostname bn01.stage.arc-ts.umich.edu [user@bn01 ~]$ 

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 4 CPUS and 4GB of memory each:

[user@login ~]$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --pty /bin/bash
srun: job 894 queued and waiting for resources
srun: job 894 has been allocated resources
[user@node0001 ~]$ srun hostname
node0001.armis2.arc-ts.umich.edu
node0002.armis2.arc-ts.umich.edu
node0001.armis2.arc-ts.umich.edu
node0001.armis2.arc-ts.umich.edu
node0001.armis2.arc-ts.umich.edu
node0002.armis2.arc-ts.umich.edu
node0002.armis2.arc-ts.umich.edu
node0002.armis2.arc-ts.umich.edu

In the above example srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job. See the srun man page for more details.

GPU and Large Memory Jobs

Jobs can request GPUs with the job submission options --partition=gpu and --gres=gpu:<count>. GPUs can be requested in both Batch and Interactive jobs.

Similarly, jobs can request nodes with large amounts of RAM with --partition=largemem.

Job Dependencies

You may want to run a set of jobs sequentially, so that the second job runs only after the first one has completed. This can be accomplished using Slurm’s job dependencies options. For example, if you have two jobs, Job1.sh and Job2.sh, you can utilize job dependencies as in the example below.

[user@login]$ sbatch Job1.sh
123213

[user@login]$ sbatch --dependency=afterany:123213 Job2.sh
123214

The flag --dependency=afterany:123213 tells the batch system to start the second job only after completion of the first job. afterany indicates that Job2 will run regardless of the exit status of Job1, i.e. regardless of whether the batch system thinks Job1 completed successfully or unsuccessfully.

Once job 123213 completes, job 123214 will be released by the batch system and then will run as the appropriate nodes become available.

Exit status: The exit status of a job is the exit status of the last command that was run in the batch script. An exit status of ‘0’ means that the batch system thinks the job completed successfully. It does not necessarily mean that all commands in the batch script completed successfully.

There are several options for the –dependency flag that depend on the status of Job1:

–dependency=afterany:Job1 Job2 will start after Job1 completes with any exit status
–dependency=after:Job1 Job2 will start any time after Job1 starts
–dependency=afterok:Job1 Job2 will run only if Job1 completed with an exit status of 0
–dependency=afternotok:Job1 Job2 will run only if Job1 completed with a non-zero exit status

Making several jobs depend on the completion of a single job is done in the example below:

[user@login]$ sbatch Job1.sh 
13205 
[user@login]$ sbatch --dependency=afterany:13205 Job2.sh 
13206 
[user@login]$ sbatch --dependency=afterany:13205 Job3.sh 
13207 
[user@login]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E" 
JOBID        NAME            ST   DEPENDENCY                    
13205        Job1.bat        R                                  
13206        Job2.bat        PD   afterany:13205                
13207        Job3.bat        PD   afterany:13205

Making a job depend on the completion of several other jobs: example below.

[user@login]$ sbatch Job1.sh
13201
[user@login]$ sbatch Job2.sh
13202
[user@login]$ sbatch --dependency=afterany:13201,13202 Job3.sh
13203
[user@login]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E"
JOBID        NAME            ST   DEPENDENCY                    
13201        Job1.sh         R                                  
13202        Job2.sh         R                                  
13203        Job3.sh         PD   afterany:13201,afterany:13202

Chaining jobs is most easily done by submitting the second dependent job from within the first job. Example batch script:

#!/bin/bash

cd /data/mydir
run_some_command
sbatch --dependency=afterany:$SLURM_JOB_ID  my_second_job

Job dependencies documentation adapted from https://hpc.nih.gov/docs/userguide.html#depend

Job Arrays

Job arrays are multiple jobs to be executed with identical parameters. Job arrays are submitted with -a <indices> or --array=<indices>. The indices specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a “-” separator: --array=0-15 or --array=0,6,16-32.

A step function can also be specified with a suffix containing a colon and number. For example,--array=0-15:4 is equivalent to --array=0,4,8,12.
A  maximum  number  of  simultaneously running tasks from the job array may be specified using a “%” separator. For example --array=0-15%4 will limit the number of simultaneously running tasks from this job array to 4. The minimum index value is 0. The maximum value is 499999.

To receive mail alerts for each individual array task, --mail-type=ARRAY_TASKS should be added to the Slurm job script. Unless this option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.

Execution Environment

For each job type above, the user has the ability to define the execution environment. This includes environment variable definitions as well as shell limits (bash ulimit or csh limit). sbatch and salloc provide the --export option to convey specific environment variables to the execution environment. sbatch and salloc provide the --propagate option to convey specific shell limits to the execution environment. By default Slurm does not source the files ~./bashrc or ~/.profile when requesting resources via sbatch (although it does when running srun / salloc ).  So, if you have a standard environment that you have set in either of these files or your current shell then you can do one of the following:

  1. Add the command #SBATCH --get-user-env to your job script (i.e. the module environment is propagated).
  2. Source the configuration file in your job script:
< #SBATCH statements >
source ~/.bashrc

Note: You may want to remove the influence of any other current environment variables by adding #SBATCH --export=NONE to the script. This removes all set/exported variables and then acts as if #SBATCH --get-user-env has been added (module environment is propagated).

Environment Variables

Slurm recognizes and provides a number of environment variables.

The first category of environment variables are those that Slurm inserts into the job’s execution environment. These convey to the job script and application information such as job ID (SLURM_JOB_ID) and task ID (SLURM_PROCID). For the complete list, see the “OUTPUT ENVIRONMENT VARIABLES” section under the sbatchsalloc, and srun man pages.

The next category of environment variables are those use user can set in their environment to convey default options for every job they submit. These include options such as the wall clock limit. For the complete list, see the “INPUT ENVIRONMENT VARIABLES” section under the sbatchsalloc, and srun man pages.

Finally, Slurm allows the user to customize the behavior and output of some commands using environment variables. For example, one can specify certain fields for the squeue command to display by setting the SQUEUE_FORMAT variable in the environment from which you invoke squeue.

Commonly Used Environment Variables

Info Slurm Notes
Job name $SLURM_JOB_NAME
Job ID $SLURM_JOB_ID
Submit directory $SLURM_SUBMIT_DIR Slurm jobs starts from the submit directory by default.
Submit host $SLURM_SUBMIT_HOST
Node list $SLURM_JOB_NODELIST The Slurm variable has a different format to the PBS one.

To get a list of nodes use:

scontrol show hostnames $SLURM_JOB_NODELIST

Job array index $SLURM_ARRAY_TASK_ID
Queue name $SLURM_JOB_PARTITION
Number of nodes allocated $SLURM_JOB_NUM_NODES

$SLURM_NNODES

Number of processes $SLURM_NTASKS
Number of processes per node $SLURM_TASKS_PER_NODE
Requested tasks per node $SLURM_NTASKS_PER_NODE
Requested CPUs per task $SLURM_CPUS_PER_TASK
Scheduling priority $SLURM_PRIO_PROCESS
Job user $SLURM_JOB_USER
Hostname $HOSTNAME == $SLURM_SUBMIT_HOST Unless a shell is invoked on an allocated resource, the HOSTNAME variable is propagated (copied) from the submit machine environments will be the same on all allocated nodes.

Job Output

Slurm merges the job’s standard error and output by default and saves it to an output file with a name that includes the job ID (slurm-<job_ID>.out for normal jobs and "slurm-<job_ID_index.out for arrays"). You can specify your own output and error files to the sbatch command using the -o /file/to/output and -e /file/to/error options respectively. If both standard out and error should go to the same file, only specify -o /file/to/output Slurm will append the job’s output to the specified file(s). If you want the output to overwrite any existing files, add the --open-mode=truncate option. The files are written as soon as output is created. It does not spool on the compute node and then get copied to the final location after the job ends. If not specified in the job submission, standard output and error are combined and written into a file in the working directory from which the job was submitted.

For example if I submit job 93 from my home directory, the job output and error will be written to my home directory in a file called slurm-93.out. The file appears while the job is still running.

[user@login ~]$ sbatch test.sh
Submitted batch job 93 
[user@login ~]$ ll slurm-93.out
-rw-r–r– 1 user hpcstaff 122 Jun 7 15:28 slurm-93.out 
[user@login ~]$ squeue 
JOBID PARTITION NAME    USER ST TIME NODES NODELIST(REASON) 
93    standard  example user R  0:04 1     node0001

If you submit from a working directory which is not a shared filesystem, your output will only be available locally on the compute node and will need to be copied to another location after the job completes. /home, /scratch, and /nfs are all networked filesystems which are available on the login nodes and all compute nodes.

For example if I submit a job from /tmp on the login node, the output will be in /tmp on the compute node.

[user@login tmp]$ pwd
/tmp
[user@login tmp]$ sbatch /home/user/test.sh
Submitted batch job 98
[user@login tmp]$ squeue
JOBID PARTITION     NAME     USER ST  TIME  NODES NODELIST(REASON)
98    standard      example  user R   0:03  1     node0001 
[user@login tmp]$ ssh node0001 
[user@node0001 ~]$ ll /tmp/slurm-98.out
-rw-r–r– 1 user hpcstaff 78 Jun 7 15:46 /tmp/slurm-98.out

Serial vs. Parallel jobs

Parallel jobs launch applications that are comprised of many processes (aka tasks) that communicate with each other, typically over a high speed switch. Serial jobs launch one or more tasks that work independently on separate problems.

Parallel applications must be launched by the srun command. Serial applications can use srun to launch them, but it is not required in one node allocations.

Job Partitions

A cluster is often highly utilized and may not be able to run a job when it is submitted. When this occurs, the job is placed in a partition. Specific compute node resources are defined for every job partition. The Slurm partition is synonymous with the term queue.

Each partition can be configured with a set of limits which specify the requirements for every job that can run in that partition. These limits include job size, wall clock limits, and the users who are allowed to run in that partition.

The Armis2 cluster has a “standard” partition for most production jobs, a “gpu”partition for GPU-intensive work, and a “largemem” partition for jobs requiring greater amounts of RAM.

Commands related to partitions include:

sinfo Lists all partitions currently configured
scontrol show partition <name> Provides details about each partition
squeue Lists all jobs currently on the system, one line per job

Job Status

Most of a job’s specifications can be seen by invoking scontrol show job <jobID>.  More details about the job can be written to a file by using  scontrol write batch_script <jobID> output.txt. If no output file is specified, the script will be written to slurm<jobID>.sh.

Slurm captures and reports the exit code of the job script (sbatch jobs) as well as the signal that caused the job’s termination when a signal caused a job’s termination.

A job’s record remains in Slurm’s memory for 30 minutes after it completes.  scontrol show job will return “Invalid job id specified” for a job that completed more than 30 minutes ago.  At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.

Modifying a Batch Job

Many of the batch job specifications can be modified after a batch job is submitted and before it runs.  Typical fields that can be modified include the job size (number of nodes), partition (queue), and wall clock limit.  Job specifications cannot be modified by the user once the job enters the Running state.

Beside displaying a job’s specifications, the scontrol command is used to modify them.  Examples:

scontrol -dd show job <jobID> Displays all of a job’s characteristics
scontrol write batch_script <jobID> Retrieve the batch script for a given job
scontrol update JobId=<jobID> Account=science Change the job’s account to the “science” account
scontrol update JobId=<jobID> Partition=priority Changes the job’s partition to the priority partition

Holding and Releasing a Batch Job

If a user’s job is in the pending state waiting to be scheduled, the user can prevent the job from being scheduled by invoking the scontrol hold <jobID> command to place the job into a Held state. Jobs in the held state do not accrue any job priority based on queue wait time.  Once the user is ready for the job to become a candidate for scheduling once again, they can release the job using the scontrol release <jobID> command.

Signalling and Cancelling a Batch Job

Pending jobs can be cancelled (withdrawn from the queue) using the scancel command (scancel <jobID>).  The scancel command can also be used to terminate a running job.  The default behavior is to issue the job a SIGTERM, wait 30 seconds, and if processes from the job continue to run, issue a SIGKILL command.

The -s option of the scancel command (scancel -s <signal> <jobID>) allows the user to issue any signal to a running job.

Common Job Commands

Command Slurm
Submit a job sbatch <job script>
Delete a job scancel <job ID>
Job status (all) squeue
Job status (by job) squeue -j <job ID>
Job status (by user) squeue -u <user>
Job status (detailed) scontrol show job -dd <job ID>
Show expected start time squeue -j <job ID> --start
Queue list / info scontrol show partition <name>
Node list scontrol show nodes
Node details scontrol show node <node>
Hold a job scontrol hold <job ID>
Release a job scontrol release <job ID>
Cluster status sinfo
Start an interactive job salloc <args>srun --pty <args>
X forwarding srun --pty <args> --x11(Update with --x11 once 17.11 is released)
Read stdout messages at runtime No equivalent command / not needed. Use the --output option instead.
Monitor or review a job’s resource usage sacct -j <job_num> --format JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed

(see sacct for all format options)

View job batch script scontrol write batch_script <jobID> [filename]
View accounts you can submit to sacctmgr show assoc user=$USER
View users with access to an account sacctmgr show assoc account=<account>
View default submission account and wckey sacctmgr show User <account>

Job States

The basic job states are these:

  • Pending – the job is in the queue, waiting to be scheduled
  • Held – the job was submitted, but was put in the held state (ineligible to run)
  • Running – the job has been granted an allocation.  If it’s a batch job, the batch script has been run
  • Complete – the job has completed successfully
  • Timeout – the job was terminated for running longer than its wall clock limit
  • Preempted – the running job was terminated to reassign its resources to a higher QoS job
  • Failed – the job terminated with a non-zero status
  • Node Fail – the job terminated after a compute node reported a problem

For the complete list, see the “JOB STATE CODES” section under the squeue man page.

Pending Reasons

A pending job can remain pending for a number of reasons:

  • Dependency – the pending job is waiting for another job to complete
  • Priority – the job is not high enough in the queue
  • Resources – the job is high in the queue, but there are not enough resources to satisfy the job’s request
  • Partition Down – the queue is currently closed to running any new jobs

For the complete list, see the “JOB REASON CODES” section under the squeue man page.

Displaying Computing Resources

As stated above, computing resources are nodes, CPUs, memory, and generic resources like GPUs.  The resources of each compute node can be seen by running the scontrol show node command.  The characteristics of each partition can be seen by running the scontrol show partition command.  Finally, a load summary report for each partition can be seen by running sinfo.

To show a summary of cluster resources on a per partition basis:

[user@login ~]$ sinfo
PARTITION     AVAIL    TIMELIMIT    NODES STATE   NODELIST
standard*     up       14-00:00:00  5     comp    node[0016-0020]
standard*     up       14-00:00:00  15    idle    node[0001-0015]
gpu           up       14-00:00:00  1     idle    node0015
[user@login ~]$ sstate
———————————————————————————————————————
Node        AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
node0001    0        16       0.00            0.03    0        64170    0.00           IDLE
node0002    0        16       0.00            0.04    0        64170    0.00           IDLE
node0003    0        16       0.00            0.05    0        64170    0.00           IDLE
node0004    0        16       0.00            0.01    0        64170    0.00           IDLE
node0005    0        16       0.00            0.04    0        64170    0.00           IDLE
node0006    0        16       0.00            0.05    0        64170    0.00           IDLE
node0007    0        16       0.00            0.03    0        64170    0.00           IDLE
node0008    0        16       0.00            0.04    0        64170    0.00           IDLE
node0009    0        16       0.00            0.08    0        64221    0.00           IDLE
node0010    0        16       0.00            0.05    0        64170    0.00           IDLE
node0011    0        16       0.00            0.02    0        64170    0.00           IDLE
node0012    0        16       0.00            0.07    0        64170    0.00           IDLE
node0013    0        16       0.00            0.01    0        64170    0.00           IDLE
node0014    0        16       0.00            0.03    0        64170    0.00           IDLE
node0015    0        16       0.00            0.02    0        64224    0.00           IDLE
node0016    0        16       0.00            0.06    0        64170    0.00           IDLE
node0017    0        16       0.00            0.03    0        64170    0.00           IDLE
node0018    0        16       0.00            0.03    0        64221    0.00           IDLE
node0019    0        16       0.00            0.02    0        64170    0.00           IDLE
node0020    0        16       0.00            0.07    0        64170    0.00           IDLE
———————————————————————————————————————
Totals:
Node    AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
20      0        320      0.00                    0        1283556  0.00

In this example the user “user” has access to submit workloads to the accounts support and hpcstaff on the Armis2 cluster. To show associations for the current user:

[user@login ~]$ sacctmgr show assoc user=$USER

Cluster  Account  User  Partition  ...
———————————————————————————————————————
armis2   support  user  1    
armis2   hpcstaff user  1

Job Statistics and Accounting

The sreport command provides aggregated usage reports by user and account over a specified period. Examples:

By user: sreport -T billing cluster AccountUtilizationByUser Start=2017-01-01 End=2017-12-31

By account: sreport -T billing cluster UserUtilizationByAccount Start=2017-01-01 End=2017-12-31

For all of the sreport options see the sreport man page.

Time Remaining in an Allocation

If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, applications have two means for discovering the time remaining in the application.

The first means is to use the sbatch --signal=<sig_num>[@<sig_time>] option to request a signal (like USR1 or USR2) at sig_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.

The second means is for the application to issue a library call to retrieve its remaining time periodically. When the library call returns a remaining time below a certain threshold, the application can take the necessary steps to write a checkpoint file and terminate gracefully.

Slurm offers the slurm_get_rem_time() library call that returns the time remaining. On some systems, the yogrt library (man yogrt) is also available to provide the time remaining.

Great Lakes Timeline

By | Great Lakes, HPC

Great Lakes Timeline

Great Lakes Installation Begins

October 1, 2018

Dell, Mellanox, and DDN will be delivering and installing the hardware to deliver the new Great Lakes service.  These teams will be working along side the…

Read more

Beta HPC cluster available

October 2, 2018

The Beta HPC cluster was introduced to enable HPC users to begin migrating their Torque job scripts to Slurm and test their workflows on a Slurm-based…

Read more

Great Lakes Beta created

November 1, 2018

Great Lakes Beta is installed for HPC support staff to build and test software packages on the same hardware in Great Lakes.

Read more

Beta Cluster testing continues

January 8, 2019

If you are a current HPC user on Flux or Armis and have not used Slurm before, we highly recommend you login and test your…

Read more

HPC OnDemand Available for Beta

January 22, 2019

The replacement for ARC Connect, called HPC OnDemand, will be available for users.  This will allow users to submit jobs via the web rather than…

Read more

Great Lakes primary installation complete

February 1, 2019

The Great Lakes installation is primarily complete other than waiting for HDR firmware public release for the new InfiniBand system.  The current InfiniBand system is…

Read more

Great Lakes Open for Early User Testing

March 4, 2019

We will be looking for sets of friendly users who will be able to test different aspects of the system to submit their workloads to…

Read more

Great lakes firmware updates complete

March 29, 2019

Great lakes firmware for the InfiniBand networking system will updated and verified by Dell and Mellanox, doubling the speed of the early-release system to 100…

Read more

Great Lakes Early User Testing Continues

April 1, 2019

After the updates to the InfiniBand firmware, we will ask our early users to continue testing and validate that everything is working properly.

Read more

Great Lakes Early User Testing Ends

May 3, 2019

The Great Lakes early user period for testing will end. Great Lakes will transition into production and users will be able to submit work as…

Read more

Great Lakes Open for General Availability

May 6, 2019

Assuming all initial testing is successful, we expect that Great Lakes will become available for University users.

Read more

Beta Retires

September 2, 2019

Beta will be retired after both Flux and Armis have been retired, as the purpose of Beta is to assist users to transition to the…

Read more

If you have questions, please send email to hpc-support@umich.edu.

Order Service

Great Lakes will be available in the first half of 2019. This page will provide updates on the progress of the project.

Please contact hpc-support@umich.edu with any questions.

Beta Timeline

By | Beta, HPC

Beta Timeline

Beta HPC cluster available

October 2, 2018

The Beta HPC cluster was introduced to enable HPC users to begin migrating their Torque job scripts to Slurm and test their workflows on a Slurm-based…

Read more

Beta Cluster testing continues

January 8, 2019

If you are a current HPC user on Flux or Armis and have not used Slurm before, we highly recommend you login and test your…

Read more

HPC OnDemand Available for Beta

January 22, 2019

The replacement for ARC Connect, called HPC OnDemand, will be available for users.  This will allow users to submit jobs via the web rather than…

Read more

Beta Retires

September 2, 2019

Beta will be retired after both Flux and Armis have been retired, as the purpose of Beta is to assist users to transition to the…

Read more

If you have questions, please send email to hpc-support@umich.edu.

Getting Access

Beta is intended for small scale testing to convert Torque/PBS scripts to Slurm. No sensitive data of any type should be used on Beta.

To request:

1. Fill out the ARC-TS HPC account request form.

Because this is a test platform, there is no cost for using Beta.

Related Event

February 27 @ 2:00 pm - 5:30 pm

Geospatial Analysis with Google Earth Engine

Google Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities. This hands-on workshop will help you understand the power (and limitation) of GEE…

March 6 @ 2:00 pm - 5:00 pm

Latent Variable Modeling

Part of the Structural Equation Modeling (SEM) series.  This workshop will help participants develop skills in understanding and conducting latent variable models, particularly from the perspective of structural equation modeling….

March 14 @ 1:00 pm - 4:30 pm

Open Source GIS

This workshop will cover introductory GIS concepts and techniques using open source tools. We will use QGIS and R and learn the basics of GIS by solving a number of…

March 19 @ 2:00 pm - 4:00 pm

Go for data processing 1/2/3

This is a three-session workshop on the use of Go for data processing.  Go is an open source language developed for general-purpose programming.  It is not more difficult to learn…

Great Lakes Configuration

By | Great Lakes, HPC

Great Lakes Configuration

Hardware

Computing

Node Type Standard Large Memory GPU Visualization
Number of Nodes 380 3 20 4
Processors 2x 3.0 GHz Intel Xeon Gold 6154 2x 3.0 GHz Intel Xeon Gold 6154 2x 2.4 GHz Intel Xeon Gold 6148 2x 2.4 GHz Intel Xeon Gold 6148
Cores per Node 36 36 40 40
RAM 192 GB 1.5 TB 192 GB 192 GB
Storage 480 GB SSD + 4 TB HDD 4 TB HDD 4 TB HDD 4 TB HDD
GPU N/A N/A 2x NVidia Tesla V100 1x NVidia Tesla P40

Networking

The compute nodes are all interconnected with InfiniBand networking, capable of 100 Gb/s throughput. In addition to the InfiniBand networking, there is a gigabit Ethernet network that also connects all of the nodes. This is used for node management and NFS file system access.

Storage

The high-speed scratch file system provides 2 petabytes of storage at approximately 80 GB/s performance (compared to 8 GB/s on Flux).

Operation

Computing jobs on Great Lakes are managed completely through Slurm.

Software

There are three layers of software on Great Lakes.

Operating Software

The Great Lakes cluster runs CentOS 7. We update the operating system on Great Lakes as CentOS releases new versions and our library of third-party applications offers support. Due to the need to support several types of drivers (AFS and Lustre file system drivers, InfiniBand network drivers and NVIDIA GPU drivers) and dozens of third party applications, we are cautious in upgrading and can lag CentOS’s releases by months.

Compilers and Parallel and Scientific Libraries

Great Lakes supports the Gnu Compiler Collection, the Intel Compilers, and the PGI Compilers for C and Fortran. The Great Lakes cluster’s parallel library is OpenMPI, and the default versions are 1.10.7 (i686) and 3.1.2 (x86_64), and there are limited earlier versions available.  Great Lakes provides the Intel Math Kernel Library (MKL) set of high-performance mathematical libraries. Other common scientific libraries are compiled from source and include HDF5, NetCDF, FFTW3, Boost, and others.

Please contact us if you have questions about the availability of, or support for, any other compilers or libraries.

Application Software

Great Lakes supports a wide range of application software. We license common engineering simulation software (e.g. Ansys, Abaqus, VASP) and we compile others for use on Great Lakes (e.g. OpenFOAM and Abinit). We also have software for statistics, mathematics, debugging and profiling, etc. Please contact us if you wish to inquire about the current availability of a particular application.

GPUs

Great Lakes has 40 NVidia Tesla V100 GPUs connected to 20 nodes. 4 NVidia Tesla P40 GPUs connected to 4 nodes are also available for visualization work.

GPU Model NVidia Tesla V100 NVidia Tesla P40
Number and Type of GPU one Volta GPU one Pascal GPU
Peak double precision floating point perf. 7 Tflops N/A
Peak single precision floating point perf. 14 Tflops 12 Tflops
Memory bandwidth (ECC off) 900 GB/sec 346 GB/sec
Memory size (GDDR5) 32 GB HBM2 24 GB GDDR5
CUDA cores 5120 3840

If you have questions, please send email to hpc-support@umich.edu.

Order Service

Great Lakes will be available in the first half of 2019. This page will provide updates on the progress of the project.

Please contact hpc-support@umich.edu with any questions.

Slurm User Guide for Beta

By | Beta

Go to Beta Overview     To search this user guide, use the Command + F (Mac) or Ctrl + F (Win) keyboard shortcuts.

Slurm User Guide for Beta

Slurm is a combined batch scheduler and resource manager that allows users to run their jobs on the University of Michigan’s high performance computing (HPC) clusters. This document describes the process for submitting and running jobs under the Slurm Workload Manager on the Beta test cluster.

The Batch Scheduler and Resource Manager

The batch scheduler and resource manager work together to run jobs on an HPC cluster. The batch scheduler, sometimes called a workload manager, is responsible for finding and allocating the resources that fulfill the job’s request at the soonest available time.  When a job is scheduled to run, the scheduler instructs the resource manager to launch the application(s) across the job’s allocated resources.  This is also known as “running the job”.

The user can specify conditions for scheduling the job. One condition is the completion (successful or unsuccessful) of an earlier submitted job.  Other conditions include the availability of a specific license or access to a specific hardware accelerator.

Computing Resources

An HPC cluster is made up of a number of compute nodes, each with a complement of processors, memory and GPUs.  The user submits jobs that specify the application(s) they want to run along with a description of the computing resources needed to run the application(s).

Login Resources

Users interact with an HPC cluster through login nodes. Login nodes are a place where users can login, edit files, view job results and submit new jobs. Login nodes are a shared resource and should not be used to run application workloads.

Jobs and Job Steps

A job is an allocation of resources assigned to an individual user for a specified amount of time. Job steps are sets of (possibly parallel) tasks within a job. When a job runs, the scheduler selects and allocates resources to the job. The invocation of the application happens within the batch script, or at the command line for interactive and jobs.

When an application is launched using srun, it runs within a “job step”. The srun command causes the simultaneous launching of multiple tasks of a single application. Arguments to srun specify the number of tasks to launch as well as the number of nodes (and CPUs and memory) on which to launch the tasks.

srun can be invoked in parallel or sequentially (by backgrounding them). Furthermore, the number of nodes specified by srun (the -N option) can be less than but no more than the number of nodes (and CPUs and memory) that were allocated to the job.

srun can also be invoked directly at the command line (outside of a job allocation). Doing so will submit a job to the batch scheduler and srun will block until that job is scheduled to run. When the srun job runs, a single job step will be created. The job will complete when that job step terminates.

Batch Jobs

The sbatch command is used to submit a batch script to Slurm. It is designed to reject the job at submission time if there are requests or constraints that Slurm cannot fulfill as specified. This gives the user the opportunity to examine the job request and resubmit it with the necessary corrections. To submit a batch script simply run the following from a shared file system; those include your home directory, /scratch, and any directory under /nfs that you can normally use in a job on Flux. Output will be sent to this working directory (jobName-jobID.log). Do not submit jobs from /tmp or any of its subdirectories.

$ sbatch myJob.sh

The batch job script is composed of three main components:

  • The interpreter used to execute the script
  • #SBATCH directives that convey submission options
  • The application(s) to execute along with its input arguments and options

Example:

#!/bin/bash
# The interpreter used to execute the script

#“#SBATCH” directives that convey submission options:

#SBATCH --job-name=example_job
#SBATCH --mail-type=BEGIN,END
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=1000m 
#SBATCH --time=10:00
#SBATCH --account=test
#SBATCH --partition=standard

# The application(s) to execute along with its input arguments and options:

/bin/hostname
sleep 60

How many nodes and processors you request will depend on the capability of your software and what it can do. There are four common scenarios:

Example: One Node, One Processor

This is the simplest case and is shown in the example above. The majority of software cannot use more than this. Some examples of software for which this would be the right configuration are SAS, Stata, R, many Python programs, most Perl programs.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: One Node, Multiple Processors

This is similar to what a modern desktop or laptop is likely to have. Software that can use more than one processor may be described as multicore, multiprocessor, or mulithreaded. Some examples of software that can benefit from this are MATLAB and Stata/MP. You should read the documentation for your software to see if this is one of its capabilities.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Nodes, One Process per CPU

This is the classic MPI approach, where multiple machines are requested, one process per processor on each node is started using MPI. This is the way most MPI-enabled software is written to work.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Nodes, Multiple CPUs per Process

This is often referred to as the “hybrid mode” MPI approach, where multiple machines are requested and multiple processes are requested. MPI will start a parent process or processes on each node, and those in turn will be able to use more than one processor for threaded calculations.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Common Job Submission Options

Option Slurm Command (#SBATCH) Beta Usage
Job name --job-name=<name> --job-name=betajob1
Account --account=<account> --account=test
Queue --partition=<name> --partition=partitionname

Available partitions: standard, gpu (GPU jobs only), largemem (RAM-intensive jobs only), debug

Wall time limit --time=<hh:mm:ss> --time=02:00:00
Node count --nodes=<count> --nodes=2
Process count per node --ntasks-per-node=<count> --ntasks-per-node=1
Core count (per process) --cpus-per-task=<cores> --cpus-per-task=1
Memory limit --mem=<limit> (Memory per node in MB) --mem=12000m
Minimum memory per processor --mem-per-cpu=<memory> --mem-per-cpu=1000m
Request GPUs --gres=gpu:<count> --gres=gpu:2
Job array --array=<array indices> --array=0-15
Standard output file --output=<file path> (path must exist) --output=/home/%u/%x-%j.log
%u = username
%x = job name
%j = job ID 
Standard error file --error=<file path> (path must exist) --error=/home/%u/error-%x-%j.log
Combine stdout/stderr to stdout --output=<combined out and err file path> --output=/home/%u/%x-%j.log
Copy environment --export=ALL (default)

--export=NONE (to not export environment)

--export=ALL
Copy environment variable --export=<variable=value,var2=val2> --export=EDITOR=/bin/vim
Job dependency --dependency=after:jobID[:jobID...]

--dependency=afterok:jobID[:jobID...]

--dependency=afternotok:jobID[:jobID...]

--dependency=afterany:jobID[:jobID...]

--dependency=after:1234[:1233]
Request software license(s)

--licenses=<application>@slurmdb:<N>

--licenses=stata@slurmdb:1
requests one license for Stata

Request event notification

--mail-type=<events>

Note: multiple mail-type requests may be specified in a comma separated list:

--mail-type=BEGIN,END,NONE,FAIL,REQUEUE,ARRAY_TASKS

--mail-type=BEGIN,END,FAIL

Email address --mail-user=<email address> --mail-user=uniqname@umich.edu
Defer job until the specified time --begin=<date/time> --begin=2020-12-25T12:30:00

Please note that if your job is set to utilize more than one node, make sure your code is MPI enabled in order to run across these nodes and you must use srun rather then mpirun or mpiexec.

Interactive Jobs

An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. Interactive jobs are useful when debugging or interacting with an application. The srun command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

[user@beta-login ~]$ srun --pty /bin/bash
srun: job 309 queued and waiting for resources
srun: job 309 has been allocated resources
[user@bn01 ~]$ hostname
bn01.stage.arc-ts.umich.edu
[user@bn01 ~]$

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 4 CPUS and 4GB of memory each:

[user@beta-login ~]$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --pty /bin/bash
srun: job 894 queued and waiting for resources
srun: job 894 has been allocated resources
[user@bn01 ~]$ srun hostname
bn01.stage.arc-ts.umich.edu
bn02.stage.arc-ts.umich.edu
bn01.stage.arc-ts.umich.edu
bn01.stage.arc-ts.umich.edu
bn01.stage.arc-ts.umich.edu
bn02.stage.arc-ts.umich.edu
bn02.stage.arc-ts.umich.edu
bn02.stage.arc-ts.umich.edu

In the above example srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job. See the srun man page for more details.

GPU and Large Memory Jobs

Jobs can request GPUs with the job submission options --partition=gpu and --gres=gpu:<count>. GPUs can be requested in both Batch and Interactive jobs.

Similarly, jobs can request nodes with large amounts of RAM with --partition=largemem.  The largemem (and debug) partition’s nodes on Beta have the same configuration as standard nodes, so this partition is just for testing.  Great Lakes will have high-RAM nodes on this partition.

Requesting software licenses

Many of the software packages that are licensed for use on ARC clusters are licensed for a limited number of concurrent uses. If you will use one of those packages, then you must request a license or licenses in your submission script. As an example, to request one Stata license, you would use

#SBATCH --licenses=stata@slurmdb:1

The list of software can be found from Beta by using the command

$ scontrol show licenses

Job Dependencies

You may want to run a set of jobs sequentially, so that the second job runs only after the first one has completed. This can be accomplished using Slurm’s job dependencies options. For example, if you have two jobs, Job1.sh and Job2.sh, you can utilize job dependencies as in the example below.

[user@beta-login]$ sbatch Job1.sh
123213

[user@beta-login]$ sbatch --dependency=afterany:123213 Job2.sh
123214

The flag --dependency=afterany:123213 tells the batch system to start the second job only after completion of the first job. afterany indicates that Job2 will run regardless of the exit status of Job1, i.e. regardless of whether the batch system thinks Job1 completed successfully or unsuccessfully.

Once job 123213 completes, job 123214 will be released by the batch system and then will run as the appropriate nodes become available.

Exit status: The exit status of a job is the exit status of the last command that was run in the batch script. An exit status of ‘0’ means that the batch system thinks the job completed successfully. It does not necessarily mean that all commands in the batch script completed successfully.

There are several options for the –dependency flag that depend on the status of Job1:

–dependency=afterany:Job1 Job2 will start after Job1 completes with any exit status
–dependency=after:Job1 Job2 will start any time after Job1 starts
–dependency=afterok:Job1 Job2 will run only if Job1 completed with an exit status of 0
–dependency=afternotok:Job1 Job2 will run only if Job1 completed with a non-zero exit status

Making several jobs depend on the completion of a single job is done in the example below:

[user@beta-login]$ sbatch Job1.sh 
13205 
[user@beta-login]$ sbatch --dependency=afterany:13205 Job2.sh 
13206 
[user@beta-login]$ sbatch --dependency=afterany:13205 Job3.sh 
13207 
[user@beta-login]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E" 
JOBID        NAME            ST   DEPENDENCY                    
13205        Job1.bat        R                                  
13206        Job2.bat        PD   afterany:13205                
13207        Job3.bat        PD   afterany:13205

Making a job depend on the completion of several other jobs: example below.

[user@beta-login]$ sbatch Job1.sh
13201
[user@beta-login]$ sbatch Job2.sh
13202
[user@beta-login]$ sbatch --dependency=afterany:13201,13202 Job3.sh
13203
[user@beta-login]$ squeue -u $USER -S S,i,M -o "%12i %15j %4t %30E"
JOBID        NAME            ST   DEPENDENCY                    
13201        Job1.sh         R                                  
13202        Job2.sh         R                                  
13203        Job3.sh         PD   afterany:13201,afterany:13202

Chaining jobs is most easily done by submitting the second dependent job from within the first job. Example batch script:

#!/bin/bash

cd /data/mydir
run_some_command
sbatch --dependency=afterany:$SLURM_JOB_ID  my_second_job

Job dependencies documentation adapted from https://hpc.nih.gov/docs/userguide.html#depend

Job Arrays

Job arrays are multiple jobs to be executed with identical parameters. Job arrays are submitted with -a <indices> or --array=<indices>. The indices specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a “-” separator: --array=0-15 or --array=0,6,16-32.

A step function can also be specified with a suffix containing a colon and number. For example,--array=0-15:4 is equivalent to --array=0,4,8,12.
A  maximum  number  of  simultaneously running tasks from the job array may be specified using a “%” separator. For example --array=0-15%4 will limit the number of simultaneously running tasks from this job array to 4. The minimum index value is 0. The maximum value is 499999.

To receive mail alerts for each individual array task, --mail-type=ARRAY_TASKS should be added to the Slurm job script. Unless this option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.

Execution Environment

For each job type above, the user has the ability to define the execution environment. This includes environment variable definitions as well as shell limits (bash ulimit or csh limit). sbatch and salloc provide the --export option to convey specific environment variables to the execution environment. sbatch and salloc provide the --propagate option to convey specific shell limits to the execution environment. By default Slurm does not source the files ~./bashrc or ~/.profile when requesting resources via sbatch (although it does when running srun / salloc ).  So, if you have a standard environment that you have set in either of these files or your current shell then you can do one of the following:

  1. Add the command #SBATCH --get-user-env to your job script (i.e. the module environment is propagated).
  2. Source the configuration file in your job script:
< #SBATCH statements >
source ~/.bashrc

Note: You may want to remove the influence of any other current environment variables by adding #SBATCH --export=NONE to the script. This removes all set/exported variables and then acts as if #SBATCH --get-user-env has been added (module environment is propagated).

Environment Variables

Slurm recognizes and provides a number of environment variables.

The first category of environment variables are those that Slurm inserts into the job’s execution environment. These convey to the job script and application information such as job ID (SLURM_JOB_ID) and task ID (SLURM_PROCID). For the complete list, see the “OUTPUT ENVIRONMENT VARIABLES” section under the sbatchsalloc, and srun man pages.

The next category of environment variables are those use user can set in their environment to convey default options for every job they submit. These include options such as the wall clock limit. For the complete list, see the “INPUT ENVIRONMENT VARIABLES” section under the sbatchsalloc, and srun man pages.

Finally, Slurm allows the user to customize the behavior and output of some commands using environment variables. For example, one can specify certain fields for the squeue command to display by setting the SQUEUE_FORMAT variable in the environment from which you invoke squeue.

Commonly Used Environment Variables

Info Slurm Notes
Job name $SLURM_JOB_NAME
Job ID $SLURM_JOB_ID
Submit directory $SLURM_SUBMIT_DIR Slurm jobs starts from the submit directory by default.
Submit host $SLURM_SUBMIT_HOST
Node list $SLURM_JOB_NODELIST The Slurm variable has a different format to the PBS one.

To get a list of nodes use:

scontrol show hostnames $SLURM_JOB_NODELIST

Job array index $SLURM_ARRAY_TASK_ID
Queue name $SLURM_JOB_PARTITION
Number of nodes allocated $SLURM_JOB_NUM_NODES

$SLURM_NNODES

Number of processes $SLURM_NTASKS
Number of processes per node $SLURM_TASKS_PER_NODE
Requested tasks per node $SLURM_NTASKS_PER_NODE
Requested CPUs per task $SLURM_CPUS_PER_TASK
Scheduling priority $SLURM_PRIO_PROCESS
Job user $SLURM_JOB_USER
Hostname $HOSTNAME == $SLURM_SUBMIT_HOST Unless a shell is invoked on an allocated resource, the HOSTNAME variable is propagated (copied) from the submit machine environments will be the same on all allocated nodes.

Job Output

Slurm merges the job’s standard error and output by default and saves it to an output file with a name that includes the job ID (slurm-<job_ID>.out for normal jobs and "slurm-<job_ID_index.out for arrays"). You can specify your own output and error files to the sbatch command using the -o /file/to/output and -e /file/to/error options respectively. If both standard out and error should go to the same file, only specify -o /file/to/output Slurm will append the job’s output to the specified file(s). If you want the output to overwrite any existing files, add the --open-mode=truncate option. The files are written as soon as output is created. It does not spool on the compute node and then get copied to the final location after the job ends. If not specified in the job submission, standard output and error are combined and written into a file in the working directory from which the job was submitted.

For example if I submit job 93 from my home directory, the job output and error will be written to my home directory in a file called slurm-93.out. The file appears while the job is still running.

[user@beta-login ~]$ sbatch test.sh
Submitted batch job 93 
[user@beta-login ~]$ ll slurm-93.out
-rw-r–r– 1 user hpcstaff 122 Jun 7 15:28 slurm-93.out 
[user@beta-login ~]$ squeue 
JOBID PARTITION NAME    USER ST TIME NODES NODELIST(REASON) 
93    standard  example user R  0:04 1     bn02

If you submit from a working directory which is not a shared filesystem, your output will only be available locally on the compute node and will need to be copied to another location after the job completes. /home, /scratch, and /nfs are all networked filesystems which are available on the login nodes and all compute nodes.

For example if I submit a job from /tmp on the login node, the output will be in /tmp on the compute node.

[user@beta-login tmp]$ pwd
/tmp
[user@beta-login tmp]$ sbatch /home/user/test.sh
Submitted batch job 98
[user@beta-login tmp]$ squeue
JOBID PARTITION     NAME     USER ST  TIME  NODES NODELIST(REASON)
98    standard      example  user R   0:03  1     bn02
[user@beta-login tmp]$ ssh bn01
[user@bn01 ~]$ ll /tmp/slurm-98.out
-rw-r–r– 1 user hpcstaff 78 Jun 7 15:46 /tmp/slurm-98.out

Serial vs. Parallel jobs

Parallel jobs launch applications that are comprised of many processes (aka tasks) that communicate with each other, typically over a high speed switch. Serial jobs launch one or more tasks that work independently on separate problems.

Parallel applications must be launched by the srun command. Serial applications can use srun to launch them, but it is not required in one node allocations.

Job Partitions

A cluster is often highly utilized and may not be able to run a job when it is submitted. When this occurs, the job is placed in a partition. Specific compute node resources are defined for every job partition. The Slurm partition is synonymous with the term queue.

Each partition can be configured with a set of limits which specify the requirements for every job that can run in that partition. These limits include job size, wall clock limits, and the users who are allowed to run in that partition.

The Beta cluster currently has the “standard” partition, used for most production jobs.  The “gpu” partition is currently running a single node and should only be used for GPU-intensive tasks.

Commands related to partitions include:

sinfo Lists all partitions currently configured
scontrol show partition <name> Provides details about each partition
squeue Lists all jobs currently on the system, one line per job

Job Status

Most of a job’s specifications can be seen by invoking scontrol show job <jobID>.  More details about the job can be written to a file by using  scontrol write batch_script <jobID> output.txt. If no output file is specified, the script will be written to slurm<jobID>.sh.

Slurm captures and reports the exit code of the job script (sbatch jobs) as well as the signal that caused the job’s termination when a signal caused a job’s termination.

A job’s record remains in Slurm’s memory for 30 minutes after it completes.  scontrol show job will return “Invalid job id specified” for a job that completed more than 30 minutes ago.  At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.

Modifying a Batch Job

Many of the batch job specifications can be modified after a batch job is submitted and before it runs.  Typical fields that can be modified include the job size (number of nodes), partition (queue), and wall clock limit.  Job specifications cannot be modified by the user once the job enters the Running state.

Beside displaying a job’s specifications, the scontrol command is used to modify them.  Examples:

scontrol -dd show job <jobID> Displays all of a job’s characteristics
scontrol write batch_script <jobID> Retrieve the batch script for a given job
scontrol update JobId=<jobID> Account=science Change the job’s account to the “science” account
scontrol update JobId=<jobID> Partition=priority Changes the job’s partition to the priority partition

Holding and Releasing a Batch Job

If a user’s job is in the pending state waiting to be scheduled, the user can prevent the job from being scheduled by invoking the scontrol hold <jobID> command to place the job into a Held state. Jobs in the held state do not accrue any job priority based on queue wait time.  Once the user is ready for the job to become a candidate for scheduling once again, they can release the job using the scontrol release <jobID> command.

Signalling and Cancelling a Batch Job

Pending jobs can be cancelled (withdrawn from the queue) using the scancel command (scancel <jobID>).  The scancel command can also be used to terminate a running job.  The default behavior is to issue the job a SIGTERM, wait 30 seconds, and if processes from the job continue to run, issue a SIGKILL command.

The -s option of the scancel command (scancel -s <signal> <jobID>) allows the user to issue any signal to a running job.

Common Job Commands

Command Slurm
Submit a job sbatch <job script>
Delete a job scancel <job ID>
Job status (all) squeue
Job status (by job) squeue -j <job ID>
Job status (by user) squeue -u <user>
Job status (detailed) scontrol show job -dd <job ID>
Show expected start time squeue -j <job ID> --start
Queue list / info scontrol show partition <name>
Node list scontrol show nodes
Node details scontrol show node <node>
Hold a job scontrol hold <job ID>
Release a job scontrol release <job ID>
Cluster status sinfo
Start an interactive job salloc <args>srun --pty <args>
X forwarding srun --pty <args> --x11(Update with --x11 once 17.11 is released)
Read stdout messages at runtime No equivalent command / not needed. Use the --output option instead.
Monitor or review a job’s resource usage sacct -j <job_num> --format JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,Elapsed

(see sacct for all format options)

View job batch script scontrol write batch_script <jobID> [filename]
View accounts you can submit to sacctmgr show assoc user=$USER
View users with access to an account sacctmgr show assoc account=<account>
View default submission account and wckey sacctmgr show User <account>

Job States

The basic job states are these:

  • Pending – the job is in the queue, waiting to be scheduled
  • Held – the job was submitted, but was put in the held state (ineligible to run)
  • Running – the job has been granted an allocation.  If it’s a batch job, the batch script has been run
  • Complete – the job has completed successfully
  • Timeout – the job was terminated for running longer than its wall clock limit
  • Preempted – the running job was terminated to reassign its resources to a higher QoS job
  • Failed – the job terminated with a non-zero status
  • Node Fail – the job terminated after a compute node reported a problem

For the complete list, see the “JOB STATE CODES” section under the squeue man page.

Pending Reasons

A pending job can remain pending for a number of reasons:

  • Dependency – the pending job is waiting for another job to complete
  • Priority – the job is not high enough in the queue
  • Resources – the job is high in the queue, but there are not enough resources to satisfy the job’s request
  • Partition Down – the queue is currently closed to running any new jobs

For the complete list, see the “JOB REASON CODES” section under the squeue man page.

Displaying Computing Resources

As stated above, computing resources are nodes, CPUs, memory, and generic resources like GPUs.  The resources of each compute node can be seen by running the scontrol show node command.  The characteristics of each partition can be seen by running the scontrol show partition command.  Finally, a load summary report for each partition can be seen by running sinfo.

To show a summary of cluster resources on a per partition basis:

[user@beta-login ~]$ sinfo
PARTITION     AVAIL    TIMELIMIT    NODES STATE   NODELIST
standard*     up       14-00:00:00  5     comp    bn[16-20]
standard*     up       14-00:00:00  15    idle    bn[01-15]
gpu           up       14-00:00:00  1     idle    bn15
[user@beta-login ~]$ sstate
———————————————————————————————————————
Node    AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
bn01    0        16       0.00            0.03    0        64170    0.00           IDLE
bn02    0        16       0.00            0.04    0        64170    0.00           IDLE
bn03    0        16       0.00            0.05    0        64170    0.00           IDLE
bn04    0        16       0.00            0.01    0        64170    0.00           IDLE
bn05    0        16       0.00            0.04    0        64170    0.00           IDLE
bn06    0        16       0.00            0.05    0        64170    0.00           IDLE
bn07    0        16       0.00            0.03    0        64170    0.00           IDLE
bn08    0        16       0.00            0.04    0        64170    0.00           IDLE
bn09    0        16       0.00            0.08    0        64221    0.00           IDLE
bn10    0        16       0.00            0.05    0        64170    0.00           IDLE
bn11    0        16       0.00            0.02    0        64170    0.00           IDLE
bn12    0        16       0.00            0.07    0        64170    0.00           IDLE
bn13    0        16       0.00            0.01    0        64170    0.00           IDLE
bn14    0        16       0.00            0.03    0        64170    0.00           IDLE
bn15    0        16       0.00            0.02    0        64224    0.00           IDLE
bn16    0        16       0.00            0.06    0        64170    0.00           IDLE
bn17    0        16       0.00            0.03    0        64170    0.00           IDLE
bn18    0        16       0.00            0.03    0        64221    0.00           IDLE
bn19    0        16       0.00            0.02    0        64170    0.00           IDLE
bn20    0        16       0.00            0.07    0        64170    0.00           IDLE
———————————————————————————————————————
Totals:
Node    AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
20      0        320      0.00                    0        1283556  0.00

In this example the user “user” has access to submit workloads to the accounts support and hpcstaff on the Beta cluster. To show associations for the current user:

[user@beta-login ~]$ sacctmgr show assoc user=$USER

Cluster  Account  User  Partition  ...
———————————————————————————————————————
beta     support  user  1    
beta     hpcstaff user  1

Job Statistics and Accounting

The sreport command provides aggregated usage reports by user and account over a specified period. Examples:

By user: sreport -T billing cluster AccountUtilizationByUser Start=2017-01-01 End=2017-12-31

By account: sreport -T billing cluster UserUtilizationByAccount Start=2017-01-01 End=2017-12-31

For all of the sreport options see the sreport man page.

Time Remaining in an Allocation

If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, applications have two means for discovering the time remaining in the application.

The first means is to use the sbatch --signal=<sig_num>[@<sig_time>] option to request a signal (like USR1 or USR2) at sig_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.

The second means is for the application to issue a library call to retrieve its remaining time periodically. When the library call returns a remaining time below a certain threshold, the application can take the necessary steps to write a checkpoint file and terminate gracefully.

Slurm offers the slurm_get_rem_time() library call that returns the time remaining. On some systems, the yogrt library (man yogrt) is also available to provide the time remaining.

Beta Configuration

By | Beta, HPC

Beta Configuration

Hardware

Computing

The Beta hardware is a subset of the hardware currently used in Flux.

Networking

The compute nodes are all interconnected with InfiniBand networking. In addition to the InfiniBand networking, there is a gigabit Ethernet network that also connects all of the nodes. This is used for node management and NFS file system access.

Storage

The high-speed scratch file system is based on Lustre v2.5 and is a DDN SFA10000 backed by the hardware described in this table, the same that is used in Flux:

Server Type

Network Connection

Disk Capacity (raw/usable)

Dell R610 40Gbps InfiniBand 520 TB / 379 TB
Dell R610 40Gbps InfiniBand 530 TB / 386 TB
Dell R610 40Gbps InfiniBand 530 TB / 386 TB
Dell R610 40Gbps InfiniBand 520 TB / 379 TB

Totals

160 Gbps

2100 TB / 1530 TB

Operation

Computing jobs on Beta are managed completely through Slurm.  See the Beta User Guide for directions on how to submit and manage jobs.

Software

There are three layers of software on Beta.

Operating Software

The Beta cluster runs CentOS 7. We update the operating system on Beta as CentOS releases new versions and our library of third-party applications offers support. Due to the need to support several types of drivers (AFS and Lustre file system drivers, InfiniBand network drivers and NVIDIA GPU drivers) and dozens of third party applications, we are cautious in upgrading and can lag CentOS’s releases by months.

Compilers and Parallel and Scientific Libraries

Beta supports the Gnu Compiler Collection, the Intel Compilers, and the PGI Compilers for C and Fortran. The Beta cluster’s parallel library is OpenMPI, and the default versions are 1.10.7 (i686) and 3.1.2 (x86_64), and there are limited earlier versions available.  Beta provides the Intel Math Kernel Library (MKL) set of high-performance mathematical libraries. Other common scientific libraries are compiled from source and include HDF5, NetCDF, FFTW3, Boost, and others.

Please contact us if you have questions about the availability of, or support for, any other compilers or libraries.

Application Software

Beta supports a wide range of application software. We license common engineering simulation software, for example, Ansys, Abaqus, VASP, and we compile other for use on Beta, for example, OpenFOAM and Abinit. We also have software for statistics, mathematics, debugging and profiling, etc. Please contact us if you wish to inquire about the current availability of a particular application.

GPUs

Beta has eight K20x GPUs on one node for testing GPU workloads under Slurm.

 
GPU Model NVidia K20X
Number and Type of GPU one Kepler GK110
Peak double precision floating point perf. 1.31 Tflops
Peak single precision floating point perf. 3.95 Tflops
Memory bandwidth (ECC off) 250 GB/sec
Memory size (GDDR5) 6 GB
CUDA cores 2688

If you have questions, please send email to hpc-support@umich.edu.

Getting Access

Beta is intended for small scale testing to convert Torque/PBS scripts to Slurm. No sensitive data of any type should be used on Beta.

To request:

1. Fill out the ARC-TS HPC account request form.

Because this is a test platform, there is no cost for using Beta.

Related Event

February 27 @ 2:00 pm - 5:30 pm

Geospatial Analysis with Google Earth Engine

Google Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities. This hands-on workshop will help you understand the power (and limitation) of GEE…

March 6 @ 2:00 pm - 5:00 pm

Latent Variable Modeling

Part of the Structural Equation Modeling (SEM) series.  This workshop will help participants develop skills in understanding and conducting latent variable models, particularly from the perspective of structural equation modeling….

March 14 @ 1:00 pm - 4:30 pm

Open Source GIS

This workshop will cover introductory GIS concepts and techniques using open source tools. We will use QGIS and R and learn the basics of GIS by solving a number of…

March 19 @ 2:00 pm - 4:00 pm

Go for data processing 1/2/3

This is a three-session workshop on the use of Go for data processing.  Go is an open source language developed for general-purpose programming.  It is not more difficult to learn…