Using Great Lakes

1Before you can use the Great Lakes cluster, the Principal Investigator (PI) must establish a Slurm account by contacting HPC Support with lists of users, admins, and a shortcode. U-M Research Computing Package accounts are also available to eligible researchers. For more information, please visit the UMRCP page.

Email Support

2 A Great Lakes cluster login account is also needed for each user, which will give you access to run jobs.

Fill Out Form

See the Great Lakes cluster cheat sheet for a list of common Linux (Bash) and Slurm commands, including Torque and Slurm comparisons.

Cheat Sheet (PDF)

Go to Great Lakes Overview To search this user guide, use the command + F (Mac) or CTRL + F (Windows) keyboard shortcuts.

Cluster Defaults, Partition Limits, and Storage

Great Lakes Cluster Defaults

Cluster Defaults Default Value
Default walltime 60 minutes
Default memory Per CPU 768 MB
Default number of CPUs

no memory specified: 1 core
Memory specified: memory/768 = # of cores (rounded down)

/scratch file deletion policy

60 days without being accessed.  (see SCRATCH STORAGE POLICIES below)

/scratch quotas per root account

10 TB storage and 1 million inode limit (see SCRATCH STORAGE POLICIES below)

/home quota per user

80 GB

Max queued jobs per user per account

5,000 

Shell timeout if idle:

2 hours

Great Lakes Partition Limits

Partition Limit standard gpu gpu_mig40 spgpu largemem standard-oc viz* debug*
Max walltime 2 weeks
2 hours 4 hours
Max running Mem per root account 3,500 GB 1.5 TB 660 GB 3,500 GB 40 GB
Max running CPUs per root account 500 cores 36 cores 132 cores 500 cores 8 cores
Max running GPUs per root account n/a n/a n/a 5 n/a

*there is only one gpu in each viz node, and these are only accessible through Open OnDemand’s Remote Desktop application

*all debug limits are per user, and only one job can run at a time.  largemem and standard-oc limits are per account

Please see the section on Partition Policies for more details.

Great Lakes Storage

Every user has a /scratch directory for every Slurm account they are a member of.  Additionally for that account, there is a shared data directory for collaboration with other members of that account.  The account directory group ownership is set using the Slurm account-based UNIX groups, so all files created in the /scratch directory are accessible by any group member, to facilitate collaboration.

Example:
/scratch/msbritt_root/msbritt
/scratch/msbritt_root/msbritt/bob
/scratch/msbritt_root/msbritt/shared_data

Please see the section on Storage Policies for more details.

Back To Top

Getting Started (Web-based Open OnDemand)

1. Get Duo

You must use Duo authentication to log on to the Great Lakes OnDemand web service.  Get more details on the Safe Computing Two-Factor page and enroll here.

2. Get a Great Lakes user login

You must establish a user login on Great Lakes by filling out this form.

3. Connect to Great Lakes OnDemand

You must be on campus or on the VPN to connect to Great Lakes OnDemand.  If you are trying to log in from off campus or using an unauthenticated wireless network such as MGuest, you should install VPN software on your computer.

Once you are on the University network, follow these instructions to connect:

  1. Open your web browser (Firefox, Edge, or Chrome in incognito recommended) and navigate to:
    greatlakes.arc-ts.umich.edu
    greatlakes-oncampus.arc-ts.umich.edu (for on-campus restricted software)
  2. Log into cosign using your uniqname and password:
  3. Complete Duo authentication: 
  4. You should now be logged in.

If you receive a “Bad Request: Your browser sent a request that this server could not understand. Size of request header field exceeds server limit.” error, please clear your browser’s cookies and try to access Great Lakes OnDemand again.

This error can occur when the request header sent by the browser becomes too large. This is typically caused by accumulated cookies and cached data for a specific site; in some cases, corrupted cookies can also contribute to the problem.

4. Get files

At the top of the page, click “Files” and then “Home Directory”.  A new tab will be created that contains the File Explorer: 

Here you can navigate your home folder.  The buttons do the following:

  • “Go To…”: Navigate to a specified folder
  • “Open in Terminal”: Opens the active folder in a terminal session (new tab)
  • “New File”: Creates a new file in the active folder
  • “New Dir”: Creates a new folder in the active folder
  • “Upload”: Select files from your local machine to upload to the active folder
  • “Show Dotfiles”: Reveals hidden files (usually do not need to be changed)
  • “Show Owner/Mode”: Shows ownership and permission information
  • “View”: Shows file contents inside the current tab
  • “Edit”: Opens a file editor in a new tab
  • “Rename/Move”: Gives a file a new path and/or name
  • “Download”: Downloads the file or folder to your local machine
  • “Copy”: Copies selected files to the clipboard
  • “Paste”: Pastes files from the clipboard
  • “(Un)Select All”: Select or unselect all files/folders
  • “Delete”: Deletes selected files/folders

5. Submit a job

At the top of the home page, click “Jobs” and then “Job Composer”.  A new tab will be created that contains the Job Composer: 

Upon your first visit to this page, you’ll go through a helpful tutorial.  The buttons do the following:

  • “New Job”: Creates a new job…
    • “From Default Template”: Uses system defaults for a bare bones “Hello World” job on the Great Lakes cluster.  Please note that you will still need to specify your account.
    • “From Specified Path”: Creates a job from a specified job script.  See the Slurm User Guide for Great Lakes for information on writing this script.  Some attributes (name, account) can be set here if not set in the script.
    • “From Selected Job”: Creates a new job that is a copy of the selected job.
  • “Edit Files”: Opens a the project folder in a new File Explorer tab, allowing you to edit the files within (see “Get Files” above for File Explorer instructions).
  • “Job Options”: Allows for editing the Name, Cluster, Job Script, and Account fields.
  • “Open Terminal”: Opens a terminal session in a new tab, starting in the project folder.
  • “Submit”: Submits the selected job to the cluster.
  • “Stop”: Stops the selected job if it has been submitted.
  • “Delete”: Delete the selected job.

To view active job information, click “Jobs” and then “Active Jobs” from the home page.

This is a simple guide to get your jobs up and running. For more advanced Slurm features and job scripting information, see the Slurm User Guide for Great Lakes. If you are familiar with using the resource manager Torque, you may find the migrating from Torque to Slurm guide useful.

Interactive Apps

At the top of the home page, click “Interactive Apps” and then select your desired application.

 

Great Lakes Remote Desktop

Launches an interactive desktop in a new tab.  Specify your account (usually your PI’s uniqname), hours, memory, cores, and partition (standard, gpu, largemem):

 

Please note: At this time only node node may be requested with Remote Desktop. 

Upon selecting “Launch”, your job will be queued on one of your nodes and shown on the “My Interactive Sessions” screen. As soon as the job’s status is “Running”, you can change remote desktop settings. For slower internet connections, you can try a higher compression and lower image quality using the sliders.  Conversely, if you have a fast connection you can lower compression and raise image quality.  You can also directly access your node’s terminal by clicking on the hostname (blue button).  Once you’re ready to use the desktop, click on “Launch Greatlakes Remote Desktop”:

A remote desktop session will then be opened in a new tab for the requested amount of time. If you finish early, return to the “My Interactive Sessions” tab and delete the job.

Great Lakes GPU Accelerated Desktop Applications

Launches an interactive desktop in a new tab.  Specify your account (usually your PI’s uniqname), hours, memory, cores, and the viz partition:

 

Please note: At this time only node node may be requested with Remote Desktop. 

Upon selecting “Launch”, your job will be queued on one of your nodes and shown on the “My Interactive Sessions” screen. As soon as the job’s status is “Running”, you can change remote desktop settings. For slower internet connections, you can try a higher compression and lower image quality using the sliders.  Conversely, if you have a fast connection you can lower compression and raise image quality.  You can also directly access your node’s terminal by clicking on the hostname (blue button).  Once you’re ready to use the desktop, click on “Launch Greatlakes Remote Desktop”:

A remote desktop session will then be opened in a new tab for the requested amount of time. If you finish early, return to the “My Interactive Sessions” tab and delete the job.

Example/Walkthrough of using GPU Hardware Acceleration

Using 40 cores, 40 GB, and 1 GPU, here is an example/walkthrough with non-gpu software acceleration and gpu hardware acceleration:

  • Launch and Open OnDemand Remote Desktop application using the viz partition
  • Open up the terminal
  • To launch a non-gpu software accelerated application, run:
    • glxspheres64
      
  • To launch a gpu hardware accelerated application, run:
    • vglrun glxsphere64
      

Using gpu hardware acceleration, glxsphere64 shows a 10x performance improvement. Your code may perform differently.

MATLAB

Launches an interactive desktop with MATLAB configured and running in a new tab.  Specify your desired version, account, hours, partition (standard, gpu, largemem), and memory (4GB minimum):

Upon selecting “Launch”, your job will be queued on one of your nodes and shown on the “My Interactive Sessions” screen. As soon as the job’s status is “Running”, you can change remote desktop settings. For slower internet connections, you can try a higher compression and lower image quality using the sliders.  Conversely, if you have a fast connection you can lower compression and raise image quality.  You can also directly access your node’s terminal by clicking on the hostname (blue button).  Once you’re ready to use the application, click on “Launch MATLAB”:

A remote desktop session running MATLAB will then be opened in a new tab for the requested amount of time. You may also use the terminal and other basic applications. If you finish early, return to the “My Interactive Sessions” tab and delete the job.

RStudio

Launches an interactive desktop with RStudio configured and running in a new tab.  Specify your desired version, account, hours, cores, partition (standard, gpu, largemem), and memory (2GB minimum):

Upon selecting “Launch”, your job will be queued on one of your nodes and shown on the “My Interactive Sessions” screen. As soon as the job’s status is “Running”, you can change remote desktop settings. For slower internet connections, you can try a higher compression and lower image quality using the sliders.  Conversely, if you have a fast connection you can lower compression and raise image quality.  You can also directly access your node’s terminal by clicking on the hostname (blue button).  Once you’re ready to use the application, click on “Launch RStudio”:

A remote desktop session running RStudio will then be opened in a new tab for the requested amount of time. You may also use the terminal and other basic applications. If you finish early, return to the “My Interactive Sessions” tab and delete the job.

Jupyter Notebook Server

Jupyter Notebook Server

Launches a Jupyter Notebook Server in a new tab. Specify your desired Anaconda Python version, account, hours, partition (standard, gpu, largemem), cores, and memory:

Upon selecting “Launch”, your job will be queued on one of your nodes and shown on the “My Interactive Sessions” screen. As soon as the job’s status is “Running”, you can click on “Connect to Jupyter”:

For instructions on using Jupyter Notebook, see the official documentation.

GPU Settings

If you have selected one of the GPU partitions (e.g. gpu or spgpu), the form for your desktop or interactive app will automatically display two new fields, Number of GPUs and GPU Compute Mode.  Enter the number of GPUs that you wish to use for the job in the appropriate field.  Note the maximum allowed value differs depending on which partition is selected.  ARC’s default setting for Compute Mode is exclusive, where only single process can run at a time on each GPU.  Shared mode allows multiple processes to run simultaneously on a single GPU.  See the CUDA Programming Guide for more details. Note, you may query the compute mode from any GPU node by entering the command nvidia-smi -q | grep "Compute Mode", where a result of Default refers to the NVIDIA default of shared mode, as opposed to the ARC default selection of exclusive mode.

For example:

$ nvidia-smi -q |grep "Compute Mode"
Compute Mode : Default

Back To Top

Getting Started (Command Line)

1. Get Duo

DUO is required to access the majority of UM services and all HPC services. If you need to set up DUO please visit this page.

2. Get a Great Lakes user login

You must establish a user login on Great Lakes by filling out this form.

3. Get an SSH Client & Connect to Great Lakes

You must be on campus or on the VPN to connect to Great Lakes.  If you are trying to log in from off campus, or using an unauthenticated wireless network such as MGuest, you have a couple of options:

Mac or Linux:

Open Terminal and type:
ssh uniqname@greatlakes.arc-ts.umich.edu
You will be required to enter your Kerberos level-1 password to log in. Please note that as you type your password, nothing you type will appear on the screen; this is completely normal. Press “Enter/Return” key once you are done typing your password.

When you’re connecting for the first time, it’s not uncommon to see a message like this one:

The authenticity of host 'greatlakes.arc-ts.umich.edu (141.211.192.39)' can't be established.
RSA key fingerprint is 6f:8c:67:df:43:4f:e0:fc:80:5b:49:1a:eb:81:cc:54.
Are you sure you want to continue connecting (yes/no)?

 

This is normal. By saying “yes” you’re accepting the public SSH key for the system. This key will be stored in a local known_hosts file on your system so you won’t be prompted in the future. The keys from Great Lakes will NOT change. So, for example, if you get a new computer and SSH to Great Lakes, you’ll be prompted to add the key again.

We encourage you to compare the fingerprint you’re presented with, when connecting for the first time, to one of the fingerprints below. The format of the fingerprint you’re presented could be dictated by the SSH client on your machine.

RSA 6f:8c:67:df:43:4f:e0:fc:80:5b:49:1a:eb:81:cc:54
ECDSA Dae1G3gu0mtro2Rm15U6l8aQg4bGFnDQJhmGH3k+fKs
ED25519 9ho43xHw/aVo4q5AalH0XsKlWLKFSGuuw9lt3tCIYEs

In the example message given above, we are presented with the RSA key fingerprint and its MD5 value, which is the same value as in the above table.

If you’re NOT seeing one of these fingerprints, submit a ticket to arcts-support@umich.edu and do NOT connect to the server via SSH until discussing with an ARC staff member to determine if there is a security issue.

To avoid being prompted to accept the key on a new system you may choose to pre-populate your SSH known_hosts file with the pub keys from Great Lakes. The keys can be found in the FAQ

 

Windows (using PuTTY):

Download and install PuTTY here.

Launch PuTTY and enter greatlakes.arc-ts.umich.edu as the host name, then click open.

If you receive a “PuTTY Security Alert” pop-up, this is completely normal, click the “Yes” option. This will tell PuTTY to trust the host the next time you want to connect to it. From there, a terminal window will open; you will be required to enter your UMICH uniqname and then your Kerberos level-1 password in order to log in. Please note that as you type your password, nothing you type will appear on the screen; this is completely normal. Press “Enter/Return” key once you are done typing your password.

All Operating Systems:

At the “Enter a passcode or select one of the following options:” prompt, type the number of your preferred choice for Duo authentication.

4. Get files

You can use SFTP (best for simple transfers of small files) or Globus (best for large files or a commonly used endpoint) to transfer data to your /home directory.

SFTP: Mac or Windows using FileZilla
  1. Open FileZilla and click the “Site Manager” button
  2. Create a New Site, which you can name “Great Lakes” or something similar
  3. Select the “SFTP (SSH File Transfer Protocol)” option
  4. In the Host field, type greatlakes-xfer.arc-ts.umich.edu
  5. Select “Interactive” for Logon Type
  6. In the User field, type your uniqname
  7. Click “Connect”
  8. Enter your Kerberos password
  9. Select your Duo method (1-3) and complete authentication
  10. Drag and drop files between the two systems
  11. Click “Disconnect” when finished

On Windows, you can also use WinSCP with similar settings, available alongside PuTTY here.

SFTP: Mac or Linux using Terminal

To copy a single file, type:

scp localfile uniqname@greatlakes-xfer.arc-ts.umich.edu:./remotefile

To copy an entire directory, type:

scp -r localdir uniqname@greatlakes-xfer.arc-ts.umich.edu:./remotedir

These commands can also be reversed in order to copy files from Great Lakes to your machine:

scp -r uniqname@greatlakes-xfer.arc-ts.umich.edu:./remotedir localdir

You will need to authenticate via Duo to complete the file transfer.

Globus: Windows, Mac, or Linux

Globus is a reliable high performance parallel file transfer service provided by many HPC sites around the world. It enables easy transfer of files from one system to another, as long as they are Globus endpoints.

  • The Globus endpoint for Great Lakes is “umich#greatlakes”.
How to use Globus

Globus Online is a web front end to the Globus transfer service. Globus Online accounts are free and you can create an account with your University identity.

  • Set up your Globus account and learn how to transfer files using the Globus documentation.  Select “University of Michigan” from the dropdown box to get started.
  • Once you are ready to transfer files, enter “umich#greatlakes” as one of your endpoints.
Globus Connect Personal

Globus Online also allows for simple installation of a Globus endpoint for Windows, Mac, and Linux desktops and laptops.

  • Follow the Globus instructions to download the Globus Connect Personal installer and set up an endpoint on your desktop or laptop.
Batch File Copies

A non-standard use of Globus Online is that you can use it to copy files from one location to another on the same cluster. To do this use the same endpoint (umich#greatlakes as an example) for both the sending and receiving machines. Setup the transfer and Globus will make sure the rest happens. The service will email you when the copy is finished.

Command Line Globus

There are Command line tools for Globus that are intended for advanced users. If you wish to use these, contact HPC support.

5. Submit a job

This is a simple guide to get your jobs up and running. For more advanced Slurm features, see the Slurm User Guide for Great Lakes. If you are familiar with using the resource manager Torque, you may find the migrating from Torque to Slurm guide useful.

Batch Jobs

Most work will be queued to be run on Great Lakes and is described through a batch script. The sbatch command is used to submit a batch script to Slurm. To submit a batch script simply run the following from a shared file system; those include your home directory, /scratch, and any directory under /nfs that you can normally use in a job on Flux. Output will be sent to this working directory (jobName-jobID.log). Do not submit jobs from /tmp or any of its subdirectories.

$ sbatch myJob.sh

The batch job script is composed of three main components:

  • The interpreter used to execute the script
  • #SBATCH directives that convey submission options
  • The application(s) to execute along with its input arguments and options

Example:

#!/bin/bash
# The interpreter used to execute the script

#“#SBATCH” directives that convey submission options:

#SBATCH --job-name=example_job
#SBATCH --mail-type=BEGIN,END
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=1000m 
#SBATCH --time=10:00
#SBATCH --account=test
#SBATCH --partition=standard

# The application(s) to execute along with its input arguments and options:

/bin/hostname
sleep 60

How many nodes and processors you request will depend on the capability of your software and what it can do. There are four common scenarios:

Example: One Node, One Processor

This is the simplest case and is shown in the example above. The majority of software cannot use more than this. Some examples of software for which this would be the right configuration are SAS, Stata, R, many Python programs, most Perl programs.

NOTE: If you will be using licensed software, for example, Stata, SAS, Abaqus, Ansys, etc., then you may need to request licenses. See below table of common submission options for the syntax; in the Software section, we show the command to see which software requires you to request a license.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun --cpu-bind=none hostname -s

Example: One Node, Multiple Processors

This is similar to what a modern desktop or laptop is likely to have. Software that can use more than one processor may be described as multicore, multiprocessor, or mulithreaded. Some examples of software that can benefit from this are MATLAB and Stata/MP. You should read the documentation for your software to see if this is one of its capabilities.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun --cpu-bind=none hostname -s

Example: Multiple Nodes, One Process per CPU

This is the classic MPI approach, where multiple machines are requested, one process per processor on each node is started using MPI. This is the way most MPI-enabled software is written to work.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun --cpu-bind=none hostname -s

Example: Multiple Nodes, Multiple CPUs per Process

This is often referred to as the “hybrid mode” MPI approach, where multiple machines are requested and multiple processes are requested. MPI will start a parent process or processes on each node, and those in turn will be able to use more than one processor for threaded calculations.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun --cpu-bind=none hostname -s
Common Job Submission Options
Description Slurm directive (#SBATCH option) Great Lakes Usage
Job name --job-name=<name> --job-name=gljob1
Account --account=<account> --account=test
Partition --partition=<partition_name> --partition=standard

Available partitions: standard (default), gpu (GPU jobs only), spgpu (single-precision optimized), largemem (large memory jobs only), debug, standard-oc (on-campus software only)

Wall time limit --time=<dd-hh:mm:ss> --time=01-02:00:00
Node count --nodes=<count> --nodes=2
Process count per node --ntasks-per-node=<count> --ntasks-per-node=1
Minimum memory per processor --mem-per-cpu=<memory> --mem-per-cpu=1000m
Request software license(s) --licenses=<application>@slurmdb:<N> --licenses=stata@slurmdb:1
requests one license for Stata
Request event notification --mail-type=<events>
Note: multiple mail-type requests may be specified in a comma separated list:
--mail-type=BEGIN,END,NONE,FAIL,REQUEUE
--mail-type=BEGIN,END,FAIL

Please note that if your job is set to utilize more than one node, make sure your code is MPI enabled in order to run across these nodes. More advanced job submission options can be found in the Slurm User Guide for Great Lakes.

Interactive Jobs

An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. Interactive jobs are useful when debugging or interacting with an application. The salloc command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

[user@gl-login1 ~]$ salloc --account=test 
salloc: job 28652756 queued and waiting for resources
salloc: job 28652756 has been allocated resources
salloc: Granted job allocation 28652756
salloc: Waiting for resource configuration
salloc: Nodes gl3057 are ready for job
[user@gl3057 ~]$ hostname 
gl3057.arc-ts.umich.edu 
[user@gl3057 ~]$

Jobs submitted with salloc and no additional specification of resources will be assigned the cluster default values of 1 CPU and 768MB of memory.  The account must be specified; the job will not run otherwise. If additional resources are required, they can be requested as options to the salloc command. The following example job would be appropriate for an MPI job where one wants two nodes with four MPI processes using one CPU on each node with one GB of memory for each CPU in each task. MPI programs run from jobs should be started with srun or one of the other commands that will start MPI programs. Note the --cpu-bind=none option, which is recommended unless you know what an efficient processor geometry for your job is.

[user@gl-login1 ~]$ salloc --nodes=2 --account=test --ntasks-per-node=4 --mem-per-cpu=1GB
salloc: Pending job allocation 28652831
salloc: job 28652831 queued and waiting for resources
salloc: job 28652831 has been allocated resources
salloc: Granted job allocation 28652831
salloc: Waiting for resource configuration
salloc: Nodes gl[3017-3018] are ready for job

[user@gl3160 ~]$ srun --cpu-bind=none hostname
gl3017.arc-ts.umich.edu
gl3017.arc-ts.umich.edu
gl3017.arc-ts.umich.edu
gl3017.arc-ts.umich.edu
gl3018.arc-ts.umich.edu
gl3018.arc-ts.umich.edu
gl3018.arc-ts.umich.edu
gl3018.arc-ts.umich.edu

In the above example srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job, though that is fairly uncommon. See the srun man page for more details.

GPU and Large Memory Jobs

Jobs can request GPUs with the job submission options --partition=gpu or --partition=spgpu and a count option from the table below. All counts can be represented by gputype:number or just a number (default type on partition will be used). Available GPU types can be found with the command sinfo -O gres -p <partition>. GPUs can be requested in both Batch and Interactive jobs.  Additionally, a user can select the compute mode of GPUs for each job as either exclusive (ARC’s default setting) or shared.  Exclusive mode limits each GPU to run only one process at a time, while shared mode allows multiple processes to run simultaneously on a single GPU.  See the CUDA Programming Guide for more details. Note, you may query the compute mode from any GPU node by entering the command nvidia-smi -q | grep "Compute Mode", where a result of Default refers to the NVIDIA default of shared mode, as opposed to the ARC default selection of exclusive mode.  For example:

$ nvidia-smi -q |grep "Compute Mode"
Compute Mode : Default

The gpu partition uses NVIDIA Tesla V100 GPUs (gputype v100) and the spgpu partition uses NVIDIA A40 GPUs (gputype a40).  For more information on these GPUs, please see the Great Lakes configuration page.

Description Slurm directive (#SBATCH or srun option) Example
GPUs per node --gpus-per-node=<gputype:number> --gpus-per-node=2 or --gpus-per-node=v100:2
GPUs per job --gpus=<gputype:number> --gpus=2 or --gpus=a40:2
GPUs per socket --gpus-per-socket=<gputype:number> --gpus-per-socket=2 or --gpus-per-socket=v100:2
GPUs per task --gpus-per-task=<gputype:number> --gpus-per-task=2 or --gpus-per-task=a40:2
Compute Mode --gpu_cmode=<shared|exclusive>  --gpu_cmode=shared
CPUs required per GPU --cpus-per-gpu=<number>  --cpus-per-gpu=4
Memory per GPU --mem-per-gpu=<number>  --mem-per-gpu=1000m

Jobs can request nodes with large amounts of RAM with --partition=largemem.

Submitting a Job in One Line

If you wish to submit a job without needing a separate script, you can use sbatch --wrap=<command string>.  This will wrap the specified command in a simple “sh” shell script, which is then submitted to the Slurm controller.

Using Local Disk During a Job

During your job, you may write to and read from two temporary locations on the node:

  • /tmp: Two 7200 RPM SATA drives in RAID 0, 3.5 TB per node
  • /tmpssd: Faster solid state drive, 426 GB per node (on standard compute nodes only)

These folders are local, meaning they are only available to the processes running on that specific node and are not shared across the cluster.  If you need shared space, your /scratch folder may be a better temporary work space.

Keep in mind that these are temporary folders and may be used by others during or after your job. Please try not to completely fill the space so that others can use it, and move or delete your /tmp and /tmpssd files after your work is finished.

Job Status

Most of a job’s specifications can be seen by invoking scontrol show job <jobID>.  More details about the job can be written to a file by using  scontrol write batch_script <jobID> output.txt. If no output file is specified, the script will be written to slurm<jobID>.sh.

A job’s record remains in Slurm’s memory for 30 minutes after it completes.  scontrol show job will return “Invalid job id specified” for a job that completed more than 30 minutes ago.  At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.

Back To Top

Citing and Grants

Researchers are urged to acknowledge ARC in any publication, presentation, report, or proposal on research that involved ARC hardware (Great Lakes or other resources) and/or staff expertise.

“This research was supported in part through computational resources and services provided by Advanced Research Computing at the University of Michigan, Ann Arbor.”

Researchers are asked to annually submit, by October 1, a list of materials that reference ARC, and inform its staff whenever any such research receives professional or press exposure (arc-contact@umich.edu). This information is extremely important in enabling ARC  to continue supporting U-M researchers and obtain funding for future system and service upgrades.

Back To Top

Updates and Notices

This section will be updated when system level changes are made to Great Lakes. There are currently no updates.

Back To Top

Policies

Partition Policies

Slurm partitions represent collections of nodes for a computational purpose, and are equivalent to Torque queues. For more Great Lakes hardware specifications, see the Configuration page.

Partitions:

  • debug: The goal of debug is to allow users to run jobs quickly for debugging purposes.
    • Maximum jobs per user: 1
    • Maximum walltime: 4 hours
    • Maximum processors per job: 8
    • Maximum memory per job: 40 GB
    • Higher scheduling priority
  • standard: Standard compute nodes used for most work.
    • Max walltime: 14 days
    • Default partition if none specified
  • standard-oc: These nodes will be configured with additional software that can only be used on-campus, but are otherwise identical to standard compute nodes.
    • Max walltime: 14 days
  • gpu: Allows use of NVIDIA Tesla V100 GPUs.
    • Max walltime: 14 days
  • spgpu: Allows use of NVIDIA A40 GPUs.
    • Max walltime: 14 days
  • largemem: Allows use of a compute node with 1.5 TB of RAM.
    • Max walltime: 14 days

Account/Association Limits

In order to facilitate fairness between accounts, we have set resource limits on each Great Lakes root account which are described in the Great Lakes User Guide.

Limits can be set on a Slurm association or on an Slurm account. This allows a PI to limit individual users or the collective set of users in an account as the PI sees fit. The following values can be used to limit either an account or user association, unless noted otherwise below:

Current Great Lakes partition limits:

  • MaxJobs
    • Maximum number of jobs allowed to run at one time
    • Account example: testaccount can have 10 simultaneously running jobs (testuser1 has 8 running jobs and testuser2 has 2 running jobs for a total of 10 running jobs)
    • Association example: testuser can have 2 simultaneously running jobs
  • MaxWall
    • Maximum duration of a job
    • Account example: all users on testaccount can run jobs for up to 3 days
    • Association example: testuser’s jobs can run up to 3 days
  • MaxTRES (CPU, Memory, GPU or billing units)
    • Maximum number of TRES the running jobs can simultaneously use
    • NOTE: CPU, Memory, and GPU can also be limited on a user’s individual job
    • Account example: testaccount’s running jobs can collectively use up to 5 GPUs (testuser1’s jobs are using 3 GPUs and testuser2’s jobs are using 2 GPUs for a total of 5 GPUs)
    • Association example: testuser’s running jobs can collectively use up to 10 cores
    • Job example: testuser can run a single job using up to 10 cores
  • GrpTRESMins (billing units)
    • The total number of TRES minutes that can possibly be used by past, present and future jobs. This is primarily used for setting spending limits
    • Account example: all users on testaccount share a spending limit of $1000
    • Association example: testuser has a spending limit of $1000
  • GrpTRESRunMins
    • The total number of TRES minutes used by all running jobs. This takes into consideration the time limit of running jobs. If the limit is reached no new jobs are started until other jobs finish.
    • Account example: all users on testaccount share a pool of 1000 CPU minutes for running jobs (users have 10 serial jobs each with 100 minutes remaining to completion)
    • Association example: testuser can have up to 100 CPU minutes of running jobs (1 job with 100 CPU minutes remaining, 2 with 50 minutes remaining, etc.)
Periodic Spending Limits

The PI has the ability to set a monthly or yearly (fiscal year) spending limit on a Slurm account. Spending limits will be updated at the beginning of each month. As an example, if the testaccount account has a monthly spending limit of $1000 and this is used up on January 22nd, jobs will be unable to run until February 1st when the limit will reset with another $1000 to spend.

Please contact ARC if you would like to implement any of these limits.

Billing Policies

A job will be charged based on the percentage of the node it uses (based on CPU, memory, and GPU usage). This is done by using the maximum of the weighted charges for CPU, memory (and GPU if appropriate). If you use 1 core and all the memory of the machine or all the cores and minimal memory, you’ll be charged for the entire machine.  

Multiple Shortcodes, up to four, can be used. If more than one Shortcode is used, the amount charged to each can be split by percentage.

Great Lakes accounts can be initiated by sending an email to hpc-support@umich.edu.

Refund Policy

ARC operates our HPC clusters to the best of our abilities, but there can be events, both within and outside of our control, which may cause interruptions to your jobs. You are responsible for due diligence around your use of the ARC HPC resources and taking measures to maximize your research.  These actions may include:

  • Backing up data to permanent storage locations
  • Checkpointing your code to minimize impacts from job interruptions
  • Error checking in your scripts
  • Understanding the operation of the system and the user guide for the HPC cluster, including per job charges which may be greater than expected

Any refunds (if any) are at the discretion of ARC and will only only be enacted during system-wide preventable issues.  This does not include hardware failure, power failures, job failures, or similar issues.

Terms of Usage and User Responsibility

  1. Data is not backed up. None of the data on Great Lakes is backed up. The data that you keep in your home directory, /tmp or any other filesystem is exposed to immediate and permanent loss at all times. You are responsible for mitigating your own risk. ARC provides more durable storage on Turbo, Locker, and Data Den. See Storage Systems & Services for more information on these.
  2. Your usage is tracked and may be used for reports. We track a lot of job data and store it for a long time. We use this data to generate usage reports and look at patterns and trends. We may report this data, including your individual data, to your adviser, department head, dean, or other administrator or supervisor.
  3. Maintaining the overall stability of the system is paramount to us. While we make every effort to ensure that every job completes with the most efficient and accurate way possible, the stability of the cluster is our primary concern. This may affect you, but mostly we hope it benefits you. System availability is based on our best efforts. We are staffed to provide support during normal business hours. We try very hard to provide support as broadly as possible, but cannot guarantee support on a 24 hours a day basis. Additionally, we perform system maintenance on a periodic basis, driven by the availability of software updates, staffing availability, and input from the user community. We do our best to schedule around your needs, but there will be times when the system is unavailable. For scheduled outages, we will announce them at least one month in advance on the ARC home page; for unscheduled outages we will announce them as quickly as we can with as much detail as we have on that same page. You can also track ARC on Twitter (@umichARC).
  4. Great Lakes is intended only for non-commercial, academic research and instruction. Commercial use of some of the software on Great Lakes is prohibited by software licensing terms. Prohibited uses include product development or validation, any service for which a fee is charged, and, in some cases, research involving proprietary data that will not be made available publicly. Please contact hpc-support@umich.edu if you have any questions about this policy, or about whether your work may violate these terms.
  5. You are responsible for the security of sensitive codes and data. If you will be storing export-controlled or other sensitive or secure software, libraries, or data on the cluster, it is your responsibility that is is secured to the standards set by the most restrictive governing rules.  We cannot reasonably monitor everything that is installed on the cluster, and cannot be responsible for it, leaving the responsibility with you, the end user.
  6. Data subject to HIPAA regulations may not be stored or processed on the cluster.

USER RESPONSIBILITIES

Users must manage data appropriately in their various locations:

  • /home
    • 80 GB quota, mounted on Turbo
  • /scratch (more information below)
  • /tmp
  • /tmpssd
  • customer-provided NFS

SCRATCH STORAGE POLICIES

File quotas on /scratch are per root account (a PI or project account) and shared between child accounts (individual users):

  • 10 TB storage limit
  • 1 million file (inode) limit

These limits may be increased if needed. If you are in need of more scratch space or a greater file limit for your account please email us at arc-support@umich.edu. Please note that these requests need to come from an administrator on the account and should include an explanation of why the increase is required. 

Users should keep in mind that /scratch has an auto-purge policy on unaccessed files, which means that any unaccessed data will be automatically deleted by the system after 60 days. Scratch file systems are not backed up. Critical files should be backed up to another location.

LOGIN NODE USAGE

Appropriate uses for the Great Lakes login nodes include:

  • Transferring small files to and from the cluster
  • Ordinary data management tasks, such as moving files, creating directories, etc.
  • Creating, modifying, and compiling code and submission scripts
  • Submitting and monitoring the status of jobs
  • Testing executables to ensure they will run on the cluster and its infrastructure.

You should limit your use of Great Lakes login nodes to programs using 4 processors or fewer, less than 16 GB of memory, and that do not run longer than 5 minutes. Larger or longer processes may cause problems for other users of the login nodes. We reserve the right to terminate processes if we think they are causing or may cause a problem. If your program needs to run for more than 5 minutes or more extensive resources, you should use an interactive job.

Any other uses of the login nodes may result in the termination of the process in violation. Any production processes (including post processing) should be submitted through the batch system to the cluster. If interactive use is required then you should submit an interactive job to the cluster.

SECURITY ON GREAT LAKES & USE OF SENSITIVE DATA

Applications and data are protected by secure physical facilities and infrastructure as well as a variety of network and security monitoring systems. These systems provide basic but important security measures including:

  • Secure access – All access to Great Lakes is via SSH or Globus. SSH has a long history of high-security.
  • Built-in firewalls – All of the Great Lakes servers have firewalls that restrict access to only what is needed.
  • Unique users – Great Lakes adheres to the University guideline of one person per login ID and one login ID per person.
  • Multi-factor authentication (MFA) – For all interactive sessions, Great Lakes requires both a UM Kerberos password and Duo authentication. File transfer sessions require a Kerberos password.
  • Private subnets – Other than the login and file transfer computers that are part of Great Lakes, all of the computers are on a network that is private within the University network and are unreachable from the Internet.
  • Flexible data storage – Researchers can control the security of their own data storage by securing their storage as they require and having it mounted via NFSv3 or NFSv4 on Great Lakes. Another option is to make use of Great Lakes’ local scratch storage, which is considered secure for many types of data. Note: Great Lakes is not considered secure for data covered by HIPAA.

Back To Top