Using Armis2

Before you can use Armis2, the Principal Investigator (PI) must establish a Slurm account by contacting HPC Support with lists of users, admins, and a shortcode. Trial accounts are also available for new PIs.

Email Support

Armis2 is suitable for export controlled and HIPAA regulated data, but users are responsible for security and compliance related to sensitive code and/or data.

View Policies

See the Armis2 cheat sheet for a list of common Linux (Bash) and Slurm commands, including Torque and Slurm comparisons.

CHEAT SHEET (PDF)

Go to Armis2 Overview   To search this user guide, use the command + F (Mac) or CTRL + F (Windows) keyboard shortcuts.

Cluster Defaults and Partition Limits

Armis2 Cluster Defaults

Cluster Defaults Default Value
Default Walltime 60 minutes
Default Memory Per CPU 768 MB
Default Number of CPUs

no memory specified: 1 core
Memory specified: memory/768 = # of cores (rounded down)

/scratch file deletion policy

60 days without being accessed.  (see SCRATCH STORAGE POLICIES below)

/scratch quota per root account

10 TB storage limit

/home quota per user

80 GB

Max queued jobs per user per account

5,000 

Shell timeout if idle:

2 hours

Armis2 Partition Limits

Partition Limit standard gpu largemem
Max Walltime 2 weeks
Max running Mem per root account 5160 GB
2210 GB
Max running CPUs per root account 1032 cores
84 cores
Max running GPUs per root account n/a 10 Tesla K40m n/a

Back To Top

Getting Started

1. Get DUO

DUO two-factor authentication is required to access the majority of U-M services and all HPC services. If you need to set up DUO, visit the DUO page on the Safe Computing site for instructions to get started.

2. Get an Armis2 user login

You must establish user login on Armis2 by filling out this login request form. If you had a login on Armis, you should have one on Armis2.

3. Get an SSH Client & Connect to Armis2 Login Node

The login node (armis2.arc-ts.umich.edu) is the entry point into the cluster. It is accessible from the Ann Arbor, Dearborn, and Flint campus IP addresses and from the U-M VPN network only and require a valid user account and Duo authentication to log in. They are a shared resource and, as such, it is expected that users do not monopolize them.

If you are trying to log in from off campus, or using an unauthenticated wireless network such as MGuest, you have a couple of options:

See the policies below governing appropriate use of the login nodes.

Mac or Linux:

Open Terminal and type:

ssh uniqname@armis2.arc-ts.umich.edu

You will be required to enter your UMICH (Level-1) password to log in. Please note that as you type your password, nothing you type will appear on the screen; this is completely normal. Press “Enter/Return” key once you are done typing your password.

When you’re connecting for the first time, it’s not uncommon to see a message like this one:

The authenticity of host 'armis2-login1.arc-ts.umich.edu (141.211.19.11)' can't be established.
RSA key fingerprint is 6f:8c:67:df:43:4f:e0:fc:80:5b:49:1a:eb:81:cc:54.
Are you sure you want to continue connecting (yes/no)?

 

This is normal. By saying “yes” you’re accepting the public SSH key for the system. This key will be stored in a local known_hosts file on your system so you won’t be prompted in the future. The keys from Armis2 will NOT change. So, for example, if you get a new computer and SSH to Armis2, you’ll be prompted to add the key again.

We encourage you to compare the fingerprint you’re presented with, when connecting for the first time, to one of the fingerprints below. The format of the fingerprint you’re presented could be dictated by the SSH client on your machine.

RSA 6f:8c:67:df:43:4f:e0:fc:80:5b:49:1a:eb:81:cc:54
ECDSA Dae1G3gu0mtro2Rm15U6l8aQg4bGFnDQJhmGH3k+fKs
ED25519 9ho43xHw/aVo4q5AalH0XsKlWLKFSGuuw9lt3tCIYEs

In the example message given above, we are presented with the RSA key fingerprint and its MD5 value, which is the same value as in the above table.

If you’re NOT seeing one of these fingerprints, submit a ticket to arcts-support@umich.edu and do NOT connect to the server via SSH until discussing with an ARC-TS staff member to determine if there is a security issue.

To avoid being prompted to accept the key on a new system you may choose to pre-populate your SSH known_hosts file with the pub keys from Armis 2. The keys can be found in the FAQ.

 

Windows (using PuTTY):

Download and install PuTTY here.

Launch PuTTY and enter armis2.arc-ts.umich.edu as the host name, then click open.

If you receive a “PuTTY Security Alert” pop-up, this is completely normal, click the “Yes” option. This will tell PuTTY to trust the host the next time you want to connect to it. From there, a terminal window will open; you will be required to enter your UMICH uniqname and then your UMICH (Level-1) password in order to log in. Please note that as you type your password, nothing you type will appear on the screen; this is normal. Press “Enter/Return” key once you are done typing your password.

All Operating Systems:

At the “Enter a passcode or select one of the following options:” prompt, type the number of your preferred choice for Duo authentication.

4. Get files

You can use SFTP to transfer data to your /home directory.

SFTP: Mac or Windows using FileZilla
  1. Open FileZilla and click the “Site Manager” button
  2. Create a New Site, which you can name “Armis2” or something similar
  3. Select the “SFTP (SSH File Transfer Protocol)” option
  4. In the Host field, type armis2-xfer.arc-ts.umich.edu
  5. Select “Interactive” for Logon Type
  6. In the User field, type your uniqname
  7. Click “Connect”
  8. Enter your UMICH (Level-1) password
  9. Select your DUO method (1-3) and complete authentication
  10. Drag and drop files between the two systems
  11. Click “Disconnect” when finished

 

On Windows, you can also use WinSCP with similar settings, available alongside PuTTY.

SFTP: Mac or Linux using Terminal

To copy a single file, type:

scp localfile uniqname@armis2-xfer.arc-ts.umich.edu:~/remotefile

To copy an entire directory, type:

scp -r localdir uniqname@armis2-xfer.arc-ts.umich.edu:~/remotedir

These commands can also be reversed in order to copy files from Armis2 to your machine:

scp -r uniqname@armis2-xfer.arc-ts.umich.edu:~/remotedir localdir

You will need to authenticate via DUO to complete the file transfer.

5. Submit a job

This is a simple guide to get your jobs up and running. For more advanced Slurm features, see the Slurm User Guide for Armis2. If you are familiar with using the resource manager Torque, you may find the migrating from Torque to Slurm guide useful.

Batch Jobs

Most work will be queued to be run on Armis2 and is described through a batch script. The sbatch command is used to submit a batch script to Slurm. To submit a batch script simply run the following from a shared file system; those include your home directory, /scratch, and any directory under /nfs that you can normally use in a job on Armis. Output will be sent to this working directory (jobName-jobID.log). Do not submit jobs from /tmp or any of its subdirectories.

$ sbatch myJob.sh

The batch job script is composed of three main components:

  • The interpreter used to execute the script
  • #SBATCH directives that convey submission options
  • The application(s) to execute along with its input arguments and options

Example:

#!/bin/bash
# The interpreter used to execute the script

#“#SBATCH” directives that convey submission options:

#SBATCH --job-name=example_job
#SBATCH --mail-type=BEGIN,END
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=1000m 
#SBATCH --time=10:00
#SBATCH --account=test
#SBATCH --partition=standard

# The application(s) to execute along with its input arguments and options:

/bin/hostname
sleep 60

How many nodes and processors you request will depend on the capability of your software and what it can do. There are four common scenarios.

Example: One Node, One Processor

This is the simplest case and is shown in the example above. The majority of software cannot use more than this. Some examples of software for which this would be the right configuration are SAS, Stata, R, many Python programs, most Perl programs.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: One Node, Multiple Processors


This is similar to what a modern desktop or laptop is likely to have. Software that can use more than one processor may be described as multicore, multiprocessor, or mulithreaded. Some examples of software that can benefit from this are MATLAB and Stata/MP. You should read the documentation for your software to see if this is one of its capabilities.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Nodes, One Process per CPU

This is the classic MPI approach, where multiple machines are requested, one process per processor on each node is started using MPI. This is the way most MPI-enabled software is written to work.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Nodes, Multiple CPUs per Process


This is often referred to as the “hybrid mode” MPI approach, where multiple machines are requested and multiple processes are requested. MPI will start a parent process or processes on each node, and those in turn will be able to use more than one processor for threaded calculations.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s
Common Job Submission Options
Description Slurm directive (#SBATCH option) Armis2 Usage
Job name --job-name=<name> --job-name=armis2job1
Account --account=<account> --account=test
Queue --partition=<partition_name> --partition=standard

Available partitions: standard, gpu (GPU jobs only), largemem (large memory jobs only)

Wall time limit --time=<hh:mm:ss> --time=02:00:00
Node count --nodes=<count> --nodes=2
Process count per node --ntasks-per-node=<count> --ntasks-per-node=1
Minimum memory per processor --mem-per-cpu=<memory> --mem-per-cpu=1000m
Request software license(s) --licenses=<application>@slurmdb:<N> --licenses=stata@slurmdb:1
requests one license for Stata
Request event notification

--mail-type=<events>

Note: multiple mail-type requests may be specified in a comma separated list:

--mail-type=BEGIN,END,NONE,FAIL,REQUEUE

--mail-type=BEGIN,END,FAIL

Please note that if your job is set to utilize more than one node, make sure your code is MPI enabled in order to run across these nodes and you must use srun rather then mpirun or mpiexec. More advanced job submission options can be found in the Slurm User Guide for Armis2.

Interactive Jobs

An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. Interactive jobs are useful when debugging or interacting with an application. The srun command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

[user@login ~]$ srun --pty /bin/bash 
srun: job 309 queued and waiting for resources 
srun: job 309 has been allocated resources 
[user@node0001 ~]$ hostname 
bn01.stage.arc-ts.umich.edu 
[user@node0001 ~]$

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 4 CPUS and 4GB of memory each:

[user@login ~]$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --pty /bin/bash
srun: job 894 queued and waiting for resources
srun: job 894 has been allocated resources
[user@node0001 ~]$ srun hostname
node0001.armis2.arc-ts.umich.edu
node0001.armis2.arc-ts.umich.edu
node0002.armis2.arc-ts.umich.edu
node0001.armis2.arc-ts.umich.edu
node0001.armis2.arc-ts.umich.edu
node0002.armis2.arc-ts.umich.edu
node0002.armis2.arc-ts.umich.edu
node0002.armis2.arc-ts.umich.edu

In the above example srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job. See the srun man page for more details.

GPU and Large Memory Jobs

Jobs can request GPUs with the job submission options --partition=gpu and a count option from the table below. All counts can be represented by gputype:number or just a number (default type will be used). Available GPU types can be found with the command sinfo -O gres -p <partition>. GPUs can be requested in both Batch and Interactive jobs.

Description Slurm directive (#SBATCH or srun option) Example
GPUs per node --gpus-per-node=<gputype:number> --gpus-per-node=2 or --gpus-per-node=v100:2
GPUs per job --gpus=<gputype:number> --gpus=2 or --gpus=v100:2
GPUs per socket --gpus-per-socket=<gputype:number> --gpus-per-socket=2 or --gpus-per-socket=v100:2
GPUs per task --gpus-per-task=<gputype:number> --gpus-per-task=2 or --gpus-per-task=v100:2
CPUs required per GPU --cpus-per-gpu=<number>  --cpus-per-gpu=4
Memory per GPU --mem-per-gpu=<number>  --mem-per-gpu=1000m

Jobs can request nodes with large amounts of RAM with --partition=largemem.

Job Status

Most of a job’s specifications can be seen by invoking scontrol show job <jobID>.  More details about the job can be written to a file by using  scontrol write batch_script <jobID> output.txt. If no output file is specified, the script will be written to slurm<jobID>.sh.

A job’s record remains in Slurm’s memory for 30 minutes after it completes.  scontrol show job will return “Invalid job id specified” for a job that completed more than 30 minutes ago.  At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.

To view TRES (Trackable RESource) utilization by user or account, use the following commands (substitute your values for bolded parts):

Shows TRES usage by all users on account during date range:
sreport cluster UserUtilizationByAccount start=mm/dd/yy end=mm/dd/yy account=test --tres type
Shows TRES usage by specified user(s) on account during date range:
sreport cluster UserUtilizationByAccount start=mm/dd/yy end=mm/dd/yy users=un1,un2 account=test --tres type
Lists users alphabetically along with TRES usage and total during date range:
sreport cluster AccountUtilizationByUser start=mm/dd/yy end=mm/dd/yy tree account=test --tres type
Possible TRES types:

cpu
mem
node
gres/gpu

For more reporting options, see the Slurm sreport documentation.

Back To Top

Software

The Armis2 cluster uses the Lmod modules system to provide access to centrally installed software. If you used a cluster at UM previously, then you should review the documentation for the module system as we have changed the configuration to match that used at most national clusters and most other university clusters.

In particular, you should use the command module keyword to look for a module and do not use module available to search for software, as module available will only show software for which all the dependencies (or prerequisites) are already loaded.

So, to search for the software package FFTW, use

$ module keyword fftw

That will show which versions are installed and provide a command to determine what is needed to load it.

Please see our page on using the Lmod modules system for more details and examples.

There are two main categories of software available on the system: software that is installed as part of the installation of the operating system and software that is installed separately. No special action is needed to use the software installed with the operating system. The separately installed software is set up so that you will use a module to use it. The module will set up the environment and make the software available. We do it this way to enable having multiple versions of the same package and to avoid having conflicts between software packages that have mutually exclusive system requirements.

Requesting software licenses

Many of the software packages that are licensed for use on ARC clusters are licensed for a limited number of concurrent uses. If you will use one of those packages, then you must request a license or licenses in your submission script. As an example, to request one Stata license, you would use

#SBATCH --licenses=stata@slurmdb:1

The list of software can be found from Armis2 by using the command

$ scontrol show licenses

Back To Top

Policies

Partition Policies

Slurm partitions represent collections of nodes for a computational purpose, and are equivalent to Torque queues. For more Armis2 hardware specifications, see the Configuration page.

Partitions:

  • debug: The goal of debug is to allow users to run jobs quickly for debugging purposes.
    • Max walltime: 4 hours
    • Max jobs per user: 1
    • Higher scheduling priority
  • standard: Standard compute nodes used for most work.
    • Max walltime: 14 days
    • Default partition if none specified
  • gpu: Allows use of NVIDIA Tesla V100 GPUs.
    • Max walltime: 14 days
  • largemem: Allows use of a compute node with 1.5 TB of RAM.
    • Max walltime: 14 days

Account/Association Limits

In order to facilitate fairness between accounts, we have set resource limits on each Armis2 root account which are described here.

Limits can be set on a Slurm association or on an Slurm account. This allows a PI to limit individual users or the collective set of users in an account as the PI sees fit. The following values can be used to limit either an account or user association, unless noted otherwise below:

Current Armis2 partition limits:

  • MaxJobs
    • Maximum number of jobs allowed to run at one time
    • Account example: testaccount can have 10 simultaneously running jobs (testuser1 has 8 running jobs and testuser2 has 2 running jobs for a total of 10 running jobs)
    • Association example: testuser can have 2 simultaneously running jobs
  • MaxWall
    • Maximum duration of a job
    • Account example: all users on testaccount can run jobs for up to 3 days
    • Association example: testuser’s jobs can run up to 3 days
  • MaxTRES (CPU, Memory, GPU or billing units)
    • Maximum number of TRES the running jobs can simultaneously use
    • NOTE: CPU, Memory, and GPU can also be limited on a user’s individual job
    • Account example: testaccount’s running jobs can collectively use up to 5 GPUs (testuser1’s jobs are using 3 GPUs and testuser2’s jobs are using 2 GPUs for a total of 5 GPUs)
    • Association example: testuser’s running jobs can collectively use up to 10 cores
    • Job example: testuser can run a single job using up to 10 cores
  • GrpTRESMins (billing units)
    • The total number of TRES minutes that can possibly be used by past, present and future jobs. This is primarily used for setting spending limits
    • Account example: all users on testaccount share a spending limit of $1000
    • Association example: testuser has a spending limit of $1000
  • GrpTRESRunMins
    • The total number of TRES minutes used by all running jobs. This takes into consideration the time limit of running jobs. If the limit is reached no new jobs are started until other jobs finish.
    • Account example: all users on testaccount share a pool of 1000 CPU minutes for running jobs (users have 10 serial jobs each with 100 minutes remaining to completion)
    • Association example: testuser can have up to 100 CPU minutes of running jobs (1 job with 100 CPU minutes remaining, 2 with 50 minutes remaining, etc.)

Periodic Spending Limits

The PI has the ability to set a monthly or yearly (fiscal year) spending limit on a Slurm account. Spending limits will be updated at the beginning of each month. As an example, if the testaccount account has a monthly spending limit of $1000 and this is used up on January 22nd, jobs will be unable to run until February 1st when the limit will reset with another $1000 to spend.

Please contact ARC-TS if you would like to implement any of these limits.

ARMIS2 TERMS OF USAGE

  1. This service is for sensitive data only. Be advised that you should not move sensitive data off of this system, unless it is to another service or machine that has been approved for hosting the same types of sensitive data.
  2. Limited data restoration. The data in your home directory can be restored from snapshots going back 3 days.  Anything beyond 3 days can not be retrieved.  Data stored on outside your home directory such as a group share will be subject to other data-lifetime policies that is setup at the time of purchasing the respective Turbo NFS volume. You are responsible for mitigating your own risk. We suggest you store copies of hard-to-reproduce data in your home directory or on HIPAA-aligned storage you own or purchased from Turbo.
  3. System usage is tracked and is used for billing reports and capacity planning. Job metadata (example: walltime, resource utilization, software accessed) is stored and used to generate usage reports and to analyze patterns and trends. ARC-TS may report this metadata, including your individual metadata data, to your adviser, department head, dean, or other administrator or supervisor for billing or capacity planning purposes.
  4. Maintaining the overall stability of the system is paramount to us. While we make every effort to ensure that every job completes with the most efficient and accurate way possible, the good of the whole is more important to us than the good of an individual. This may affect you, but mostly we hope it benefits you. System availability is based on our best efforts. We are staffed to provide support during normal business hours. We try very hard to provide support as broadly as possible, but cannot guarantee support on a 24 hour per day basis. Additionally, we perform system maintenance on a periodic basis, driven by the availability of software updates, staffing availability, and input from the user community. We do our best to schedule around your needs, but there will be times when the system is unavailable. For scheduled outages, we will announce them at least one month in advance on the ARC-TS home page; for unscheduled outages we will announce them as quickly as we can with as much detail as we have on that same page. You can also track ARC-TS at Twitter name @ARCTS_UM.
  5. Armis2 is intended only for non-commercial, academic research and instruction. Commercial use of some of the software on Armis2 is prohibited by software licensing terms. Prohibited uses include product development or validation, software use supporting any service for which a fee is charged, and, in some cases, research involving proprietary data that will not be made available publicly regardless whether the research is published . Please contact arcts-support@umich.edu if you have any questions about this policy, or about whether your work may violate these terms.
  6. Data subject to export control and HIPAA regulations may be stored or processed on the cluster. The appropriate storage solution for storing export controlled information or PHI that can be accessed on the Armis2 cluster is the  Turbo-NFSv4 with Kerberos offering(See the Sensitive Data Restrictions for Turbo-NFSv4 with Kerberos for further details). It is your responsibility, not ARC’s, to be aware of and comply with all applicable laws, regulations, and universities policies (e.g., ITAR, EAR, HIPAA) as part of any research activity that may raise compliance issues under those laws. For assistance with export controlled research, contact the U-M Export Control Officer at exportcontrols@umich.edu. For assistance with HIPAA-related computational research, contact the ARC liaison to the Medical School at msis.help@umich.edu.

USER RESPONSIBILITIES

Users should make requests by email to arcts-support@umich.edu:

  • One day in advance, request users be added to Armis2 accounts you may administer. All users need approval to be added to an account on Armis2 before they can have a user login created on the cluster.

Users are responsible for security and compliance related to sensitive code and/or data. Security and compliance are shared responsibilities. If you process or store sensitive university data, software, or libraries on the cluster, you are responsible for understanding and adhering to any relevant legal, regulatory or contractual requirements.

Users are responsible for maintaining MCommunity groups used for MReport authorizations.

Users must manage PHI (protected health information) appropriately and can use the following locations:

  • /home (80 GB quota)
  • /scratch (more information below)
  • /tmp
  • Any appropriate PHI-compliant NFS volume mounted on Armis2

SCRATCH STORAGE POLICIES

Every user has a /scratch directory for every Slurm account they are a member of.  Additionally for that account, there is a shared data directory for collaboration with other members of that account.  The account directory group ownership is set using the Slurm account-based UNIX groups, so all files created in the /scratch directory are accessible by any group member, to facilitate collaboration.

Example:
/scratch/msbritt_root
/scratch/msbritt_root/msbritt
/scratch/msbritt_root/shared_data

There is a 10 TB quota on /scratch per root account (a PI or project account), which is shared between child accounts (individual users).

Users should keep in mind that scratch has an auto-purge policy on unaccessed files, which means that any unaccessed data will be automatically deleted by the system after 60 days. Scratch file systems are not backed up. Critical files should be backed up to another location.

LOGIN NODE POLICIES

Appropriate uses for the login nodes:

  • Transferring small files to and from the cluster
  • Creating, modifying, and compiling code and submission scripts
  • Submitting and monitoring the status of jobs
  • Testing executables to ensure they will run on the cluster and its infrastructure. Processes are limited to a maximum of 15 minutes of CPU time to prevent runaway processes and overuse.

Any other uses of the login node may result in the termination of the process in violation. Any production processes (including post processing) should be submitted through the batch system to the cluster. If interactive use is required then you should submit an interactive job to the cluster.

Back To Top

Updates and Notices

This section will be updated when system level changes are made to Armis2. There are currently no updates.

Back To Top