Using Flux

1The first step in using Flux is establishing a user account.

Fill Out Form

2 A Flux allocation is also needed

Read More

Go to Flux Overview     To search this user guide, use the command + f keyboard shortcut.

Getting Started

Flux in 10 Easy Steps

1. Get Duo

You must use Duo authentication to log on to Flux. For more details see http://its.umich.edu/two-factor-authentication.

2. Get a Flux user account

You need to have a user account to use Flux. Apply for one using this form.

3. Get access to an allocation

If you don’t have one already, an allocation must be purchased before you can use Flux. If you don’t have access, request a Flux allocation by emailing hpc-support@umich.edu, and one of our support staff will help you get set up. For more, see our Allocation pages. (LSA users have access to a public allocation.)

4. Get an SSH client

You need a terminal emulator to log into Flux. This video will help teach you some basic Linux navigation commands if needed.

If you are trying to log in from off campus, or using an unauthenticated wireless network such as MGuest, you have a couple of options:

    • Install VPN software on your computer
    • First ssh to login.itd.umich.edu, then ssh to flux-login.arc-ts.umich.edu from there.

Here’s what a login looks like using a terminal emulator:

Mac using terminal: Open terminal

Type: ssh -l uniqname flux-login.arc-ts.umich.edu [replacing your uniqname in the command]

Windows using PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/)

Launch Putty and enter flux-login.arc-ts.umich.edu as the host name then click open.

putty

For both Mac and Windows:

At the “Enter a passcode or select one of the following options:” prompt, type the number of your preferred choice for Duo authentication.

login-mac

5. Get files

The server you can use to get files onto your home directory on Flux is flux-xfer.arc-ts.umich.edu. Here’s what it looks like on

Cyberduck on a Mac (Video: http://tinyurl.com/flux-cyberduck)

  1. Open Cyberduck and click the Open connection button.

  2. Set the “Server:” to: be flux-xfer.arc-ts.umich.edu

  3. Set your Username: to be your uniqname.

  4. Enter your Kerberos password.

  5. Click connect.

    cyberduck

  1. Drag and drop files between the two systems.  Click the Disconnect button when completed.

An alternative for Windows is WinSCP.exe and you can get this from the U-M Blue Disc site. Here’s what transferring files or directories looks like from the command line (Linux or Mac):

% scp localfile uniqname@flux-xfer.arc-ts.umich.edu:remotefile (copy a file)

or

% scp -r localdir  uniqname@flux-xfer.arc-ts.umich.edu:remotedir    (copy an entire directory)

Globus is a great alternative especially for large data transfers. Learn more at the Globus web site or this how-to video.

6. Get an editor

nano is an easy to use editor available to edit your files on Flux.

7. Get a PBS batch script

The cluster requires some information to wrap your code so that it can submit your job for execution. Wrapping these jobs allows the Flux scheduler to fits jobs to available processing on the system. This means you must make a batch file to submit your code.

  1. Video: PBS Basics.

    Sample batch file, named sample.pbs:

#!/bin/sh
####  PBS preamble

#PBS -N sample_job

# Change "bjensen" to your uniqname:
#PBS -M bjensen@umich.edu
#PBS -m abe

# Change the number of cores (ppn=1), amount of memory, and walltime:
#PBS -l nodes=1:ppn=1,mem=2000mb,walltime=01:00:00
#PBS -j oe
#PBS -V

# Change "example_flux" to the name of your Flux allocation:
#PBS -A example_flux
#PBS -q flux
#PBS -l qos=flux

####  End PBS preamble

#  Show list of CPUs you ran on, if you're running under PBS
if [ -n "$PBS_NODEFILE" ]; then cat $PBS_NODEFILE; fi

#  Change to the directory you submitted from
if [ -n "$PBS_O_WORKDIR" ]; then cd $PBS_O_WORKDIR; fi

#  Put your job commands here:
echo "Hello, world"

8. Get modules

Flux makes software available for use by packaging them in modules.  You must load the modules that you need for your job before you submit it. The commands are:

module avail
module list
module load modulename    (eg module load R)
module unload modulename    (eg module unload R)

9. Get a job

You can submit your job and check its status in the queueing system by using the commands below.

Submitting your Job:

qsub filename.pbs    (eg qsub sample.pbs … outputs jobid on successful submission)

Checking the status of your Job:

For a single job: qstat jobid     OR  
checkjob  jobid   (just the numeric portion)

To see all of your jobs: qstat -u uniqname  OR showq -w user=uniqname

Deleting your job:

qdel  jobid

If a job doesn’t start in within 30 minutes, send an email with a copy of your PBS script and the job number to flux-support@umich.edu.  Do not delete the job.

10. Get output

When your job is completed, you will receive an email similar to this one:

PBS Job Id: 11800755.flux.arc-ts.umich.edu
Job Name:   sample_job
Exec host:  flux5301/0
Exit_status=0
resources_used.cput=00:00:03
resources_used.mem=768124kb
resources_used.vmem=770420kb
resources_used.walltime=00:00:13

Exit_status=0 means “OK.” Anything else indicates an error occurred when running the PBS script — check the PBS output and error files to find out what the problem is.

resources_used.vmem=770420kb means your job used a total of 770 MB of memory. This may be used to revise your memory estimate for future submissions.

resources_used.walltime=00:00:13 means your job took 13 seconds to execute. This may be used to revise your walltime estimate for future submissions.

If you have specified the -j oe option in your PBS script, all script output and error message output will appear in a file named jobname.ojobid in your submission directory after the job completes.  In the above sample, the output generated by your batch script (here “Hello, world”) will be placed into the file sample_job.o11800755 .

Back To Top

Flux Tips (Dos and Don’ts)

Do:

  • use the U-M VPN in order to log in to Flux from off campus. WHY: Flux login is restricted to campus IP addresses.
  • have your jobs read their input and write their output to the /scratch filesystem, WHY:/scratch is much faster and more reliable than your Flux home directory.
  • remember to load any modules your job needs each time after logging in to Flux but before submitting any jobs.  Modules that you will always be using can be loaded automatically when you log in by putting the “module load” commands in your ~/privatemodules/defaultfile. WHY: Software will not be available to your job if it is not loaded prior to submitting the job.
  • request 20% more than the maximum memory and maximum walltime you think your jobs might need.  WHY: if a job exceeds the requested memory or walltime, it will be terminated before it can finish.
  • use “#PBS -j oe” in your PBS scripts to combine the PBS output and error messages into a single file.  WHY: It is much easier to figure out what your job did (you won’t have to match up lines between the two files).
  • submit lots of jobs at once rather than submitting one job and waiting for it to complete before submitting another.  WHY: “keeping the queue full” will give you the overall best throughput and utilization for your Flux allocation.
  • perform regular backups of all of your data on Flux yourself, including data in your home directory and in/scratch WHY: If you lose a file, the Flux staff can’t get it back for you.
  • submit interactive jobs using “qsub -I”.  WHY: Interactive jobs on the login nodes will be terminated after 15 minutes or before if they are disrupting normal service.
  • send any requests for help as a new email (not a reply to a previous email) to hpc-support@umich.edu.  WHY: You’ll get quicker help if you don’t send email to individuals directly and don’t reply to old (unrelated) support tickets that may be closed already.
  • run “qdel $(qselect -u $USER)” to delete all your jobs if you need to terminate all your jobs.

Don’t:

  • run interactive jobs or do significant computation on the Flux login nodes.  WHY: processes on the Flux login nodes will be automatically terminated if they use 15 minutes of CPU time, or before if they are disrupting normal service.
  • use /scratch space for long-term storage; files that you’re not using for two weeks or longer should be moved to your home directory or another system. WHY: /scratch is a limited, shared resource; also, no files anywhere on Flux are backed up, and cannot be recovered if lost.
  • run “qdel all” WHY: It will lock up the cluster scheduler for a long time trying to delete jobs that do not belong to you when you do not have the permission to delete them.

Back To Top

Flux FAQs

If these Frequently Asked Questions don’t address your issue, please email hpc-support@umich.edu. Please include your job ID number for specific job troubleshooting.

How can I start using Flux?

In order to use Flux, you will first need a User Account. You can apply for a User Account by filling out this form.

A Flux User Account is different from a Flux Account. A Flux User Account is used by a single user to log onto the Flux nodes, whereas a Flux Account is a collection of Flux User Accounts that are associated with one or more Flux allocations.
Flux uses two-factor authentication for security purposes, so you will also need to get an MToken to be able to log in to one of our login nodes. Once you have a User Account and an MToken, you will be able to copy data to your Flux home directory and run small programs on the the Flux login nodes.
You will need access to a Flux Account with an allocation in order to run jobs. A Flux Account must be paid for, so this is typically provided by a faculty member or your school or college.

How do I log on to Flux?

To connect to Flux you should use secure shell (ssh) from a terminal on your computer to one of the Flux login nodes. If you are using a Mac- or Linux-based operating system, you can use the default terminal for this. If you are using Windows, you will need to download a program. One popular program is PuTTY.

If you are not logged on to a computer using your uniqname, you should specify your username when connecting.
For example:

$ ssh uniqname@flux-login.engin.umich.edu

You will be prompted to input your MToken passcode. (Visit the MToken site for more information on obtaining an MToken.) After entering your MToken passcode successfully, you will be prompted to input your password. This is your kerberos password that you use to login to University services.

login-mac

Above: Connecting to a login node from a Mac.

 

 

 

 

login-putty

Connecting to a login node with PuTTY on Windows.

What is the difference between a compute node and a login node?

A login node is a computer that you can connect to directly through ssh. The login nodes can be used to copy files to your home directory and to queue jobs to run on the compute nodes. The compute nodes are where the actual jobs are run. Compute nodes are automatically assigned to a job when a PBS script is submitted to the job scheduler.

Which Flux Accounts can I use?

If you are a member of the College of Engineering or the College of Literature, Science, and the Arts, you have access to a Flux Account with a College funded allocation.
For Engineering, the Flux Account name is engin_flux. For LSA, it is lsa_flux.
To view which Flux Accounts you have access to, use the following command on one of the flux-login nodes:

$ mdiag -u <username>

For example:

 

midiag-u-uniqnameThe ALIST lists all of the Flux Accounts you are authorized to use. Note that some of these accounts might not have an active allocation. Our example user has access to run jobs on engin_flux and FluxTraining_flux. It is important to note that default_flux cannot be used to run jobs and is simply a placeholder.

How do I access an existing Flux Account?

To be given access to a Flux Account that your group is already using, please have an administrator of that Flux Account send an email to flux-support@umich.edu requesting that you be added. The administrator will usually be the person who pays for the account, or a delegated manager.

How many jobs can I have queued or running at once?

There can be up to 5,000 jobs per user in each queue (like flux or fluxm). There is no built-in limit to the number of jobs a user can have running at one time, but specific Flux Accounts can have per-user job limits. For example, engin_flux allows a maximum of 10 jobs per user at one time.

What is PBS?

Portable Batch System (PBS) is the system that the cluster uses to schedule jobs. Users use PBS scripts to specify important information, such as number of processors and memory that a job requires, to the scheduler. Visit this page for more information about PBS.

What do I change in my PBS script when running on a different Flux Account?

When using standard Flux Accounts, those that end in _flux, you must set the qos parameter and queue parameter to flux in your PBS script.
For example:

#PBS -A example_flux
#PBS -l qos=flux
#PBS -q flux

When using Large Memory Flux Accounts, those that end in _fluxm, you must set the qos to flux and queue to fluxm in your PBS script.

#PBS -A example_fluxm
#PBS -l qos=flux
#PBS -q fluxm
When using GPU Flux Accounts, those that end in _fluxg, you must set the qos to flux and queue to fluxg in your PBS script.
For example:

#PBS -A example_fluxg
#PBS -l qos=flux
#PBS -q fluxg

When using Flux on Demand (FoD) Accounts, those that end in _fluxod, you must set the qos to flux and queue to fluxod in your PBS script.

#PBS -A example_fluxod
#PBS -l qos=flux
#PBS -q fluxod

When using Flux operating Environment (FoE) Accounts, those that end in _fluxoe, you must set the qos to flux and queue to fluxoe in your PBS script.

#PBS -A example_fluxoe
#PBS -l qos=flux
#PBS -q fluxoe

What security measures are in place for Flux?

The Flux high-performance computing system at the University of Michigan has been built to provide a flexible and secure HPC environment. Flux is an extremely scalable, flexible, and reliable platform that enables researchers to match their computing capability and costs with their needs while maintaining the security of their research.

Built-in Security Features

Applications and data are protected by secure physical facilities and infrastructure as well as a variety of network and security monitoring systems. These systems provide basic but important security measures including:

  • Secure access – All access to Flux is via ssh or Globus. Ssh has a long history of high-security. Globus provides basic security and supports additional security if you need it.
  • Built-in firewalls – All of the Flux computers have firewalls that restrict access to only what is needed.
  • Unique users – Flux adheres to the University guideline of one person per login ID and one login ID per person.
  • Multi-factor authentication (MFA) – For all interactive sessions, Flux requires both a UM Kerberos password and an MToken. File transfer sessions require a Kerberos password.
  • Private Subnets – Other than the login and file transfer computers that are part of Flux, all of the computers are on a network that is private within the University network and are unreachable from the Internet.
  • Flexible data storage – Researchers can control the security of their own data storage by securing their storage as they require and having it mounted via NFSv3 or NFSv4 on Flux. Another option is to make use of Flux’s local scratch storage, which is considered secure for many types of data. Note: Flux is not considered secure for data covered by HIPAA.

Flux/Globus & Sensitive Data

To find out what types of data may be processed in Flux or Globus, visit the U-M Sensitive Data Guide to IT Resources.

Additional Security Information

If you require more detailed information on Flux’s security or architecture to support your data management plan or technology control plan, please contact the Flux team at hpc-support@umich.edu.

We know that it’s important for you to understand the protection measures that are used to guard the Flux infrastructure. But since you can’t physically touch the servers or walk through the data centers, how can you be sure that the right security controls are in place?

The answer lies in the third-party certifications and evaluations that Flux has undergone. IIA has evaluated the system, network, and storage practices of Flux and Globus. The evaluation for Flux is published at http://safecomputing.umich.edu/dataguide/?q=node/151 and the evaluation for Globus is published at http://safecomputing.umich.edu/dataguide/?q=node/155.

Shared Security and Compliance Responsibility

Because you’re managing your data in the Flux high-performance computing environment, the security responsibilities will be shared.

Flux operators have secured the underlying infrastructure, and you are obligated to secure anything you put on the your own infrastructure itself, as well meet any other compliance requirement.  These requirements may be derived from your grant or funding agency, or data owners or stewards other than yourself, or state or federal laws and regulations.

The Flux support staff is available to help manage user lists for data access, and information is publicly available on how to manage file system permissions, please see: http://en.wikipedia.org/wiki/File_system_permissions.

Contacting Flux Support

The Flux Support Team encourages communications, including for security-related questions. Please email us at hpc-support@umich.edu.

We have created a PGP key for especially sensitive communications you may need to send.

-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1

mQENBFNEDlUBCACvXwy9tYzuD3BqSXrxcAEcIsmmH52066R//RMaoUbS7AcoaF12
k+Quy/V0mEQGv5C4w2IC8Ls2G0RHMJ2PYjndlEOVVQ/lA8HpaGhrSxhY1bZzmbkr
g0vGzOPN87dJPjgipSCcyupKG6Jnnm4u0woAXufBwjN2wAP2E7sqSZ2vCRyMs4vT
TGiw3Ryr2SFF98IJCzFCQAwEwSXZ2ESe9fH5+WUxJ6OM5rFk7JBkH0zSV/RE4RLW
o2E54gkF6gn+QnLOfp2Y2W0CmhagDWYqf5XHAr0SZlksgDoC14AN6rq/oop1M+/T
C/fgpAKXk1V/p1SlX7xL230re8/zzukA5ETzABEBAAG0UEhQQyBTdXBwb3J0IChV
bml2ZXJzaXR5IG9mIE1pY2hpZ2FuIEhQQyBTdXBwb3J0IEdQRyBrZXkpIDxocGMt
c3VwcG9ydEB1bWljaC5lZHU+iQE+BBMBAgAoBQJTRA5VAhsDBQkJZgGABgsJCAcD
AgYVCAIJCgsEFgIDAQIeAQIXgAAKCRDHwuoUZnHdimrSB/4m6P7aQGnsbYVFspJ8
zquGRZd3fDU/IaCvLyjsUN4Qw1KFUmqQjvvfTxix7KjlNMcGy1boUCWKNNk1sFtb
E9Jr2p6Z/M7pm4XWhZIs1UIfHr3XgLdfbeYgXpt4Md2G6ttaXv44D10xL2LYCHE8
DnSVv+2SIG9PhaV+h+aBUo4yKwTwVBZsguU1Z1fsbiu6z6iDrzU2dlQp0NLmw73G
v5HUdYdu/YJdh5frp/2XorLXynrEyCk1SxViXrHY6dc9Y3bUjwl0MOJypLuRhQmj
kVwHIsNsRg1YJ6iyJzom33C7YdRktBiPpstkYDHJf/PVRAw1G4dkyjfUfG2pIoQd
WjOxuQENBFNEDlUBCADNwZ5edW/e08zYFWSGVsdpY4HM2CdsVqkuQru2puHhJqg4
eWS9RAdJ6fWp3HJCDsDkuQr19B3G5gEWyWOMgPJ9yW2tFVCrVsb9UekXAWh6C6hL
Tj+pgVVpNDTYrErYa2nlll0oSyplluVBRlzDfuf4YkHDy2TFd7Kam2C2NuQzLQX3
THhHkgMV+4SQZ+HrHRSoYPAcPb4+83dyQUo9lEMGcRA2WqappKImGhpccQ6x3Adj
/HFaDrFT7itEtC8/fx4UyaIeMszNDjD1WIGBJocOdO7ClIEGyCshwKn5z1cCUt72
XDjun0f1Czl6FOzkG+CHg5mf1cwgNUNx7TlVBFdTABEBAAGJASUEGAECAA8FAlNE
DlUCGwwFCQlmAYAACgkQx8LqFGZx3YrcqggAlKZhtrMDTHNki1ZTF7c7RLjfN17H
Fb342sED1Y3y3Dm0RVSQ2SuUWbezuDwov6CllgQR8SjBZ+D9G6Bt05WZgaILD7H0
LR9+KtBNYjxoVIdNHcGBf4JSL19nAI4AMWcOOjfasGrn9C60SwiiZYzBtwZa9VCi
+OhZRbmcBejBfIAWC9dGtIcPHBVcObT1WVqAWKlBOGmEsj/fcpHKkDpbdS7ksLip
YLoce2rmyjXhFH4GXZ86cQD1nvOoPmzocIOK5wpIm6YxXtYLP07T30022fOV7YxT
mbiKKL2LmxN1Nb/+mf+wIZ5w2ZdDln1bbdIKRHoyS2HyhYuLd1t/vAOFwg==
=yAEg
-----END PGP PUBLIC KEY BLOCK-----

Why can't I connect to Flux from off campus?

Flux does not allow network connections from non-UM networks. To connect to Flux from off-campus, you can use the UM VPN client. Information about the UM VPN client, including terms of use, can be found at http://www.itcs.umich.edu/vpn/.
You can also use an intermediary machine, for example, login.itd.umich.edu. You first connect to it, then from it you connect to Flux. For example, you might use PuTTY to connect to login.itd.umich.edu, then from that login machine, use

% ssh flux-login.engin.umich.edu

where the % is the prompt on the login.itd machine.

How can I find out my job's status?

To determine your job’s status use either of the following commands on a login node:

$ qstat -u uniqname
$ qstat <jobid>

The status of the job is found in the column labeled with the letter “S”.
R means that the job is currently running
Q means that the job is waiting in the queue
C means that the job has already completed
E means that the job is in the process of exiting
H means that the job is on hold (generally set by the user or an unfulfilled job dependency)

Why isn’t my job running? I queued my job, but it isn’t running and has a status of queued.

Jobs may sit in queued status for a variety of reasons. The scheduler makes batch passes of queued jobs and determines if there are sufficient free resources to run a job, and then runs those that it can. If there are sufficient resources at the time a job is submitted, it may still sit in the queue for up to 15 minutes before the scheduler makes a batch pass and starts the job.
If there are not sufficient resources, the job will sit in queue until resources open up. The most common limiting resources are processors and memory. For example, if I try to run a job on a Flux Account that has an allocation with a total of 10 processors and nine are in use, a job asking for two processors will have to wait in queue until another processor is becomes available. Once a processor is freed, the scheduler will assign the job to the two free processors and the job will run.

Why does my job have a status of Batchhold? I queued it but it isn't running.

Jobs are assigned a Batchhold by the scheduler when they have bad PBS credentials and will not run. Jobs are often given this status when the Flux Account name, qos, or queue are misspelled. Jobs can also be given this status if you try to run on a Flux Account to which you do not have access. If you cannot determine why your job is on Batchhold, please contact us at flux-support@umich.edu with your job number.

How many processors or how much memory does my Flux Account have?

You can check the resources available to a Flux Account with the command:

$ mdiag -a  <accountname_flux>

For example:

howmany1

MAXPROC indicates the total number of processors available to the Flux Account.

MAXMEM  indicates the total amount of memory available to the Flux Account in megabytes.

If a Flux Account has a limit set for the maximum number of processors a single user can use at once it will be indicated with MAXPROC[USER]

For example:

howmany2

What jobs are currently running on the Flux Account I use?

You can check the jobs that are queued, blocked and running on a Flux Account with the command:

$ showq -w acct=accountname_flux

What does this e-mail mean: moab job resource violation: ``job ####### exceeded MEM usage soft limit``?

This message is sent when you use more memory than you asked for (default is 768 MB per core).

You can request additional memory by adding “#PBS -l pmem=###MB” to your pbs file, which will ask for ###MB of memory per process asked for (i.e., if you asked for 2 nodes with ppn=2 and pmem=3000MB, you will have asked for 12000MB of memory total). This is not in addition to the default, but replaces it.

How can I specify processor and computer layout with PBS?

Sometimes a program will want to have the processors it uses arranged across the computers in a particular way. There are several ways to tell PBS how many processors there are and on what machines they can go. We’ll look at three cases here, starting with the least detailed and proceeding to the most detailed:

  1. The processors can be anywhere;
  2. There must be a minimum of N processors on each computer;
  3. There must be exactly N processors on each computer.

1. If it does not matter how the processors are divided among the computers, you should use the procs property. Using the procs property usually allows for an eligible job to run with the shortest time queued. Flux is a heterogeneous cluster with computers of varying number of processors. Because of this, asking for a specific computer/processor combination may cause delay in your jobs starting.

If you want N processors spread across the first available processors, regardless of which physical computer they are on, you should use the command:

#PBS -l procs=N

2. If you would like a minimum of M processors on each computer, you should use the nodes property in conjunction with the ppn property.

For example:

#PBS -l nodes=N:ppn=M

Here, nodes is used to assign a group of M processors all to the same physical computer. The number of groups of M processors is assigned by N. Using nodes in conjunction with ppn does not guarantee that the M groups of processors will all be put on separate physical computers. Because of this, you will get a computer with at least M processors, but you may end up with some multiple of M on a computer.

For example:

#PBS -l nodes=3:ppn=4

Here, three groups of four processors could all end up on one computer, or groups of four processors could end up on three separate computers or eight processors could end up on one computer while four end up on another.

3. If you want exactly M processors on exactly N computers, you should use the tpn property in conjunction with procs.

For example:

#PBS -l procs=M,tpn=N

The procs property will specify the total number of processors to be used across all computers, when using the procs property with the tpn property.

Assigning “procs=M” says that you want M processors total, and you want exactly N processors to run on each physical computer. This would give you M/N processors running on N separate physical computers.

How do I get my Network File System (NFS) shares, including Value Storage, mounted on Flux?

To get an NFS mounted on Flux you will need to contact the administrator of the NFS share and ask them to export it to the following IP ranges:

  • 141.212.30.0/23
  • 10.164.0.0/21
  • 10.224.0.0/21

For Value Storage shares purchased through ITS, the email address is: vstore-admins@umich.edu

Once the NFS share has been exported, please contact us at flux-support@umich.edu requesting that we mount the NFS share. Please be sure to include the name of the NFS share in both your email to the NFS share administrators and to flux-support@umich.edu

May I process sensitive data using Flux?

Yes, but only if you use a secure storage solution like Mainstream Storage and Flux’s scratch storage. Flux’s home directories are provided by Value Storage, which is not an appropriate location to store sensitive institutional data.

One possible workflow is to use sftp or Globus to move data between a secure solution and Flux’s scratch storage, which is secure, bypassing your home directory or any of your own Value Storage directories.

Keep in mind that compliance is a shared responsibility.You must also take any steps required by your role or unit to comply with relevant regulatory requirements.

For more information on specific types of data that can be stored and analyzed on Flux, Value Storage, and other U-M services, please see the “Sensitive Data Guide to IT Services” web page on the Safe Computing website: http://safecomputing.umich.edu/dataguide/

Back To Top

How Flux Works

Flux Components

Cores

A core — the allocatable unit in Flux — is one processing element. Cores compose a central processing unit (CPU). In Flux most of the CPUs have twelve cores; some of the CPUs have sixteen or eight.

A node (or computer) comprises:

  • a number of CPUs,
  • memory (RAM),
  • a hard drive,
  • power supply, and
  • network connections.

hpc_how_flux_works_core.png

In Flux all of the nodes are nearly identical, so there is no distinction made between nodes when scheduling jobs, making allocations, setting rates, or for billing.

Flux User Account

A Flux user account consists of a Linux login ID and password (the same as your U-M uniqname and umich.edu Kerberos password, respectively), and a home directory for your files. A Flux user account allows you to:

  • log in,
  • transfer files,
  • compile software, and
  • create and submit job submission scripts.

A user account alone cannot run any jobs.

Flux Project

A Flux project is a collection of Flux user accounts that are associated with one or more Flux allocations. A PI can have more than one Flux project to correspond with different research projects or funding sources, and a Flux user account can belong to more than one Flux project. The project owner (typically the PI or his/her designee) controls the list of users in a project and can change the users.

Flux Allocation

A Flux allocation:

  • describes the limits of how much of the Flux system members of a Flux project can use.
  • is the maximum number of cores and RAM a project can use at any time, and the number of months over which the project can access those cores and RAM.
  • determines your costs.

Your monthly bill is the number of cores in an allocation multiplied by the current rate. An active project requires at least one allocation, but can have as many as makes sense.

A common configuration is for a Flux project to have a “base” allocation of, for example, 50 cores that lasts for 48 months, and, from time to time, supplementary allocations that add to the base as the project grows.

The total number of cores the members of this example project can use collectively during January is 50 and, assuming 4GB RAM per core, the maximum amount of RAM the members of the project can use is 200GB.  One scenario is a set of jobs that fit within the memory limits of the allocation, but the last job in the set would exceed the number of available cores, and thus would wait.

An allocation is made in terms of multiplying the cores and the duration (in seconds) of the allocation, and represented as core*seconds. However, the number of cores that can be used at once is a hard limit, so it is very rare that you would run out of core*seconds before the end date of the allocation.

Flux Billing

The Flux allocations are billed to a U-M shortcode and a brief summary appears on your monthly Statement of Activity. ITS processes the allocations and generates monthly bills as long as you have an active Flux allocation. You can start an allocation at any time during the month and, because bills are generated monthly, you can end an allocation within 30 days of the last bill. Because Flux allocations are billed monthly, it is not possible to pre-pay for Flux allocations; you must have funding available during each month to pay for your Flux allocation.

Flux Job

A Flux job is a compute job that can use any portion of the allocations your project has available. The job is described by a short text file (the PBS file, or batch submission script) that is submitted to the job scheduler. The job scheduler takes into account the job’s requirements, the number of cores available in the Flux project’s allocations, and any other requirements.

Total resources Resources Requested by Job Job State in 50-core, 200GB allocation
9 cores; < 36GB 9 cores; < 4GB per core job starts
18 cores; < 72GB 9 cores; < 4GB per core job starts
27 cores; < 108GB 9 cores; < 4GB per core job starts
36 cores; < 144GB 9 cores; < 4GB per core job starts
45 cores; < 180GB 9 cores; < 4GB per core job starts
46 cores; < 184GB 1 core; < 4GB per core job starts
47 cores; < 188GB 1 core; < 4GB per core job starts
48 cores; < 192GB 1 core; < 4GB per core job starts
52 cores; < 208GB 4 cores; < 4GB per core job waits

Another scenario is a set of jobs with larger memory requirements, where even though there are cores available, the memory associated with the allocation is consumed and jobs are queued.

Total resources Resources Requested by Job Job State in 50-core, 200GB allocation
10 cores; 80 GB 10 cores; 8 GB RAM per core job starts
20 cores; 160GB 10 cores; 8 GB RAM per core job starts
25 cores; 200GB 5 cores; 8 GB RAM per core job starts
26 cores; 201GB 1 core; 1 GB RAM per core job waits

Depending on the requirements of the job and the state of the system, jobs may wait in a queue while other jobs complete and resources become available. The job is started once all the requirements are met. When the job ends, the Flux allocation(s) are debited the core*seconds actually used by the job.

Integrating the Flux Components

These components are used together to support computational research within the U-M business and technology environment. A Flux account is used to submit a Flux job to the cluster, where it is authorized by its associated Flux project to debit a Flux allocation and execute on Flux cores. The existing Flux allocations are aggregated and a Flux bill is applied each month to the university account (chartfield combination) specified at the creation of the allocation.

If you have questions, please send email to hpc-support@umich.edu.

Back To Top

Accessing Flux and Data Storage Options

Two-Factor Authentication

Flux and other ARC-TS resources require two-factor authentication with both a UMICH password and Duo (which replaces MTokens starting July 20, 2016) in order to log in.

Duo provides several options, including a smartphone or tablet app, phone calls to landlines or cell phones, and text messages. Instructions on enrolling in Duo are available at http://its.umich.edu/two-factor-authentication.

Using your two-factor key to login

The following is an example of what opening a session on Flux might look like.

$ ssh flux-login
Password:
Duo two-factor login for uniqname

Enter a passcode or select one of the following options:

 1. Duo Push to XXX-XXX-1810
 2. Phone call to XXX-XXX-1810
 3. SMS passcodes to XXX-XXX-1810

Passcode or option (1-3): 1
Success. Logging you in...
uniqname@flux-login ~$

If you need help go to the U-M Duo page, contact 4help@umich.edu, or call (734) 764-4357 (764-HELP). Questions on the Flux cluster or other ARC-TS resources can be directed to hpc-support@umich.edu.

Back To Top

Login Nodes and transfer hosts

Login nodes

The login nodes are the front end to the cluster. They are accessible from the Ann Arbor, Dearborn, and Flint campus IP addresses and from the UM VPN network only and require a valid user account and a Duo two-factor authentication account to log in. Login nodes are a shared resource and, as such, it is expected that users do not monopolize them.

Login nodes for flux

The Flux login nodes are accessible via the following hostnames.

  • flux-login.arc-ts.umich.edu
    will connect you to the general Flux login hosts
  • flux-campus-login.arc-ts.umich.edu
    will connect you to the login hosts that can run software that requires you be on campus.
Policies governing the login nodes

Appropriate uses for the login nodes:

  • Transferring small files to and from the cluster
  • Creating, modifying, and compiling code and submission scripts
  • Submitting and monitoring the status of jobs
  • Testing executables to ensure they will run on the cluster and its infrastructure. Processess are limited to a maximum of 15 minutes of CPU time to prevent runaway processes and over use.

Any other uses of the login nodes may result in the termination of the process in violation. Any production processes (including post processing) should be submitted through the batch system to the cluster. If interactive use is required then you should submit an interactive job to the cluster.

Transfer hosts

The transfer hosts are available for users to transfer data to and from Flux. Connections are limited to SCP and SFTP and interactive logins are not allowed. Currently the transfer hosts have 10 Gbps connections to the network, which is much faster than connections to the login nodes. Connections to the transfer hosts are allowed from the same networks as are the login nodes.

Transfer hosts for flux
  • flux-xfer.arc-ts.umich.edu

Supported sftp and scp clients

Back To Top

Using software graphical interfaces with VNC

Flux is primarily a batch-oriented system, however, it is possible and sometimes necessary to use the graphical interface to some software. This can be accomplished using an “interactive batch job” and VNC, a program that will display a remote graphical interface locally. The traditional method for doing so involved setting up X-forwarding, but that can be very slow, especially over slow or congested networks or when off-campus.

VNC creates a virtual desktop to which programs display, and VNC then handles sending changes to that desktop to a viewer that runs on your local computer. This results in much faster updates and much better performance of graphical applications.

There are four steps needed to create and display a VNC session.

  1. Run a batch job that starts the VNC server
  2. Determine the network port that VNC is using
  3. Create a tunnel (network connection) from your local machine to the VNC desktop
  4. Connect to the VNC desktop using the tunnel

Before going through those steps, there is some set up needed. The first is to set a VNC password. This is completely different from your login password. This password is used only when connecting to the desktop, and it is not secure, so please use a password that is not used for other things. To set the VNC password, from a Flux login node run $ vncpasswd

To get a nicer desktop environment, we highly recommend that you use our xstartup file. To do so, $ cp /usr/cac/rhel6/vnc/xstartup ~/.vnc/

When you are done working in the VNC session, you will use the $ stopvnc

command in the provided terminal window to shut down the VNC session and end the job.

Step 1: Submit the batch job

You will need to have a PBS script to run the VNC job. Something like the following, which asks for two processors on one node for two hours.

####  PBS preamble

#PBS -N VNC
#PBS -M uniqname@umich.edu
#PBS -m ab

#PBS -A example_flux
#PBS -l qos=flux
#PBS -q flux

#PBS -l nodes=1:ppn=2,pmem=2gb
#PBS -l walltime=2:00:00
#PBS -j oe
#PBS -V
####  End PBS preamble
if [ -e $PBS_O_WORKDIR ] ; then
    cd $PBS_O_WORKDIR
fi
# Run the VNC server in the foreground so PBS does not end the job.
# You may wish to change 1024x768 to 1280x1024 if you have a large screen.
# On smaller laptops, 1024x768 is recommended.
vncserver -depth 24 -geometry 1024x768 -name ${USER} -AlwaysShared -fg

NOTE Remove any old log and pid files from your ~/.vnc folder before you run qsub with $ rm ~/.vnc/*.log ~/.vnc/*.pid

If you call that vnc.pbs, then to complete step one of four, you would $ qsub vnc.pbs  to run it.

We recommend not running more than one VNC job at a time. If you do, you should not delete all the .pid and .log files, but only those from VNC jobs that have finished. If you do, it is up to you to keep track of those files for active jobs.

Step 2: Determining your port

The line #PBS -m ab instructs PBS to send you an e-mail when the job starts, and it will contain something that looks like

PBS Job Id: 17528732.nyx.arc-ts.umich.edu
Job Name:   VNC
Exec host:  nyx5541/11
Begun execution

You will need the hostname, which in this case is nyx5541 from that message to set up the tunnel.

To complete step 2, take the hostname and use the command

$ ls $HOME/.vnc/nyx5541*.log
/home/bennet/.vnc/nyx5541.arc-ts.umich.edu:99.log

The number that follows the hostname, in this case 99, is the desktop number for VNC, and the port number is that plus 5900.

Step 3: Setting up the tunnel

You need to create a tunnel to the node on which your VNC server is running. The tunnel is a way to use one machine, in this case, we will use flux-xfer.arc-ts.umich.edu, to pass the network connection to another, in this case nyx5541.arc-ts.umich.edu. To do this, we use a special form of the ssh command.

From a Mac or a Linux machine, you would use $ ssh -N -L5999:nyx5541.arc-ts.umich.edu:5999 flux-xfer.arc-ts.umich.edu

This will prompt you for your password and then do nothing except forward the VNC connection to the compute host. Note that we are using the Flux file transfer host and not the login host for this tunnel as it does not require two-factor authentication.

From Windows, we recommend that you use PuTTY, which is available as part of the UM Blue Disc. With PuTTY, you need to set the port forwarding from the configuration menu. You will need to reset this each time the VNC port changes. See Additional VNC topics for instructions on configuring a tunnel with PuTTY.

Step 4: Connect to the VNC desktop

Now that you have a port forwarded, you are ready to connect your VNC client to your running VNC server. Choose the link from the Additional VNC topics that corresponds to your VNC client to see what the screens look like.

The VNC client will have a place for you to enter the host, which is usually localhost or the IP number 127.0.0.1, the Display (desktop number), and VNC password.

The VNC session will start with a terminal window open. Run the commands there that you need for your application. When you are done with your VNC session, typing the command

$ stopvnc

in the terminal window will end the VNC session and the PBS job.

Back To Top

Using General Purpose GPUs on Flux

What are GPGPUs?

GPGPUs are general-purpose graphics processing units. Originally, graphics processing units were very special purpose, designed to do the mathematical calculations needed to render high-quality graphics for games and other demanding graphical programs. People realized that those operations occur in other contexts and so started using the graphics card for other calculations. The industry responded by creating the general purpose cards, which generally has meant increasing the memory, numerical precision, speed, and number of processors.

GPGPUs are particularly good at matrix multiplication, random number generation, fast Fourier transforms (FFTs), and other numerically intensive and repetitive mathematical operations. They can deliver 5–10 times speed-up for many codes with careful programming.

Submitting Batch Jobs

The Flux GPGPU allocations are based on a single GPGPU accompanied by two CPUs each with 4 GB of memory for a total CPU memory pool of 8 GB per GPU. To use more memory or more CPUs with a GPU job, you must increase your allocation to make them available. For example, if your GPGPU program required 17 GB of memory, you would need to have an allocation for three GPGPUs to obtain enough CPU memory even though your job only uses one GPGPU and one CPU.

To use a GPU, you must request one in your PBS script. To do so, use the node attribute on your #PBS -l line. Here is an example that requests one GPU.

#PBS -l nodes=1:gpus=1,mem=2gb,walltime=1:00:00,qos=flux

Note that you must use nodes=1 and not procs=1 or the job will not run.

Also note that GPUs are available only with a GPU allocation, and those have names that end with _fluxg instead of _flux. Make sure that the line requesting the queue matches the GPU allocation name; i.e.,

#PBS -q fluxg

See our web page on Torque for more details on PBS scripts.

Programming for GPGPUs

The GPGPUs on Flux are NVIDIA graphics processors and use NVIDIA’s CUDA programming language. This is a very C-like language (that can be linked with Fortan codes) that makes programming for GPGPUs straightforward. For more information on CUDA programming, see the documentation at http://www.nvidia.com/object/cuda_develop.html.

For more examples of applications that are well-suited to CUDA, a language that enables use of GPUs, see NVIDIA’s CUDA pages at http://www.nvidia.com/object/cuda_home.html.

NVIDIA also makes special libraries available that make using the GPGPUs even easier. Two of these libraries are cuBLAS and cuFFT.

cuBLAS is a BLAS library that uses the GPGPU for matrix operations. For more information on the BLAS routines implemented by cuBLAS, see the documentation at http://developer.nvidia.com/cublas.

cuFFT is a set of FFT routines that use the GPGPU for their calculations. For more information on the FFT routines implemented by cuFFT, see the documentation at http://developer.nvidia.com/cufft.

To use the CUDA compiler (nvcc) or to link your code against one of the CUDA-enabled libraries, you will need to load a cuda module. There are typically several versions installed, and you can see which are available with

$ module av cuda

You can just load the cuda module to get the default, or you can load a specific version by specifying it, as in

$ module load cuda/6.0

Loading a cuda module will add the path to the the nvcc compiler to your PATH, and and it will set several other environment variables that can be used to link against cuBLAS, cuFFT, and other CUDA libraries in the library directory. You can use

$ module show cuda

to display which variables are set.

CUDA-based applications can be compiled on the login nodes, but cannot be run there, since they do not have GPGPUs. To run a CUDA application, you must submit a job, as shown above.

Back To Top

File transfers with Globus – GridFTP

Globus GridFTP is a reliable high performance parallel file transfer service provided by many HPC sites around the world. A GridFTP server is available for Flux.

How to use GridFTP

Globus Online is a web front end to GridFTP, this is the recommended way to interact with GridFTP on campus. Globus Online is a web based project hosted off campus. Globus Online accounts are free and your username need not match your campus uniquename.

Globus Connect

Globus Online also allows for simple installation of the GridFTP endpoint for Windows, Mac(OSX), and Linux. These installations while simple are only for a single user on a machine at a time. If you want your cluster or shared data repo to support all users on your systems as a public endpoint like Flux or Nyx your admin needs to install Globus Connect Server (former known as Global Connect Multi-User) — see below for details.

Batch File Copies

A non-standard use of Globus Online is that you can use it to copy files form one location to another on the same cluster. To do this use the same endpoint (umich#flux as an example) for both the sending and receiving machines. Setup the transfer and Globus will make sure the rest happens. The service will email you when the copy is finished.

Command Line GridFTP

There are Command line tools for GridFTP. If you wish to use these contact the FLUX support group. Their use is discouraged.

Flux GridFTP Servers

  • gsiftp://gridftp-flux.engin.umich.edu

Globus Connect Server (GCMU)

Globus Connect Server, formerly Globus Connect Multi-User (GCMU), is a full featured GridFTP endpoint that will enable any user on your system to use GridFTP to transfer files from your system to any other GridFTP endpoint. Installation is more complicated than Globus Connect for single users and only support Linux at this time.

This package is required to have your system appear as a public endpoint in Globus Online. Authentication is handled by the campus wide CiLogin server against the Michigan Kerberos password database. It also does not require the cluster to procure its own signed certificates.

  1. Install the globus-connect-server package for your platform as instructed from Globus.
  2. Use this globus-connect-server.conf
  3. Follow the remaining instructions from 1.

Groups on campus who wish to install Globus Connect Server and add their machines to the umich# name space in Globus Online should contact hpc-support@umich.edu.

Back To Top

Flux Storage Options

Several levels of data storage are provided for Flux, varying by capacity, I/O rate, and longevity of storage. Nothing is backed up, except AFS. Please contact hpc-support@umich.edu with any questions.

Storage type / location Description Best Used For Access and Policy Details
/tmp
Local directory unique to each node. Not shared. High-speed read and writes for small files (less than 10GB)
/home
Shared across the entire cluster. Only for use with currently running jobs. Quota of 80GB per user. Currently running jobs.
/scratch
Lustre-based parallel file system shared across all Flux nodes. Large reads and writes of very large data files. Checkpoint/restart files and large data sets that are frequently read from or written to are common examples. Also, code that uses MPI. ARC-TS /scratch page
AFS
AFS is a filesystem maintained and backed up by ITS. It is the only storage option available for Flux that is regularly backed up, and is therefore the most secure choice. It is only available on Flux login nodes and can provide up to 10GB of backed-up storage. Storing important files. NOT available for running jobs on compute nodes. ARC-TS AFS page
Turbo
Turbo Research Storage is a high-speed storage service providing NFSv3 and NFSv4 access. It is available only for research data. Data stored in Turbo can be easily shared with collaborators when used in combination with the Globus file transfer service. Storing research data. ARC-TS Turbo page
Long-term storage
Users who need long-term storage can purchase it from ITS MiStorage. Once established, it can be mounted on the Flux login and compute nodes. Long-term storage. ITS MiStorage page

Back To Top

AFS Storage details

The Andrew File System or AFS is a central file storage, sharing and retrieval system operated by Information and Technology Services and accessible from Mac, Windows, and Unix computers.

For Flux users, AFS is good for storing important files – ITS backs this up to tape, making it a relatively secure file system.

On the other hand, AFS isn’t available on the compute nodes because your Kerberos token doesn’t get passed with your job information, so it is not good for running compute jobs.

Using AFS

On the Flux login nodes you can access your AFS space by typing:

[login@flux-login1 ~]$ kinit
Password for login@UMICH.EDU: your password
[login@flux-login1 ~]$ aklog UMICH.EDU
[login@flux-login1 ~]$ cd /afs/umich.edu/user/1/2/login
Where login is your login ID, 1 is the first letter of your login ID, and 2 is the second letter of your login ID. So if your login ID isacaird the path to your AFS space is /afs/umich.edu/u/a/c/acaird.

For more information, please see ITS’s web pages.

Back To Top

Scratch Storage Details

/scratch is a shared high performance storage system on Flux that provides access to large amounts of disk for short periods of time at much higher speed than /home or /home2.

Directories in /scratch

Upon the creation of a new Flux project, a directory will be created in /scratch for each user with access to that project. The paths will be of the form /scratch/projectname_flux/username. These directories are owned by the user and their default group and set up such that their default UNIX file permissions are 0700.

Upon request of someone with access to make changes to the project we will modify the project’s user directories within/scratch in the following way:

  • Create a UNIX group whose membership matches the users with access to the project
  • Set the group ownership of the user directories under the project’s root within /scratch to this new group
  • Set the permissions on these directories to 2750.

In order to allow /scratch to be controlled systematically, other modifications to the root of a project’s /scratch directory are not be permitted.

User Directory Example

If you have access to Flux projects with the names projectA_flux and projectB_flux your directories in /scratch will be:
/scratch/projectA_flux/YOUR_LOGIN_ID
/scratch/projectB_flux/YOUR_LOGIN_ID
You should be careful to put data related to a particular project only in the project’s area as it may be deleted when the allocations associated with that project expire. See Policies below for details.

Policies

  • Only data for currently running jobs should be left on /scratch. The system is not backed up and is vulnerable to data loss.
  • Data that has not been accessed in the past 90 days will be automatically deleted.
  • 14 days after an allocation expires, the top-level directories will be set to unreadable as long as there are no currently running jobs (or the next day after all of the jobs associated with that project have completed).
  • 60 days after an allocation expires, the directory and its contents will be permanently removed.
  • If an allocation is renewed within 60 days, the permissions on the directory will be restored and the project members will again have access to the data therein.
  • Data should be cleaned by users often, old data will be removed to maintain working space. Important data should be moved to /home or AFS to avoid loss.

Details

The filesystem uses the Lustre cluster file system and scales to 100s of GB/s IO. Flux’s implementation is 2 metadata servers (for redundancy) and 4 storage servers (for redundancy, performance and capacity), each with 4 storage targets, providing a total usable storage space of 1.5PB.  Measured single client performance over infiniband is 1GB/s total filesystem performance is 5GB/s in optimal conditions.

Back To Top

Software

Managing software with Lmod

Why software needs managing

Almost all software requires that you modify your environment in some way. Your environment consists of the running shell, typically bash on Flux, and the set of environment variables that are set. The most familiar environment variable ot most people is the PATH variable, which lists all the directories in which the shell will search for a command, but there may be many others, depending on the particular software package.

Beginning in July 2016, Flux uses a program called Lmod to resolve the changes needed to accommodate having many versions of the same software installed. We use Lmod to help manage conflicts among the environment variables across the spectrum of software packages. Lmod can be used to modify your own default environment settings, and it is also useful if you install software for your own use.

Basic Lmod usage

Listing, loading, and unloading modules

Lmod provides the module command, an easy mechanism for changing the environment as needed to add or remove software packages from your environment.

This should be done before submitting a job to the cluster and not from within a PBS submit script.

A module is a collection of environment variable settings that can be loaded or unloaded. When you first log into Flux, a set of modules is loaded by default in a module called SteEnv. To see which modules are currently loaded, you can use the command

$ module list

Currently Loaded Modules:
  1) intel/16.0.3   2) openmpi/1.10.2/intel/16.0.3   3) StdEnv

We try to make the names of the modules as close to the official name of the software as we can, so you can see what is available by using, for example,

$ module av matlab

------------------------ /sw/arcts/centos7/modulefiles -------------------------
   matlab/R2016a

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".

where av stands for avail (available). To make the software found available for use, you use

$ module load matlab

(you can also use add instead of load, if you prefer.) If you need to use software that is incompatible with Matlab, you would remove it using

$ module unload matlab

More ways to find modules

In the output from module av matlab, module suggests a couple of alternate ways to search for software. When you use module av, it will match the search string anywhere in the module name; for example,

$ module av gcc

------------------------ /sw/arcts/centos7/modulefiles -------------------------
   fftw/3.3.4/gcc/4.8.5                          hdf5-par/1.8.16/gcc/4.8.5
   fftw/3.3.4/gcc/4.9.3                   (D)    hdf5-par/1.8.16/gcc/4.9.3 (D)
   gcc/4.8.5                                     hdf5/1.8.16/gcc/4.8.5
   gcc/4.9.3                                     hdf5/1.8.16/gcc/4.9.3     (D)
   gcc/5.4.0                              (D)    openmpi/1.10.2/gcc/4.8.5
   gromacs/5.1.2/openmpi/1.10.2/gcc/4.9.3        openmpi/1.10.2/gcc/4.9.3
   gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0 (D)    openmpi/1.10.2/gcc/5.4.0  (D)

  Where:
   D:  Default Module

However, if you are looking for just gcc, that is more than you really want. So, you can use one of two commands. The first is

$ module spider gcc

----------------------------------------------------------------------------
  gcc:
----------------------------------------------------------------------------
    Description:
      GNU compiler suite

     Versions:
        gcc/4.8.5
        gcc/4.9.3
        gcc/5.4.0

     Other possible modules matches:
        fftw/3.3.4/gcc  gromacs/5.1.2/openmpi/1.10.2/gcc  hdf5-par/1.8.16/gcc  ...

----------------------------------------------------------------------------
  To find other possible module matches do:
      module -r spider '.*gcc.*'

----------------------------------------------------------------------------
  For detailed information about a specific "gcc" module (including how to load
the modules) use the module's full name.
  For example:

     $ module spider gcc/5.4.0
----------------------------------------------------------------------------

That is probably more like what you are looking for if you really are searching just for gcc. That also gives suggestions for alternate searching, but let us return to the first set of suggestions, and see what we get with keyword searching.

At the time of writing, if you were to use module av to look for Python, you would get this result.

[bennet@flux-build-centos7 modulefiles]$ module av python

------------------------ /sw/arcts/centos7/modulefiles -------------------------
   python-dev/3.5.1

However, we have Python distributions that are installed that do not have python as part of the module name. In this case, module spider will also not help. Instead, you can use

$ module keyword python

----------------------------------------------------------------------------
The following modules match your search criteria: "python"
----------------------------------------------------------------------------

  anaconda2: anaconda2/4.0.0
    Python 2 distribution.

  anaconda3: anaconda3/4.0.0
    Python 3 distribution.

  epd: epd/7.6-1
    Enthought Python Distribution

  python-dev: python-dev/3.5.1
    Python is a general purpose programming language

----------------------------------------------------------------------------
To learn more about a package enter:

   $ module spider Foo

where "Foo" is the name of a module

To find detailed information about a particular package you
must enter the version if there is more than one version:

   $ module spider Foo/11.1
----------------------------------------------------------------------------

That displays all the modules that have been tagged with the python keyword or where python appears in the module name.

More about software versions

Note that Lmod will indicate the default version in the output from module av, which will be loaded if you do not specify the version.

$ module av gromacs

------------------------ /sw/arcts/centos7/modulefiles -------------------------
   gromacs/5.1.2/openmpi/1.10.2/gcc/4.9.3
   gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0 (D)

  Where:
   D:  Default Module

When loading modules with complex names, for example, gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0, you can specify up to the second-from-last element to load the default version. That is,

$ module load gromacs/5.1.2/openmpi/1.10.2/gcc

will load gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0

To load a version other than the default, specify the version as it is displayed by the module av command; for example,

$ module load gromacs/5.1.2/openmpi/1.10.2/gcc/4.9.3

When unloading a module, only the base name need be given; for example, if you loaded either gromacs module,

$ module unload gromacs

Module prerequisites and named sets

Some modules rely on other modules. For example, the gromacs module has many dependencies, some of which conflict with the default modules. To load it, you might first clear all modules with module purge, then load the dependencies, then finally load gromacs.

$ module list
Currently Loaded Modules:
  1) intel/16.0.3   2) openmpi/1.10.2/intel/16.0.3   3) StdEnv

$ module purge
$ module load gcc/5.4.0 openmpi/1.10.2/gcc/5.4.0 boost/1.61.0 mkl/11.3.3
$ module load gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0
$ module list
Currently Loaded Modules:
  1) gcc/5.4.0                  4) mkl/11.3.3
  2) openmpi/1.10.2/gcc/5.4.0   5) gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0
  3) boost/1.61.0

That’s a lot to do each time. Lmod provides a way to store a set of modules and give it a name. So, once you have the above list of modules loaded, you can use

$ module save my_gromacs

to save the whole list under the name my_gromacs. We recommend that you make each set fully self-contained, and that you use the full name/version for each module (to prevent problems if the default version of one of them changes), then use the combination

$ module purge
$ module restore my_gromacs
Restoring modules to user's my_gromacs

To see a list of the named sets you have (which are stored in ${HOME}/.lmod.d, use

$ module savelist
Named collection list:
  1) my_gromacs

and to see which modules are in a set, use

$ module describe my_gromacs
Collection "my_gromacs" contains: 
   1) gcc/5.4.0                   4) mkl/11.3.3
   2) openmpi/1.10.2/gcc/5.4.0    5) gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0
   3) boost/1.61.0

How to get more information about the module and the software

We try to provide some helpful information about the modules. For example,

$ module help openmpi/1.10.2/gcc/5.4.0
------------- Module Specific Help for "openmpi/1.10.2/gcc/5.4.0" --------------

OpenMPI consists of a set of compiler 'wrappers' that include the appropriate
settings for compiling MPI programs on the cluster.  The most commonly used
of these are

    mpicc
    mpic++
    mpif90

Those are used in the same way as the regular compiler program, for example,

    $ mpicc -o hello hello.c

will produce an executable program file, hello, from C source code in hello.c.

In addition to adding the OpenMPI executables to your path, the following
environment variables set by the openmpi module.

    $MPI_HOME

For some generic information about the program you can use

$ module whatis openmpi/1.10.2/gcc/5.4.0
openmpi/1.10.2/gcc/5.4.0      : Name: openmpi
openmpi/1.10.2/gcc/5.4.0      : Description: OpenMPI implementation of the MPI protocol
openmpi/1.10.2/gcc/5.4.0      : License information: https://www.open-mpi.org/community/license.php
openmpi/1.10.2/gcc/5.4.0      : Category: Utility, Development, Core
openmpi/1.10.2/gcc/5.4.0      : Package documentation: https://www.open-mpi.org/doc/
openmpi/1.10.2/gcc/5.4.0      : ARC examples: /scratch/data/examples/openmpi/
openmpi/1.10.2/gcc/5.4.0      : Version: 1.10.2

and for information about what the module will set in the environment (in addition to the help text), you can use

$ module show openmpi/1.10.2/gcc/5.4.0
[ . . . .  Help text edited for space -- see above . . . . ]
whatis("Name: openmpi")
whatis("Description: OpenMPI implementation of the MPI protocol")
whatis("License information: https://www.open-mpi.org/community/license.php")
whatis("Category: Utility, Development, Core")
whatis("Package documentation: https://www.open-mpi.org/doc/")
whatis("ARC examples: /scratch/data/examples/openmpi/")
whatis("Version: 1.10.2")
prereq("gcc/5.4.0")
prepend_path("PATH","/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/bin")
prepend_path("MANPATH","/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/share/man")
prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/lib")
setenv("MPI_HOME","/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0")

where the lines to attend to are the prepend_path(), setenv(), and prereq(). There is also an append_path() function that you may see. The prereq() function sets the list of other modules that must be loaded before the one being displayed. The rest set or modify the environment variable listed as the first argument; for example,

prepend_path("PATH", "/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/bin")

adds /sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/bin to the beginning of the PATH environment variable.

Back To Top

Submitting jobs using Torque PBS

Description

Torque PBS (just PBS, hereafter) is the Portable Batch System, and it controls the running of jobs on the Flux cluster. PBS queues, starts, controls, and stops jobs. It also has utilities to report on the status of jobs and nodes. PBS and Moab, the scheduling software, are the core software that keep jobs running on Flux.

Availability

PBS is available in the core software modules and is loaded automatically by the system at login. The PBS commands should automatically load when you log in. If, for some reason, you remove the module, or otherwise clear your modules, you can reload it with

$ module load torque

PBS overview

PBS is a batch job manager, where a job is some set of computing tasks to be performed. For most people, the primary uses will be to put jobs into the queue to be run, to check on the status of jobs, and to delete jobs. Most of the time, you will write a PBS script, which is just a text file – a shell script with PBS directives that the shell will interpret as comments – that contains information about the job and the commands that do the desired work.

You will find it convenient to name the PBS scripts in a consistent way, and some find that using the .pbs extension clearly identifies, and makes it easy to list, PBS scripts. We will use that convention for our examples. Before we get to the contents of a PBS script, we will show the three primary PBS commands, after which we will look at the PBS directives.

Submitting a PBS script

Suppose you have PBS script called test.pbs that you wish to run. The command qsub test.pbs will submit it to be run. The output will be a JobID, which is used if you need information about the job or wish to delete it. If you are having trouble with a job, it is always a good idea to include the JobID.

$ qsub test.pbs
1234567.nyx.engin.umich.edu
Checking on job status

You only need to specify the numeric part of the JobID To get information about its status within the PBS system. For example,

$ qstat 1234567
Deleting a job

To delete a job, use

$ qdel 1234567

A PBS script template

There are many PBS directives you can use to specify the characteristics of a job and how you and the PBS system will interact. A PBS directive is on a line beginning with #PBS. We will show an idealized template for a PBS script to illustrate some of the most commonly used PBS directives, then we will explain them. The example PBS script, call it test.pbs contains these lines.

####  PBS preamble

#PBS -N PBS_test_script
#PBS -M uniqname@umich.edu
#PBS -m abe

#PBS -A example_flux
#PBS -l qos=flux
#PBS -q flux

#PBS -l nodes=4:ppn=2,pmem=2gb
#PBS -l walltime=1:15:00
#PBS -j oe
#PBS -V

####  End PBS preamble

if [ -s "$PBS_NODEFILE" ] ; then
    echo "Running on"
    cat $PBS_NODEFILE
fi

if [ -d "$PBS_O_WORKDIR" ] ; then
    cd $PBS_O_WORKDIR
    echo "Running from $PBS_O_WORKDIR"
fi

#  Put your job commands after this line
echo "Hello, world."

The lines that are not PBS directives but begin with the # character are comments and are ignored by PBS and by the shell. Once the first non-comment line is reached, and in the template above, the line that begins if, PBS stops looking for directives. It is, therefore, important to put all PBS directives before any commands.

You may find that grouping the PBS directives into blocks of related characteristics helps when reviewing a file for completeness and accuracy.

Roughly speaking, the three blocks, in order, are the attributes that control how you interact with the job, how the job gets paid for and routed, and what the resource characteristics of the job are.

All PBS directives start with #PBS, and each directive in a PBS script corresponds to a command line option to the qsub command. The #PBS -N PBS_test_script directive – which sets the job name – corresponds to adding -N PBS_test_script as an option to the qsub command. You can override the directives in the PBS script on the command line, or supplement them. So for example, you could use

$ qsub -N second_test test.pbs

and the name second_test would override the name set in test.pbs. The name should contain only letters, digits, the underscore, or the dash characters.

The next two directives (we will omit the #PBS portion for the remainder of this section), the -M and -m, control how PBS communicates job status to you by e-mail. Specify your own e-mail address after the -M (uniqname@umich.edu is just a placeholder).

The three letters after the -m directive specify under what conditions PBS should send a notification: b is when a job begins, e is when a job ends, and a is when a job aborts. You can also specify n for none to suppress all e-mail from PBS.

The second block contains directives that have, roughly, to do with how your job is paid for and which sets of resources it runs against. The -A directive specifies the account to use. This is set up for the person paying for the use. The -l qos option will always be flux unless you receive specific instructions to the contrary.

The -q specifies which queue to use. In general, the queue will match the account suffix, so if the account is default_flux account, the queue would be specified using -q flux; similarly, if this were a large-memory account, default_fluxm, the queue should be specified -q fluxm; etc.

The exception to this rule is if you will be using software that is restricted to on-campus use, and you submit jobs from the flux-campus-login.engin.umich.edu node, in which case, the queue would be specified as -q flux-oncampus. Jobs can submitted to the flux-oncampus queue only from the flux-campus-login node.

The last block contains the directives that determine the environment in which your job runs and what resources are allocated to it. The most commonly changed are the processor count and layout, the memory, and the maximum amount of time the job can run.

Let us save the most complicated options for last, and review these in reverse order. The -V option instructs PBS to set the environment on all of the compute nodes assigned to a job to be the same as the environment that was in effect when you submitted the job. This is very important to include to make sure that all the paths to programs and libraries, and any important environment variables, are set everywhere. Many obscure error messages are due to this option not being used.

PBS will normally create a separate file for output and for errors. These are named job_name.oXXXXXXXX and job_name.eXXXXXXXX where job_name is that name you specify with the -N option, XXXXXXXX is the numeric JobID PBS assigned the job, and the o and e represent output and error, respectively. If there are errors, or if your program writes informative information to the error stream, then it can be helpful to combined the output and error stream so that the error messages appear in context with the output. It can be very difficult otherwise to determine exactly where in the course of a job an error occurred. This is what the -j oeoption does: it joins the output and error streams into the output stream (specifying it as eo would combine them into the error stream).

There are many options that can be specified with -l (dash ‘ell’). One of the simplest is walltime which specifies the maximum clock time that your job will be allowed to run. Time is specified as day, hours, minutes, seconds, dd:hh:mm:ss. So, 15:00 requests 15 minutes (this should be the minimum that you request for a single job), 2:30:00 would request two and one-half hours, and 7:00:00:00 would request one week. Longer times are harder to schedule than shorter times, but it is better to ask for too much time than too little, so your job will finish.

Finally, we get to the most complex of the -l options: selecting the number of nodes, processors, and memory. In the example above, the request is for nodes=4:ppn=2,pmem=2gb, which translates as “assign me 4 machines, each with 2 processors, and each processor should have 2 GB of memory”.

It does not specify that exactly; instead, it specifies a sort of “worst-case” scenario. It really says that you would take up to four physical machines, each of which has at least two processors each. You could end up with eight processors on one machine with that request.

Back To Top

Interactive PBS jobs

You can request an interactive PBS job for any activity for which the environment needs to be the same as for a batch job. Interactive jobs are also what you should use if you have something that needs more resources than are appropriate to use on a login node. When the interactive job starts, you will get a prompt, and you will have access to all the resources assigned to the job. This is a requirement to test or debug, for example, MPI jobs that run across many nodes.

Submitting an interactive job

There are two ways you can submit an interactive job to PBS: By including all of the options on the command line, or by listing the options in a PBS script and submitting the script with the command-line option specifying an interactive job.

Submitting an interactive job from the command line

To submit from the command line, you need to specify all of the PBS options that would normally be specified by directives in a PBS script. The translation from script to command line is simply to take a line, say,

#PBS -A example_flux

remove the #PBS, and the rest is the option you should put on the command line. More options will be needed, but that would lead to

$ qsub -A example_flux

For an interactive job, several options that are appropriate in a PBS script may be left off. Since you will have a prompt, you probably don’t need to use the options to send you mail about job status. The options that must be included include the accounting options, the resource options for number of nodes, processors, memory, and walltime, and the -V option to insure that all the nodes get the correct environment. The -I flag signals that the job should run as an interactive job. (Note: in the example that follows, the character indicates that the following line is a continuation of the one on which it appears.)

$ qsub -I -V -A example_flux -q flux 
   -l nodes=2:ppn=2,pmem=1gb,walltime=4:00:00,qos=flux

The above example requests an interactive job, using the account example_flux and two nodes with two processors, each processor with 1 GB of memory, for four hours. The prompt will change to something that says the job is waiting to start, followed by a prompt on the first of the assigned nodes.

qsub: waiting for job 12345678.nyx.engin.umich.edu to start
[grundoon@nyx5555 ~]$

If at some point before the interactive job has started you decide you do not want to use it, Ctrl-C will cancel it, as in

^CDo you wish to terminate the job and exit (y|[n])? y
Job 12345678.nyx.engin.umich.edu is being deleted

When you have completed the work for which you requested the interactive job, you can just logout of the compute node, either with exit or with logout, and you will return to the login node prompt.

[grundoon@nyx5555 ~]$ exit
[grundoon@flux-login1 ~]$

Submitting an interactive job using a file

To recreate the same interactive job as above, you could create a file, say interactive.pbs, with the following lines in it

#!/bin/bash
#PBS -V
#PBS -A example_flux
#PBS -l qos=flux
#PBS -q flux

#PBS -l nodes=2:ppn=2,pmem=1gb,walltime=4:00:00

then submit the job using

$ qsub -I interactive.pbs

Back To Top

Linking libraries with applications

Using external libraries with compiled programs

Libraries are collections of functions that are already compiled and that can be included in your program without your having to write the functions yourself or compile them separately.

Why you might use libraries

Saving yourself time by not having to write the functions is one obvious reason to use a library. Additionally, many of the libraries focus on high performance and accuracy. Many of the libraries are very well-tested and proven. Others can add parallelism to computationally intensive functions without you having to write your own parallel code. In general, libraries can provide significant performance or accuracy dividends with a relatively low investment of time. They can also be cited in publications to assure readers that the fundamental numerical components of your work are fully tested and stable.

Compiling and linking with libraries

To use libraries you must link them with your own code. When you write your own code, the compiler turns that into object code, which is understandable by the machine. Even though most modern compilers hide it from you, there is a second step where the object code it created for you must be glued together with all the standard functions you include, and any external libraries, and that is called linking.When linking libraries that are not included with your compiler, you must tell the compiler/linker where to find the file that contains the library – typically .so and/or .a files. For libraries that require prototypes (C/C++, etc.) you must also tell the preprocessor/compiler where to find the header (.h) files. Fortran modules are also needed, if you are compiling Fortran code.

Environment variables from the module

When we install libraries on Flux, we usually create modules for them that will set the appropriate environment variables to make it easier for you to provide the right information to the compiler and the linker.The naming scheme is, typically, a prefix indicating the library, for example, FFTW, followed by a suffix to indicate the variable–s function, for example, _INCLUDE for the directory containing the header files. So, for example, the module for FFTW3 includes the variables FFTW_INCLUDE and FFTW_LIB for the include and library directories, respectively. We also, typically, set a variable to the top level of the library path, for example, FFTW_ROOT. Some configuration schemes want that and infer the rest of the directory structure relative to it.Libraries can often be tied to specific versions of a compiler, so you will want to run

$ module av

to see which compilers and versions are supported.One other variable that is often set by the library module is the LD_LIBRARY_PATH variable, which is used when you run the program to tell it where to find the libraries needed at run time. If you compile and link against an external library, you will almost always need to load the library module when you want to run the program so that this variable gets set.To see the variable names that a module provides you can use the show option to the module command to show what is being set by the module. Here is an edited example of what that would print if you were to run it for FFTW3.

[markmont@flux-login2 ~]$ module show fftw/3.3.4/gcc/4.8.5
-------------------------------------------------------------------------------
   /sw/arcts/centos7/modulefiles/fftw/3.3.4/gcc/4.8.5.lua:
-------------------------------------------------------------------------------
help([[
FFTW consists of libraries for computation of the discrete Fourier transform
in one or more dimensions.  In addition to adding entries to the PATH, MANPATH,
and LD_LIBRARY_PATH, the following environment variables are created.

    FFTW_ROOT       The root of the FFTW installation folder
    FFTW_INCLUDE    The FFTW3 include file folder
    FFTW_LIB        The FFTW3 library folder, which includes single (float),
                    double, and long-double versions of the library, as well
                    as OpenMP and MPI versions.  To use the MPI libary, you
                    must load the corresponding OpenMPI module.

An example of usage of those variables on a compilation command is, for gcc and
icc,

    $ gcc -o fftw3_prb fftw3_prb-c -I${FFTW_INCLUDE} -L${FFTW_LIB} -lfftw3 -lm
    $ icc -o fftw3_prb fftw3_prb-c -I${FFTW_INCLUDE} -L${FFTW_LIB} -lfftw3 -lm

]])
whatis("Name: fftw")
whatis("Description: Libraries for computation of discrete Fourier transform.")
whatis("License information: http://www.fftw.org/fftw3_doc/License-and-Copyright.html")
whatis("Category: Library, Development, Core")
whatis("Package documentation: http://www.fftw.org/fftw3_doc/")
whatis("Version: 3.3.4")
prepend_path("PATH","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/bin")
prepend_path("MANPATH","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/share/man")
prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/lib")
prepend_path("FFTW_ROOT","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5")
prepend_path("FFTW_INCLUDE","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/include")
prepend_path("FFTW_LIB","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/lib")
setenv("FFTW_HOME","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5")

[markmont@flux-login2 ~]$

In addition to the environment variables being set, the show option also displays the names of other modules with which FFTW3 conflicts (in this case, just itself), and there may be links to documentation and the vendor web site (not shown above).

Compile and link in one step

Here is an example of compiling and linking a C program with the FFTW3 libraries.

gcc -I$FFTW_INCLUDE -L$FFTW_LIB mysource.c -lfftw3 -o myprogram

Here is a breakdown of the components of that command.

  • -I$FFTW_INCLUDE The -I option to the compiler indicates a location for header files and, in this case, points to a directory that holds the fftw3.h header file.
  • -L$FFTW_LIB The -L compiler option indicates a library location and, in this case, points to a directory that holds the libfftw3.a and libfftw3.so files, which are the library files. Note, you will want to make sure that the -L option precedes the -l option.
  • mysource.c This is the source code that refers to the FFTW3 library functions; that is, your program.
  • -lfftw3 The -l compiler option indicates the name of a library that contains a function referenced in the source code. The compiler will look through the standard library (linker) paths the compiler came with, then the ones added with -L, and it wil link the first libfftw3.* file that it finds (that will be libfftw3.so if you are specifying dynamic linking and libfftw3.a if you are statically linking).
  • -o myprogram The -o option is followed by the name of the final, executable file, in this case myprogram.

Compile and link in multiple steps

Sometimes you will need or want to compile some files without creating the final executable program, for example, if you have many smaller source files that all combine to make a complete executable. Here is an example.

gcc -c -I$FFTW_INCLUDE source1.c 
gcc -c -I$FFTW_INCLUDE source2.c 
gcc -L$FFTW_LIB source1.o source2.o -o myprogram -lfftw3

The -c compiler option tells the compiler to compile an object file only. Note that only the -I option is needed if you are not linking. The header files are needed to create the object code, which contain references to the functions in the library.The last line does not actually compile anything, rather, it links the components. The -L and -l options are the same as on the one-step compilation and linkage command and specifies where the binary library files are located. The -o option specifies the name of the final executable, in this case source.The location of the header files are only needed before linking. Thus the -I flags can be left off for the final step. The same is true for the -L and -l flags, which are only needed for the final link step, and so can be left off the compilation. Note that all the object files to be linked need to be named.

You will typically see this method used in large, complex projects, with many functions spread across many files with lots of interdepenencies. This method minimizes the amount of time it takes to recompile and relink a program if only a small part of it is changed. This is best managed with make and make files.

Back To Top

Flux for LSA

LSA’s public Flux allocation

Overview

Researchers in the College of Literature, Science, and the Arts have four options for using the Flux High Performance Computing cluster:

Cost Wait for jobs to start Special usage limits Notes
Public LSA allocations Free
(paid for by LSA)
•••• Yes (cores per user, jobs per user, job size, maximum walltime) Only for researchers who do not have access to another allocation for the same Flux service (e.g., Standard Flux).

Resources are shared by researchers College-wide and so there may frequently be waits for jobs to start.

Department or multiple-group allocation $
(paid for by department or cooperating research groups)
Optional (up to purchaser) Best value per dollar in most cases.

Resources are shared only between researchers within the department or groups; whether a job waits to start depends on how the allocation has been sized relative to the needs of the researchers.

Private allocation $$$
(paid for by researcher)
None Both traditional (monthly) and “on demand” (metered) options are available.

Purchased resources are not shared with other researchers although jobs may have to wait to start if certain specific resource configurations are requested.

Flux Operating Environment $$$
(paid for by researcher)
None Typically used with external grants that require the purchase of computing hardware rather than services.  Researchers purchase specific hardware for their exclusive use for 4 years.  Custom hardware configurations (e.g., amount of memory per node) are possible.

The College of Literature, Science, and the Arts provides three public Flux allocations to LSA researchers at no cost.  A researcher can use one of the public allocations if they do not have access to another allocation for the same Flux service.  For example, an LSA researcher can use lsa_fluxm if they do not have access to another Larger Memory Flux allocation.

Allocation name Service Size Usage limits
lsa_flux Standard Flux 120 cores Size (maximums, per person): Normal: 24 cores / 96 GB RAM. Non-busy times: 36 cores / 144 GB RAM
Runtime: maximum 4 core*months remaining across all running jobs (per person)
lsa_fluxm Larger Memory Flux 56 cores Only for jobs that need more memory or cores per node than possible under lsa_flux.
Size: 56 cores / 1400 GB per job
Walltime: maximum 1 week per job
lsa_fluxg GPU Flux 2 GPUs Only for jobs that use a GPU.
Size: 1 GPU, 2 cores, 8 GB RAM per person
Walltime: maximum 3 days per job

Flux Hadoop and Flux Xeon Phi services are also both available to everyone in LSA as no-cost technology previews. Descriptions of these services are available on the Systems and Services Page.

Uses of these allocations include but are not limited to:

  • Running jobs for individual researchers who fit within the usage limits for the LSA allocations, particularly for individual researchers or students who do not have access to funding that could be used to purchase their own Flux allocation (for example, graduate students and undergraduates doing their own research).
  • Testing Flux to determine whether to purchase a Flux allocation.  (Note that PIs can also request a one-time two week trial allocation for this purpose by contacting hpc-support@umich.edu; trial allocations are 16 cores but are for the exclusive use of the PI’s research group).
  • Experimentation and exploration on an ad hoc basis of questions not necessarily tied to any particular research project, without needing to obtain funding and purchasing a Flux allocation first.

The LSA public allocations are neither intended to replace nor supplement other Flux allocations.  Research groups who need more computation than is provided under the public allocation usage limits, or who need their jobs to start running faster than under the public allocations, should obtain their own Flux allocation.  Shared allocations can also be obtained for use of multiple research groups across departments, centers, institutes, or other units.  Graduate students in particular may want to use Rackham Graduate Student Research Grants to purchase their own, private Flux allocation.

Usage limits

The LSA public allocations (lsa_flux, lsa_fluxm, lsa_fluxg) are not meant for use by anyone who has their own Flux allocation, nor by those who have access to another shared allocation such as a departmental allocation or an allocation for their center or institute.

LSA has imposed additional usage limits on its public allocations in order to avoid a single user (or a small group of users) monopolizing the allocations for extended periods of time to the detriment of other researchers who want to use the allocations.

LSA Flux support staff will periodically monitor jobs which are running under the LSA public allocations.  Users who have running jobs exceeding the usage limit will receive an email asking them to delete some of their running jobs.  Users who receive four or more such emails within 120 days may be temporarily or permanently removed from the allocations, at the discretion of LSA Flux support staff.

You can check your current usage of the LSA public allocations at any point in time, to determine if you are under the usage limits, by running the following command:

lsa_flux_check

Only running jobs count against the usage limits; jobs which are idle or on hold do not count against the usage limits.

lsa_flux

Users of lsa_flux can use up to 24 cores or up to 96 GB of memory across all of their running jobs at any point in time.  When there are cores and/or memory that are idle (that is, not being used by other users), then these limits are increased to 36 cores and 144 GB of memory.

Additionally, individual users are restricted to having no more than 4 core*months (2,880 core*hours) worth of jobs running at any one time.  This limit is calculated by summing the product of the remaining walltime and number of cores for all of a given users’ running jobs, as shown by the command “showq -r -u $USER").  4 core*months are sufficient to run a 4 core job for 28 days, a 8 core job for 15 days, a 16 core job for 7 days, and many other combinations.

Important note: if a single job requests more than 24 cores or 96 GB memory when lsa_flux is busy (but less than 36 cores / 144 GB), then this job will not start to run until lsa_flux becomes “not busy”, which could take days, weeks, or even longer.  To avoid this, check to see if lsa_flux is “busy” before submitting a job that is larger than 24 cores / 96 GB by running the following command:

freealloc lsa_flux

If freealloc reports that more cores and more memory are available than you will be requesting in your job, then lsa_flux is not busy and you can request more than 24 cores / 96 GB without having your job start time be delayed.

lsa_fluxm

The requested walltime for each job under lsa_fluxm must be no more than 1 week (168 hours).  This permits a single researcher to use the full lsa_fluxm allocation (all 40 cores / 1 TB RAM) in a single job, but it can also result in very long waits for jobs to start.  Researchers who need jobs to start more quickly should either purchase their own Large Memory Flux allocation, use lsa_flux (if they need 96 GB RAM or less), or use XSEDE.

Use of lsa_fluxm is restricted to jobs that require more memory or more cores per node than is possible under lsa_flux.

lsa_fluxg

Each user of lsa_fluxg can run one job at a time, using a single GPU, up to 2 cores, and up to 8 GB RAM for a maximum of three days (72 hours).  Use of lsa_fluxg is restricted to jobs that make use of a GPU.

Frequently Asked Questions

Am I able to use the LSA public allocations?

Run the command “mdiag -u your-uniqname-here” on a Flux login node.  If you see the LSA public allocation names as a part of ALIST, then you are able to run jobs under them.

[bjensen@flux-login1 ~]$ mdiag -u bjensen
evaluating user information
Name                      Priority        Flags         QDef      QOSList*        PartitionList
Target  Limits
bjensen                          0            -         flux         flux                     -   0.00       -
  GDEF=dada
  EMAILADDRESS=bjensen@umich.edu
  ADEF=default_flux  ALIST=default_flux,lsa_flux,lsa_fluxm,lsa_fluxg,bigproject_flux
[bjensen@flux-login1 ~]$

If you are a member of LSA but the public allocation names do not show up in the ALIST for your Flux account, please contact hpc-support@umich.edu and ask to be added to the allocations.

How do I use the LSA public allocations?

To use lsa_flux, specify the following credentials in your PBS script:

#PBS -A lsa_flux
#PBS -q flux
#PBS -l qos=flux

To use lsa_fluxm, specify the following credentials in your PBS script:

#PBS -A lsa_fluxm
#PBS -q fluxm
#PBS -l qos=flux

To use lsa_fluxg, specify the following credentials in your PBS script:

#PBS -A lsa_fluxg
#PBS -q fluxg
#PBS -l qos=flux

For more information about PBS scripts, see the Flux PBS web page.

How can I tell what jobs are using (or waiting to use) one of the allocations?

Run the following command on a Flux login node to see what jobs are using (or waiting to use) lsa_flux:

showq -w acct=lsa_flux

Replace “lsa_flux” above with “lsa_fluxm” or “lsa_fluxg” as desired.

How close am I to the usage limits?

You can check your current usage of the LSA public allocations at any point in time, to determine if you are under the usage limit, by running the following command:

lsa_flux_check

My job waits a long time before starting, what are my options?

Because the LSA allocations are public resources, they can often be in high demand, resulting in jobs taking hours or even days to start, even if each individual user is under the usage limis.  Options for getting jobs to start more quickly include:

  • Purchase a Flux allocation.  LSA provides cost-sharing for Flux allocations for all LSA faculty, students, and staff; regular-memory Flux allocations are thus available to members of LSA for only $6.60/core/month.  Graduate students are encouraged to apply for Rackham Graduate Student Research Grants which may be (relatively) quick and easy to obtain.
  • Ask your department, center, or institute about the possibility of a shared allocation funded by department discretionary funds, individual researcher contributions, or other sources.
  • PIs can apply for a one-time trial allocation on Flux that lasts two weeks; contact hpc-support@umich.edu for more information.
  • Use XSEDE. XSEDE has relatively large (up to 200,000 service unit) startup allocations that are fairly easy to obtain, requiring a CV and 1-page description of the research and how the research will utilize XSEDE.  Research allocation requests are reviewed four times a year and are awarded based on the results shown from the startup and previous research allocations; research allocations can be larger than startup allocations.  For more information, contact hpc-support@umich.edu.

How can I avoid exceeding the usage limits for the LSA public allocations?

The usage limits for the LSA public allocations are automatically enforced wherever possible, but you may run into problems and have to manage your usage yourself in order to stay within the usage limits if you are requesting more than 4 GB per core under lsa_flux.

A variety of options are available to manage your usage:

  • You can submit a large number of jobs at once, but use PBS job dependencies to divide the jobs into smaller groups, so that each group is under the usage limit for the allocation.  More information is available in the “How to Use PBS Job Dependencies” section of the Flux Torque PBS web page.
  • If you are using PBS job arrays, you can specify a “slot limit” to limit how many of the individual jobs in the array can run simultaneously.  Array jobs are particularly useful if you are doing parameter sweeps or otherwise running the same code many times with different input files or parameters.  To use PBS job arrays with slot limits, add a percent sign followed by the slot limit to the end of the job array specification in your PBS script.  For example, “#PBS -t 1-100%4” will submit the job 100 times, but will ensure that only four of them will be running at any point in time.  More information is available in the “How to Use Job Arrays” section of the Flux Torque PBS web page and the Adaptive Computing web page on job arrays.
  • Submit only a few jobs at a time, staying under the usage limit for concurrently running jobs.  Wait for the jobs to complete before submitting additional jobs.

Can I use the LSA public allocations instead of another allocation?

Yes.  You can send email to hpc-support@umich.edu and ask to be removed from the other allocations in order to become eligible to use one or more of the LSA public allocations.  Note that you only need to do this if the other allocation you are in is of the same service type as the LSA public allocation you want to use.  For example, if you are in a Standard Flux allocation named someproject_flux and you want to use lsa_flux, you will need to be removed from someproject_flux first.  However, you can use lsa_fluxm and lsa_fluxg without being removed from someproject_flux as long as you do not also have access to some other other Larger Memory Flux or GPU Flux allocation.

Please send any questions or requests to hpc-support@umich.edu.

Back To Top

LSA funding for instructional use of Flux

An LSA pilot program funds use of Flux by LSA classes (instructional use of Flux). Any LSA faculty member can apply to receive a Flux allocation paid for by LSA to use in the classroom. Each application is for a single term only; if a class will be using Flux for multiple terms, the instructor must apply for the LSA-funded class Flux allocation each term. Since funding is limited and since it can take a while to install new software on Flux that may be needed for the course, faculty are encouraged to apply as early as possible (two months or more before the start of the term is ideal, although we can also accept applications after a term has started).

To apply for an LSA-funded class Flux allocation, the faculty member teaching the class should send the following information to hpc-support@umich.edu:

  1. Course name, course number, and academic term.
  2. Approximate number of students who will be enrolled in the course.
  3. A two to three sentence description of how Flux will be used in the course.
  4. Which Flux service(s) are being requested (Standard Flux, Larger Memory Flux, or GPU Flux).
  5. For each month of the course, the number of Flux cores requested.  The number of cores can vary based on when students will be using Flux and when assignment/project due dates are.  LSA Flux support staff can meet with you to help determine how many cores will be needed each month, based on how many students are in the class and what number/length/type of jobs the students will be running.  (Example response: “0 cores in September, 24 cores in October, and 64 cores in each of November and December.”)
  6. Is there any special software that LSA Flux support staff should install on Flux for the course, or any other special setup or resources the class will need?
  7. Would you like LSA Flux support staff to give a guest lecture on how to use Flux?  If so, approximately when in the term?

Please send any questions about instructional use of Flux to hpc-support@umich.edu.

Back To Top

Flux for College of Engineering

College of Engineering’s shared Flux allocation

The College of Engineering provides a Flux allocation that can be used by anyone in the College.

The allocation is currently for the Standard Flux service. For detailed information, see the College’s Flux Allocation webpage.

Please send any questions or requests to hpc-support@umich.edu.

Back To Top

Policies

Terms of Usage and User Responsibilities

  1. Data is not backed up. None of the data on Flux is backed up. The data that you keep in your home directory, /tmp or any other filesystem is exposed to immediate and permanent loss at all times. You are responsible for mitigating your own risk. We suggest you store copies of hard-to-reproduce data on systems that are backed up, for example, the AFS filesystem maintained by ITS.
  2. Your usage is tracked and may be used for reports. We track a lot of job data and store it for a long time. We use this data to generate usage reports and look at patterns and trends. We may report this data, including your individual data, to your adviser, department head, dean, or other administrator or supervisor.
  3. Maintaining the overall stability of the system is paramount to us. While we make every effort to ensure that every job completes with the most efficient and accurate way possible, the good of the whole is more important to us than the good of an individual. This may affect you, but mostly we hope it benefits you. System availability is based on our best efforts. We are staffed to provide support during normal business hours. We try very hard to provide support as broadly as possible, but cannot guarantee support on a 24 hour per day basis. Additionally, we perform system maintenance on a periodic basis, driven by the availability of software updates, staffing availability, and input from the user community. We do our best to schedule around your needs, but there will be times when the system is unavailable. For scheduled outages, we will announce them at least one month in advance on the ARC-TS home page; for unscheduled outages we will announce them as quickly as we can with as much detail as we have on that same page.You can also track ARC-TS at Twitter name ARC-TS.
  4. Flux is intended only for non-commercial, academic research and instruction. Commercial use of some of the software on Flux is prohibited by software licensing terms. Prohibited uses include product development or validation, any service for which a fee is charged, and, in some cases, research involving proprietary data that will not be made available publicly. Please contact hpc-support@umich.edu if you have any questions about this policy, or about whether your work may violate these terms.
  5. You are responsible for the security of sensitive codes and data. If you will be storing export-controlled or other sensitive or secure software, libraries, or data on the cluster, it is your responsibility that is is secured to the standards set by the most restrictive governing rules.  We cannot reasonably monitor everything that is installed on the cluster, and cannot be responsible for it, leaving the responsibility with you, the end user.
  6. Data subject to HIPAA regulations may not be stored or processed on the cluster. For assistance with HIPAA-related computational research please contact Jeremy Hallum, ARC liaison to the Medical School, at jhallum@med.umich.edu.

User Responsibilities

Users should make requests by email hpc-support@umich.edu:

  • Renewing allocations at least 2 days before your current allocation expires to have the new allocation provisioned before the old one expires.
  • At least a day in advance, request users being added to allocations you may have.


Users are responsible for maintaining MCommunity groups used for MReport authorizations.

Users must manage data appropriately in their various locations:

  • /home, /home2
  • /scratch
  • /tmp and /var/tmp
  • customer-provided NFS

 

Back To Top

Security on Flux / Use of Sensitive Data

The Flux high-performance computing system at the University of Michigan has been built to provide a flexible and secure HPC environment. Flux is an extremely scalable, flexible, and reliable platform that enables researchers to match their computing capability and costs with their needs while maintaining the security of their research.

Built-in Security Features

Applications and data are protected by secure physical facilities and infrastructure as well as a variety of network and security monitoring systems. These systems provide basic but important security measures including:

  • Secure access – All access to Flux is via ssh or Globus. Ssh has a long history of high-security. Globus provides basic security and supports additional security if you need it.
  • Built-in firewalls – All of the Flux computers have firewalls that restrict access to only what is needed.
  • Unique users – Flux adheres to the University guideline of one person per login ID and one login ID per person.
  • Multi-factor authentication (MFA) – For all interactive sessions, Flux requires both a UM Kerberos password and Duo authentication. File transfer sessions require a Kerberos password.
  • Private Subnets – Other than the login and file transfer computers that are part of Flux, all of the computers are on a network that is private within the University network and are unreachable from the Internet.
  • Flexible data storage – Researchers can control the security of their own data storage by securing their storage as they require and having it mounted via NFSv3 or NFSv4 on Flux. Another option is to make use of Flux’s local scratch storage, which is considered secure for many types of data. Note: Flux is not considered secure for data covered by HIPAA.

Flux/Globus & Sensitive Data

To find out what types of data may be processed in Flux or Globus, visit the U-M Sensitive Data Guide to IT Resources.

Additional Security Information

If you require more detailed information on Flux’s security or architecture to support your data management plan or technology control plan, please contact the Flux team at hpc-support@umich.edu.

We know that it’s important for you to understand the protection measures that are used to guard the Flux infrastructure. But since you can’t physically touch the servers or walk through the data centers, how can you be sure that the right security controls are in place?

The answer lies in the third-party certifications and evaluations that Flux has undergone. IIA has evaluated the system, network, and storage practices of Flux and Globus. The evaluation for Flux is published athttp://safecomputing.umich.edu/dataguide/?q=node/151 and the evaluation for Globus is published at http://safecomputing.umich.edu/dataguide/?q=node/155.

Shared Security and Compliance Responsibility

Because you’re managing your data in the Flux high-performance computing environment, the security responsibilities will be shared.

Flux operators have secured the underlying infrastructure, and you are obligated to secure anything you put on the your own infrastructure itself, as well meet any other compliance requirement.  These requirements may be derived from your grant or funding agency, or data owners or stewards other than yourself, or state or federal laws and regulations.

The Flux support staff is available to help manage user lists for data access, and information is publicly available on how to manage file system permissions, please see:http://en.wikipedia.org/wiki/File_system_permissions.

Contacting Flux Support

The Flux Support Team encourages communications, including for security-related questions. Please email us at hpc-support@umich.edu.

We have created a PGP key for especially sensitive communications you may need to send.

-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1

mQENBFNEDlUBCACvXwy9tYzuD3BqSXrxcAEcIsmmH52066R//RMaoUbS7AcoaF12
k+Quy/V0mEQGv5C4w2IC8Ls2G0RHMJ2PYjndlEOVVQ/lA8HpaGhrSxhY1bZzmbkr
g0vGzOPN87dJPjgipSCcyupKG6Jnnm4u0woAXufBwjN2wAP2E7sqSZ2vCRyMs4vT
TGiw3Ryr2SFF98IJCzFCQAwEwSXZ2ESe9fH5+WUxJ6OM5rFk7JBkH0zSV/RE4RLW
o2E54gkF6gn+QnLOfp2Y2W0CmhagDWYqf5XHAr0SZlksgDoC14AN6rq/oop1M+/T
C/fgpAKXk1V/p1SlX7xL230re8/zzukA5ETzABEBAAG0UEhQQyBTdXBwb3J0IChV
bml2ZXJzaXR5IG9mIE1pY2hpZ2FuIEhQQyBTdXBwb3J0IEdQRyBrZXkpIDxocGMt
c3VwcG9ydEB1bWljaC5lZHU+iQE+BBMBAgAoBQJTRA5VAhsDBQkJZgGABgsJCAcD
AgYVCAIJCgsEFgIDAQIeAQIXgAAKCRDHwuoUZnHdimrSB/4m6P7aQGnsbYVFspJ8
zquGRZd3fDU/IaCvLyjsUN4Qw1KFUmqQjvvfTxix7KjlNMcGy1boUCWKNNk1sFtb
E9Jr2p6Z/M7pm4XWhZIs1UIfHr3XgLdfbeYgXpt4Md2G6ttaXv44D10xL2LYCHE8
DnSVv+2SIG9PhaV+h+aBUo4yKwTwVBZsguU1Z1fsbiu6z6iDrzU2dlQp0NLmw73G
v5HUdYdu/YJdh5frp/2XorLXynrEyCk1SxViXrHY6dc9Y3bUjwl0MOJypLuRhQmj
kVwHIsNsRg1YJ6iyJzom33C7YdRktBiPpstkYDHJf/PVRAw1G4dkyjfUfG2pIoQd
WjOxuQENBFNEDlUBCADNwZ5edW/e08zYFWSGVsdpY4HM2CdsVqkuQru2puHhJqg4
eWS9RAdJ6fWp3HJCDsDkuQr19B3G5gEWyWOMgPJ9yW2tFVCrVsb9UekXAWh6C6hL
Tj+pgVVpNDTYrErYa2nlll0oSyplluVBRlzDfuf4YkHDy2TFd7Kam2C2NuQzLQX3
THhHkgMV+4SQZ+HrHRSoYPAcPb4+83dyQUo9lEMGcRA2WqappKImGhpccQ6x3Adj
/HFaDrFT7itEtC8/fx4UyaIeMszNDjD1WIGBJocOdO7ClIEGyCshwKn5z1cCUt72
XDjun0f1Czl6FOzkG+CHg5mf1cwgNUNx7TlVBFdTABEBAAGJASUEGAECAA8FAlNE
DlUCGwwFCQlmAYAACgkQx8LqFGZx3YrcqggAlKZhtrMDTHNki1ZTF7c7RLjfN17H
Fb342sED1Y3y3Dm0RVSQ2SuUWbezuDwov6CllgQR8SjBZ+D9G6Bt05WZgaILD7H0
LR9+KtBNYjxoVIdNHcGBf4JSL19nAI4AMWcOOjfasGrn9C60SwiiZYzBtwZa9VCi
+OhZRbmcBejBfIAWC9dGtIcPHBVcObT1WVqAWKlBOGmEsj/fcpHKkDpbdS7ksLip
YLoce2rmyjXhFH4GXZ86cQD1nvOoPmzocIOK5wpIm6YxXtYLP07T30022fOV7YxT
mbiKKL2LmxN1Nb/+mf+wIZ5w2ZdDln1bbdIKRHoyS2HyhYuLd1t/vAOFwg==
=yAEg
-----END PGP PUBLIC KEY BLOCK-----

May I process sensitive data using Flux?

Yes, but only if you use a secure storage solution like Mainstream Storage and Flux’s scratch storage. Flux’s home directories are provided by Value Storage, which is not an appropriate location to store sensitive institutional data.One possible workflow is to use sftp or Globus to move data between a secure solution and Flux’s scratch storage, which is secure, bypassing your home directory or any of your own Value Storage directories.Keep in mind that compliance is a shared responsibility.You must also take any steps required by your role or unit to comply with relevant regulatory requirements.

For more information on specific types of data that can be stored and analyzed on Flux, Value Storage, and other U-M services, please see the “Sensitive Data Guide to IT Services” web page on the Safe Computing website: http://safecomputing.umich.edu/dataguide/

Back To Top

Acknowledging Flux in Published Papers

Researchers are urged to acknowledge ARC in any publication, presentation,
report, or proposal on research that involved ARC hardware (Flux) and/or
staff expertise.

“This research was supported in part through computational resources and
services provided by Advanced Research Computing at the University of
Michigan, Ann Arbor.”

Researchers are asked to annually submit, by October 1, a list of materials
that reference ARC, and inform its staff whenever any such research receives
professional or press exposure (arc-contact@umich.edu). This information is
extremely important in enabling ARC  to continue supporting U-M researchers
and obtain funding for future system and service upgrades.

Back To Top

Policy on commercial use of Flux

Flux is intended only for non-commercial, academic research and instruction. Commercial use of some of the software on Flux is prohibited by software licensing terms. Prohibited uses include product development or validation, any service for which a fee is charged, and, in some cases, research involving proprietary data that will not be made available publicly.

Please contact hpc-support@umich.edu if you have any questions about this policy, or about whether your work may violate these terms.

Back To Top

Advanced Topics

Data Science Platform (Hadoop)

The ARC-TS Data Science Platform is an upgraded Hadoop cluster currently available as a technology preview with no associated charges to U-M researchers. The ARC-TS Hadoop cluster is an on-campus resource that provides a different service level than most cloud-based Hadoop offerings, including:

  • high-bandwidth data transfer to and from other campus data storage locations with no data transfer costs
  • very high-speed inter-node connections using 40Gb/s Ethernet

The cluster provides 112TB of total usable disk space, 40GbE inter-node networking, Hadoop version 2.3.0, and several additional data science tools.

Aside from Hadoop and its Distributed File System, the ARC-TS data science service includes:

  • Pig, a high-level language that enables substantial parallelization, allowing the analysis of very large data sets.
  • Hive, data warehouse software that facilitates querying and managing large datasets residing in distributed storage using a SQL-like language called HiveQL.
  • Sqoop, a tool for transferring data between SQL databases and the Hadoop Distributed File System.
  • Rmr, an extension of the R Statistical Language to support distributed processing of large datasets stored in the Hadoop Distributed File System.
  • Spark, a general processing engine compatible with Hadoop data
  • mrjob, allows MapReduce jobs in Python to run on Hadoop

The software versions are as follows:

Title Version
Hadoop 2.5.0
Hive 0.13.1
Sqoop 1.4.5
Pig 0.12.0
R/rhdfs/rmr 3.0.3
Spark 1.2.0
mrjob 0.4.3-dev, commit

226a741548cf125ecfb549b7c50d52cda932d045

If a cloud-based system is more suitable for your research, ARC-TS can support your use of Amazon cloud resources through MCloud, the UM-ITS cloud service.

For more information on the Hadoop cluster, please see this documentation or contact us at data-science-support@umich.edu.

A Flux account is required to access the Hadoop cluster. Visit the Establishing a Flux allocation page for more information.

Back To Top

Connecting Flux and XSEDE

XSEDE is an open scientific discovery infrastructure combining leadership class resources at eleven partner sites to create an integrated, persistent computational resource. It is the successor to TeraGrid.

For general information on XSEDE, visit the XSEDE home page.

This page describes the how to connect an XSEDE allocation with a Flux allocation.

The XSEDE Client Toolkit allows Flux users to connect to XSEDE resources using the GSI interface.  This eases logins and file transfers between the two sets of resources.  The toolkit provides the commands myproxy-logingsisshgsiscpglobus-url-copy, and uberftp. Refer to XSEDE’s Data Transfers page for details on these commands.

Loading the XSEDE Client Toolkit

Load the toolkit module:

module load xsede

Getting your XSEDE User Portal ticket

Before connecting to any XSEDE resource using the GSI interface you must get an XSEDE User Portal ticket. This proves your identity and lasts for 12 hours by default. Use your XSEDE portal username and password.

myproxy-logon -l portalusername

Logins to XSEDE Resources

The command gsissh works the same as normal ssh but uses your portal ticket to authenticate.

gsissh Xsedeloginhost
gsissh gordon.sdsc.edu

Connect to the resource you have access to via your startup or XSEDE TRAC.

File Transfers

The XSEDE Client Toolkit allows file transfers between XSEDE and Flux resources and between XSEDE and XSEDE resources using gsiscp or the more complex and powerful globus-url-copy.

GSISCP

gsiscp uses the same options as normal scp allowing transfer of files between Flux resources and XSEDE resources:

Transfer file fluxfile to the XSEDE resource xsedehost into the folder folder1

gsiscp fluxfile xsedehost:folder1/

Transfer the folder xfolder from xsedehost to Flux.

gsiscp -r teragridhost:folder1 .
 

GridFTP support globus-url-copy

GridFTP is a powerful system for moving files between XSEDE sites. You can use the command globus-url-copy to initiate transfers from a Flux resource. Please refer to the XSEDE’s Data Transfers page on its use.

XSEDE Available Resources:

To find available XSEDE resources when choosing where to request your allocation see the Resource Catalog, or contact hpc-support@umich.edu.

XSEDE Training Opportunities:

XSEDE offers training throughout the year listed on the XSEDE Course Calendar. XSEDE also maintains a collection of online training resources.

Back To Top

Accessing the Internet from ARC-TS compute nodes

Normally, compute nodes on ARC-TS clusters cannot directly access the Internet because they have private IP addresses. This increases cluster security while reducing the costs (IPv4 addresses are limited, and ARC-TS clusters do not currently support IPv6). However, this also means that jobs cannot install software, download files, or access databases on servers located outside of University of Michigan networks: the private IP addresses used by the cluster are routable on-campus but not off-campus.

If your work requires these tasks, there are three ways to allow jobs running on ARC-TS clusters to access the Internet, described below. The best method to use depends to a large extent on the software you are using. If your software supports HTTP proxying, that is the best method. If not, SOCKS proxying or SSH tunneling may be suitable.

HTTP proxying

HTTP proxying, sometimes called “HTTP forward proxying”  is the simplest and most robust way to access the Internet from ARC-TS clusters. However, there are two main limitations:

  • Some software packages do not support HTTP proxying.
  • HTTP proxying only supports HTTP, HTTPS and FTP protocols.

If either of these conditions apply (for example, if your software needs a database protocol such as MySQL), users should explore SOCKS proxying or SSH tunneling, described below.

Some popular software packages that support HTTP proxying include:

HTTP proxying is automatically set up when you log in to ARC-TS clusters and it should be used by any software which supports HTTP proxying without any special action on your part.

Here is an example that shows installing the Python package pyvcf from within an interactive job running on a Flux compute node:


[markmont@flux-login1 ~]$ module load anaconda2/latest
[markmont@flux-login1 ~]$ qsub -I -V -A example_flux -q flux -l nodes=1:ppn=2,pmem=3800mb,walltime=04:00:00,qos=flux
qsub: waiting for job 18927162.nyx.arc-ts.umich.edu to start
qsub: job 18927162.nyx.arc-ts.umich.edu ready

[markmont@nyx5792 ~]$ pip install –user pyvcf
Collecting pyvcf
Downloading PyVCF-0.6.7.tar.gz
Collecting distribute (from pyvcf)
Downloading distribute-0.7.3.zip (145kB)
100% |████████████████████████████████| 147kB 115kB/s
Requirement already satisfied (use –upgrade to upgrade): setuptools>=0.7 in
/usr/cac/rhel6/lsa/anaconda2/latest/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg (from distribute->pyvcf)
Building wheels for collected packages: pyvcf, distribute
Running setup.py bdist_wheel for pyvcf … done
Stored in directory: /home/markmont/.cache/pip/wheels/68/93/6c/fb55ca4381dbf51fb37553cee72c62703fd9b856eee8e7febf
Running setup.py bdist_wheel for distribute … done
Stored in directory: /home/markmont/.cache/pip/wheels/b2/3c/64/772be880a32a0c41e64b56b13c25450ff31cf363670d3bc576
Successfully built pyvcf distribute
Installing collected packages: distribute, pyvcf
Successfully installed distribute pyvcf
[markmont@nyx5792 ~]$

If HTTP proxying were not supported by pip (or was otherwise not working), you’d be unable to access the Internet to install the pyvcf package and receive “Connection timed out”, “No route to host”, or “Connection failed” error messages when you tried to install it.

Information for advanced users

HTTP proxying is controlled by the following environment variables which are automatically set on each compute node:

export http_proxy="http://proxy.arc-ts.umich.edu:3128/"
export https_proxy="http://proxy.arc-ts.umich.edu:3128/"
export ftp_proxy="http://proxy.arc-ts.umich.edu:3128/"
export no_proxy="localhost,127.0.0.1,.localdomain,.umich.edu"
export HTTP_PROXY="${http_proxy}"
export HTTPS_PROXY="${https_proxy}"
export FTP_PROXY="${ftp_proxy}"
export NO_PROXY="${no_proxy}"

Once these are set in your environment, you can access the Internet from compute nodes — for example, you can install Python and R libraries from compute nodes. There’s no need to start any daemons as is needed with the first two solutions above. The HTTP proxy server proxy.arc-ts.umich.edu does support HTTPS but does not terminate the TLS session at the proxy; traffic is encrypted by the software the user runs and the traffic is not decrypted until it reaches the destination server on the Internet.

To prevent software from using HTTP proxying, run the following command:

unset http_proxy https_proxy ftp_proxy no_proxy HTTP_PROXY HTTPS_PROXY FTP_PROXY NO_PROXY

The above command will only affect software started from the current shell.  If you start a new shell (for example, if you open a new window or log in again) you’ll need to re-run the command above each time.  To permanently disable HTTP proxying for all software, add the command above to the end of your ~/.bashrc file.

Finally, note that HTTP proxying (which is forward proxying) should not be confused with reverse proxying.  Reverse proxying, which is done by the ARC Connect service, allows researchers to start web applications (including Jupyter notebooks, RStudio sessions, and Bokeh apps) on compute nodes and then access those web applications through the ARC Connect.

SOCKS

A second solution is available for any software that either supports the SOCKS protocol or that can be “made to work” with SOCKS. Most software does not support SOCKS, but here is an example using curl (which does have built-in support for SOCKS) to download a file from the Internet from inside an interactive job running on a Flux compute node. We use “ssh -D” to set up a “quick and dirty” SOCKS proxy server for curl to use:

[markmont@flux-login1 ~]$ qsub -I -V -A example_flux -q flux -l nodes=1:ppn=2,mem=8000mb,walltime=04:00:00,qos=flux
qsub: waiting for job 18927190.nyx.arc-ts.umich.edu to start
qsub: job 18927190.nyx.arc-ts.umich.edu ready

[markmont@nyx5441 ~]$ ssh -f -N -D 1080 flux-xfer.arc-ts.umich.edu
[markmont@nyx5441 ~]$ curl –socks localhost -O ftp://ftp.gnu.org/pub/gnu/bc/bc-1.06.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 272k 100 272k 0 0 368k 0 –:–:– –:–:– –:–:– 1789k
[markmont@nyx5441 ~]$ ls -l bc-1.06.tar.gz
-rw-r–r– 1 markmont lsa 278926 Feb 10 17:11 bc-1.06.tar.gz
[markmont@nyx5441 ~]$

A limitation of “ssh -D” is that it only handles TCP traffic, not UDP traffic (including DNS lookups, which happen over UDP). However, if you have a real SOCKS proxy accessible to you elsewhere on the U-M network (such as on a server in your lab), you can specify its hostname instead of “localhost” above and omit the ssh command in order to have UDP traffic handled.

For software that does not have built-in support for SOCKS, it’s possible to wrap the software with a library that intercepts networking calls and routes the traffic via the “ssh -D” SOCKS proxy (or a real SOCKS proxy, if you have one accessible to you on the U-M network). This will allow most software running on compute nodes to access the Internet. ARC-TS clusters provide one such SOCKS wrapper, socksify, by default:

[markmont@nyx5441 ~]$ telnet towel.blinkenlights.nl 666  # this won't work...
Trying 94.142.241.111...
telnet: connect to address 94.142.241.111: No route to host
Trying 2a02:898:17:8000::42...
[markmont@nyx5441 ~]$ ssh -f -N -D 1080 flux-xfer.arc-ts.umich.edu # if it's not still running from above
[markmont@nyx5441 ~]$ socksify telnet towel.blinkenlights.nl 666

=== The BOFH Excuse Server ===
the real ttys became pseudo ttys and vice-versa.

Connection closed by foreign host.
[markmont@nyx5441 ~]$

You can even surf the web in text mode from a compute node:

[markmont@nyx5441 ~]$ socksify links http://xsede.org/

socksify is the client part of the Dante SOCKS server.

Local SSH tunneling (“ssh -L”)

A final option for accessing the Internet from an ARC-TS  compute node is to set up a local SSH tunnel using the “ssh -L” command. This provides a local port on the compute node that processes can connect to to access a single specific remote port on a single specific host on a non-UM network.

MongoDB example

Here is an example that shows how to use a local tunnel to access a MongoDB database hosted off-campus from inside a job running on a compute node. First, on a cluster login node, run the following command in order to get the keys for flux-xfer.arc-ts.umich.edu added to your ~/.ssh/known_hosts file. This needs to be done interactively so that you can respond to the prompt that the ssh command gives you:

[markmont@flux-login1 ~]$ ssh flux-xfer.arc-ts.umich.edu
The authenticity of host 'flux-xfer.arc-ts.umich.edu (141.211.22.200)' can't be established.
RSA key fingerprint is 6f:8c:67:df:43:4f:e0:fc:80:5b:49:1a:eb:81:cc:54.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'flux-xfer.arc-ts.umich.edu' (RSA) to the list of known hosts.
---------------------------------------------------------------------
Advanced Research Computing - Technology Services
University of Michigan
hpc-support@umich.edu

This machine is intended for transferring data to and from the cluster
flux using scp and sftp only. Use flux-login.engin.umich.edu for
interactive use.

For usage information, policies, and updates, please see:
arc-ts.umich.edu

Thank you for using U-M information technology resources responsibly.
———————————————————————

^CConnection to flux-xfer.arc-ts.umich.edu closed.
[markmont@flux-login1 ~]$

After you see the login banner, the connection will hang, so press Control-C to terminate it and get your shell prompt back.

You can now run the following commands in a job, either interactively or in a PBS script, in order to access the MongoDB database at db.example.com from a compute node:

# Start the tunnel so that port 27017 on the compute node connects to port 27017 on db.example.com:
ssh -N -L 27017:db.example.com:27017 flux-xfer.arc-ts.umich.edu &
# Give the tunnel time to completely start up:
sleep 5
# You can now access the MongoDB database at db.example.com by connecting to localhost instead.
# For example, if you have the “mongo” command installed in your current directory, you could run the
# following command to view the collections available in the “admin” database:
./mongo --username MY_USERNAME --password “MY_PASSWORD” localhost/admin --eval ‘db.getCollectionNames();’
# When you are all done using it, tear down the tunnel:
kill %1

scp example

Here is an example that shows how to use a local tunnel to copy a file using scp from a remote system (residing on a non-UM network) named “far-away.example.com” onto an ARC-TS cluster from inside a job running on a compute node.

You should run the following commands inside an interactive PBS job the first time so that you can respond to prompts to accept various keys, as well as enter your password for far-away.example.com when prompted.

# Start the tunnel so that port 2222 on the compute node connects to port 22 on far-away.example.com:
ssh -N -L 2222:far-away.example.com:22 flux-xfer.arc-ts.umich.edu &
# Give the tunnel time to completely start up:
sleep 5
# Copy the file “my-data-set.csv” from far-away.example.com to the compute node:
# Replace “your-user-name” with the username by which far-away.example.com knows you.
# If you don’t have public key authentication set up from the cluster for far-away.example.com, you’ll
# be prompted for your far-away.example.com password
scp -P 2222 your-user-name@localhost:my-data-set.csv .
# When you are all done using it, tear down the tunnel:
kill %1

Once you have run these commands once, interactively, from a compute node, they can then be used in non-interactive PBS batch jobs, if you’ve also set up public key authentication for far-away.example.com

Back To Top