Scratch Storage Details

By | | No Comments

/scratch is a shared high performance storage system on Flux that provides access to large amounts of disk for short periods of time at much higher speed than /home or /home2.

Directories in /scratch

Upon the creation of a new Flux project, a directory will be created in /scratch for each user with access to that project. The paths will be of the form /scratch/projectname_flux/username. These directories are owned by the user and their default group and set up such that their default UNIX file permissions are 0700.

Upon request of someone with access to make changes to the project we will modify the project’s user directories within/scratch in the following way:

  • Create a UNIX group whose membership matches the users with access to the project
  • Set the group ownership of the user directories under the project’s root within /scratch to this new group
  • Set the permissions on these directories to 2750.

In order to allow /scratch to be controlled systematically, other modifications to the root of a project’s /scratch directory are not be permitted.

User Directory Example

If you have access to Flux projects with the names projectA_flux and projectB_flux your directories in /scratch will be:
/scratch/projectA_flux/YOUR_LOGIN_ID
/scratch/projectB_flux/YOUR_LOGIN_ID
You should be careful to put data related to a particular project only in the project’s area as it may be deleted when the allocations associated with that project expire. See Policies below for details.

Policies

  • Only data for currently running jobs should be left on /scratch. The system is not backed up and is vulnerable to data loss.
  • Data that has not been accessed in the past 90 days will be automatically deleted.
  • 14 days after an allocation expires, the top-level directories will be set to unreadable as long as there are no currently running jobs (or the next day after all of the jobs associated with that project have completed).
  • 60 days after an allocation expires, the directory and its contents will be permanently removed.
  • If an allocation is renewed within 60 days, the permissions on the directory will be restored and the project members will again have access to the data therein.
  • Data should be cleaned by users often, old data will be removed to maintain working space. Important data should be moved to /home or AFS to avoid loss.

Details

The filesystem uses the Lustre cluster file system and scales to 100s of GB/s IO. Flux’s implementation is 2 metadata servers (for redundancy) and 4 storage servers (for redundancy, performance and capacity), each with 4 storage targets, providing a total usable storage space of 1.5PB.  Measured single client performance over infiniband is 1GB/s total filesystem performance is 5GB/s in optimal conditions.

AFS Storage details

By | | No Comments

The Andrew File System or AFS is a central file storage, sharing and retrieval system operated by Information and Technology Services and accessible from Mac, Windows, and Unix computers.

For Flux users, AFS is good for storing important files – ITS backs this up to tape, making it a relatively secure file system.

On the other hand, AFS isn’t available on the compute nodes because your Kerberos token doesn’t get passed with your job information, so it is not good for running compute jobs.

Using AFS

On the Flux login nodes you can access your AFS space by typing:

[login@flux-login1 ~]$ kinit
Password for login@UMICH.EDU: your password
[login@flux-login1 ~]$ aklog UMICH.EDU
[login@flux-login1 ~]$ cd /afs/umich.edu/user/1/2/login
Where login is your login ID, 1 is the first letter of your login ID, and 2 is the second letter of your login ID. So if your login ID isacaird the path to your AFS space is /afs/umich.edu/u/a/c/acaird.

For more information, please see ITS’s web pages.

Flux Storage Options

By | | No Comments

Several levels of data storage are provided for Flux, varying by capacity, I/O rate, and longevity of storage. Nothing is backed up, except AFS. Please contact hpc-support@umich.edu with any questions.

Storage type / location Description Best Used For Access and Policy Details
/tmp
Local directory unique to each node. Not shared. High-speed read and writes for small files (less than 10GB)
/home
Shared across the entire cluster. Only for use with currently running jobs. Quota of 80GB per user. Currently running jobs.
/scratch
Lustre-based parallel file system shared across all Flux nodes. Large reads and writes of very large data files. Checkpoint/restart files and large data sets that are frequently read from or written to are common examples. Also, code that uses MPI. ARC-TS /scratch page
AFS
AFS is a filesystem maintained and backed up by ITS. It is the only storage option available for Flux that is regularly backed up, and is therefore the most secure choice. It is only available on Flux login nodes and can provide up to 10GB of backed-up storage. Storing important files. NOT available for running jobs on compute nodes. ARC-TS AFS page
Turbo
Turbo Research Storage is a high-speed storage service providing NFSv3 and NFSv4 access. It is available only for research data. Data stored in Turbo can be easily shared with collaborators when used in combination with the Globus file transfer service. Storing research data. ARC-TS Turbo page
Long-term storage
Users who need long-term storage can purchase it from ITS MiStorage. Once established, it can be mounted on the Flux login and compute nodes. Long-term storage. ITS MiStorage page

File transfers with Globus – GridFTP

By | | No Comments

Globus GridFTP is a reliable high performance parallel file transfer service provided by many HPC sites around the world. A GridFTP server is available for Flux.

How to use GridFTP

Globus Online is a web front end to GridFTP, this is the recommended way to interact with GridFTP on campus. Globus Online is a web based project hosted off campus. Globus Online accounts are free and your username need not match your campus uniquename.

Globus Connect

Globus Online also allows for simple installation of the GridFTP endpoint for Windows, Mac(OSX), and Linux. These installations while simple are only for a single user on a machine at a time. If you want your cluster or shared data repo to support all users on your systems as a public endpoint like Flux or Nyx your admin needs to install Globus Connect Server (former known as Global Connect Multi-User) — see below for details.

Batch File Copies

A non-standard use of Globus Online is that you can use it to copy files form one location to another on the same cluster. To do this use the same endpoint (umich#flux as an example) for both the sending and receiving machines. Setup the transfer and Globus will make sure the rest happens. The service will email you when the copy is finished.

Command Line GridFTP

There are Command line tools for GridFTP. If you wish to use these contact the FLUX support group. Their use is discouraged.

Flux GridFTP Servers

  • gsiftp://gridftp-flux.engin.umich.edu

Globus Connect Server (GCMU)

Globus Connect Server, formerly Globus Connect Multi-User (GCMU), is a full featured GridFTP endpoint that will enable any user on your system to use GridFTP to transfer files from your system to any other GridFTP endpoint. Installation is more complicated than Globus Connect for single users and only support Linux at this time.

This package is required to have your system appear as a public endpoint in Globus Online. Authentication is handled by the campus wide CiLogin server against the Michigan Kerberos password database. It also does not require the cluster to procure its own signed certificates.

  1. Install the globus-connect-server package for your platform as instructed from Globus.
  2. Use this globus-connect-server.conf
  3. Follow the remaining instructions from 1.

Groups on campus who wish to install Globus Connect Server and add their machines to the umich# name space in Globus Online should contact hpc-support@umich.edu.

Using General Purpose GPUs on Flux

By | | No Comments

What are GPGPUs?

GPGPUs are general-purpose graphics processing units. Originally, graphics processing units were very special purpose, designed to do the mathematical calculations needed to render high-quality graphics for games and other demanding graphical programs. People realized that those operations occur in other contexts and so started using the graphics card for other calculations. The industry responded by creating the general purpose cards, which generally has meant increasing the memory, numerical precision, speed, and number of processors.

GPGPUs are particularly good at matrix multiplication, random number generation, fast Fourier transforms (FFTs), and other numerically intensive and repetitive mathematical operations. They can deliver 5–10 times speed-up for many codes with careful programming.

Submitting Batch Jobs

The Flux GPGPU allocations are based on a single GPGPU accompanied by two CPUs each with 4 GB of memory for a total CPU memory pool of 8 GB per GPU. To use more memory or more CPUs with a GPU job, you must increase your allocation to make them available. For example, if your GPGPU program required 17 GB of memory, you would need to have an allocation for three GPGPUs to obtain enough CPU memory even though your job only uses one GPGPU and one CPU.

To use a GPU, you must request one in your PBS script. To do so, use the node attribute on your #PBS -l line. Here is an example that requests one GPU.

#PBS -l nodes=1:gpus=1,mem=2gb,walltime=1:00:00,qos=flux

Note that you must use nodes=1 and not procs=1 or the job will not run.

Also note that GPUs are available only with a GPU allocation, and those have names that end with _fluxg instead of _flux. Make sure that the line requesting the queue matches the GPU allocation name; i.e.,

#PBS -q fluxg

See our web page on Torque for more details on PBS scripts.

Programming for GPGPUs

The GPGPUs on Flux are NVIDIA graphics processors and use NVIDIA’s CUDA programming language. This is a very C-like language (that can be linked with Fortan codes) that makes programming for GPGPUs straightforward. For more information on CUDA programming, see the documentation at http://www.nvidia.com/object/cuda_develop.html.

For more examples of applications that are well-suited to CUDA, a language that enables use of GPUs, see NVIDIA’s CUDA pages at http://www.nvidia.com/object/cuda_home.html.

NVIDIA also makes special libraries available that make using the GPGPUs even easier. Two of these libraries are cuBLAS and cuFFT.

cuBLAS is a BLAS library that uses the GPGPU for matrix operations. For more information on the BLAS routines implemented by cuBLAS, see the documentation at http://developer.nvidia.com/cublas.

cuFFT is a set of FFT routines that use the GPGPU for their calculations. For more information on the FFT routines implemented by cuFFT, see the documentation at http://developer.nvidia.com/cufft.

To use the CUDA compiler (nvcc) or to link your code against one of the CUDA-enabled libraries, you will need to load a cuda module. There are typically several versions installed, and you can see which are available with

$ module av cuda

You can just load the cuda module to get the default, or you can load a specific version by specifying it, as in

$ module load cuda/6.0

Loading a cuda module will add the path to the the nvcc compiler to your PATH, and and it will set several other environment variables that can be used to link against cuBLAS, cuFFT, and other CUDA libraries in the library directory. You can use

$ module show cuda

to display which variables are set.

CUDA-based applications can be compiled on the login nodes, but cannot be run there, since they do not have GPGPUs. To run a CUDA application, you must submit a job, as shown above.

Using software graphical interfaces with VNC

By | | No Comments

Flux is primarily a batch-oriented system, however, it is possible and sometimes necessary to use the graphical interface to some software. This can be accomplished using an “interactive batch job” and VNC, a program that will display a remote graphical interface locally. The traditional method for doing so involved setting up X-forwarding, but that can be very slow, especially over slow or congested networks or when off-campus.

VNC creates a virtual desktop to which programs display, and VNC then handles sending changes to that desktop to a viewer that runs on your local computer. This results in much faster updates and much better performance of graphical applications.

There are four steps needed to create and display a VNC session.

  1. Run a batch job that starts the VNC server
  2. Determine the network port that VNC is using
  3. Create a tunnel (network connection) from your local machine to the VNC desktop
  4. Connect to the VNC desktop using the tunnel

Before going through those steps, there is some set up needed. The first is to set a VNC password. This is completely different from your login password. This password is used only when connecting to the desktop, and it is not secure, so please use a password that is not used for other things. To set the VNC password, from a Flux login node run $ vncpasswd

To get a nicer desktop environment, we highly recommend that you use our xstartup file. To do so, $ cp /usr/cac/rhel6/vnc/xstartup ~/.vnc/

When you are done working in the VNC session, you will use the $ stopvnc

command in the provided terminal window to shut down the VNC session and end the job.

Step 1: Submit the batch job

You will need to have a PBS script to run the VNC job. Something like the following, which asks for two processors on one node for two hours.

####  PBS preamble

#PBS -N VNC
#PBS -M uniqname@umich.edu
#PBS -m ab

#PBS -A example_flux
#PBS -l qos=flux
#PBS -q flux

#PBS -l nodes=1:ppn=2,pmem=2gb
#PBS -l walltime=2:00:00
#PBS -j oe
#PBS -V
####  End PBS preamble
if [ -e $PBS_O_WORKDIR ] ; then
    cd $PBS_O_WORKDIR
fi
# Run the VNC server in the foreground so PBS does not end the job.
# You may wish to change 1024x768 to 1280x1024 if you have a large screen.
# On smaller laptops, 1024x768 is recommended.
vncserver -depth 24 -geometry 1024x768 -name ${USER} -AlwaysShared -fg

NOTE Remove any old log and pid files from your ~/.vnc folder before you run qsub with $ rm ~/.vnc/*.log ~/.vnc/*.pid

If you call that vnc.pbs, then to complete step one of four, you would $ qsub vnc.pbs  to run it.

We recommend not running more than one VNC job at a time. If you do, you should not delete all the .pid and .log files, but only those from VNC jobs that have finished. If you do, it is up to you to keep track of those files for active jobs.

Step 2: Determining your port

The line #PBS -m ab instructs PBS to send you an e-mail when the job starts, and it will contain something that looks like

PBS Job Id: 17528732.nyx.arc-ts.umich.edu
Job Name:   VNC
Exec host:  nyx5541/11
Begun execution

You will need the hostname, which in this case is nyx5541 from that message to set up the tunnel.

To complete step 2, take the hostname and use the command

$ ls $HOME/.vnc/nyx5541*.log
/home/bennet/.vnc/nyx5541.arc-ts.umich.edu:99.log

The number that follows the hostname, in this case 99, is the desktop number for VNC, and the port number is that plus 5900.

Step 3: Setting up the tunnel

You need to create a tunnel to the node on which your VNC server is running. The tunnel is a way to use one machine, in this case, we will use flux-xfer.arc-ts.umich.edu, to pass the network connection to another, in this case nyx5541.arc-ts.umich.edu. To do this, we use a special form of the ssh command.

From a Mac or a Linux machine, you would use $ ssh -N -L5999:nyx5541.arc-ts.umich.edu:5999 flux-xfer.arc-ts.umich.edu

This will prompt you for your password and then do nothing except forward the VNC connection to the compute host. Note that we are using the Flux file transfer host and not the login host for this tunnel as it does not require two-factor authentication.

From Windows, we recommend that you use PuTTY, which is available as part of the UM Blue Disc. With PuTTY, you need to set the port forwarding from the configuration menu. You will need to reset this each time the VNC port changes. See Additional VNC topics for instructions on configuring a tunnel with PuTTY.

Step 4: Connect to the VNC desktop

Now that you have a port forwarded, you are ready to connect your VNC client to your running VNC server. Choose the link from the Additional VNC topics that corresponds to your VNC client to see what the screens look like.

The VNC client will have a place for you to enter the host, which is usually localhost or the IP number 127.0.0.1, the Display (desktop number), and VNC password.

The VNC session will start with a terminal window open. Run the commands there that you need for your application. When you are done with your VNC session, typing the command

$ stopvnc

in the terminal window will end the VNC session and the PBS job.

Login Nodes and transfer hosts

By | | No Comments

Login nodes

The login nodes are the front end to the cluster. They are accessible from the Ann Arbor, Dearborn, and Flint campus IP addresses and from the UM VPN network only and require a valid user account and a Duo two-factor authentication account to log in. Login nodes are a shared resource and, as such, it is expected that users do not monopolize them.

Login nodes for flux

The Flux login nodes are accessible via the following hostnames.

  • flux-login.arc-ts.umich.edu
    will connect you to the general Flux login hosts
  • flux-campus-login.arc-ts.umich.edu
    will connect you to the login hosts that can run software that requires you be on campus.
Policies governing the login nodes

Appropriate uses for the login nodes:

  • Transferring small files to and from the cluster
  • Creating, modifying, and compiling code and submission scripts
  • Submitting and monitoring the status of jobs
  • Testing executables to ensure they will run on the cluster and its infrastructure. Processess are limited to a maximum of 15 minutes of CPU time to prevent runaway processes and over use.

Any other uses of the login nodes may result in the termination of the process in violation. Any production processes (including post processing) should be submitted through the batch system to the cluster. If interactive use is required then you should submit an interactive job to the cluster.

Transfer hosts

The transfer hosts are available for users to transfer data to and from Flux. Connections are limited to SCP and SFTP and interactive logins are not allowed. Currently the transfer hosts have 10 Gbps connections to the network, which is much faster than connections to the login nodes. Connections to the transfer hosts are allowed from the same networks as are the login nodes.

Transfer hosts for flux
  • flux-xfer.arc-ts.umich.edu

Supported sftp and scp clients

Two-Factor Authentication

By | | No Comments

Flux and other ARC-TS resources require two-factor authentication with both a UMICH password and Duo (which replaces MTokens starting July 20, 2016) in order to log in.

Duo provides several options, including a smartphone or tablet app, phone calls to landlines or cell phones, and text messages. Instructions on enrolling in Duo are available at http://its.umich.edu/two-factor-authentication.

Using your two-factor key to login

The following is an example of what opening a session on Flux might look like.

$ ssh flux-login
Password:
Duo two-factor login for uniqname

Enter a passcode or select one of the following options:

 1. Duo Push to XXX-XXX-1810
 2. Phone call to XXX-XXX-1810
 3. SMS passcodes to XXX-XXX-1810

Passcode or option (1-3): 1
Success. Logging you in...
uniqname@flux-login ~$

If you need help go to the U-M Duo page, contact 4help@umich.edu, or call (734) 764-4357 (764-HELP). Questions on the Flux cluster or other ARC-TS resources can be directed to hpc-support@umich.edu.