Lighthouse User Guide

1Before you can use Lighthouse, the Principal Investigator (PI) must establish a Slurm account by contacting HPC Support.

Email Support

Lighthouse users must be authorized by the PI or a user authorized to make changes to the account. Contact HPC support to request user changes.

EMAIL SUPPORT

See the Lighthouse cheat sheet for a list of common Linux (Bash) and Slurm commands, including Torque and Slurm comparisons.

CHEAT SHEET (PDF)

Go to Lighthouse Overview   TO SEARCH THIS USER GUIDE, USE THE COMMAND + F (MAC) OR CTRL + F (WIN) KEYBOARD SHORTCUTS.

 

Scheduler Defaults

Max walltime: 2 weeks
Default walltime (if not specified in job script): 60 minutes
Default memory per CPU (if not specified in job script): 768M
Default number of CPUs (if not specified in job script): This is tied to the default memory per CPU. If you request no memory and no CPU in your job script you will receive 1 CPU with 768M of memory.

Back To Top

Getting Started (Command Line)

1. Get Duo

You must use Duo authentication to log on to Lighthouse.  Get more details on the Safe Computing Two-Factor page and enroll here.

2. Get a Lighthouse user login

You must establish a user login on Lighthouse by filling out this form.

3. Get an SSH Client & Connect to Lighthouse

You must be on campus or on the VPN to connect to Lighthouse.  If you are trying to log in from off campus, or using an unauthenticated wireless network such as MGuest, you have a couple of options:

Mac or Linux:

Open Terminal and type:

ssh uniqname@lighthouse.arc-ts.umich.edu

Windows (using PuTTY):

Download and install PuTTY here.

Launch PuTTY and enter lighthouse.arc-ts.umich.edu as the host name, then click open.

All Operating Systems:

At the “Enter a passcode or select one of the following options:” prompt, type the number of your preferred choice for Duo authentication.

4. Get files

You can use SFTP (best for simple transfers of small files) or Globus (best for large files or a commonly used endpoint) to transfer data to your /home directory.

SFTP: Mac or Windows using FileZilla
  1. Open FileZilla and click the “Site Manager” button
  2. Create a New Site, which you can name “Lighthouse” or something similar
  3. Select the “SFTP (SSH File Transfer Protocol)” option
  4. In the Host field, type lighthouse-xfer.arc-ts.umich.edu
  5. Select “Interactive” for Logon Type
  6. In the User field, type your uniqname
  7. Click “Connect”
  8. Enter your Kerberos password
  9. Select your Duo method (1-3) and complete authentication
  10. Drag and drop files between the two systems
  11. Click “Disconnect” when finished

On Windows, you can also use WinSCP with similar settings, available alongside PuTTY here.

SFTP: Mac or Linux using Terminal

To copy a single file, type:

scp localfile uniqname@lighthouse-xfer.arc-ts.umich.edu:~/remotefile

To copy an entire directory, type:

scp -r localdir uniqname@lighthouse-xfer.arc-ts.umich.edu:~/remotedir

These commands can also be reversed in order to copy files from Lighthouse to your machine:

scp -r uniqname@lighthouse-xfer.arc-ts.umich.edu:~/remotedir localdir

You will need to authenticate via Duo to complete the file transfer.

Globus: Windows, Mac, or Linux

Globus is a reliable high performance parallel file transfer service provided by many HPC sites around the world. It enables easy transfer of files from one system to another, as long as they are Globus endpoints.

  • The Globus endpoint for Lighthouse is “umich#lighthouse”.
How to use Globus

Globus Online is a web front end to the Globus transfer service. Globus Online accounts are free and you can create an account with your University identity.

  • Set up your Globus account and learn how to transfer files using the Globus documentation.  Select “University of Michigan” from the dropdown box to get started.
  • Once you are ready to transfer files, enter “umich#lighthouse” as one of your endpoints.
Globus Connect Personal

Globus Online also allows for simple installation of a Globus endpoint for Windows, Mac, and Linux desktops and laptops.

  • Follow the Globus instructions to download the Globus Connect Personal installer and set up an endpoint on your desktop or laptop.
Batch File Copies

A non-standard use of Globus Online is that you can use it to copy files from one location to another on the same cluster. To do this use the same endpoint (umich#lighthouse as an example) for both the sending and receiving machines. Setup the transfer and Globus will make sure the rest happens. The service will email you when the copy is finished.

Command Line Globus

There are Command line tools for Globus that are intended for advanced users. If you wish to use these, contact HPC support.

5. Submit a job

This is a simple guide to get your jobs up and running. For more advanced Slurm features, see the Slurm User Guide for Lighthouse. If you are familiar with using the resource manager Torque, you may find the migrating from Torque to Slurm guide useful.

Resources

On Lighthouse, like the Flux Operating Environment, you can run on hosts owned by a researcher who has granted you access.    Unless the researcher has placed additional limits, the only default limit is a 2-week maximum runtime (wallclock).  You will submit your jobs to a partition (queue) named after the PI’s uniqname.

To help in the transition, we have included 2 test nodes in your partition.  These 2 test nodes are shared with all FOE/Lighthouse researchers, so availability will depend on current demand.    This is similar to the environment you previously tested on the Beta cluster (there is a short limit on run times for jobs). You can submit jobs to that partition to validate your workflows.  Once the PI has moved their nodes from FOE to Lighthouse, the test nodes will be removed from the PI’s partition and the maximum runtime will be set to 2 weeks.

Batch Jobs

Most work will be queued to be run on Lighthouse and is described through a batch script. The sbatch command is used to submit a batch script to Slurm. To submit a batch script simply run the following from a shared file system; those include your home directory, /scratch, and any directory under /nfs that you can normally use in a job on Flux. Output will be sent to this working directory (jobName-jobID.log). Do not submit jobs from /tmp or any of its subdirectories.

$ sbatch myJob.sh

The batch job script is composed of three main components:

  • The interpreter used to execute the script
  • #SBATCH directives that convey submission options
  • The application(s) to execute along with its input arguments and options

Example:

#!/bin/bash
# The interpreter used to execute the script

#“#SBATCH” directives that convey submission options:

#SBATCH --job-name=example_job
#SBATCH --mail-type=BEGIN,END
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=1000m 
#SBATCH --time=10:00
#SBATCH --account=test
#SBATCH --partition=standard

# The application(s) to execute along with its input arguments and options:

/bin/hostname
sleep 60

How many nodes and processors you request will depend on the capability of your software and what it can do. There are four common scenarios:

Example: One Node, One Processor

This is the simplest case and is shown in the example above. The majority of software cannot use more than this. Some examples of software for which this would be the right configuration are SAS, Stata, R, many Python programs, most Perl programs.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: One Node, Multiple Processors

This is similar to what a modern desktop or laptop is likely to have. Software that can use more than one processor may be described as multicore, multiprocessor, or multithreaded. Some examples of software that can benefit from this are MATLAB and Stata/MP. You should read the documentation for your software to see if this is one of its capabilities.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Nodes, One Process per CPU

This is the classic MPI approach, where multiple machines are requested, one process per processor on each node is started using MPI. This is the way most MPI-enabled software is written to work.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s

Example: Multiple Nodes, Multiple CPUs per Process

This is often referred to as the “hybrid mode” MPI approach, where multiple machines are requested and multiple processes are requested. MPI will start a parent process or processes on each node, and those in turn will be able to use more than one processor for threaded calculations.

#!/bin/bash
#SBATCH --job-name JOBNAME
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=1g
#SBATCH --time=00:15:00
#SBATCH --account=test
#SBATCH --partition=standard
#SBATCH --mail-type=NONE

srun hostname -s
Common Job Submission Options
Description Slurm directive (#SBATCH option) Lighthouse Usage
Job name --job-name=<name> --job-name=lhjob1
Account --account=<account> --account=test
Queue --partition=<partition_name> --partition=name

The partition name depends on the PI’s requested cluster

Wall time limit --time=<hh:mm:ss> --time=02:00:00
Node count --nodes=<count> --nodes=2
Process count per node --ntasks-per-node=<count> --ntasks-per-node=1
Minimum memory per processor --mem-per-cpu=<memory> --mem-per-cpu=1000m
Request software license(s) --licenses=<application>@slurmdb:<N> --licenses=stata@slurmdb:1
requests one license for Stata
Request event notification

--mail-type=<events>

Note: multiple mail-type requests may be specified in a comma separated list:

--mail-type=BEGIN,END,NONE,FAIL,REQUEUE

--mail-type=BEGIN,END,FAIL

Please note that if your job is set to utilize more than one node, make sure your code is MPI enabled in order to run across these nodes and you must use srun rather then mpirun or mpiexec. More advanced job submission options can be found in the Slurm User Guide for Lighthouse.

Interactive Jobs

If you need to actively interact to complete a task, you can submit an interactive job.  When your interactive job runs, you will have command line access (instead of running a script) on a compute node and can interact with all of the resources you requested. Interactive jobs are useful when debugging or interacting with an application. The srun command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

[user@login ~]$ srun --pty /bin/bash 
srun: job 309 queued and waiting for resources 
srun: job 309 has been allocated resources 
[user@node0001 ~]$ hostname 
node0001.lh.arc-ts.umich.edu 
[user@node0001 ~]$

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 4 CPUS and 4GB of memory each:

[user@login ~]$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --pty /bin/bash
srun: job 894 queued and waiting for resources
srun: job 894 has been allocated resources
[user@node0001 ~]$ srun hostname
node0001.lh.arc-ts.umich.edu
node0002.lh.arc-ts.umich.edu
node0001.lh.arc-ts.umich.edu
node0001.lh.arc-ts.umich.edu
node0001.lh.arc-ts.umich.edu
node0002.lh.arc-ts.umich.edu
node0002.lh.arc-ts.umich.edu
node0002.lh.arc-ts.umich.edu

In the above example srun is used within the job from the first compute node to run a command once for every task in the job on the assigned resources. srun can be used to run on a subset of the resources assigned to the job. See the srun man page for more details.

Job Status

You can see the status of a job with `squeue -j <jobID>` . Most of a job’s specifications can be seen by invoking scontrol show job <jobID>.  More details about the job can be written to a file by using  scontrol write batch_script <jobID> output.txt. If no output file is specified, the script will be written to slurm<jobID>.sh.

A job’s record remains in Slurm’s memory for 30 minutes after it completes.  scontrol show job will return “Invalid job id specified” for a job that completed more than 30 minutes ago.  At that point, one must invoke the sacct command to retrieve the job’s record from the Slurm database.

To view TRES (Trackable RESource) utilization by user or account, use the following commands (substitute your values for bolded parts):

Shows TRES usage by all users on account during date range:
sreport cluster UserUtilizationByAccount start=mm/dd/yy end=mm/dd/yy account=test --tres type
Shows TRES usage by specified user(s) on account during date range:
sreport cluster UserUtilizationByAccount start=mm/dd/yy end=mm/dd/yy users=un1,un2 account=test --tres type
Lists users alphabetically along with TRES usage and total during date range:
sreport cluster AccountUtilizationByUser start=mm/dd/yy end=mm/dd/yy tree account=test --tres type
Possible TRES types:

cpu
mem
node
gres/gpu

For more reporting options, see the Slurm sreport documentation.

Back To Top

Getting Started (Web-based)

1. Get Duo

You must use Duo authentication to log on to the Lighthouse OnDemand web service.  Get more details on the Safe Computing Two-Factor page and enroll here.

2. Get a Lighthouse user login

You must establish a user login on Lighthouse by filling out this form.

3. Connect to Lighthouse OnDemand

You must be on campus or on the VPN to connect to Lighthouse OnDemand.  If you are trying to log in from off campus or using an unauthenticated wireless network such as MGuest, you should install VPN software on your computer.

Once you are on the University network, follow these instructions to connect:

  1. Open your web browser (Firefox, Edge, or Chrome in incognito recommended) and navigate to lighthouse.arc-ts.umich.edu
  2. Log into cosign using your uniqname and password:
  3. Complete Duo authentication: 
  4. You should now be logged in.

4. Get files

At the top of the page, click “Files” and then “Home Directory”.  A new tab will be created that contains the File Explorer: 

Here you can navigate your home folder.  The buttons do the following:

  • “Go To…”: Navigate to a specified folder
  • “Open in Terminal”: Opens the active folder in a terminal session (new tab)
  • “New File”: Creates a new file in the active folder
  • “New Dir”: Creates a new folder in the active folder
  • “Upload”: Select files from your local machine to upload to the active folder
  • “Show Dotfiles”: Reveals hidden files (usually do not need to be changed)
  • “Show Owner/Mode”: Shows ownership and permission information
  • “View”: Shows file contents inside the current tab
  • “Edit”: Opens a file editor in a new tab
  • “Rename/Move”: Gives a file a new path and/or name
  • “Download”: Downloads the file or folder to your local machine
  • “Copy”: Copies selected files to the clipboard
  • “Paste”: Pastes files from the clipboard
  • “(Un)Select All”: Select or unselect all files/folders
  • “Delete”: Deletes selected files/folders

5. Submit a job

At the top of the home page, click “Jobs” and then “Job Composer”.  A new tab will be created that contains the Job Composer: 

Upon your first visit to this page, you’ll go through a helpful tutorial.  The buttons do the following:

  • “New Job”: Creates a new job…
    • “From Default Template”: Uses system defaults for a bare bones “Hello World” job on the Lighthouse cluster.  Please note that you will still need to specify your account.
    • “From Specified Path”: Creates a job from a specified job script.  See the Slurm User Guide for Lighthouse for information on writing this script.  Some attributes (name, account) can be set here if not set in the script.
    • “From Selected Job”: Creates a new job that is a copy of the selected job.
  • “Edit Files”: Opens a the project folder in a new File Explorer tab, allowing you to edit the files within (see “Get Files” above for File Explorer instructions).
  • “Job Options”: Allows for editing the Name, Cluster, Job Script, and Account fields.
  • “Open Terminal”: Opens a terminal session in a new tab, starting in the project folder.
  • “Submit”: Submits the selected job to the cluster.
  • “Stop”: Stops the selected job if it has been submitted.
  • “Delete”: Delete the selected job.

To view active job information, click “Jobs” and then “Active Jobs” from the home page.

This is a simple guide to get your jobs up and running. For more advanced Slurm features and job scripting information, see the Slurm User Guide for Lighthouse. If you are familiar with using the resource manager Torque, you may find the migrating from Torque to Slurm guide useful.

Interactive Apps

At the top of the home page, click “Interactive Apps” and then select your desired application.

Lighthouse Remote Desktop

Launches an interactive desktop in a new tab (uses noVNC).  Specify your account, hours, memory, cores, and partition (usually your PI’s uniqname):

Upon selecting “Launch”, your job will be queued on one of your nodes and shown on the “My Interactive Sessions” screen. As soon as the job’s status is “Running”, you can click on “Launch noVNC in New Tab”:

A remote desktop session will then be opened in a new tab for the requested amount of time.  If you finish early, return to the “My Interactive Sessions” tab and delete the job.

MATLAB

Launches an interactive desktop with MATLAB configured and running in a new tab (uses noVNC).  Specify your desired version, account, hours, and memory (4GB minimum):

Upon selecting “Launch”, your job will be queued on one of your nodes and shown on the “My Interactive Sessions” screen. As soon as the job’s status is “Running”, you can click on “Launch noVNC in New Tab”:

A remote desktop session running MATLAB will then be opened in a new tab for the requested amount of time. You may also use the terminal and other basic applications. If you finish early, return to the “My Interactive Sessions” tab and delete the job.

Jupyter Notebook Server

Launches a Jupyter Notebook Server in a new tab. Specify your desired version, project name, hours, memory, cores, runtime directory, and partition (usually your PI’s uniqname):

Upon selecting “Launch”, your job will be queued on one of your nodes and shown on the “My Interactive Sessions” screen. As soon as the job’s status is “Running”, you can click on “Connect to Jupyter”:

For instructions on using Jupyter Notebook, see the official documentation.

Resources

On Lighthouse, like the Flux Operating Environment, you can run on hosts owned by a researcher who has granted you access.    Unless the researcher has placed additional limits, the only default limit is a 2-week maximum runtime (wallclock).  You will submit your jobs to a partition (queue) named after the PI’s uniqname.

To help in the transition, we have included 2 test nodes in your partition.  These 2 test nodes are shared with all FOE/Lighthouse researchers, so availability will depend on current demand.    This is similar to the environment you previously tested on the Beta cluster (there is a short limit on run times for jobs). You can submit jobs to that partition to validate your workflows.  Once the PI has moved their nodes from FOE to Lighthouse, the test nodes will be removed from the PI’s partition and the maximum runtime will be set to 2 weeks.

Back To Top

Software

Software modules

The Lighthouse cluster uses the Lmod modules system to provide access to centrally installed software. If you used a cluster at UM previously, then you should review the documentation for the module system as we have changed the configuration to match that used at most national clusters and most other university clusters.

In particular, you should use the command module keyword to look for a module and do not use module available to search for software, as module available will only show software for which all the dependencies (or prerequisites) are already loaded.

So, to search for the software package FFTW, use

$ module keyword fftw

That will show which versions are installed and provide a command to determine what is needed to load it.

Please see our page on using the Lmod modules system for more details and examples.

There are two main categories of software available on the system: software that is installed as part of the installation of the operating system and software that is installed separately. No special action is needed to use the software installed with the operating system. The separately installed software is set up so that you will use a module to use it. The module will set up the environment and make the software available. We do it this way to enable having multiple versions of the same package and to avoid having conflicts between software packages that have mutually exclusive system requirements.

Requesting software licenses

Many of the software packages that are licensed for use on ARC clusters are licensed for a limited number of concurrent uses. If you will use one of those packages, then you must request a license or licenses in your submission script. As an example, to request one Stata license, you would use

#SBATCH --licenses=stata@slurmdb:1

The list of software can be found from Lighthouse by using the command

$ scontrol show licenses

Back To Top

Policies

Partition Policies

Slurm partitions represent collections of nodes, and are equivalent to Torque queues.  Each PI’s standard compute nodes are identified by the PI’s uniqname and have a maximum job walltime of 14 days (can be increased up to 4 weeks at the PI’s request).  During the transition from FOE to Lighthouse, each partition will include two test nodes prior to the migration of the PI’s nodes into Lighthouse for testing.

Account/Association Limits

Slurm associations are a combination of cluster, account, user names and optionally a partition. An association can have limits (e.g. account ‘testaccount’ using partition ‘msbritt’ on cluster ‘lighthouse’ has a running job limit of X). TRES (Trackable Resources) are resources which can be tracked for usage or used to enforce limits. Common examples include CPU, memory, and GRES for GPUs.

Limits can be set on the user association as well as the account association. This allows a PI to limit individual users or the collective set of users in an account as the PI sees fit. Please contact ARC-TS if you would like to implement any of these limits.

Terms of Usage and User Responsibility

  1. Data is not backed up. None of the data on Lighthouse is backed up. The data that you keep in your home directory, /tmp or any other filesystem is exposed to immediate and permanent loss at all times. You are responsible for mitigating your own risk. ARC-TS provides more durable storage on Turbo, Locker, and Data Den.  For more information on these, look here.
  2. Your usage is tracked and may be used for reports. We track a lot of job data and store it for a long time. We use this data to generate usage reports and look at patterns and trends. We may report this data, including your individual data, to your adviser, department head, dean, or other administrator or supervisor.
  3. Maintaining the overall stability of the system is paramount to us. While we make every effort to ensure that every job completes with the most efficient and accurate way possible, the stability of the cluster is our primary concern. This may affect you, but mostly we hope it benefits you. System availability is based on our best efforts. We are staffed to provide support during normal business hours. We try very hard to provide support as broadly as possible, but cannot guarantee support on a 24 hours a day basis. Additionally, we perform system maintenance on a periodic basis, driven by the availability of software updates, staffing availability, and input from the user community. We do our best to schedule around your needs, but there will be times when the system is unavailable. For scheduled outages, we will announce them at least one month in advance on the ARC-TS home page; for unscheduled outages we will announce them as quickly as we can with as much detail as we have on that same page. You can also track ARC-TS on Twitter (@ARC-TS ).
  4. Lighthouse is intended only for non-commercial, academic research and instruction. Commercial use of some of the software on Lighthouse is prohibited by software licensing terms. Prohibited uses include product development or validation, any service for which a fee is charged, and, in some cases, research involving proprietary data that will not be made available publicly. Please contact hpc-support@umich.edu if you have any questions about this policy, or about whether your work may violate these terms.
  5. You are responsible for the security of sensitive codes and data. If you will be storing export-controlled or other sensitive or secure software, libraries, or data on the cluster, it is your responsibility that is is secured to the standards set by the most restrictive governing rules.  We cannot reasonably monitor everything that is installed on the cluster, and cannot be responsible for it, leaving the responsibility with you, the end user.
  6. Data subject to HIPAA regulations may not be stored or processed on the cluster.

USER RESPONSIBILITIES

Users must manage data appropriately in their various locations:

  • /home
  • /scratch (more information below)
  • /tmp
  • customer-provided NFS

SCRATCH STORAGE POLICIES

Every user has a /scratch directory for every Slurm account they are a member of.  Additionally for that account, there is a shared data directory for collaboration with other members of that account.  The account directory group ownership is set using the Slurm account-based UNIX groups, so all files created in the /scratch directory are accessible by any group member, to facilitate collaboration.

Example:
/scratch/msbritt_root/msbritt
/scratch/msbritt_root/msbritt/bob
/scratch/msbritt_root/msbritt/shared_data

Users are able to use /scratch with size- and inode-based quotas (sizes TBD) and with an auto-purge policy on unaccessed files, which means that any unaccessed data will be automatically deleted by the system after 60 days. Scratch file systems are not backed up. Critical files should be backed up to another location.

SECURITY ON LIGHTHOUSE/ USE OF SENSITIVE DATA

The Lighthouse high-performance computing system at the University of Michigan has the same security stance as the Great Lakes cluster.

Applications and data are protected by secure physical facilities and infrastructure as well as a variety of network and security monitoring systems. These systems provide basic but important security measures including:

  • Secure access – All access to Lighthouse is via SSH or Globus. SSH has a long history of high-security.
  • Built-in firewalls – All of the Lighthouse computers have firewalls that restrict access to only what is needed.
  • Unique users – Lighthouse adheres to the University guideline of one person per login ID and one login ID per person.
  • Multi-factor authentication (MFA) – For all interactive sessions, Lighthouse requires both a UM Kerberos password and Duo authentication. File transfer sessions require a Kerberos password.
  • Private Subnets – Other than the login and file transfer computers that are part of Lighthouse, all of the computers are on a network that is private within the University network and are unreachable from the Internet.
  • Flexible data storage – Researchers can control the security of their own data storage by securing their storage as they require and having it mounted via NFSv3 or NFSv4 on Lighthouse. Another option is to make use of Lighthouse’s local scratch storage, which is considered secure for many types of data. Note: Lighthouse is not considered secure for data covered by HIPAA.

Back To Top

Updates and Notices

This section will be updated when system level changes are made to Lighthouse. There are currently no updates.

Back To Top