Flux Tips (Dos and Don’ts)

By February 23, 2016

Do:

  • use the U-M VPN in order to log in to Flux from off campus. WHY: Flux login is restricted to campus IP addresses.
  • have your jobs read their input and write their output to the /scratch filesystem, WHY:/scratch is much faster and more reliable than your Flux home directory.
  • remember to load any modules your job needs each time after logging in to Flux but before submitting any jobs.  Modules that you will always be using can be loaded automatically when you log in by putting the “module load” commands in your ~/privatemodules/defaultfile. WHY: Software will not be available to your job if it is not loaded prior to submitting the job.
  • request 20% more than the maximum memory and maximum walltime you think your jobs might need.  WHY: if a job exceeds the requested memory or walltime, it will be terminated before it can finish.
  • use “#PBS -j oe” in your PBS scripts to combine the PBS output and error messages into a single file.  WHY: It is much easier to figure out what your job did (you won’t have to match up lines between the two files).
  • submit lots of jobs at once rather than submitting one job and waiting for it to complete before submitting another.  WHY: “keeping the queue full” will give you the overall best throughput and utilization for your Flux allocation.
  • perform regular backups of all of your data on Flux yourself, including data in your home directory and in/scratch WHY: If you lose a file, the Flux staff can’t get it back for you.
  • submit interactive jobs using “qsub -I”.  WHY: Interactive jobs on the login nodes will be terminated after 15 minutes or before if they are disrupting normal service.
  • send any requests for help as a new email (not a reply to a previous email) to hpc-support@umich.edu.  WHY: You’ll get quicker help if you don’t send email to individuals directly and don’t reply to old (unrelated) support tickets that may be closed already.
  • run “qdel $(qselect -u $USER)” to delete all your jobs if you need to terminate all your jobs.

Don’t:

  • run interactive jobs or do significant computation on the Flux login nodes.  WHY: processes on the Flux login nodes will be automatically terminated if they use 15 minutes of CPU time, or before if they are disrupting normal service.
  • use /scratch space for long-term storage; files that you’re not using for two weeks or longer should be moved to your home directory or another system. WHY: /scratch is a limited, shared resource; also, no files anywhere on Flux are backed up, and cannot be recovered if lost.
  • run “qdel all” WHY: It will lock up the cluster scheduler for a long time trying to delete jobs that do not belong to you when you do not have the permission to delete them.