- use the U-M VPN in order to log in to Flux from off campus. WHY: Flux login is restricted to campus IP addresses.
- have your jobs read their input and write their output to the /scratch filesystem, WHY:/scratch is much faster and more reliable than your Flux home directory.
- remember to load any modules your job needs each time after logging in to Flux but before submitting any jobs. Modules that you will always be using can be loaded automatically when you log in by putting the modules into your default module set. WHY: Software will not be available to your job if it is not loaded prior to submitting the job.
- request 20% more than the maximum memory and maximum walltime you think your jobs might need. WHY: if a job exceeds the requested memory or walltime, it will be terminated before it can finish.
- use “#PBS -j oe” in your PBS scripts to combine the PBS output and error messages into a single file. WHY: It is much easier to figure out what your job did (you won’t have to match up lines between the two files).
- submit lots of jobs at once rather than submitting one job and waiting for it to complete before submitting another. WHY: “keeping the queue full” will give you the overall best throughput and utilization for your Flux allocation.
- perform regular backups of all of your data on Flux yourself, including data in your home directory and in/scratch WHY: If you lose a file, the Flux staff can’t get it back for you.
- submit interactive jobs using “qsub -I”. WHY: Login nodes have a per-user resource limit of 8GB of memory and even splitting of CPU cycles among users.
- send any requests for help as a new email (not a reply to a previous email) to firstname.lastname@example.org. WHY: You’ll get quicker help if you don’t send email to individuals directly and don’t reply to old (unrelated) support tickets that may be closed already.
- run “qdel $(qselect -u $USER)” to delete all your jobs if you need to terminate all your jobs.
- run interactive jobs or do significant computation on the Flux login nodes. WHY: Login nodes have a per-user resource limit of 8GB of memory and even splitting of CPU cycles among users.
- use /scratch space for long-term storage; files that you’re not using for two weeks or longer should be moved to your home directory or another system. WHY: /scratch is a limited, shared resource; also, no files anywhere on Flux are backed up, and cannot be recovered if lost.
- run “qdel all” WHY: It will lock up the cluster scheduler for a long time trying to delete jobs that do not belong to you when you do not have the permission to delete them.