Overview

Nitro is designed to schedule thousands to millions of tasks very quickly. It works in conjunction with Torque, so you will create a file containing a list of tasks to perform that Nitro will execute, then you will submit a single PBS job that executes Nitro, which in turn will run through the list of tasks.

From Adaptive Computing’s website: “Nitro facilitates the execution of small compute tasks on a very large scale and without the overhead of individual scheduler jobs. Instead of creating individual jobs, Nitro combines all of the compute tasks into a single file. The file is then sent to Nitro as part of a job, and Nitro distributes the compute tasks across the allocated nodes. Tasks are executed on multiple threads on each compute node. Since the overhead of managing these tasks is small, most of the allocated compute resources can be spent executing the desired tasks.”

Nitro

 

How to Use Nitro

To run jobs using Nitro, you need a PBS script and a Nitro Task File.  In your PBS script, you will define job resource requirements and specify the location of your Nitro Task File. Then you will execute Nitro.  

Example Files

To get started, you can download an example PBS and Nitro Task file with

git clone https://bitbucket.org/umarcts/nitro-examples.git

PBS Script

The PBS script for Nitro jobs is almost identical to a normal PBS script, with the following differences. You should specify the number of processors that you would like in groups of four. This enables Nitro to be efficient while allowing you to submit to a known number of processors in our heterogenous environment. You should also request the generic resource “nitro” for each processor. Lastly, you should specify your memory request using pmem (per processor memory) rather than mem. This will ensure that each processor has enough memory.

#PBS -l nodes=X:ppn=4  
#PBS -l gres=nitro:Y    (where Y=4X)
#PBS -l pmem=300mb

Note that the number of nodes you request, gres=nitro:Y, should be the total number of processors that your jobs will be running on.

You must also export the environment variables for Nitro Coordinator Options and the Nitro Task File. Finally you must launch Nitro.

export NITRO_COORD_OPTIONS="--run-local-worker"
export NITRO_TASK_FILE=/home/uniqname/nitro/my_nitro_job.nbatch
/opt/nitro/scripts/torque/launch_nitro.sh

A full Nitro PBS script might look like (e.g. nitro.pbs):

# Set gres to nodes*4
#PBS -l nodes=2:ppn=4
#PBS -l gres=nitro:8

# Set memory for each processor
#PBS -l pmem=3000mb
#PBS -l walltime=15:00

# Credentials
# Set your Moab account 
#PBS -A example_flux
#PBS -l qos=flux
#PBS -q flux

#PBS -N nitrotest
#PBS -j oe
# Mail options - insert your uniqname here
#PBS -M uniqname@umich.edu

#PBS -m n
#PBS -V

if [ -d "$PBS_O_WORKDIR" ] ; then
 cd $PBS_O_WORKDIR
 echo "Running from $PBS_O_WORKDIR"
fi

# Set Nitro environmental variables
export NITRO_COORD_OPTIONS="--run-local-worker"
export NITRO_TASK_FILE=${PBS_O_WORKDIR}/hostname.nbatch

# Launch nitro
/opt/nitro/scripts/torque/launch_nitro.sh

Nitro Task File

The Nitro Task File contains a set of command lines, each on a separate line of the file. As an example Nitro Task File, consider the contents of my_nitro_job.nbatch.

/bin/hostname; sleep 1
/bin/hostname; sleep 1
/bin/hostname; sleep 1
/bin/hostname; sleep 1
/bin/hostname; sleep 1

When you submit your PBS job ( $ qsub nitro.pbs), you’ll have a single PBS job for all of your Nitro tasks.

Monitoring Your Nitro Job

nitrostat

You can monitor the status of your Nitro job using the command nitrostat ####### (where ###### is your PBS job id). You must use the full name of the job (1234.nyx.arc-ts.umich.edu, not just the numeric part of the job id). The output will give you statistics of the run including the number of tasks, completion percentage, number of successes and failures, and the load average of the nodes running the job (note, the example job was not computationally intensive; you should expect to see a high load average if your jobs are fully utilizing the compute nodes).

$ /opt/nitro/bin/nitrostat 17435106.nyx.arc-ts.umich.edu
Nitro Job Progress Report

Start Time  : 2015-10-24 19:04:35-0400
Current Time: 2015-10-24 19:07:06-0400
Elapsed Time: 151 seconds (00:02:31)

Job Id      : 17435106.nyx.arc-ts.umich.edu
Coordinator : nyx5996
Task Log    : /home/msbritt/nitro/17435106.nyx.arc-ts.umich.edu/nitro_17435106.nyx.arc-ts.umich.edu.tasklog.txt
Task File   : /home/msbritt/nitro/hostname.nbatch
  File Size : 1150000
  Est Tasks : 50000
  Processed : 65%

Tasks
------------
Pending     : 3500
In Progress : 0
Completed   : 24750
  Success   : 24750
  Failure   : 0
  Timeout   : 0
  Invalid   : 0
  Tasks/sec : 163.9
Total Tasks : 33000

Workers
-------
Host          Pid    Thrds Status  Assigned InPrgrs Completed  Success  Failure  Timeout Tasks/sec AsgmtDur Load
nyx5996:47004 9507      1  running      250     250         0        0        0        0       1.0      0.0  0.8
nyx5998       18814    20  running     3250     500      2750     2750        0        0      19.7     14.8  0.0
nyx6030       10815    20  running     3250     500      2750     2750        0        0      19.7     14.8  0.5
nyx6045       29998    20  running     3250     500      2750     2750        0        0      19.7     14.8  0.3
nyx6052       9893     20  running     3250     500      2750     2750        0        0      19.7     14.8  0.3
nyx6061       3033     20  running     3250     500      2750     2750        0        0      19.7     14.8  0.0
nyx6073       22433    20  running     3250     500      2750     2750        0        0      19.7     14.8  0.2
nyx6074       6987     20  running     3250     500      2750     2750        0        0      19.7     14.8  1.0
nyx6075       29997    20  running     3250     500      2750     2750        0        0      19.7     14.8  0.9
nyx6076       21514    20  running     3250     500      2750     2750        0        0      19.7     14.8  0.1

Joblog and Tasklog

Unless you specify otherwise, Nitro will track and log in $HOME/nitro/full_pbs_jobid/ (e.g., /home/uniqname/nitro/17435106.nyx.arc-ts.umich.edu). Here you will find the joblog (which looks exactly like the output of nitrostat), the tasklog, which shows the stats of each task, and a directory of logs.

$ pwd
/home/uniqname/nitro/17435106.nyx.arc-ts.umich.edu
$ ls -l
total 7544
drwxr-xr-x 2 msbritt hpcstaff    4096 Oct 24 19:04 logs
-rw-r--r-- 1 msbritt hpcstaff    1902 Oct 24 19:12 nitro_17435106.nyx.arc-ts.umich.edu.joblog.txt
-rw-r--r-- 1 msbritt hpcstaff 7676559 Oct 24 19:12 nitro_17435106.nyx.arc-ts.umich.edu.tasklog.txt

Advanced Options

Job Recovery

If your job has exited before it completed all of its tasks, you can restart the job from the last task completed. You simply need to set the NITROJOBID environment variable to the full JOBID of your job and resubmit.

$ export NITROJOBID=16725656.nyx.arc-ts.umich.edu
$ qsub nitro.pbs

Setting Options In your Nitro Task File

You can set additional options in your task file, if your tasks need more cores, memory, etc.

The options include the following.

cores=<count>
env=<name=value>[,<name=value>,...]
labels=<label>[,<label>,...]
name=<task name>
maxtime=<time limit in seconds>
memory=<amount>
shell=[default | none | <shell path>]
"cmd=<command line>"

An example task file might look like:

name=S21T00 cmd="/opt/framemaker/bin/framegen -i /shared/scene21.def -tindex 0"
name=S21T01 cmd="/opt/framemaker/bin/framegen -i /shared/scene21.def -tindex 1"
name=S22T00 labels=green maxtime=30 cmd="/opt/framemaker/bin/framegen -i /shared/scene22.def -tindex 0"
name=S22T01 labels=green maxtime=30 cmd="/opt/framemaker/bin/framegen -i /shared/scene22.def -tindex 1"