This document contains the following sections. Click on any one to jump to it.
TensorFlow is an end-to-end open source platform for machine learning (ML). It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.
To use TensorFlow, you may either (a) load the module files for the TensorFlow versions that are installed on the cluster, or (b) install the TensorFlow version of your choice into your local Python library collection.
Using the TensorFlow Modules
On the Great Lakes cluster, there are two TensorFlow modules available for use currently:
?.? indicates the versioning of the given release. To determine the exact versions of TensorFlow modules available, use
$ module spider tensorflow
and the available modules with version numbers will be returned to you. You may load one of the tensorflow modules via:
$ module load tensorflow/<version>
As an alternative to the TensorFlow modules, you may wish to install a specific version of TensorFlow into your personal Python library collection. You only need to install Python packages once for each cluster on which you wish to use the library and, separately, for each version of Python that you use.
The most recent version of Anaconda that is compatible with TensorFlow 1 is that which provides Python version 3.6. To install TensorFlow 1, you must first load the python3.6-anaconda module as follows
$ module load python3.6-anaconda
With the python3.6-anaconda module loaded, you will then be able to install Python packages into your personal library using the pip command with the
--user tag which will, by default, place packages in
?.? are the numbers in the Python version. The library will then be available to you for this and future sessions.
To install the TensorFlow 1 package (version 1.15 or greater), the pip install command is
$ pip install --user "tensorflow>=1.15,<2.0"
Earlier versions of TensorFlow (<1.15) required installation of separate packages for use with and without a GPU device. Separate installations of TensorFlow packages are no longer required, regardless of whether you will be using a GPU device or not.
The most recent version of Anaconda that is compatible with TensorFlow 2 is that which provides Python version 3.7. To install TensorFlow 2, you must first load the python3.7-anaconda module as follows
$ module load python3.7-anaconda
With the python3.7-anaconda module loaded, you will then be able to install Python packages into your personal library using the pip command with the
--user tag described above.
To install the TensorFlow 2 package, the pip install command is
$ pip install --user "tensorflow>=2.1"
To ensure that your TensorFlow package is working properly, run the short test script tf.py for TensorFlow 1, located in the examples directory, from a GPU node. If testing TensorFlow 2, modify the test to use the tf-v2.py script, also found in the examples directory, instead. The following modules must be loaded to use TensorFlow with a GPU device: Anaconda3, CUDA, and cuDNN.
- Anaconda provides a python environment with over 200 packages pre-installed
- CUDA is a parallel computing platform and programming model for computing on GPUs
- cuDNN is a GPU-accelerated library of primitives for deep neural networks
The below Slurm script will initiate a job on a GPU node and run the test script.
#!/bin/bash #SBATCH --job-name=tf_test #SBATCH --account=<your-account> #SBATCH --partition=gpu #SBATCH --gres=gpu:1 #SBATCH --time=15:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=5gb #SBATCH --mail-type=FAIL # Load modules module load python3.6-anaconda module load cuda/10.0.130 cudnn/10.0-v7.6 module list # Run the test python3 /sw/examples/tensorflow/tf.py
Copy and paste the text above into a new Slurm batch script file such as
tf-test.sbat, put your Slurm account name in place of
<your-account>, and run the Slurm script with
$ sbatch tf-test.sbat
The last few lines of output produced from running the Slurm script on a GPU node, excluding possible warning messages, should include content similar to the following:
$ tail slurm-<jobID>.out | grep -v deprecated 2019-10-24 11:04:55.073023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15022 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d8:00.0, compute capability: 7.0) [[4 6 8] [4 6 8]]
Specifically, it should identify a GPU device as well as the calculation result. Standard output will print to a file with the default naming convention of
slurm-<jobID>.out, or on the command line for an interactive bash job. If the example runs without errors, everything is good!
If you are using TensorFlow without a GPU, the output of the example test will not include a line with the GPU specs. Instead, the last couple of output lines will be as follows:
2019-10-24 17:18:08.036518: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version [[4 6 8] [4 6 8]]
Previously Installed Versions
The python3.6-anaconda module on Great Lakes now provides NumPy version 1.16.3 which is compatible with TensorFlow 1. Prior instructions provided by ARC-TS for installing TensorFlow required the user to install NumPy version 1.16.3 into their personal library. However, we have updated the python3.6-anaconda module to include the correct version of NumPy. If you installed TensorFlow following the earlier procedure and you would like to upgrade to the most recent version of TensorFlow 1, please follow these instructions:
$ module load python3.6-anaconda $ pip uninstall -y numpy $ pip uninstall -y tensorflow $ pip uninstall -y tensorflow-gpu $ pip install --user "tensorflow>=1.15,<2.0"
Official TensorFlow documentation can be found here: TensorFlow Guide.