TensorFlow is an end-to-end open source platform for machine learning (ML). It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. Official TensorFlow documentation can be found here: TensorFlow Guide.
To use TensorFlow, you may either (a) load the module files for the TensorFlow versions that are installed on the cluster, or (b) install the TensorFlow version of your choice into your local Python library collection.
Using the TensorFlow Modules
On the Great Lakes cluster, there are three TensorFlow modules available for use currently. The versions are represented by
?.? indicates the versioning of the given release. To determine the exact versions of TensorFlow modules available, use
$ module spider tensorflow
and the available modules with version numbers will be returned to you. You may load one of the TensorFlow modules via the
module load command. As an example, to load the TensorFlow module for version 2.3.1, you would enter the following command:
$ module load tensorflow/2.3.1
When a TensorFlow module is loaded, compatible versions of Python, CUDA, and cuDNN will be loaded simultaneously, as dependencies. After loading the TensorFlow module, you can list the modules which are currently loaded in your environment by the
module list command. The list command for this particular version of TensorFlow would show all of dependent modules which have been loaded in addition to the TensorFlow module:
1) python3.8-anaconda/2020.07 2) cuda/10.1.243 3) cudnn/10.1-v7.6.4 4) tensorflow/2.3.1
Since TensorFlow is a Python library, the TensorFlow module requires, as a dependency, the specific version of Python that was used during installation. If you switch from using a module for one version of TensorFlow to a different one, the underlying version of Python may, and likely will, also change. This is important if you have installed any additional libraries/packages while using a TensorFlow module, because a new Python version will not have the packages that you already installed for a different version. When switching to a different TensorFlow module, you will have to re-install any packages you need so that they are available in the library of the new version of Python. This can easily be done using
pip install --user <package_name> . You only need to install Python packages once for each cluster on which you wish to use the library and, separately, for each version of Python that you use.
As an alternative to the TensorFlow modules, you may wish to install a specific version of TensorFlow into your personal Python library collection. As explained above, you will need to install Python packages once for each cluster on which you wish to use the library and, separately, for each version of Python that you use.
The most recent version of Anaconda that is compatible with TensorFlow 2, at the time of this writing, is that which provides Python version 3.8. To install TensorFlow 2, you must first load the python3.8-anaconda module as follows
$ module load python3.8-anaconda
With the python3.8-anaconda module loaded, you will then be able to install Python packages into your personal library using the pip command with the
--user tag which will, by default, place packages in
for this example. When a different version of Python is used, the path would reflect the given version number in place of
3.8. The library will then be available to you for this and future sessions.
To install the TensorFlow 2 package, the pip install command is
$ pip install --user "tensorflow > 2"
The most recent version of Anaconda that is compatible with TensorFlow 1 is that which provides Python version 3.7. To install TensorFlow 1, you must first load the python3.7-anaconda module as follows
$ module load python3.7-anaconda
With the python3.7-anaconda module loaded, you will then be able to install Python packages into your personal library using the pip command with the
--user tag as described above.
To install the most recent TensorFlow 1 package (version 1.15.5 at the time of this writing), it is necessary to install a separate version for GPU use. For both CPU and GPU capability, the pip install commands are
$ pip install --user "tensorflow < 2"
$ pip install --user "tensorflow-gpu < 2"
Beginning with TensorFlow 2, installation of a separate package for use with a GPU device is no longer necessary. TensorFlow installation for any version >=2 can be completed in one step, as described above.
To ensure that your TensorFlow package is working properly, run the short test script tf-2.py for TensorFlow 2, located in the examples directory, from a GPU node. If testing TensorFlow 1, modify the test to use the tf-1.py script, also found in the examples directory, instead. The following modules must be loaded to use TensorFlow with a GPU device: Anaconda3, CUDA, and cuDNN.
- Anaconda provides a python environment with over 200 packages pre-installed
- CUDA is a parallel computing platform and programming model for computing on GPUs
- cuDNN is a GPU-accelerated library of primitives for deep neural networks
The below Slurm script will initiate a job on a GPU node and run the test script.
#!/bin/bash #SBATCH --job-name=tf_test #SBATCH --account=<your-account> #SBATCH --partition=gpu #SBATCH --gres=gpu:1 #SBATCH --time=15:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=5gb #SBATCH --mail-type=FAIL # Load modules module load python3.8-anaconda module load cuda/10.1.243 cudnn/10.1-v7.6.4 module list # Run the test python3 /sw/examples/tensorflow/tf-2.py
Copy and paste the text above into a new Slurm batch script file such as
tf-test.sbat, put your Slurm account name in place of
<your-account>, and run the Slurm script with
$ sbatch tf-test.sbat
The last few lines of output produced from running the Slurm script on a GPU node, excluding possible warning messages, should include content similar to the following:
$ tail slurm-<jobID>.out | grep -v deprecated 2019-10-24 11:04:55.073023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15022 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d8:00.0, compute capability: 7.0) [[4 6 8] [4 6 8]]
Specifically, it should identify a GPU device as well as the calculation result. Standard output will print to a file with the default naming convention of
slurm-<jobID>.out, or on the command line for an interactive bash job. If the example runs without errors, everything is good!
If you are using TensorFlow without a GPU, the output of the example test will not include a line with the GPU specs. Instead, the last couple of output lines will be as follows:
2019-10-24 17:18:08.036518: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version [[4 6 8] [4 6 8]]