Setting up tensorflow-GPU Conda environment with CUDA 11.2 and cuDNN 8.1-8.2 (CUDA 460 Driver

Solution 1:

UPDATE (08/30/21): The Esri conda channel has a tensorflow-gpu package that seems to work correctly out of the box. It can be installed using:

conda install -c esri tensorflow-gpu

If that doesn't work, see if what follows will help.

I have a Linux Mint 20.1 system (based on Ubuntu 20.04 LTS) with a GeForce RTX 3080 (driver version 460.80) and had a lot of issues trying to run Tensorflow in a conda environment.

It seems the problem is currently there is no conda environment that is correctly packaged with Tensorflow 2.4+, CUDA 11+ and CuDNN 8+, which are required to run on this newer GPU architecture (more info here). If you use conda install -c anaconda tensorflow-gpu​, it will install TF v2.2, cudatoolkit 10.x, and cudnn 7.x by default. If you try to force a newer version using tensorflow-gpu=2.4​ it will either just install the older incompatible cudatoolkit 10.x/cudnn 7.x libraries or not install them at all.

There are probably a number of different ways to do it but here is what worked for me after a lot of trial and error:

Step 1: Create a conda environment and install cudatoolkit and cudnn into it.

conda create -n tf_gpu_env -c conda-forge cudatoolkit cudnn python=3.8

As of this writing, this will install cudatoolkit 11.2, cudnn 8.2 and python 3.8.10 into this new environment. I used the conda-forge channel but imagine the anaconda and nvidia channels would work too.

Step 2: Activate the environment and install tensorflow-gpu using pip not conda. Installing from conda will either take a very long time as conda tries to resolve conflicts before it errors out, or will forcefully downgrade cudatoolkit and cudnn to older versions.

conda activate tf_gpu_env​
pip install tensorflow-gpu

As of this writing, this installs Tensorflow-gpu 2.5.0

Step 3: Check that Tensorflow is working and using GPU. Make sure you have the new environment activated and start a python session in the terminal. I use the following statements for my check.

>>> import tensorflow as tf 

should return a message saying it successfully opened libcudart

>>> tf.config.list_physical_devices('GPU')

should return a long message that it successfully opened a bunch of cuda libraries and more importantly, a list at the end with a named tuple indicating that it found the GPU (e.g. [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]). If this returns an empty list, then Tensorflow is not using the GPU.

Finally, create some random tensor with tf.constant or tf.random. This check is very important. Tensorflow can still recognize your GPU even if the cuda libraries are incompatible and return similar messages when entering the two commands above indicating that all is good. However, if it is working correctly, the following command (or similar) should execute and return a tensor almost instantly:

>>> tf.random.uniform([4, 4, 4, 4])

If things are out of whack, there will be a very long lag before you get the answer (although subsequent calls may be quick). This will be even worse if you try to run actual models with lags lasting many minutes or almost an hour before running the first epoch in addition to unpredictable behavior such as getting nan values for certain networks like CNNs.

Finally, a couple of notes:

  • Be careful when running conda install or conda update in this environment and check the package plan carefully before hitting enter. For example, if you use conda to install tensorflow-probability, it might also install tensorflow-base as a dependency, which can override tensorflow-gpu.
  • You can also install other versions of Tensorflow and the cuda libraries. For example, you can also use TF 2.4.1, cudatoolkit 11.0, and cudnn 8.0 by using cudatoolkit=11.0, cudnn=8.0, tensorflow-gpu==2.4.1 (double equals for pip) in the installation commands above.