CUDA_ERROR_OUT_OF_MEMORY in tensorflow

Solution 1:

In case it's still relevant for someone, I encountered this issue when trying to run Keras/Tensorflow for the second time, after a first run was aborted. It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes that use the GPU, or alternatively, closing the existing terminal and running again in a new terminal window.

Solution 2:

By default, tensorflow try to allocate a fraction per_process_gpu_memory_fraction of the GPU memory to his process to avoid costly memory management. (See the GPUOptions comments).
This can fail and raise the CUDA_OUT_OF_MEMORY warnings. I do not know what is the fallback in this case (either using CPU ops or a allow_growth=True).
This can happen if an other process uses the GPU at the moment (If you launch two process running tensorflow for instance). The default behavior takes ~95% of the memory (see this answer).

When you use allow_growth = True, the GPU memory is not preallocated and will be able to grow as you need it. This will lead to smaller memory usage (as the default option is to use the whole memory) but decreases the perfomances if not use properly as it requires a more complex handeling of the memory (which is not the most efficient part of CPU/GPU interactions).

Solution 3:

I faced this issue when trying to train model back to back. I figured that the GPU memory wasn't available due to previous training run. So I found the easiest way would be to manually flush the GPU memory before every next training.

Use nvidia-smi to check the GPU memory usage:

nvidia-smi

nvidia-smi --gpu-reset

The above command may not work if other processes are actively using the GPU.

Alternatively you can use the following command to list all the processes that are using GPU:

sudo fuser -v /dev/nvidia*

And the output should look like this:

USER        PID ACCESS COMMAND
/dev/nvidia0:        root       2216 F...m Xorg
                     sid        6114 F...m krunner
                     sid        6116 F...m plasmashell
                     sid        7227 F...m akonadi_archive
                     sid        7239 F...m akonadi_mailfil
                     sid        7249 F...m akonadi_sendlat
                     sid       18120 F...m chrome
                     sid       18163 F...m chrome
                     sid       24154 F...m code
/dev/nvidiactl:      root       2216 F...m Xorg
                     sid        6114 F...m krunner
                     sid        6116 F...m plasmashell
                     sid        7227 F...m akonadi_archive
                     sid        7239 F...m akonadi_mailfil
                     sid        7249 F...m akonadi_sendlat
                     sid       18120 F...m chrome
                     sid       18163 F...m chrome
                     sid       24154 F...m code
/dev/nvidia-modeset: root       2216 F.... Xorg
                     sid        6114 F.... krunner
                     sid        6116 F.... plasmashell
                     sid        7227 F.... akonadi_archive
                     sid        7239 F.... akonadi_mailfil
                     sid        7249 F.... akonadi_sendlat
                     sid       18120 F.... chrome
                     sid       18163 F.... chrome
                     sid       24154 F.... code

From here, I got the PID for the process which was holding the GPU memory, which in my case is 24154.

Use the following command to kill the process by its PID:

sudo kill -9 MY_PID

Replace MY_PID with the relevant PID.

Solution 4:

Tensorflow 2.0 alpha

The problem is, that Tensorflow is greedy in allocating all available VRAM. That causes issues for some people.

For Tensorflow 2.0 alpha / nightly use this:

import tensorflow as tf
tf.config.gpu.set_per_process_memory_fraction(0.4)

Source: https://www.tensorflow.org/alpha/guide/using_gpu