Understanding the ResourceExhaustedError: OOM when allocating tensor with shape

Solution 1:

Let's divide the issues one by one:

About tensorflow to allocate all memory in advance, you can use following code snippet to let tensorflow allocate memory whenever it is needed. So that you can understand how the things are going.

gpu_options = tf.GPUOptions(allow_growth=True)
session = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

This works equally with tf.Session() instead of tf.InteractiveSession() if you prefer.

Second thing about the sizes, As there is no information about your network size, we cannot estimate what is going wrong. However, you can alternatively debug step by step all the network. For example, create a network only with one layer, get its output, create session and feed values once and visualize how much memory you consume. Iterate this debugging session until you see the point where you are going out of memory.

Please be aware that 3840 x 155229 output is really, REALLY a big output. It means ~600M neurons, and ~2.22GB per one layer only. If you have any similar size layers, all of them will add up to fill your GPU memory pretty fast.

Also, this is only for forward direction, if you are using this layer for training, the back propagation and layers added by optimizer will multiply this size by 2. So, for training you consume ~5 GB just for output layer.

I suggest you to revise your network and try to reduce batch size / parameter counts to fit your model to GPU

Solution 2:

This may not make sense technically but after experimenting for some time, this is what I have found out.

ENVIRONMENT: Ubuntu 16.04


When you run the command

nvidia-smi

You will get the total memory consumption of the installed Nvidia graphic card. An example is as shown in this image enter image description here

When you have run your neural network your consumption may change to look like enter image description here

The memory consumption is typically given to python. For some strange reason, if this process fails to terminate successfully, the memory is never freed. If you try running another instance of the neural network application, you are bout to receive a memory allocation error. The hard way is to try to figure out a way to terminate this process using the process ID. Example, with process ID 2794, you can do

sudo kill -9 2794

The simple way, is to just restart your computer and try again. If it is a code related bug however, this will not work.

If the above mentioned process does not work,it is likely you are using a data batch size that can not fit into GPU or CPU memory.

What you can do is to reduce the batch size or the spatial dimensions (length, width and depth) of your input data. This may work but you may run out of RAM.

The surest way to conserve RAM is to use a function generator and that is subject matter on its own.