Tensorflow NaN bug?

Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

Replacing it with

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

solved all my problems.


Actually, clipping is not a good idea as it will stop the gradient from propagating backwards when the threshold is reached. Instead we can add a little bit of constant to the softmax output.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))

A bias free alternative.

Many of the other solutions use clipping to avoid an undefined gradient. Depending on your problem, clipping introduces bias and may not be acceptable in all cases. As the following code demonstrates, we need only handle the point of discontinuity--not the region near it.

Specific Answer

def cross_entropy(x, y, axis=-1):
  safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y)
  return -tf.reduce_sum(x * tf.log(safe_y), axis)

def entropy(x, axis=-1):
  return cross_entropy(x, x, axis)

But did it work?

x = tf.constant([0.1, 0.2, 0., 0.7])
e = entropy(x)
# ==> 0.80181855
g = tf.gradients(e, x)[0]
# ==> array([1.30258512,  0.60943794, 0., -0.64332503], dtype=float32)  Yay! No NaN.

(Note: deleted dup cross-post.)

General Recipe

Use an inner tf.where to ensure the function has no asymptote. That is, alter the input to the inf generating function such that no inf can be created. Then use a second tf.where to always select the valid code-path. That is, implement the mathematical condition as you would "normally", i.e., the "naive" implementation.

In Python code, the recipe is:

Instead of this:

tf.where(x_ok, f(x), safe_f(x))

Do this:

safe_x = tf.where(x_ok, x, safe_x)
tf.where(x_ok, f(safe_x), safe_f(x))

Example

Suppose you wish to compute:

f(x) = { 1/x, x!=0
       { 0,   x=0

A naive implementation results in NaNs in the gradient, i.e.,

def f(x):
  x_ok = tf.not_equal(x, 0.)
  f = lambda x: 1. / x
  safe_f = tf.zeros_like
  return tf.where(x_ok, f(x), safe_f(x))

Does it work?

x = tf.constant([-1., 0, 1])
tf.gradients(f(x), x)[0].eval()
# ==> array([ -1.,  nan,  -1.], dtype=float32)
#  ...bah! We have a NaN at the asymptote despite not having
# an asymptote in the non-differentiated result.

The basic pattern for avoiding NaN gradients when using tf.where is to call tf.where twice. The innermost tf.where ensures that the result f(x) is always finite. The outermost tf.where ensures the correct result is chosen. For the running example, the trick plays out like this:

def safe_f(x):
  x_ok = tf.not_equal(x, 0.)
  f = lambda x: 1. / x
  safe_f = tf.zeros_like
  safe_x = tf.where(x_ok, x, tf.ones_like(x))
  return tf.where(x_ok, f(safe_x), safe_f(x))

But did it work?

x = tf.constant([-1., 0, 1])
tf.gradients(safe_f(x), x)[0].eval()
# ==> array([-1.,  0., -1.], dtype=float32)
# ...yay! double-where trick worked. Notice that the gradient
# is now a constant at the asymptote (as opposed to being NaN).

If y_conv is the result of a softmax, say, y_conv = tf.nn.softmax(x), then an even better solution is to replace it with log_softmax:

y = tf.nn.log_softmax(x)
cross_entropy = -tf.reduce_sum(y_*y)

Sometimes you use tf.sqrt() function without adding a small constant 1e-10 in it, inducing this nan problem.