Why the 6 in relu6?

I've hacked a deep feed forward NN from scratch in R, and it seems more stable with "hard sigmoid" activations - max(0,min(1,x)) - than ReLU. Trying to port it to TensorFlow, and noticed that they don't have this activation function built in, only relu6, which uses an upper cutoff at 6. Is there a reason for this? (I realize that you could do relu6(x*6)/6, but if the TF guys put the 6 there for a good reason, I'd like to know.) Also, I'd like to know if others have explosion problems with ReLU in feed forward nets (I'm aware of RNN issues).


From this reddit thread:

This is useful in making the networks ready for fixed-point inference. If you unbound the upper limit, you lose too many bits to the Q part of a Q.f number. Keeping the ReLUs bounded by 6 will let them take a max of 3 bits (upto 8) leaving 4/5 bits for .f

It seems, then, that 6 is just an arbitrary value chosen according to the number of bits you want to be able to compress your network's trained parameters into. As per the "why" only the version with value 6 is implemented, I assume it's because that's the value that fits best in 8 bits, which probably is the most common use-case.


Tensorflows documentation (https://www.tensorflow.org/api_docs/python/tf/nn/relu6) points to the following paper:

... First, we cap the units at 6, so our ReLU activation function is y = min(max(x, 0), 6). In our tests, this encourages the model to learn sparse features earlier. In the formulation of [8], this is equivalent to imagining that each ReLU unit consists of only 6 replicated bias-shifted Bernoulli units, rather than an infinute amount. We will refer to ReLU units capped at n as ReLU-n units.

http://www.cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf

Since it originates from the paper, I suspect that they tested it with different n's and got the best results for their testset with n=6.