Neural network backpropagation with RELU
if x <= 0, output is 0. if x > 0, output is 1
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.
If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0
. During training, the ReLU will return 0
to your output layer, which will either return 0
or 0.5
if you're using logistic units, and the softmax will squash those. So a value of 0
under your current architecture doesn't make much sense for the forward propagation part either.
See for example this. What you can do is use a "leaky ReLU", which is a small value at 0
, such as 0.01
.
I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.