Keras: Difference between Kernel and Activity regularizers
I have noticed that weight_regularizer is no more available in Keras and that, in its place, there are activity and kernel regularizer. I would like to know:
- What are the main differences between kernel and activity regularizers?
- Could I use activity_regularizer in place of weight_regularizer?
The activity regularizer works as a function of the output of the net, and is mostly used to regularize hidden units, while weight_regularizer, as the name says, works on the weights (e.g. making them decay). Basically you can express the regularization loss as a function of the output (activity_regularizer
) or of the weights (weight_regularizer
).
The new kernel_regularizer
replaces weight_regularizer
- although it's not very clear from the documentation.
From the definition of kernel_regularizer
:
kernel_regularizer: Regularizer function applied to the
kernel
weights matrix (see regularizer).
And activity_regularizer
:
activity_regularizer: Regularizer function applied to the output of the layer (its "activation"). (see regularizer).
Important Edit: Note that there is a bug in the activity_regularizer that was only fixed in version 2.1.4 of Keras (at least with Tensorflow backend). Indeed, in the older versions, the activity regularizer function is applied to the input of the layer, instead of being applied to the output (the actual activations of the layer, as intended). So beware if you are using an older version of Keras (before 2.1.4), activity regularization may probably not work as intended.
You can see the commit on GitHub
Five months ago François Chollet provided a fix to the activity regularizer, that was then included in Keras 2.1.4
This answer is a bit late, but is useful for the future readers.
So, necessity is the mother of invention as they say. I only understood it when I needed it.
The above answer doesn't really state the difference cause both of them end up affecting the weights, so what's the difference between punishing for the weights themselves or the output of the layer?
Here is the answer: I encountered a case where the weights of the net are small and nice, ranging between [-0.3] to [+0.3].
So, I really can't punish them, there is nothing wrong with them. A kernel regularizer is useless. However, the output of the layer is HUGE, in 100's.
Keep in mind that the input to the layer is also small, always less than one. But those small values interact with the weights in such a way that produces those massive outputs. Here I realized that what I need is an activity regularizer, rather than kernel regularizer. With this, I'm punishing the layer for those large outputs, I don't care if the weights themselves are small, I just want to deter it from reaching such state cause this saturates my sigmoid activation and causes tons of other troubles like vanishing gradient and stagnation.