Why is the softmax function called that way?

I understand that the function "squashes" a real vector space between the values 0 and 1.

However I don't see what this has to do with the "max" function, or why that makes it a "softer" version of the max function.


The largest element in the input vector remains the largest element after the softmax function is applied to the vector, hence the "max" part. The "soft" signifies that the function keeps information about the other, non-maximal elements in a reversible way (as opposed to a "hardmax", which is just the standard maximum function).

The function produces a probability distribution from any vector, and is thus used in machine learning when inputs need to be classified. The output of a neural network is normalised via this function, and this normalisation is required for machine learning techniques to work.


I always thought it was called softmax because it is differentiable ("soft") at all points for all elements of the input vector. This explanation would be analogous to what makes the softplus function, $f(x) = \ln(1 + e^x)$, the "soft" version of $f(x) = \max(0, x)$