Machine Learning: Is the softmax function Lipschitz with Lipschitz constant $1$?
Let the softmax function be defined as: https://en.wikipedia.org/wiki/Softmax_function
$l: \mathbb{R}^K \to \text{int}(S^{K-1}), l = (l_j)_{j \in \{1,\ldots, K\}}$, $S^{K-1}$ being the $K-1$ simplex, such that for a vector $\mathbf{z} \in \mathbb{R}^K$
$l_j(\mathbf{z}) = \dfrac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$ for $j = 1, ..., K.$ where $l_j: \mathbb{R}^K \to (0,1)$
Question:
Is the softmax function Lipschitz in the 2-norm? If so, is it Lipschitz with Lipschitz constant $1$?
I am asking because I have reason to believe that this is the case (through numerical simulation), but I have no idea how to verify it. If anyone has any reference to this result or a path going forward I would greatly appreciate it.
Solution 1:
It is easy to show that for any differentiable function $f:\mathbb{R}^n\rightarrow\mathbb{R}^m$,
$\left \|f(x)-f(y)\right \|_2 \leq \left \|J\right \|^*_F \left \|x-y\right\|_2 \, \, \forall x,y\in\mathbb{R}^n$
where
$\left \|J\right \|^*_F = \max\limits_{x} \left \|J\right \|_F$
and
$J$ is the jacobian matrix of $f(x)$ wrt $x$.
In fact, it is possible to obtain a tighter upper bound. But for your question, this upper bound will suffice.
So, by above equation and the definition of lipschitz continuity, we obtain that
$L \leq \left \|J\right \|^*_F$
Now, we must find the value of $\left \|J\right \|^*_F$.
In your case, $f:\mathbb{R}^K \rightarrow \mathbb{R}^{K}$ is defined as,
$f(x)_i = \frac{\exp(x_i)}{\sum\limits_{j=1}^{K}\exp(x_j)} = p_i(x), \, i={1,2,...K}$
For brevity, I'll write $p_i(x)$ as $p_i$. The jacobian matrix for some $x$ will be,
$J = \begin{bmatrix} p_1(1-p_1) & p_1p_2 & p_1p_3 & ... & p_1p_{K-1} & p_1p_K \\ p_2p_1 & p_2(1-p_2) & p_2p_3 & ... & p_2p_{K-1} & p_2p_K \\ p_3p_1 & p_3p_2 & p_3(1-p_3) & ... & p_3p_{K-1} & p_3p_K \\ ... & ... & ... & ... & ... & ... \\ p_{K-1}p_{1} & p_{K-1}p_{2} & p_{K-1}p_{3} & ... & p_{K-1}(1-p_{K-1}) & p_{K-1}p_{K} \\ p_{K}p_{1} & p_{K}p_{2} & p_{K}p_{3} & ... & p_{K}p_{K-1} & p_{K}(1-p_{K}) \end{bmatrix}$
Now, Frobenius norm of above matrix will be,
$\left \| J \right \|_F = \sqrt{\sum\limits_{i=1}^{K}\sum\limits_{j=1 \\ i\neq j}^{K}p_{i}^{2}p_{j}^{2} + \sum\limits_{i=1}^{K} p_i^2(1-p_i)^{2}}$
Note that above $p_k \equiv p_k(x)$.
Finally, $\left \| J \right \|^{*}_F = \max\limits_{x}\left \| J \right \|_F $
Now, to compute $\left \| J \right \|^{*}_F$, either one can compute partial derivative of $\left \| J \right \|_F$ with respect to $x_i$ and solve the resulting equations to get critical points and then check which one leads to a maximum. However, another way is to see that the maximum will be achieved when $p_{i}(x) = \frac{1}{K}, \, \forall i \in {1,...,K}$ and if one tries to perturb one of the $p_{i}(x)$ towards either end i.e. $0$ or $1$, the value of $\left \| J \right \|_F$ decreases. This can further be visualized when $K=2$ (Check plot of f(x,y) = 2exp(x+y)/(exp(x)+exp(y))^2 in wolfram).
Putting $p_i = \frac{1}{K}$ in above equation of $\left \| J \right \|_F$, we get $\left \| J \right \|^{*}_F = \frac{\sqrt{K-1}}{K}$ which is less than $1$ for $K\geq2$.
Therefore, lipschitz constant is less than $\frac{\sqrt{K-1}}{K} < 1, \forall K \geq 2$.
Please let me know if you don't understand some step or find some ambiguity in above proof.
Solution 2:
This is wrong due to a typo in the Jacobian calculations, namely the term $(1-p_i^2)$ should be $(1-p_i)^2$. Note that already one term $p_i=\frac{1}{2}$ gives $\frac{1}{4}$ on the diagonal so the Frobenious norm is at least $\frac{1}{4}$.
In particular it does not decrease with $K$.
The above calculations prove that the Lipschitz constant is $O(1)$. The constant has been discussed in machine-learning papers, they often use the result which gives the bound 1. See https://arxiv.org/pdf/1704.00805.pdf