Multinomial logistic loss gradient and hessian
Having the multinomial logistic loss defined as:
$$ L(z; y=j) = -\log[\operatorname{softmax}(z)]_j $$ with, $$[\operatorname{softmax}(z)]_j = \frac{\exp(z_j)}{\sum^K_{k=1} \exp(z_k)}$$
How can I compute the gradient and the Hessian of L with respect to z?
Right now I have the following for the gradient of $L$ with respect to $z$:
$$ \begin{array}{ll} s_i & i \neq j \\ (s_j-1) & i = j \end{array} $$
And so, from my understanding, computing the Hessian will give me the following:
$$ \begin{array}{ll} 0 & i \neq j \\ 1 & i = j \end{array} $$
Leaving me with the Identity matrix.
What I need help with, is to understand if these values are correct or not.
They don't look correct, to say the least.
Thank you.
$ \def\c#1{\color{red}{#1}} \def\o{{\tt1}}\def\p{\partial} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\Big(#1\Big)} \def\bR#1{\big(#1\big)} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} $The softmax function applied elementwise on the $z$-vector yields the $s$-vector (or softmax vector) $$s = \frac{e^z}{\o:e^z} \qiq S=\Diag{s}\qiq \c{ds = \LR{S-ss^T}dz}$$ Calculate the gradient of the loss function (for an unspecified $y$-vector) $$\eqalign{ L &= -y:\log(s) \\ dL &= -y:S^{-1}ds \\ &= S^{-1}y:\c{\LR{-ds}} \\ &= S^{-1}y:\c{\LR{ss^T-S}dz} \\ &= \LR{ss^T-S}S^{-1}y:dz \\ &= \LR{s\o^T-I}y:dz \\ &= \LR{s-y}:dz \\ \grad{L}{z} &= \LR{s-y} \;\doteq\; g \qquad\bR{{\rm the\;gradient}} \\ }$$ Then calculate the Hessian as the gradient of the gradient. $$\eqalign{ dg &= \c{ds = \LR{S-ss^T}dz} \\ \grad{g}{z} &= \LR{S-ss^T} \;\doteq\; H \qquad\bR{{\rm the\;Hessian}} \\ }$$ Now you can set $y$ to whichever one-hot vector you're interested in. Note that the Hessian is independent of $y$, so it doesn't matter which one you choose. Also note that, the term one-hot vector is not used outside of the discipline of Machine Learning. Instead they're referred to as the standard (or Cartesian or canonical) basis vectors and denoted as $\{e_j\}$
NB: $\;$In many of the steps above, the matrix inner product is used $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ }$$ When applied to vectors $\bR{n=\o}$, this corresponds to the ordinary dot product, i.e. $$a:b = a^Tb \;=\; \sum_{i=1}^m a_{i}b_{i}$$
The properties of the underlying trace function allow the terms in an inner product to be rearranged in many different but equivalent ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:AB &= CB^T:A = A^TC:B \\ }$$
Your notation (probably derived from a statistics books) is horrible (as most statistics books).
In mathematical terms: let $\pi_k:\mathbf{R}^d \to \mathbf{R}$ denote the $k$th projection $\pi_k(z) = z_k.$ You want to find the first and second derivative of $L(z) = - \log\left( \dfrac{\exp(\pi_j(z))}{\sum\limits_{k = 1}^d \exp(\pi_k(z))} \right).$ Let me first rewrite $L(z)$ as follows: $$ L(z) = \log \left( \sum\limits_{k = 1}^d \exp(\pi_k(z)) \right) - \pi_j(z). $$ Next, notice that each $\pi_k$ is a linear function, which implies that its derivative at any point is just itsef: $\pi_k'(z) = \pi_k.$ Then, $$ \begin{align} L'(z) &= \log'\left( \sum\limits_{k = 1}^d \exp(\pi_k(z)) \right) \sum\limits_{k = 1}^d \exp'(\pi_k(z)) \pi_k - \pi_j \\ &=\dfrac{\sum\limits_{k = 1}^d \exp(\pi_k(z)) \pi_k}{\sum\limits_{k = 1}^d \exp(\pi_k(z))} - \pi_j. \end{align} $$ It is convenient to write this in matrix terms. Note that for any $k,$ $\pi_k(z) = z_k = c_k^\intercal z,$ where $c_k$ is the $k$th canonical vector in $\mathbf{R}^d.$ So that $\sum\limits_{k = 1}^d \exp(\pi_k(z)) = \sum\limits_{k = 1}^d e^{c_k^\intercal z}.$ Thus, the matrix representation of $L$ (relative to the canonical bases of $\mathbf{R}^d$ and $\mathbf{R}$) is $$ L'(z) = \dfrac{\sum\limits_{k = 1}^d e^{c_k^\intercal z} c_k}{\sum\limits_{k = 1}^d e^{c_k^\intercal z}} - c_j. $$ The second derivative of $L$ is now obvious (just differentiate the previous expression relative to $z$): $$ L''(z) = \dfrac{ \left( \sum\limits_{k = 1}^d e^{c_k^\intercal z} c_k c_k^\intercal \right) \sum\limits_{k = 1}^d e^{c_k^\intercal z} - \left( \sum\limits_{k = 1}^d e^{c_k^\intercal z} c_k \right) \left( \sum\limits_{k = 1}^d e^{c_k^\intercal z} c_k^\intercal \right) }{ \left( \sum\limits_{k = 1}^d e^{c_k^\intercal z} \right)^2 } $$ For a given $z,$ write $s_k = s_k(z) = e^{c_k^\intercal z} = e^{z_k},$ and denote by $s$ the sum of the $s_k.$ Then $$ L''(z) = \dfrac{\sum\limits_{k = 1}^d ss_k c_k c_k^\intercal - \sum\limits_{1 \leq a, b \leq d}^d s_as_b c_ac_b^\intercal}{ s^2 } $$ Note that $c_ac_b^\intercal$ is a square matrix of order $d$ filled with zeroes except entry $(a, b)$ where you have a one. Thus, $$ [L'(z)]_i = \dfrac{s_i}{s} - \delta_{i,j} $$ and $$ [L''(z)]_{a,b} = \delta_{a,b} \dfrac{s_a}{s} - \dfrac{s_a s_b}{s^2}. $$