Derivative of Softmax loss function

I am trying to wrap my head around back-propagation in a neural network with a Softmax classifier, which uses the Softmax function:

\begin{equation} p_j = \frac{e^{o_j}}{\sum_k e^{o_k}} \end{equation}

This is used in a loss function of the form

\begin{equation}L = -\sum_j y_j \log p_j,\end{equation}

where $o$ is a vector. I need the derivative of $L$ with respect to $o$. Now if my derivatives are right,

\begin{equation} \frac{\partial p_j}{\partial o_i} = p_i(1 - p_i),\quad i = j \end{equation}

and

\begin{equation} \frac{\partial p_j}{\partial o_i} = -p_i p_j,\quad i \neq j. \end{equation}

Using this result we obtain

\begin{eqnarray} \frac{\partial L}{\partial o_i} &=& - \left (y_i (1 - p_i) + \sum_{k\neq i}-p_k y_k \right )\\ &=&p_i y_i - y_i + \sum_{k\neq i} p_k y_k\\ &=& \left (\sum_i p_i y_i \right ) - y_i \end{eqnarray}

According to slides I'm using, however, the result should be

\begin{equation} \frac{\partial L}{\partial o_i} = p_i - y_i. \end{equation}

Can someone please tell me where I'm going wrong?


Your derivatives $\large \frac{\partial p_j}{\partial o_i}$ are indeed correct, however there is an error when you differentiate the loss function $L$ with respect to $o_i$.

We have the following (where I have highlighted in $\color{red}{red}$ where you have gone wrong) $$\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$$ given that $\sum_ky_k=1$ from the slides (as $y$ is a vector with only one non-zero element, which is $1$).