Derivative of Binary Cross Entropy - why are my signs not right?
I'm trying to derive formulas used in backpropagation for a neural network that uses a binary cross entropy loss function. When I perform the differentiation, however, my signs do not come out right:
Binary cross entropy loss function: $$J(\hat y) = \frac{-1}{m}\sum_{i=1}^m y_i\log(\hat y_i)+(1-y_i)(\log(1-\hat y)$$
where
$m = $ number of training examples
$y = $ true y value
$\hat y = $ predicted y value
When I attempt to differentiate this for one training example, I do the following process:
Product rule: $$ \frac{dJ}{d\hat y_i} = -1(\frac{d}{d\hat y_i}(y_i\log(\hat y_i)+(1-y_i)(\log(1-\hat y)))) $$
Sum rule: $$ = -1(\frac{d}{d\hat y_i}y_i\log(\hat y_i)+\frac{d}{d\hat y_i}(1-y_i)(\log(1-\hat y))) $$
Product rule, deriv of constant (treating $y$ as a constant) and deriv of natural log: $$ = -1(\frac{y_i}{\hat y_i} + \frac{1-y_i}{1 - \hat y_i})$$
However, this is different from the expected result: $$ \frac{dJ}{d\hat y_i} = -1(\frac{y_i}{\hat y_i} - \frac{1-y_i}{1 - \hat y_i}) $$
Not sure what's going wrong. I'm sure I'm doing something incorrectly, but I can't figure out what it is. Any help is appreciated!
Solution 1:
$$\mathbf{h} = \mathbf{w}^T \mathbf{X} $$
$$\mbox{Logistic regression: }\mathbf{z} = \sigma(\mathbf{h}) = \frac{1}{1 + e^{-\mathbf{h}}}$$
$$\mbox{Cross-entropy loss: } J(\mathbf{w}) = -(\mathbf{y} log(\mathbf{z}) + (1 - \mathbf{y})log(1 - \mathbf{z})) $$ $$ \mbox{Use chain rule: } \frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{w}}} = \frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{z}}} \frac{\partial{\mathbf{z}}}{\partial{\mathbf{h}}} \frac{\partial{\mathbf{h}}}{\partial{\mathbf{\mathbf{w}}}}$$
$$\frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{z}}} = -(\frac{\mathbf{y}}{\mathbf{z}} - \frac{1-\mathbf{y}}{1-\mathbf{z}}) = \frac{\mathbf{z} - \mathbf{y}}{\mathbf{z}(1-\mathbf{z})}$$
$$\frac{\partial{\mathbf{z}}}{\partial{\mathbf{h}}} = \mathbf{z}(1-\mathbf{z}) $$
$$\frac{\partial{\mathbf{h}}}{\partial{\mathbf{\mathbf{w}}}} = \mathbf{X} $$
$$\frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{w}}} = \mathbf{X}^T (\mathbf{z}-\mathbf{y})$$
$$\mbox{Gradient descent: } \mathbf{w} = \mathbf{w} - \alpha \frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{w}}} $$
Solution 2:
Let's denote the inner/Frobenius product by $a:b= a^Tb$
and the elementwise/Hadamard product by $a\odot b$
and elementwise/Hadamard division by $\frac{a}{b}$
and note that the $\log$ function is to be applied elementwise.
For convenience, let's use a modified loss function $$L=-mJ$$ Then the differential and gradient of $L$ can be calculated as $$\eqalign{ L &= y:\log({\hat y}) + (1-y):\log(1-{\hat y}) \cr \cr dL &= y:d\log({\hat y}) + (1-y):d\log(1-{\hat y}) \cr &= \frac{y}{{\hat y}}:d{\hat y} + \frac{1-y}{1-{\hat y}}:d(1-{\hat y}) \cr &= \Big(\frac{y}{{\hat y}} - \frac{1-y}{1-{\hat y}}\Big):d{\hat y} \cr &= \Big(\frac{y-{\hat y}}{{\hat y}-{\hat y}\odot{\hat y}}\Big):d{\hat y} \cr \cr \frac{\partial L}{\partial{\hat y}} &= \frac{y-{\hat y}}{{\hat y}-{\hat y}\odot{\hat y}} \cr \cr }$$ And the gradient of the original cost function is $$\eqalign{ \frac{\partial J}{\partial{\hat y}} &= -\frac{1}{m}\frac{\partial L}{\partial{\hat y}} = \frac{{\hat y}-y}{m\,({\hat y}-{\hat y}\odot{\hat y})} \cr }$$
Solution 3:
Your answer is almost correct except for the second term. While taking derivative of $(1-y_i)(\log(1-\hat y))$ w.r.t $ \hat y$, using product rule, $= (1-y_i)(\frac {1} {1-\hat y_i})* \frac {d(1-\hat y_i)}{d\hat y_i} $ $ = (1-y_i)(\frac {1} {1-\hat y_i})*-1 $ $ = - (\frac {1-y_i} {1-\hat y_i})$