Derivative of Binary Cross Entropy - why are my signs not right?

I'm trying to derive formulas used in backpropagation for a neural network that uses a binary cross entropy loss function. When I perform the differentiation, however, my signs do not come out right:

Binary cross entropy loss function: $$J(\hat y) = \frac{-1}{m}\sum_{i=1}^m y_i\log(\hat y_i)+(1-y_i)(\log(1-\hat y)$$

where

$m = $ number of training examples
$y = $ true y value
$\hat y = $ predicted y value

When I attempt to differentiate this for one training example, I do the following process:

Product rule: $$ \frac{dJ}{d\hat y_i} = -1(\frac{d}{d\hat y_i}(y_i\log(\hat y_i)+(1-y_i)(\log(1-\hat y)))) $$

Sum rule: $$ = -1(\frac{d}{d\hat y_i}y_i\log(\hat y_i)+\frac{d}{d\hat y_i}(1-y_i)(\log(1-\hat y))) $$

Product rule, deriv of constant (treating $y$ as a constant) and deriv of natural log: $$ = -1(\frac{y_i}{\hat y_i} + \frac{1-y_i}{1 - \hat y_i})$$

However, this is different from the expected result: $$ \frac{dJ}{d\hat y_i} = -1(\frac{y_i}{\hat y_i} - \frac{1-y_i}{1 - \hat y_i}) $$

Not sure what's going wrong. I'm sure I'm doing something incorrectly, but I can't figure out what it is. Any help is appreciated!


Solution 1:

$$\mathbf{h} = \mathbf{w}^T \mathbf{X} $$

$$\mbox{Logistic regression: }\mathbf{z} = \sigma(\mathbf{h}) = \frac{1}{1 + e^{-\mathbf{h}}}$$

$$\mbox{Cross-entropy loss: } J(\mathbf{w}) = -(\mathbf{y} log(\mathbf{z}) + (1 - \mathbf{y})log(1 - \mathbf{z})) $$ $$ \mbox{Use chain rule: } \frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{w}}} = \frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{z}}} \frac{\partial{\mathbf{z}}}{\partial{\mathbf{h}}} \frac{\partial{\mathbf{h}}}{\partial{\mathbf{\mathbf{w}}}}$$

$$\frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{z}}} = -(\frac{\mathbf{y}}{\mathbf{z}} - \frac{1-\mathbf{y}}{1-\mathbf{z}}) = \frac{\mathbf{z} - \mathbf{y}}{\mathbf{z}(1-\mathbf{z})}$$

$$\frac{\partial{\mathbf{z}}}{\partial{\mathbf{h}}} = \mathbf{z}(1-\mathbf{z}) $$

$$\frac{\partial{\mathbf{h}}}{\partial{\mathbf{\mathbf{w}}}} = \mathbf{X} $$

$$\frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{w}}} = \mathbf{X}^T (\mathbf{z}-\mathbf{y})$$

$$\mbox{Gradient descent: } \mathbf{w} = \mathbf{w} - \alpha \frac{\partial{J(\mathbf{w})}}{\partial{\mathbf{w}}} $$

Solution 2:


Let's denote the inner/Frobenius product by $a:b= a^Tb$
and the elementwise/Hadamard product by $a\odot b$
and elementwise/Hadamard division by $\frac{a}{b}$
and note that the $\log$ function is to be applied elementwise.

For convenience, let's use a modified loss function $$L=-mJ$$ Then the differential and gradient of $L$ can be calculated as $$\eqalign{ L &= y:\log({\hat y}) + (1-y):\log(1-{\hat y}) \cr \cr dL &= y:d\log({\hat y}) + (1-y):d\log(1-{\hat y}) \cr &= \frac{y}{{\hat y}}:d{\hat y} + \frac{1-y}{1-{\hat y}}:d(1-{\hat y}) \cr &= \Big(\frac{y}{{\hat y}} - \frac{1-y}{1-{\hat y}}\Big):d{\hat y} \cr &= \Big(\frac{y-{\hat y}}{{\hat y}-{\hat y}\odot{\hat y}}\Big):d{\hat y} \cr \cr \frac{\partial L}{\partial{\hat y}} &= \frac{y-{\hat y}}{{\hat y}-{\hat y}\odot{\hat y}} \cr \cr }$$ And the gradient of the original cost function is $$\eqalign{ \frac{\partial J}{\partial{\hat y}} &= -\frac{1}{m}\frac{\partial L}{\partial{\hat y}} = \frac{{\hat y}-y}{m\,({\hat y}-{\hat y}\odot{\hat y})} \cr }$$

Solution 3:

Your answer is almost correct except for the second term. While taking derivative of $(1-y_i)(\log(1-\hat y))$ w.r.t $ \hat y$, using product rule, $= (1-y_i)(\frac {1} {1-\hat y_i})* \frac {d(1-\hat y_i)}{d\hat y_i} $ $ = (1-y_i)(\frac {1} {1-\hat y_i})*-1 $ $ = - (\frac {1-y_i} {1-\hat y_i})$