Where in tensorflow gradients is the sum over the elements of y made?

Solution 1:

I've answered it here but I'm guessing it's not very useful because you can't use this knowledge to be able to differentiate with respect to non-scalar y. Scalar y assumption is central to design of reverse AD algorithm, and there's not a single place you can modify to support non-scalar ys. Since this confusion keeps coming up, let me go in a bit more detail as to why it's non-trivial:

First of all, how reverse AD works -- suppose we have a function f that's composition of component functions f_i. Each component function takes a vector of length n and produces a vector of length n.

Its derivative can be expressed as a sequence of matrix multiplications. The entire expression can be expressed below.

When differentiating, function composition becomes matrix multiplication of corresponding component function Jacobians.

Note that this involves matrix/matrix products which proves to be too expensive for neural networks. IE, AlexNet contains 8k activations in its convnet->fc transition layer. Doing matrix multiples where each matrix is 8k x 8k would take too long. The trick that makes it efficient is to assume that last function in the chain produces a scalar. Then its Jacobian is a vector, and the whole thing can be rewritten in terms of vector-matrix multiplies, instead of matrix-matrix multiplies.

This product can be computed efficiently by doing multiplication left to right so everything you do is an nxn vector-matrix multiply instead of nxn matrix-matrix multiply.

You can make it even more efficient by never forming those nxn derivative matrices in a first place, and associate each component function with an op that does vector x Jacobian matrix product implicitly. That's what TensorFlow tf.RegisterGradient does. Here's an illustration of the "grad" associated with an a component function.

enter image description here

Now, this is done for vector value functions, what if your functions are matrix valued? This is a typical situation we deal with in neural networks. IE, in a layer that does matrix multiply, matrix that you multiply by is an unknown and it is matrix valued. In that case, the last derivative has rank 2, and remaining derivatives have rank 3.

Now to apply the chain rule you'd have to deal with extra notation because now "x" in chain rule means matrix multiplication generalized to tensors of rank-3. enter image description here

However, note that we never have to do the multiplication explicitly since we are using a grad operator. So now in practice, this operator now takes values of rank-2 and produces values of rank-2.

enter image description here

So in all of this, there's an assumption that final target is scalar which allows fully connected layers to be differentiated by passing matrices around.

If you want to extend this to support non-scalar vector, you would need to modify the reverse AD algorithm to to propagate more info. IE, for fully connected feed-forward nets you would propagate rank-3 tensors around instead of matrices.

Solution 2:

With the jacobian function in Tensorflow 2, this is a straightforward task.

with tf.GradientTape() as tape1:
    with tf.GradientTape() as tape2:
        y = layer(x)
        loss = tf.reduce_mean(y ** 2)
    first_order_gradient = tape2.gradient(loss, layer.trainable_weights)
hessian = tape1.jacobian(first_order_gradient, layer.trainable_weights)

https://www.tensorflow.org/guide/advanced_autodiff#hessian