What is the difference between the Jacobian, Hessian and the Gradient?
I know there is a lot of topic regarding this on the internet, and trust me, I've googled it. But things are getting more and more confused for me.
From my understanding, The gradient is the slope of the most rapid descent. Modifying your position by descending along this gradient will most rapidly cause your cost function to become minimal (the typical goal).
Could anyone explain in simple words (and maybe with an example) what the difference between the Jacobian, Hessian, and the Gradient?
Some good resources on this would be any introductory vector calculus text. I'll try to be as consistent as I can be with Stewart's Calculus, perhaps the most popular calculus textbook in North America.
The Gradient
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be a scalar field. The gradient, $\nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n$ is a vector, such that $(\nabla f)_j = \partial f/ \partial x_j$. Because every point in $\text{dom}(f)$ is mapped to a vector, then $\nabla f$ is a vector field.
The Jacobian
Let $\operatorname{F}: \mathbb{R}^n \rightarrow \mathbb{R}^m$ be a vector field. The Jacobian can be considered as the derivative of a vector field. Considering each component of $\mbox{F}$ as a single function (like $f$ above), then the Jacobian is a matrix in which the $i^{th}$ row is the gradient of the $i^{th}$ component of $\operatorname{F}$. If $\mathbf{J}$ is the Jacobian, then
$$\mathbf{J}_{i,j} = \dfrac{\partial \operatorname{F}_i}{\partial x_j}$$
The Hessian
Simply, the Hessian is the matrix of second order mixed partials of a scalar field.
$$\mathbf{H}_{i, j}=\frac{\partial^{2} f}{\partial x_{i} \partial x_{j}}$$
In summation:
Gradient: Vector of first order derivatives of a scalar field
Jacobian: Matrix of gradients for components of a vector field
Hessian: Matrix of second order mixed partials of a scalar field.
Example
Squared error loss $f(\beta_0, \beta_1) = \sum_i (y_i - \beta_0 - \beta_1x_i)^2$ is a scalar field. We map every pair of coefficients to a loss value.
The gradient of this scalar field is $$\nabla f = \left< -2 \sum_i( y_i - \beta_0 - \beta_1x_i), -2\sum_i x_i(y_i - \beta_0 - \beta_1x_i) \right>$$
Now, each component of $\nabla f$ is itself a scalar field. Take gradients of those and set them to be rows of a matrix and you've got yourself the Jacobian
$$ \left[\begin{array}{cc} \sum_{i=1}^{n} 2 & \sum_{i=1}^{n} 2 x_{i} \\ \sum_{i=1}^{n} 2 x_{i} & \sum_{i=1}^{n} 2 x_{i}^{2} \end{array}\right]$$
- The Hessian of $f$ is the same as the Jacobian of $\nabla f$. It would behoove you to prove this to yourself.
Resources: Calculus: Early Transcendentals by James Stewart, or earlier editions, as well as Wikipedia which is surprisingly good for these topics.
If you have a function that maps a 1D number to a 1D number, then you can take the derivative of it,
$f(x) = x^2, f'(x) = 2x$
If you have a function that maps a ND vector to a 1D number, then you take the gradient of it
$f(x) = x^Tx, \nabla f(x) = 2x, x = (x_1, x_2, \ldots, x_N)$
If you have a function that maps a ND vector to a ND vector, then you take the Jacobian of it.
$f(x_1, x_2) = \begin{bmatrix} x_1x_2^2 \\ x_1^2x_2\end{bmatrix}, J_f(x_1, x_2) = \begin{bmatrix} x_2^2 & 2x_1x_2 \\ x_1^2 & 2 x_1x_2\end{bmatrix}$
The Hessian is the Jacobian of the gradient of a function that maps from ND to 1D
So the gradient, Jacobian and Hessian are different operations for different functions. You literally cannot take the gradient of a ND $\to $ ND function. That's the difference.