Why is the approximation of Hessian$= J^TJ$ reasonable?

I met this equation frequently in Guass-Newton optimizations. But I dont understand why the left and right side of the equation can be equal.

Lets say the Jacobian is $2$ by $2$

and Hessian is $$\begin{bmatrix}\frac{\partial^2f_1}{\partial^2 x_1 } & \frac{\partial^2f_1}{\partial^2 x_2 } \\ \frac{\partial^2f_2}{\partial^2 x_1 } & \frac{\partial^2f_2}{\partial^2 x_2 }\end{bmatrix}$$

But the right hand side of the equation is $$J^TJ =\begin{bmatrix} \left( \frac{\partial f_1}{\partial x_1} \right)^2 + \left( \frac{\partial f_2}{\partial x_1} \right)^2 & \frac{\partial f_1}{\partial x_1}\frac{\partial f_1}{\partial x_2} +\frac{\partial f_2}{\partial x_1}\frac{\partial f_2}{\partial x_1}\\ \frac{\partial f_1}{\partial x_1}\frac{\partial f_1}{\partial x_2} +\frac{\partial f_2}{\partial x_1}\frac{\partial f_2}{\partial x_1} &\left( \frac{\partial f_1}{\partial x_2} \right)^2 + \left( \frac{\partial f_2}{\partial x_2} \right)^2\end{bmatrix} $$

Why can these two be equal as presented in papers?


The quadratic model based on the true Hessian is derived from truncating a Taylor series of the objective function as a whole, whereas the quadratic model based on the gauss-Newton hessian is based on truncating a Taylor series of the residual.

Starting with the optimization problem: $$\min_x \frac{1}{2} ||y-f(x)||^2,$$ Consider taking a Taylor series of $f$: $$f(x)=f(x_0)+J(x-x_0)+\text{higher order terms}.$$ The approximate optimization problem formed by truncating the Taylor series, $$\min_x \frac{1}{2} ||y-f(x_0)-J(x-x_0)||^2,$$ has Hessian $J^TJ$.

In general this is not exactly equal to the true Hessian, owing to potential second order cross reactions between other terms in the Taylor series of the residual, but they are equal when $y=f(x_0)$.


Two simple examples where we can compare, plot, $H = f''(x)$ and $J^2 = f'(x)^2$, are lines and quadratics in 1d:
Lines: $ \qquad \ f(x) = a x, \qquad J(x) = a, \ \ \ \ \ J^2(x) = a^2, \ \ \ \ \ \ H(x) = 0$
Quadratics: $ \ f(x) = a x^2 /2, \ \ \ \ J(x) = ax, \ \ \ J^2(x) = a^2 x^2, \ \ \ H(x) = a$

So for lines, $J^2$ is close to $H$ iff $a$ is small; for quadratics, if $ax^2 \approx 1, f(x) \approx \, ^1/_2 $.

In $n$ dimensions, $H$ can be rotated to a sum of independent components $a_i (x_i - b_i)^2$ . Some of these may have $H_i \approx J_i^2$, some not.