Why solving $Ax = b$ is equivalent to minimize $\frac{1}{2}x^TAx - b^Tx$ over x

Solution 1:

We can write the function $f(x) = \frac 12 x^TAx-b^Tx$ as follows: $$f(x) = \frac 12 \sum_{i=1}^n \sum_{j=1}^n a_{ij}x_ix_j - \sum_{k=1}^n b_kx_k$$ which leads to $$\frac{\partial}{\partial x_l}f = \frac 12 \bigg ( \sum_{j=1}^n a_{lj}x_j + \sum_{i=1}^n a_{il}x_i \bigg ) - b_l$$ and therefore the gradient is $$\nabla f= \frac 12 (Ax + A^Tx) - b$$ So if $A=A^T$ (so $A$ is symmetric), we have $\nabla f = Ax-b$ and at any minimum of $f$ we know that $\nabla f=0$ has to be true. So when $Ax=b$, we obtain $\nabla f=0$. We have to find an argument why this is truly a minimum and why it couldn't be a maximum. For that we can look at the second derivatives

$$\frac{\partial^2}{\partial x_mx_l} f = \frac 12 (a_{lm} + a_{ml})$$ which leads to $H_f = \frac 12 (A+A^T)$. If $A$ is symmetric, this means $H_f=A$. Now we know that we have a minimum if the Hessian $H_f$ is positive definite.

In conclusion, we have ONLY if $A$ is symmetric and positive definite, that finding a minimum of $f(x) = \frac 12 x^TAx-b^Tx$ is equivalent to solving $Ax=b$, because in that case the Hessian is $H_f=A$, so only a minimum can be achieved and the gradient $\nabla f$ vanishes iff $Ax=b$.

Solution 2:

For the gradient of $\frac{1}{2}x^TAx-b^Tx$, the latter part $-b^Tx$ is trivial to dealing with, and for the former part $\frac{1}{2}x^TAx$, as the differential of a product of two functions, it is not difficult to see that the result is $$\frac{1}{2}[Ax+(x^TA)^T]=\frac{1}{2}(A+A^T)x.$$

So, is there an additional condition that $A^T=A$? If so, we will derive what we want.

Solution 3:

Here's a proof that doesn't use calculus. I'll assume $A \in \mathbb R^{n \times n}$ is symmetric positive definite. And I'll make use of the "$A$-norm" $\| \cdot \|_A$ on $\mathbb R^n$ defined by $\| y \|_A = \sqrt{y^T A y}$.

Suppose that $x^*$ is the unique solution to $Ax = b$. Then \begin{align} \|x - x^*\|_A^2 &= (x - x^*)^T A (x - x^*) \\ &= x^T A x - 2x^T A x^* + {x^*}^T A x^* \\ &= x^T A x - 2x^T b + \underbrace{{x^*}^T A x^*}_{\substack{\text{Does not}\\ \text{depend on $x$}}}. \end{align} The quantity $\| x - x^* \|_A^2$ is minimized when $x = x^*$, of course. So we can see that the function $$ f(x) = x^T Ax - 2 x^T b $$ is also minimized when $x = x^*$.


Alternatively, if we want to use calculus, here's a way to compute the gradient of the function $f(x) = x^T A x - 2 x^T b$. First let's discover a product rule for the derivative of the function $$ F(x) = \langle g(x), h(x) \rangle $$ where $g: \mathbb R^n \to \mathbb R^m$ and $h: \mathbb R^n \to \mathbb R^m$ are differentiable functions. Notice that \begin{align} F(x + \Delta x) &= \langle g(x + \Delta x, h(x + \Delta x) \rangle \\ &\approx \langle g(x) + g'(x) \Delta x, h(x) + h'(x) \Delta x \rangle \\ &= \langle g(x),h(x) \rangle + \langle h(x), g'(x) \Delta x \rangle + \langle g(x), h'(x) \Delta x \rangle + \underbrace{\langle g'(x) \Delta x, h'(x) \Delta x \rangle}_{\text{negligible}} \\ &\approx \underbrace{\langle g(x), h(x) \rangle}_{F(x)} + \left(h(x)^T g'(x) + g(x)^T h'(x)\right)\Delta x. \end{align} Comparing this with the approximation $$ F(x + \Delta x) \approx F(x) + F'(x) \Delta x $$ reveals that $$ \tag{1} F'(x) = h(x)^T g'(x) + g(x)^T h'(x). $$ This is our product rule.

Now let's take $g(x) = x$ and $h(x) = Ax$, so that $F(x) = x^T A x$. Note that $g'(x) = I$ (the identity matrix) and $h'(x) = A$. Our product rule (1) tells us that $$ F'(x) = (Ax)^T I + x^T A = x^T (A^T + A). $$ The gradient of $F$ is $$ \nabla F(x) = F'(x)^T = (A + A^T)x. $$ If we assume that $A^T = A$, then $F'(x) = 2Ax$.

It follows that $$ \nabla f(x) = 2Ax - 2b. $$ So, setting $\nabla f(x)$ equal to $0$, we obtain $$ 2Ax - 2b = 0 \quad \text{or equivalently} \quad Ax = b. $$