Pseudo Inverse Solution for Linear Equation System Using the SVD
I read about the SVD theory and its usage for solving Linear Equation System. I saw many papers mentioning property of the solution yet no proof of it.
The Property: The solution given the Pseudo Inverse ( $ V {\Sigma}^{-1} {U}^{H} $ ) minimizes both the error norm and the solution norm.
The minimization of the error norm is easily proved using the Normal Equations ( $ \hat{x} $ is the Least Squares solution iff $ {A}^{H} A \hat{x} = {A}^{H} b $ ).
Yet beyond the intuition of $ \hat{x} $ must lie in the Row Space of A hence its norm is minimized I couldn't find a formal proof for that.
Moreover, Let's define $ A = U \Sigma {V}^{H} $ then when we calculate its pseudo inverse we we handle $ \Sigma $ with extra care, only reversing its non zero entries. What the formal reasoning for that?
Thanks!
As Jack pointed out in a comment, the question is somewhat unclear. I interpret it as follows:
You have a system $Ax=b$ of linear equations. You understand why $\hat x=A^+b$, where $A^+$ is the (Moore–Penrose) pseudo-inverse of $A$, minimizes the error norm $|Ax-b|$, but you want to know why, if $Ax=b$ has solutions, $\hat x$ is the unique solution that minimizes the solution norm $|x|$.
You state that $\hat x$ must lie in the row space of $A$. That's true in the real case; more generally, in the complex case, $\hat x$ must lie in the column space of $A^*$, or equivalently in the orthogonal complement of the null space of $A$. This can be seen in two different ways:
Any solution $x$ of $Ax=b$ can be written as $x=u+v$, where $u$ is in the null space of $A$ and $v$ in its orthogonal complement. Then $|x|^2=|u|^2+|v|^2$. Since $u$ is in the null space of $A$, $v$ also solves $Ax=b$. Thus, the solution with minimal norm must have $u=0$, and must therefore lie in the orthogonal complement of the null space.
Alternatively, this can be derived using constrained minimization. To minimize $|x|^2$ under the system of constraints $Ax=b$, we introduce a vector $\lambda$ of Lagrange multipliers and consider the function $f(x)=x^*x-(Ax-b)^*\lambda$. Its gradient is $2x-A^*\lambda$, and setting this to zero yields $\hat x=A^*\lambda/2$. Thus $\hat x$ is in the column space of $A^*$.
Now there can be only one solution of $Ax-b$ in the orthogonal complement of the null space of $A$. If $x_1$ and $x_2$ are two solutions, then $A(x_1-x_2)=Ax_1-Ax_2=(Ax_1-b)-(Ax_2-b)=0$, so their difference is in the null space of $A$. If the solutions are both in the orthogonal complement of the null space, then so is their difference; since the difference is both in the null space and in its orthogonal complement, it must be zero.
Thus it suffices to show that $A^+b$ solves $Ax-b$ and lies in the column space of $A^*$. We know that if $Ax-b$ has solutions then $A^+b$ is one of them, because $A^+b$ minimizes the error norm. We also know that $A^+b$ lies in the column space of $A^*$, since $A^+b=(A^*A^{+*}A^+)b=A^*(A^{+*}A^+b)$ .
@joriki provides a great answer. This post has a different perspective and should be viewed as a complement his work.
Definitions
The linear system is $$ \mathbf{A} x = b \tag{1} $$ where the system matrix has rank $\rho\le\min\left( m,n \right)$, $ \mathbf{A} \in \mathbb{C}^{m\times n}_{\rho} $ and the data vector $b\in \mathbb{C}^{m}$. For existence, we require that the data vector is not in the null space: $b\notin\color{red}{\mathcal{N}\left( \mathbf{A}^{*}\right)}.$
The least squares solution to $(1)$ is defined as $$ x_{LS} = \left\{ x\in\mathbb{C}^{n} \colon \lVert \mathbf{A} x - b \rVert_{2}^{2} \text{ is minimized} \right\} $$ The least squares minimizers are the affine set given by $$ x_{LS} = \color{blue}{\mathbf{A}^{+} b} + \color{red}{ \left( \mathbf{I}_{n} - \mathbf{A}^{+} \mathbf{A} \right) y}, \quad y \in \mathbb{C}^{n} $$ Coloring indicates membership in $\color{blue}{range}$ or $\color{red}{null}$ space. The least squares minimizers are represented by the dashed $\color{red}{red}$ line below.
Solution of minimum norm
Every vector on the dashed line minimizes the sum of the squares of the residual errors. What are the lengths of these vectors? $$ \lVert x_{LS} \rVert^{2}_{2} = % \lVert \color{blue}{\mathbf{A}^{+} b} + \color{red}{ \left( \mathbf{I}_{n} - \mathbf{A}^{+} \mathbf{A} \right) y} \rVert^{2}_{2} % = \lVert \color{blue}{\mathbf{A}^{+} b} \rVert^{2}_{2} + \lVert \color{red}{\left( \mathbf{I}_{n} - \mathbf{A}^{+} \mathbf{A} \right) y} \rVert^{2}_{2} % $$ Which of these vectors has the shortest length? The case where $y=\mathbf{0}$. The least squares minimizers with minimum length is the vector lying in the range space $\color{blue}{\mathcal{R} \left( \mathbf{A}^{*} \right)}$. This vector is $$ \color{blue}{x_{LS}} = \color{blue}{\mathbf{A}^{+} b} $$
The question about the minimum norm solution reveals a misunderstanding. The pseudoinverse solution $\color{blue}{\mathbf{A}^{+} b}$ does not select the solution of minimum length. It is the solution of minimum length.
To see how the SVD naturally produces the pseudoinverse solution, read How does the SVD solve the least squares problem?
Inverting the SVD
The answer to the special handling of the $\Sigma$ matrix is going to boil down to this: a convention. The singular value decomposition is defined as $$ \begin{align} \mathbf{A} &= \mathbf{U} \, \Sigma \, \mathbf{V}^{*} \\ % &= % U \left[ \begin{array}{cc} \color{blue}{\mathbf{U}_{\mathcal{R}}} & \color{red}{\mathbf{U}_{\mathcal{N}}} \end{array} \right] % Sigma \left[ \begin{array}{cccc|cc} \sigma_{1} & 0 & \dots & & & \dots & 0 \\ 0 & \sigma_{2} \\ \vdots && \ddots \\ & & & \sigma_{\rho} \\\hline & & & & 0 & \\ \vdots &&&&&\ddots \\ 0 & & & & & & 0 \\ \end{array} \right] % V \left[ \begin{array}{c} \color{blue}{\mathbf{V}_{\mathcal{R}}}^{*} \\ \color{red}{\mathbf{V}_{\mathcal{N}}}^{*} \end{array} \right] \\ % & = % U \left[ \begin{array}{cccccccc} \color{blue}{u_{1}} & \dots & \color{blue}{u_{\rho}} & \color{red}{u_{\rho+1}} & \dots & \color{red}{u_{n}} \end{array} \right] % Sigma \left[ \begin{array}{cc} \mathbf{S}_{\rho\times \rho} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{array} \right] % V \left[ \begin{array}{c} \color{blue}{v_{1}^{*}} \\ \vdots \\ \color{blue}{v_{\rho}^{*}} \\ \color{red}{v_{\rho+1}^{*}} \\ \vdots \\ \color{red}{v_{n}^{*}} \end{array} \right] % \end{align} $$ Think of the $\Sigma$ matrix as a sabot matrix which insures conformability between the domain matrices $\mathbf{U}$ and $\mathbf{V}$ of different dimension. The heart of the sabot matrix is the diagonal matrix of singular values $\mathbf{S}\in\mathbb{R}^{\rho\times\rho}$. As you can see from the structure of the SVD, the $\color{red}{red}$ vectors are stenciled out from being able to contribute to the $\color{blue}{range}$ space terms.
The Moore-Penrose pseudoinverse is $$ \begin{align} \mathbf{A}^{+} = \mathbf{V} \, \Sigma^{+} \, \mathbf{U}^{*} % = % V \left[ \begin{array}{cc} \color{blue}{\mathbf{V}_{\mathcal{R}}} & \color{red}{\mathbf{V}_{\mathcal{N}}} \end{array} \right] % Sigma \left[ \begin{array}{cc} \mathbf{S}^{-1} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{array} \right] % U* \left[ \begin{array}{c} \color{blue}{\mathbf{U}_{\mathcal{R}}}^{*} \\ \color{red}{\mathbf{U}_{\mathcal{N}}}^{*} \end{array} \right] \\ % \end{align} $$ The different guises of are cataloged in $\mathbf{A}^{+}$ generalized inverse of a matrix and convergence for singular matrix