How does the SVD solve the least squares problem?
How do I prove that the least-squares solution for $$\text{minimize} \quad \|Ax-b\|_2$$ is $A^{+} b$, where $A^{+}$ is the pseudoinverse of $A$?
Solution 1:
The Moore-Penrose pseudoinverse is a natural consequence from applying the singular value decomposition to the least squares problem. The SVD resolves the least squares problem into two components: (1) a range space part which can be minimized, and (2) a null space term which cannot be removed - a residual error. The first part will naturally create the pseudoinverse solution.
Define SVD
Start with a nonzero matrix $\mathbf{A}\in\mathbb{C}^{m\times n}_{\rho}$, where the matrix rank $1\le\rho<m$ and $\rho<n$. The singular value decomposition, guaranteed to exist, is $$ \mathbf{A} = \mathbf{U} \, \Sigma \, \mathbf{V}^{*} = \left[ \begin{array}{cc} \color{blue}{\mathbf{U}_{\mathcal{R}}} & \color{red}{\mathbf{U}_{\mathcal{N}}} \end{array} \right] % \left[ \begin{array}{c} \mathbf{S} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{array} \right] % \left[ \begin{array}{c} \color{blue}{\mathbf{V}_{\mathcal{R}}}^{*} \\ \color{red}{\mathbf{V}_{\mathcal{N}}}^{*} \end{array} \right]. $$ The codomain matrix $\mathbf{U}\in\mathbb{C}^{m\times m}$, and the domain matrix $\mathbf{V}\in\mathbb{C}^{n\times n}$ are unitary: $$ \mathbf{U}^{*}\mathbf{U} = \mathbf{U}\mathbf{U}^{*} = \mathbf{I}_{m}, \quad \mathbf{V}^{*}\mathbf{V} = \mathbf{V}\mathbf{V}^{*} = \mathbf{I}_{n}. $$ The column vectors of the domain matrices provide orthonormal bases for the four fundamental subspaces: $$ \begin{array}{ll} % matrix & subspace \\\hline % \color{blue}{\mathbf{U}_{\mathcal{R}}}\in\mathbb{C}^{m\times\rho} & \color{blue}{\mathcal{R}\left(\mathbf{A}\right)} = \text{span}\left\{\color{blue}{u_{1}},\dots,\color{blue}{u_{\rho}}\right\}\\ % \color{blue}{\mathbf{V}_{\mathcal{R}}}\in\mathbb{C}^{n\times\rho} & \color{blue}{\mathcal{R}\left(\mathbf{A}^{*}\right)} = \text{span}\left\{\color{blue}{v_{1}},\dots,\color{blue}{v_{\rho}}\right\}\\ % \color{red}{\mathbf{U}_{\mathcal{N}}}\in\mathbb{C}^{m\times m-\rho} & \color{red}{\mathcal{N}\left(\mathbf{A^{*}}\right)} = \text{span}\left\{\color{red}{u_{\rho+1}},\dots,\color{red}{u_{m}}\right\}\\ % \color{red}{\mathbf{V}_{\mathcal{N}}}\in\mathbb{C}^{n\times n-\rho} & \color{red}{\mathcal{N}\left(\mathbf{A}\right)} = \text{span}\left\{\color{red}{v_{\rho+1}},\dots,\color{red}{v_{n}}\right\}\\ % \end{array} $$ There are $\rho$ singular values which are ordered and real: $$ \sigma_{1} \ge \sigma_{2} \ge \dots \ge \sigma_{\rho}>0, $$ and are the square root of non-zero eigenvalues of the product matrices $\mathbf{A}^{*}\mathbf{A}$ and $\mathbf{A}\mathbf{A}^{*}$. These singular values form the diagonal matrix of singular values $$ \mathbf{S} = \text{diagonal} (\sigma_{1},\sigma_{2},\dots,\sigma_{\rho}) = \left[ \begin{array}{ccc} \sigma_{1} \\ & \ddots \\ && \sigma_{\rho} \end{array} \right] \in\mathbb{R}^{\rho\times\rho}. $$ The $\mathbf{S}$ matrix is embedded in the sabot matrix $\Sigma\in\mathbb{R}^{m\times n}$ whose shape insures conformability.
Define least squares solution
Consider that the linear system $$ \mathbf{A} x = b $$ does not have an exact solution, so we generalize the question and ask for the best solution vector $x$. Using the $2-$norm, we ask for the least squares solution which minimizes $r^{2}(x) = \lVert \mathbf{A} x - b \rVert_{2}^{2}$, the sum of the squares of the residual errors: $$ x_{LS} = \left\{ x \in \mathbb{C}^{n} \colon \big\lVert \mathbf{A} x - b \big\rVert_{2}^{2} \text{ is minimized} \right\} $$
Exploit SVD - resolve range and null space components
A useful property of unitary transformations is that they are invariant under the $2-$norm. For example $$ \lVert \mathbf{V} x \rVert_{2} = \lVert x \rVert_{2}. $$ This provides a freedom to transform problems into a form easier to manipulate. In this case, $$ \begin{align} r\cdot r &= \lVert \mathbf{A}x - b \rVert_{2}^{2} = \lVert \mathbf{U}^{*}\left(\mathbf{A}x - b \right) \rVert_{2}^{2} = \lVert \mathbf{U}^{*}\left(\mathbf{U} \, \Sigma \, \mathbf{V}^{*}x - b \right) \rVert_{2}^{2} \\ &= \lVert \Sigma \mathbf{V}^{*} x - \mathbf{U}^{*} b \rVert_{2}^{2}. \end{align} $$ Switching to the block form separates the range and null space terms, $$ \begin{align} r^{2}(x) &= \big\lVert \Sigma \mathbf{V}^{*} x - \mathbf{U}^{*} b \big\rVert_{2}^{2} = \Bigg\lVert % \left[ \begin{array}{c} \mathbf{S} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{array} \right] % \left[ \begin{array}{c} \color{blue}{\mathbf{V}_{\mathcal{R}}}^{*} \\ \color{red}{\mathbf{V}_{\mathcal{N}}}^{*} \end{array} \right] % x - \left[ \begin{array}{c} \color{blue}{\mathbf{U}_{\mathcal{R}}}^{*} \\ \color{red}{\mathbf{U}_{\mathcal{N}}}^{*} \end{array} \right] b \Bigg\rVert_{2}^{2} \\ &= \big\lVert \mathbf{S} \color{blue}{\mathbf{V}_{\mathcal{R}}}^{*} x - \color{blue}{\mathbf{U}_{\mathcal{R}}}^{*} b \big\rVert_{2}^{2} + \big\lVert \color{red}{\mathbf{U}_{\mathcal{N}}}^{*} b \big\rVert_{2}^{2} \end{align} $$
The separation between the range and null space contributions to the total error is also a separation between components which are under control and not controlled. The vector which we control is the solution vector $x$ which appears only in the (blue) range space term. What remains is the (red) null space term, a residual error.
Solve and recover Moore-Penrose pseudoinverse
Select the vector $x$ in the range space term which forces that term to $0$: $$ \mathbf{S} \color{blue}{\mathbf{V}_{\mathcal{R}}}^{*} x - \color{blue}{\mathbf{U}_{\mathcal{R}}}^{*} b = 0 \qquad \Rightarrow \qquad x = \color{blue}{\mathbf{V}_{\mathcal{R}}} \mathbf{S}^{-1} \color{blue}{\mathbf{U}_{\mathcal{R}}}^{*}b. $$ This is the least squares solution $$ \color{blue}{x_{LS}} = \color{blue}{\mathbf{A}^{\dagger} b} = \color{blue}{\mathbf{V}_{\mathcal{R}}} \mathbf{S}^{-1} \color{blue}{\mathbf{U}_{\mathcal{R}}}^{*}b $$ using the thin SVD.
We are left with an error term which we cannot remove, a residual error, given by $$ r^{2}\left(x_{LS}\right) = \big\lVert \color{red}{\mathbf{U}_{\mathcal{N}}^{*}} b \big\rVert_{2}^{2}, $$ which quantifies the portion of the data vector $b$ residing in the null space.
Solution 2:
First order condition:
$$\frac{d}{dx}||Ax-b||_2^2=\frac{d}{dx}(Ax-b)'(Ax-b)=A'(Ax-b)=0$$
Thus,
$$x=(A'A)^{-1}A'b=A^+b$$
Solution 3:
The first thing to realise is that the possible values of $Ax$ cover all and only the range of $A$: $\{Ax: x\in V\}=\mathrm{Range}(A)$. This means that the problem $Ax=b$ is solvable if and only if $b\in\mathrm{Range}(A)$.
Now consider the more general case in which $b\in \mathrm{Range}(A)$ is not necessarily true, and thus the problem cannot be solved exactly. Because $A$ allows us to generate everything in $\mathrm{Range}(A)$ by an appropriate choice of $x$, it is then only natural that the $x$ that gives the best solution (in $\|\cdot\|_2$ norm) is the one such that $Ax$ equals the projection of $b$ onto $\mathrm{Range}(A)$. In other words, denoting with $\mathbb P_R$ the projector onto $\mathrm{Range}(A)$, we want $x$ such that $$Ax=\mathbb P_R b.\tag A$$ With this choice of $x$, the distance between $Ax$ and $b$ is then $$\|b-\mathbb P_R b\|_2=\|(I-\mathbb P_R)b\|_2.$$
Now, why does the SVD enters the discussion? Well, mostly because it provides an easy way to find the $x$ satisfying (A).
To see this, write the SVD in dyadic notation as $$A=\sum_k s_k u_k v_k^*,\tag B$$ where $s_k>0$ are the singular values, and $\{u_k\}$ and $\{v_k\}$ are orthonormal bases for $\mathrm{Range}(A)$ and $A^{-1}(\mathrm{Range}(A)\setminus \{0\})=\mathrm{Ker}(A)_\perp$, respectively. The expression (B) is equivalent to the maybe more common way to see the SVD, $A=UDV^\dagger$, with $v_k$ being the columns of $V$ and $u_k$ the columns of $U$.
The pseudo-inverse $A^+$ has a nice expression in terms of its SVD: $$A^+=\sum_k s_k^{-1} v_k u_k^*. \tag C$$ With these expressions, you might notice how $$A(A^+ y)=\sum_k u_k u_k^* y=\sum_k u_k\langle u_k,y\rangle=\mathbb P_R y.$$ Now, remember we want to get as close to $b$ as possible, and therefore are looking for some $y$ such that $$A(A^+ y)=\mathbb P_R y=\mathbb P_R b.$$ This tells us that we want $y=b$, and thus $x=A^+ b$.