Does gradient descent converge to a minimum-norm solution in least-squares problems?
Consider running gradient descent (GD) on the following optimization problem:
$$\arg\min_{\mathbf x \in \mathbb R^n} \| A\mathbf x-\mathbf b \|_2^2$$
where $\mathbf b$ lies in the column space of $A$, and the columns of $A$ are not linearly independent. Is it true that GD would find a solution with minimum norm? I saw some articles (e.g., 1705.09280) that indicated so, but I couldn't find a proof, searching on the internet for a while.
Can someone confirm or refute it? And if it's true, a proof or a reference to the proof would be much appreciated!
EDITS 2019/11/27:
Thanks to littleO's answer, apparently the answer to this question is no in general. However, I'm still curious about the following:
Follow-up Question: Are there some constraints under which the answer is yes? Is it true that, as Clement C. suggested, if we initialize $\mathbf x$ in the range of $A^\top$, then GD finds the minimum-norm solution? Is this a sufficient condition or is it also necessary?
It appears to me that the answer is yes, if and only if we initialize $\mathbf x$ in the range of $A^\top$.
I'll list my arguments below and would appreciate it if someone would confirm it or point out where I'm mistaken.
My arguments: Let $f(\mathbf x)= \| A\mathbf x-\mathbf b \|_2^2$. Then $\nabla_{\mathbf x}f(\mathbf x) = 2A^\top(A\mathbf x - \mathbf b),$ and GD iterates as follows: $\mathbf x^{(t+1)}=\mathbf x^{(t)}-\eta \nabla_{\mathbf x}f(\mathbf x^{(t)})$. Note that all GD updates are in the range of $A^\top$. Hence we may write $\mathbf x^{(t)}=\mathbf x^{(0)}+A^\top \mathbf u$ for some vector $\mathbf u$.
Sufficiency: Suppose $\mathbf x^{(0)}$ is also in the range of $A^\top$, i.e. $\mathbf x^{(0)}=A^\top \mathbf v$. Then $\mathbf x^{(t)}=A^\top (\mathbf v+\mathbf u).$ Since $f(\mathbf x)$ is convex, we know that GD will converge to a global minimum ($0$) if the step size is small enough. Denote this by $\mathbf x^{(t)} \to \mathbf x^* = A^\top \mathbf u^*$. Hence $A\mathbf x^*-\mathbf b=AA^\top \mathbf u^*-\mathbf b=\mathbf 0$, so $\mathbf u^*=(AA^\top)^{-1}\mathbf b$ (assuming $A$ is full rank), and $\mathbf x^*=A^\top (AA^\top)^{-1}\mathbf b$, which is the well-known minimum norm solution. (If $A$ is not full (row) rank, we can delete some redundant rows.)
Necessity: Now suppose $\mathbf x^{(0)} \notin \mathrm{range}(A^\top)$, and $\mathbf x^{(t)} \to \mathbf x^*$. We necessarily have $\mathbf x^* = A^\top \mathbf u^* + \mathbf x^{(0)}$ for some $\mathbf u^*$. However, clearly $\mathbf x^*\notin \mathrm{range}(A^\top)$, so it cannot possibly be the (unique) minimum norm solution, $ A^\top (AA^\top)^{-1}\mathbf b$.
Solution 1:
From the paper [0] in question:
When optimizing underdetermined problems with multiple global minima, the choice of optimization algorithm can play a crucial role in biasing us toward a specific global minima, even though this bias is not explicitly specified in the objective or problem formulation. For example, using gradient descent to optimize an unregularized, underdetermined least squares problem would yield the minimum Euclidean norm solution, while using coordinate descent or preconditioned gradient descent might yield a different solution. Such implicit bias, which can also be viewed as a form of regularization, can play an important role in learning.
Given fat matrix $\mathrm A \in \mathbb R^{m \times n}$ ($m < n$) and vector $\mathrm b \in \mathbb R^m$, consider the following linear system in $\mathrm x \in \mathbb R^n$
$$\rm A x = b$$
where $\rm A$ has full row rank. Let the singular value decomposition (SVD) of $\rm A$ be as follows
$$\mathrm A = \mathrm U \Sigma \mathrm V^\top = \mathrm U \begin{bmatrix} \Sigma_1 & \mathrm O \end{bmatrix} \begin{bmatrix} \mathrm V_1^\top \\ \mathrm V_2^\top \end{bmatrix} = \mathrm U \Sigma_1 \mathrm V_1^\top$$
The least-norm solution of $\rm A x = b$ is given by
$$\mathrm x_{\text{LN}} := \mathrm A^\top \left( \mathrm A \mathrm A^\top \right)^{-1} \mathrm b = \cdots = \mathrm V_1 \Sigma_1^{-1} \mathrm U^\top \mathrm b$$
where the inverse of $\mathrm A \mathrm A^\top$ exists because $\rm A$ has full row rank.
Gradient descent
Let cost function $f : \mathbb R^n \to \mathbb R$ be defined by
$$f (\mathrm x) := \frac12 \left\| \rm{A x - b} \right\|_2^2$$
whose gradient is
$$\nabla f (\mathrm x) = \rm A^\top \left( A x - b \right)$$
Using gradient descent with step $\mu > 0$,
$$\begin{aligned} {\rm x}_{k+1} &= {\rm x}_k - \mu \nabla f ({\rm x}_k)\\ &= \left( {\rm I} - \mu {\rm A^\top A} \right) {\rm x}_k + \mu {\rm A^\top b}\end{aligned}$$
Hence,
$${\rm x}_k = \left( {\rm I} - \mu {\rm A^\top A} \right)^k {\rm x}_0 + \mu \sum_{\ell = 0}^{k-1} \left( {\rm I} - \mu {\rm A^\top A} \right)^{\ell} {\rm A^\top b}$$
Letting $\rm y := V^\top x$, we rewrite
$$\begin{aligned} {\rm y}_k &= \left( {\rm I} - \mu \Sigma^\top \Sigma \right)^k {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \left( {\rm I} - \mu \Sigma^\top \Sigma \right)^{\ell} \Sigma^\top {\rm U^\top b}\\ &= \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^k & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} \begin{bmatrix} \Sigma_1\\ \mathrm O \end{bmatrix} {\rm U^\top b}\\ &= \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^k & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1\\ \mathrm O\end{bmatrix} {\rm U^\top b} \end{aligned}$$
Choosing $\mu > 0$ such that all eigenvalues of ${\rm I} - \mu \Sigma_1^2$ are strictly inside the unit circle, then ${\rm y}_k \to {\rm y}_{\infty}$, where
$${\rm y}_{\infty} = \begin{bmatrix} \mathrm O & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{\infty} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1\\ \mathrm O\end{bmatrix} {\rm U^\top b}$$
where
$$\mu \sum_{\ell = 0}^{\infty} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1 = \mu \left( {\rm I} - {\rm I} + \mu \Sigma_1^2 \right)^{-1} \Sigma_1 = \Sigma_1^{-1}$$
and, thus,
$${\rm y}_{\infty} = \begin{bmatrix} \mathrm O & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \begin{bmatrix} \Sigma_1^{-1} \\ \mathrm O\end{bmatrix} {\rm U^\top b}$$
Since $\rm x := V y$,
$$\boxed{ \,\\\quad {\rm x}_{\infty} = {\rm V}_2 {\rm V}_2^\top {\rm x}_0 + \underbrace{{\rm V}_1 \Sigma_1^{-1}{\rm U^\top b}}_{= \mathrm x_{\text{LN}}} \quad\\}$$
Therefore, we conclude that if ${\rm x}_0$ is orthogonal to the null space of $\rm A$, then gradient descent will converge to the least-norm solution.
[0] Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro, Implicit Regularization in Matrix Factorization, May 2017.
optimization numerical-optimization convex-optimization quadratic-programming gradient-descent least-squares least-norm matrices svd
Solution 2:
If you initialize gradient descent with a point $x_0$ which is a minimizer of the objective function but not a least norm minimizer, then the gradient descent iteration will have $x_k = x_0$ for all $k \geq 0$. We won't move anywhere. So gradient descent does not necessarily converge to a least norm solution.