Proof for convergence of stochastic gradient descent to a local optimum for non convex functions

Solution 1:

Let us consider the problem $\underset{x \in X}{\min} F(x)$, where $X \subset \mathbb{R}^n$ is a nonempty closed convex set (you have $X = \mathbb{R}^n$ in your setting).

I'm going to assume that for any given $x \in X$, you can compute a stochastic gradient $g(x,Y)$ of $\nabla F(x)$ that satisfies $\mathbb{E}_Y\left[g(x,Y)\right] = \nabla F(x)$. If you can only show $\mathbb{E}_Y\left[g(x,Y)\right] \propto \nabla F(x)$ with the proportionality constant independent of $x$, then you can extend (most of) the results below by overestimating the proportionality constant (say M > 0) and scaling the step lengths according to $M^{-1}$.

Let $\{\alpha_t\}$ denote a sequence of step lengths. The traditional projected stochastic gradient update rule reads $x_{t+1} = \text{proj}_X\left(x_t - \alpha_t g(x_t,Y_t)\right)$, where $g(x_t,Y_t)$ is a stochastic gradient of $F(x_t)$, $\text{proj}_X$ denotes projection onto the set $X$, and $Y_t$ is a sample of the random variable at iteration $t$. Below are several results in the literature for this setup.

  • The Ghadimi-Lan paper referenced by @eepperly16 considers the case when $X = \mathbb{R}^n$. They assume that $\nabla F$ is Lipschitz continuous, i.e., $$\left\lVert \nabla F(x) - \nabla F(y)\right\rVert \leq L \left\lVert x - y \right\rVert, \quad \forall x, y \in \mathbb{R}^n,$$ and that the variance of the stochastic gradients are bounded, i.e., $$\mathbb{E}_Y\left[ \left\lVert g(x,Y) - \nabla F(x) \right\rVert^2 \right] \leq \sigma^2, \quad \forall x \in \mathbb{R}^n,$$ which in a certain sense requires that the stochastic gradients provide "reasonable approximations" of the true gradient [see A1 of the paper]. They set the step lengths $\alpha_t \in (0,\frac{2}{L})$ and obtain guarantees such as $\mathbb{E}\left[ \left\lVert \nabla F(x_R) \right\rVert^2 \right] = O\left(\frac{1}{\sqrt{T}}\right)$, where $T$ is the number of iterations and $R$ is chosen uniformly at random from $\{1,\cdots,T\}$ [see Corollary 2 of the paper]. If $F$ is convex, then they get a bound on the suboptimality of $F(x_R)$ itself. To get a bound on the probability that $x_R$ is approximately stationary, they assume that the stochastic gradient satisfies large deviation properties, run the stochastic gradient method multiple times to get multiple candidate solutions, and take the solution with the smallest estimated gradient in a post-optimization phase, see Corollary 2.5 of the paper. Note that this paper also considers the case when you only have access to stochastic function values.

  • A followup paper considers the case when $X$ is a general nonempty closed convex subset of $\mathbb{R}^n$, the rest of the assumptions are similar to the above setting. A main result of this paper is $\mathbb{E}\left[ \left\lVert \tilde{\nabla} F(x_R) \right\rVert^2 \right] = O\left(\frac{1}{\sqrt{T}}\right)$, where $\tilde{\nabla} F(x_R)$ is a projected gradient (see Equation (2.4) of the paper for a definition). In order to achieve this, the authors need to consider an increasing mini-batch while computing stochastic gradients which is cumbersome [the paper referenced in the next bullet point says this is not necessary]. To understand what this implies for the original problem, see Section 4 of this paper.

  • A more recent paper with a more general setup is here, which refines the analysis of the above paper. In particular, Section 3.1 of this paper considers the above setup for the case when $F$ is a weakly convex function and $X$ is a nonempty closed convex set. Note that the set of weakly convex functions include smooth functions with Lipschitz continuous gradients, and finite maximum of such functions, see Section 2.1 of the paper for examples. It is assumed that $\mathbb{E}\left[g(x,Y)\right] \in \partial F(x)$ and $\mathbb{E}_Y\left[ \left\lVert g(x,Y) \right\rVert^2 \right] \leq \sigma^2, \quad \forall x \in \mathbb{R}^n$. The authors establish similar guarantees as the above (for our setup) without the need for mini-batching of the stochastic subgradients.

  • This paper explores the limits of stochastic gradient-based methods, and shows that stochastic subgradient methods converge on "tame functions".

Note that stochastic gradient methods can only typically establish convergence to "stationary points". Guaranteeing convergence to a local minimum is NP-hard even using deterministic methods. However, it is noteworthy that (perturbations of) stochastic gradient methods are usually shown to escape poor saddle points, see here and here for example.