What happens if a convex objective is optimized by stochastic gradient descent? Is a global solution achieved?


Solution 1:

Eventually :). A convex objective function has no local minima, so, for every point in the domain, the integral curve of the gradient vector field from that point leads to the global minimum and SGD approximately follows the integral curve as long as the learning rate is small enough. It is not even necessary to track the integral curve very closely, since a gradient integral curve leads the global minimum from wherever you land, you just need to keep moving downhill. The real issue is the rate of progress, which can easily stall in regions with small gradient. Avoiding this also depends on the learning rate, making the learning rate critical to the practical use of SGD. I believe there are theoretical guarantees of convergence if you vary the learning rate inversely with time. The best place to look for specific results is in Leon Bottou's papers and the Pegasos paper. The current focus of research seems to be on varying the learning rate adaptively during training based on the learning progress thus far. An early version of this was the passive-aggressive perceptron. More recent methods like "natural gradient" and the AROW algorithm adaptively maintain a separate learning rate for each component of the gradient.