Motivation for Mirror-Descent

Solution 1:

The advantage of using mirror descent over gradient descent is that it takes into account the geometry of the problem through the potential function $\Phi$.

We can see mirror descent as a generalization of the projected gradient descent, which ordinarily is based on an assumed euclidean geometry. Projected gradient descent is a special case of mirror descent using the potential $\Phi = \|\cdot\|^2/2$. But sometimes there are other choices of $\Phi$ that can better model the geometry of our problem. Examples include $x\log x$ or $-\log x$, both of which are useful for problems over the positive orthant, simplex, etc.

To answer 1), it depends on the geometry of the problem. In high dimensions (large scale optimization), it can be advantageous to abandon the euclidean geometry to improve the constants in our convergence rates. As an example, consider the differentiable convex function $f$ on the Euclidean ball $B_{2,n}$ with $\|\nabla f(x)\|_{\infty}\leq 1$ for all $x\in B_{2,n}$. Then, $|\nabla f(x)|_2\leq \sqrt{n}$ and the projected gradient descent converges at a rate of $\sqrt{\frac{n}{k}}$. However, using mirror descent with an appropraitely chosen $\Phi$ we can get a rate of $\sqrt{\frac{\log(n)}{k}}$. When $n$ is quite large, this is a huge improvement. The key step is that our choice of $\Phi$ (potential) determines the new geometry.

To answer 2), many times when considering probability distributions the euclidean distance is lackluster and we prefer to use the so called Kullback Liebler divergence to measure the 'distance' between two probability distributions. In such scenarios, mirror descent with $\Phi$ set to be the entropy provides better constants in our convergence rates.

To answer 3), we map our primal point $x$ to the dual space (through the mapping $\nabla \Phi$) and take a step in the direction given by the gradient of our function, then we map back to the primal space (through $\nabla\Phi^{-1}$) after that step. For spaces which are reflexive (i.e. $X^{**} = X$) this is not a big deal, for example in Hilbert spaces we do not worry so much since the norm arises naturally from an inner product and our space is isomorphic to its dual already. However, in a more general context (any banach space), we have to be careful about handling this duality.

I know this is a vague answer that doesn't quite answer your questions, but I hope it helps you find the right way.

Motivation for Mirror-Descent

Solution 1:

Related

Recent Posts