Are gradient flows the quickest way to minimize a function for a short time?

Let $F:\mathbb{R}^n \to \mathbb{R}$ be a smooth function, and let $p \in \mathbb{R}^n$. Let $\alpha(t)$ be the solution to the negative gradient flow of $F$, i.e.

$$ \alpha(0)=p, \, \, \dot \alpha(t)=-\nabla F(\alpha(t)).$$

Let $\beta(t)$ be a smooth path starting at $p$ (i.e. $\beta(0)=p$) and suppose that $\|\dot \beta(t)\|=\|\dot \alpha(t)\|$.

Is it true that $F(\alpha(t)) \le F(\beta(t))$ for sufficiently small $t$?

We can assume that $\nabla F(p) \neq 0$, since otherwise $\alpha$ is constant, and then $\|\dot \beta \|=\|\dot \alpha\|=0$ implies $\beta$ is also constant.

It is easy to see that the answer is positive if $\dot \beta(0) \neq -\nabla F(p)$ (see details below). I am not sure what happens when $\dot \beta(0) = -\nabla F(p)$.

Details:

Write $G(t)=F(\beta(t))-F(\alpha(t)) $. Then,$G(0)=0$, and $$G'(0)=\langle \nabla F(p),\dot \beta(0) \rangle-\langle \nabla F(p),\dot \alpha(0) \rangle=\langle \nabla F(p),\dot \beta(0) \rangle+\|\nabla F(p)\|^2 \ge 0,$$

Since by the C-S inequality, we have

$\langle \nabla F(p),\dot \beta(0) \rangle \ge - \|\nabla F(p)\| \cdot \|\dot \beta(0)\|=-\|\nabla F(p)\| \cdot \|\dot \alpha(0)\|=-\|\nabla F(p)\|^2$, with equality if and only if $\dot \beta(0)=-\lambda \nabla F(p)$ for some positive scalar $\lambda$.

So, we showed that if $\dot \beta(0) \neq -\nabla F(p)$, then $G(0)=0,G'(0) >0$, hence $G(t)>G(0)$.


Analysis of the case $\dot \beta(0) = -\nabla F(p)$: Writing $$G'(t)=\langle \nabla F(\beta(t)),\dot \beta(t) \rangle-\langle \nabla F(\alpha(t)),\dot \alpha(t) \rangle$$

we get $$G''(t)= d^2F_{\beta(t)}(\dot \beta(t),\dot \beta(t))+\langle \nabla F(\beta(t)),\ddot \beta(t) \rangle -d^2F_{\alpha(t)}(\dot \alpha(t),\dot \alpha(t))-\langle \nabla F(\alpha(t)),\ddot \alpha(t) \rangle,$$

so $$ G''(0)= d^2F_{p}(-\nabla F(p),-\nabla F(p))+\langle \nabla F(p),\ddot \beta(0) \rangle -d^2F_{p}(-\nabla F(p),-\nabla F(p))-\langle \nabla F(p),\ddot \alpha(0) \rangle=\langle \nabla F(p),\ddot \beta(0) -\ddot \alpha(0) \rangle. \tag{1}$$

Now, we use our assumption that $\langle \dot \beta(t),\dot \beta(t) \rangle=\langle \dot \alpha(t),\dot \alpha(t) \rangle$. Differentiating this, we obtain

$$ \langle \dot \beta(0),\ddot \beta(0) \rangle=\langle \dot \alpha(0),\ddot \alpha(0) \rangle \tag{2},$$

which really means $$ \langle \nabla F(p),\ddot \beta(0) \rangle=\langle \nabla F(p),\ddot \alpha(0) \rangle. \tag{3}$$

Combining $(1)$ and $(3)$ implies that $G''(0)=0$, so this does not seem to help us.

Do we need to proceed to third derivatives? It seems interesting to see if using $\|\dot \beta \|=\|\dot \alpha\|$ we can express neatly $G'''(0)$.

Edit:

As commented by Anthony Carapetis, this "differential analysis" approach is doomed to fail: Indeed, if we want to show $G(t)\ge 0$ by examining derivatives of $G$ at zero, we will have to show that the first non-zero derivative is strictly positive. However, $\alpha$ and $\beta$ may have arbitrarily many derivatives agreeing at zero. (they can even agree on the derivatives of all orders).


Consider the function $F=-\theta$ aka the negative polar angle (taken between $-\pi$ and $\pi$) near $p$ with $x=1, y=0$ (happens to also be $r=1, \theta=0$). Then the negative gradient of $F$ is $\frac{1}{r} \hat{\theta}$ and the gradient flowline through $p$ is the unit speed parametrized arc of unit circle. Now consider $\beta(t)$ with $\theta(t)=t$, $r(t)=1-t^3$. The arclength from $t=0$ to $t=T$ is $\int_0^T \sqrt{(1-t^3)^2+9t^4}dt$ which for small positive $T$ is less than $T$. So $\beta$ gets to same levels of $F$ as $\alpha$ using less length, which means if we reparametrize it by arclength it will descend through levels of $F$ faster than $\alpha$.

I guess this means one should proceed to third derivatives, just to get the opposite conclusion from what you expected.

Some extra explanations: This was against my initial intuition also. I actually started trying to prove grad flow was locally optimal (or, more precisely, figure out what will happen); I changed coordinates to make $F$ one of coordinate functions, say $x_0$, and to make gradient flowline through p a coordinate line; then I pulled back the Euclidean metric so see what kind of length integrals I would get for curves parametrized by $x_0$. Then I noticed, roughy, that the Euclidean formula "$ds^2=\sqrt{x_0^2+ |\vec{x}|^2}$" means that sideways motion only “costs quadratically” in length, but can affect the metric tensor linearly; this means that if metric has smaller $g_{00}$ component in some sideways direction, going there is a net win. In terms of the original coordinates this means that the level sets of $F$ should be more closely spaced on some side of the flow line through p — i.e. the flow line should have curvature. What I have in the answer is the simplest example of this. In fact, this more or less says that this behavior is universal, I think.

Aside: This "quadratic-linear tradeoff" is one of my favorite motifs -- this is the reason horizontal garlands will hang fairly low when loosened by very small amount from being perfectly tightly strung; it is also "why" when you write the gradient in Cartesian coordinates you don't just go in the coordinate direction with the highest partial derivative, but trade off going sideways for quadratically small cost to get a linear gain -- the trade off balancing out precisely so that you end up going in the gradient direction.