Tricky proof of a result of Michael Nielsen's book "Neural Networks and Deep Learning".
In his free online book, "Neural Networks and Deep Learning", Michael Nielsen proposes to prove the next result:
If $C$ is a cost function which depends on $v_{1}, v_{2}, ..., v_{n}$, he states that we make a move in the $\Delta v$ direction to decrease $C$ as much as possible, and that's equivalent to minimizing $\Delta C \approx \nabla C \cdot \Delta v$. So if $\lvert\lvert\Delta v\rvert\rvert = \epsilon$ for a small $\epsilon$, it can be proved that the choice of $\Delta v$ that minimizes $\Delta C \approx \nabla C \cdot \Delta v$ is $\Delta v = -\eta \nabla C$ where $\eta = \epsilon / \lvert\lvert \nabla C \rvert\rvert$. And he suggests using the Cauchy-Schwarz inequality to prove this.
Ok, so what I've done is to minimize with respect to $\Delta v$ an equivalent function $0 = min_{\Delta v} \lvert\lvert \nabla C \Delta v \rvert\rvert^{2} \leq min_{\Delta v}\lvert\lvert \nabla C \rvert\rvert^{2}\lvert\lvert \Delta v\rvert\rvert^{2}$ (using C-S inequality). I would say this is the correct path to prove the result but I'm stuck and can't arrive to the same result.
Thanks.
Goal: We want to minimize ΔC ≈ ∇C ⋅ Δv by finding some value for Δv that does the trick.
Given: ||Δv|| = ϵ for some small fixed ϵ ∈ ℝ > 0 (this is our fixed “step size” by which we’ll move down the error surface of C).
How should we move v (what should Δv be?) to decrease C as much as possible?
Claim: The optimal value is Δv = -η∇C where η = ϵ / ||∇C|| , or, Δv = -ϵ∇C / ||∇C||
Proof:
1) What is the minimum of ∇C ⋅ Δv? By Cauchy-Schwarz inequality we know that:
|∇C ⋅ Δv| ≤ ||∇C|| * ||Δv||
∴ min(∇C ⋅ Δv) = - ||∇C|| * ||Δv|| (recall: if |x| ≤ 2, x ∈ [-2, 2])
2) By substitution, we want some value for Δv such that:
∇C ⋅ Δv = -||∇C|| * ||Δv||
= ∇C ⋅ Δv = -||∇C||ϵ
3) Consider the following:
∇C ⋅ ∇C = ||∇C||2 (because ||∇C|| = sqrt(∇C ⋅ ∇C))
∴ (∇C ⋅ ∇C) / ||∇C|| = ||∇C||
4) Now multiply both sides by -ϵ: (-ϵ∇C ⋅ ∇C) / ||∇C|| = -ϵ||∇C||
Notice that the right hand side of this equality is the same as in (2).
5) Rewrite the left hand side of (4) to separate one of the ∇C’s. The other term will be our Δv such that ∇C ⋅ Δv = -||∇C||ϵ.
((-ϵ∇C) / ||∇C||) ⋅ ∇C = -ϵ||∇C||
∴ because the Cauchy-Schwarz inequality told us what the minimum of ∇C ⋅ Δv could be (by considering |∇C ⋅ Δv|), we know Δv must be (-ϵ∇C) / ||∇C|| ∎
suppose by absurd that
∇C ⋅ Δv < ∇C ⋅ ( -ϵ∇C / ||∇C|| )
note that ride side is <= 0 and so left side too
changing sign now we get
-∇C ⋅ Δv > ∇C ⋅ ( ϵ∇C / ||∇C|| )
note that now both right and left side are >= 0
applying modulo we get
| ∇C ⋅ Δv | > ∇C ⋅ ( ϵ∇C / ||∇C|| )
given that ||Δv|| = ϵ (by hipotesys) and ||a|| = a⋅a/||a|| we get
| ∇C ⋅ Δv | > || ∇C || * || Δv ||
this is absurd because by Cauchy–Schwarz inequality | ∇C ⋅ Δv | <= || ∇C || * || Δv ||