Tricky proof of a result of Michael Nielsen's book "Neural Networks and Deep Learning".

In his free online book, "Neural Networks and Deep Learning", Michael Nielsen proposes to prove the next result:

If $C$ is a cost function which depends on $v_{1}, v_{2}, ..., v_{n}$, he states that we make a move in the $\Delta v$ direction to decrease $C$ as much as possible, and that's equivalent to minimizing $\Delta C \approx \nabla C \cdot \Delta v$. So if $\lvert\lvert\Delta v\rvert\rvert = \epsilon$ for a small $\epsilon$, it can be proved that the choice of $\Delta v$ that minimizes $\Delta C \approx \nabla C \cdot \Delta v$ is $\Delta v = -\eta \nabla C$ where $\eta = \epsilon / \lvert\lvert \nabla C \rvert\rvert$. And he suggests using the Cauchy-Schwarz inequality to prove this.

Ok, so what I've done is to minimize with respect to $\Delta v$ an equivalent function $0 = min_{\Delta v} \lvert\lvert \nabla C \Delta v \rvert\rvert^{2} \leq min_{\Delta v}\lvert\lvert \nabla C \rvert\rvert^{2}\lvert\lvert \Delta v\rvert\rvert^{2}$ (using C-S inequality). I would say this is the correct path to prove the result but I'm stuck and can't arrive to the same result.

Thanks.

Goal: We want to minimize ΔC ≈ ∇C ⋅ Δv by finding some value for Δv that does the trick.

Given: ||Δv|| = ϵ for some small fixed ϵ ∈ ℝ > 0 (this is our fixed “step size” by which we’ll move down the error surface of C).
How should we move v (what should Δv be?) to decrease C as much as possible?

Claim: The optimal value is Δv = -η∇C where η = ϵ / ||∇C|| , or, Δv = -ϵ∇C / ||∇C||

Proof:
1) What is the minimum of ∇C ⋅ Δv? By Cauchy-Schwarz inequality we know that:
|∇C ⋅ Δv| ≤ ||∇C|| * ||Δv||
∴ min(∇C ⋅ Δv) = - ||∇C|| * ||Δv|| (recall: if |x| ≤ 2, x ∈ [-2, 2])

2) By substitution, we want some value for Δv such that:
∇C ⋅ Δv = -||∇C|| * ||Δv||
= ∇C ⋅ Δv = -||∇C||ϵ

3) Consider the following:
∇C ⋅ ∇C = ||∇C||² (because ||∇C|| = sqrt(∇C ⋅ ∇C))
∴ (∇C ⋅ ∇C) / ||∇C|| = ||∇C||

4) Now multiply both sides by -ϵ: (-ϵ∇C ⋅ ∇C) / ||∇C|| = -ϵ||∇C||
Notice that the right hand side of this equality is the same as in (2).

5) Rewrite the left hand side of (4) to separate one of the ∇C’s. The other term will be our Δv such that ∇C ⋅ Δv = -||∇C||ϵ.

((-ϵ∇C) / ||∇C||) ⋅ ∇C = -ϵ||∇C||

∴ because the Cauchy-Schwarz inequality told us what the minimum of ∇C ⋅ Δv could be (by considering |∇C ⋅ Δv|), we know Δv must be (-ϵ∇C) / ||∇C|| ∎

suppose by absurd that

∇C ⋅ Δv < ∇C ⋅ ( -ϵ∇C / ||∇C|| )

note that ride side is <= 0 and so left side too

changing sign now we get

-∇C ⋅ Δv > ∇C ⋅ ( ϵ∇C / ||∇C|| )

note that now both right and left side are >= 0

applying modulo we get

| ∇C ⋅ Δv | > ∇C ⋅ ( ϵ∇C / ||∇C|| )

given that ||Δv|| = ϵ (by hipotesys) and ||a|| = a⋅a/||a|| we get

| ∇C ⋅ Δv | > || ∇C || * || Δv ||

this is absurd because by Cauchy–Schwarz inequality | ∇C ⋅ Δv | <= || ∇C || * || Δv ||

Tricks for quickly reading off the eigenvalues of a matrix

Prove this integral $\int_0^\infty \frac{dx}{\sqrt{x^4+a^4}+\sqrt{x^4+b^4}}=\frac{\Gamma(1/4)^2 }{6 \sqrt{\pi}} \frac{a^3-b^3}{a^4-b^4}$

Proof of nonnegativity of KL divergence using Jensen's inequality

Prove that $E\left(\frac{XY}{X^2+Y^2}\right) \geqslant 0$ for i.i.d. $X$ and $Y$

Differential Equation : One question, two methods, both result in different answers

Is there a bijection of the natural numbers which swaps $\frac{1}{n}$-summable subsets with $\frac{1}{\sqrt{n}}$-summable subsets?

Wondering why proof by contradiction works

What is the remainder for a taylor series of two variables.

An explanation for undergraduated students about why the Jacobian conjecture is hard

How would you prove that there is only a finite number of these primes?

Why are $i$ and $-i$ "more indistinguishable" than $\sqrt{2}$ and $-\sqrt{2}$?

Gödel's completeness theorem and the undecidability of first-order logic

Tricky proof of a result of Michael Nielsen's book "Neural Networks and Deep Learning".

Related

Recent Posts