Expectation of gradient in stochastic gradient descent algorithm

Solution 1:

Let's assume we are talking about stochastic gradient descent where we update the weights based on a single example (not minibatch), out of a total data set of size $N$.

The total error over the whole set is given by: $$ L(w) = \frac{1}{N}\sum\limits_{n=1}^{N} L_n(w) $$ Then, at every step, a random sample point $n\sim U$ is chosen, and we update the weights via: $$ w \leftarrow w - \gamma\nabla L_n(w) $$ where $U$ means uniform over the data set. Now we want to know whether $\mathbb{E}_{n\sim U}[\nabla L_n(w)]=\nabla L(w)$.

We show this as follows: \begin{align} \mathbb{E}_{n\sim U}[\nabla L_n(w)] &= \nabla\; \mathbb{E}_{n\sim U}[ L_n(w)] \\ &= \nabla \sum\limits_{i=1}^N P(n=i) L_i(w)\\ &= \nabla \frac{1}{N}\sum\limits_{i=1}^{N} L_i(w)\\ &= \nabla L(w) \end{align}

The first step is probably the nastiest (although not in the discrete case I guess), but we can interchange the gradient and expectation assuming $L$ is sufficiently smooth and bounded (which it probably is). See here and here.

The other steps are just the definition of discrete expectation (but should still work assuming continuous spaces as well).

Prove that a semigroup satisfying $a^pb^q=ba$ is commutative

A manifold such that its boundary is a deformation retract of the manifold itself.

Nonsingular projective variety of degree $d$

Matrix given by $a_{ij} = 1/(i+j)$ is non-singular.

Quotient of free module

Does $\lim_{n\rightarrow \infty} \left(r+\frac{1}{n^2}\right)\uparrow \uparrow n=e$ hold?

How to prove there exists a linear transformation?

Integral $\int_0^{\pi} \frac{\cos(2018x)}{5-4\cos{x}}dx$

Two Dirichlet's series related to the Divisor Summatory Function and to the Riemann's zeta-function, $\zeta(s)$

Methods for showing three points in $\mathbb{R}^2$ are colinear (or not) [closed]

Proof that Fibonacci Sequence modulo m is periodic? [duplicate]