Expectation of gradient in stochastic gradient descent algorithm

Solution 1:

Let's assume we are talking about stochastic gradient descent where we update the weights based on a single example (not minibatch), out of a total data set of size $N$.

The total error over the whole set is given by: $$ L(w) = \frac{1}{N}\sum\limits_{n=1}^{N} L_n(w) $$ Then, at every step, a random sample point $n\sim U$ is chosen, and we update the weights via: $$ w \leftarrow w - \gamma\nabla L_n(w) $$ where $U$ means uniform over the data set. Now we want to know whether $\mathbb{E}_{n\sim U}[\nabla L_n(w)]=\nabla L(w)$.

We show this as follows: \begin{align} \mathbb{E}_{n\sim U}[\nabla L_n(w)] &= \nabla\; \mathbb{E}_{n\sim U}[ L_n(w)] \\ &= \nabla \sum\limits_{i=1}^N P(n=i) L_i(w)\\ &= \nabla \frac{1}{N}\sum\limits_{i=1}^{N} L_i(w)\\ &= \nabla L(w) \end{align}

The first step is probably the nastiest (although not in the discrete case I guess), but we can interchange the gradient and expectation assuming $L$ is sufficiently smooth and bounded (which it probably is). See here and here.

The other steps are just the definition of discrete expectation (but should still work assuming continuous spaces as well).