How are parameters updated with SGD-Optimizer?

Solution 1:

That formula applies to both gradient descent and stochastic gradient descent (SGD). The difference between the two is that in SGD the loss is computed over a random subset of the training data (i.e. a mini-batch/batch) as opposed to computing the loss over all the training data as in traditional gradient descent. So in SGD x and y correspond to a subset of the training data and labels, whereas in gradient descent they correspond to all the training data and labels.

θ represents the parameters of the model. Mathematically this is usually modeled as a vector containing all the parameters of the model (all the weights, biases, etc...) arranged into a single vector. When you compute the gradient of the loss (a scalar) w.r.t. θ you get a vector containing the partial derivative of loss w.r.t. each element of θ. So ∇L(θ;x,y) is just a vector, the same size as θ. If we were to assume that the loss were a linear function of θ, then this gradient points in the direction in parameter space that would result in the maximal increase in loss with a magnitude that corresponds to the expected increase in loss if we took a step of size 1 in that direction. Since loss isn't actually a linear function and we actually want to decrease loss we instead take a smaller step in the opposite direction, hence the η and minus.

It's also worth pointing out that mathematically the form you've given is a bit problematic. We wouldn't usually write it like this since assignment and equal aren't the same thing. The equation you provided would seem to imply that the θ on the left-hand and right-hand side of the equation were the same. They are not. The θ on the left side of the equal sign represents the value of the parameters after taking a step and the θs on the right side correspond to the parameters before taking a step. We could be more clear by writing it with subscripts

enter image description here

where θ_{t} is the parameter vector at step t and θ_{t+1} is the parameter vector one step later.

How are parameters updated with SGD-Optimizer?

Solution 1:

Related

Recent Posts