How are parameters updated with SGD-Optimizer?
Solution 1:
That formula applies to both gradient descent and stochastic gradient descent (SGD). The difference between the two is that in SGD the loss is computed over a random subset of the training data (i.e. a mini-batch/batch) as opposed to computing the loss over all the training data as in traditional gradient descent. So in SGD x
and y
correspond to a subset of the training data and labels, whereas in gradient descent they correspond to all the training data and labels.
θ
represents the parameters of the model. Mathematically this is usually modeled as a vector containing all the parameters of the model (all the weights, biases, etc...) arranged into a single vector. When you compute the gradient of the loss (a scalar) w.r.t. θ
you get a vector containing the partial derivative of loss w.r.t. each element of θ
. So ∇L(θ;x,y)
is just a vector, the same size as θ
. If we were to assume that the loss were a linear function of θ
, then this gradient points in the direction in parameter space that would result in the maximal increase in loss with a magnitude that corresponds to the expected increase in loss if we took a step of size 1 in that direction. Since loss isn't actually a linear function and we actually want to decrease loss we instead take a smaller step in the opposite direction, hence the η and minus.
It's also worth pointing out that mathematically the form you've given is a bit problematic. We wouldn't usually write it like this since assignment and equal aren't the same thing. The equation you provided would seem to imply that the θ
on the left-hand and right-hand side of the equation were the same. They are not. The θ
on the left side of the equal sign represents the value of the parameters after taking a step and the θ
s on the right side correspond to the parameters before taking a step. We could be more clear by writing it with subscripts
where θ_{t}
is the parameter vector at step t
and θ_{t+1}
is the parameter vector one step later.