Taking derivative of $L_0$-norm, $L_1$-norm, $L_2$-norm

I am a little confused about taking derivatives w.r.t. the norms.

$L_0$-norm: $L_0$ means number of non-zero elements in a vector. Say, I am interested in an $x_i$.

$$\displaystyle\min_{i}(y_i-x_i)^2+c\|x_i \|_{0}$$ The answer depends on $x_i=0$ or not?
My work: take the norm of $x_i$, which is a constant, then, derivative, so it's 0.

$L_1$-norm: Manhattan distance. What should I do? $$\displaystyle\min_{i}(y_i-x_i)^2+c\|x_i \|_{1}$$

$L_2$-norm:Euclidean distance. What should I do?
$$\displaystyle\min_{i}(y_i-x_i)^2+c\|x_i \|_{2}$$


Solution 1:

I'm going to assume that you're talking about partial derivatives and gradients. All of the norm functions that you stated are non-differentiable somewhere:

  • [$L_0$] This is zero (as you pointed out), but only in places where it isn't interesting. Where it is interesting it's not differentiable (it has jump discontinuities).
  • [$L_1$] This norm is not differentiable with respect to a coordinate where that coordinate is zero. Elsewhere, the partial derivatives are just constants, $\pm 1$ depending on the quadrant.
  • [$L_2$] Usually people use the 2-norm squared so that it's differentiable even at zero. The gradient of $\|x\|_2^2$ is $2x$, but without the square it's $x/\|x\|$ (i.e. it just points away from zero). The problem is that it's not differentiable at zero.

If you're trying to do gradient descent with these formulations, the only one that will really work is the squared $L_2$ norm, because that's the only one that's differentiable everywhere.

Solution 2:

For $L_2$ and $L_1$ you can do gradient descent or whatever, the methods are called Ridge Regression and LASSO, respectively.