L1 norm and L2 norm

Solution 1:

Let me highlight the parts of the sentence that should be grouped together:

The amplitude distribution of the optimal residual for the l1-norm approximation problem will tend to have more (zero and very small residuals), compared to the l2-norm approximation solution. In contrast, the l2-norm solution will tend to have relatively fewer (large residuals) (since large residuals incur a much larger penalty in l2-norm approximation than in l1-norm approximation).

This doesn't mean that you won't see large residuals in l1-norm problems (you have to kind of read between the lines). This means that minimizing l1 error will tend to produce solutions that have:

  • a few residuals that are larger and
  • lots of very insignificant residuals.

In other words, the distribution of residuals will be very "spiky." (This is good, for example, when you want to be robust to outliers -- this method "lets" you have a few large residuals (i.e., large errors) while keeping most of the errors small.)

L2 residuals, on the other hand, will produce:

  • very few big residuals, because they're penalized a lot more,
  • but at the cost of having lots more small residuals that are still significant.

In other words, the distribution of residuals will be far less "spiky" and more "even." (This is good when you have no outliers and you want to keep the overall error small -- it will produce a better "fit.")

Solution 2:

In many situations, the data behave like remainders mod 9, 90. In those cases, it is better and correct to use L1 norm. The remainders must expand to full potential, and that too, without power functions [L2 norm has squaring].