Why are additional constraint and penalty term equivalent in ridge regression?

Solution 1:

Let us first define the two problems:

Problem 1: \begin{equation} \min_{\beta} ~f_\alpha(\beta):=\frac{1}{2}\Vert y-X\beta\Vert^2 +\alpha\Vert \beta\Vert^2\end{equation}
Problem 2: \begin{align} \min_{\beta} ~&\frac{1}{2}\Vert y-X\beta\Vert^2\\ s.t.~&\Vert \beta\Vert^2-c\leq 0\end{align}

The Lagrangian for Problem 2 reads: \begin{equation} \mathcal{L}(\beta,\lambda)=\frac{1}{2}\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2-c) \end{equation} and you probably already see the resemblance with Problem 1 (identical except for the constant term $-\lambda c$).

Now let us look at the necessary conditions for optimality. For Problem 1, these read: \begin{equation} \nabla_\beta f_\alpha(\beta^*(\alpha))=0 \end{equation} where we voluntarily write $\beta^*(\alpha)$ to show that this is the optimal solution for a given $\alpha$.

For Problem 2, the KKT conditions imply that we have: \begin{align*} \nabla_\beta \mathcal{L}(\beta^*,\lambda^*)&=\nabla_\beta f_\lambda(\beta^*)=0\\ \lambda^* (\Vert \beta^*\Vert^2-c)&=0 \end{align*} The first line says that the gradient of the Lagrangian with respect to $\beta$ should be null and the second is the complementary condition. (We also need $\lambda^* \geq 0$, but this is less important for our discussion). Also observe that the gradient of the Lagrangian is equal to the gradient of $f_\lambda$ (objective function of problem 1 but with $\lambda$ instead of $\alpha$).

Now suppose we solve Problem 1 for a given $\alpha$ and obtain its solution $\beta^*(\alpha)$. Let $c=\Vert \beta^*(\alpha)\Vert^2$, the squared norm of the solution to Problem 1. Then $\lambda^*=\alpha$ and $\beta^*=\beta^*(\alpha)$ satisfy the KKT conditions for Problem 2, showing that both Problems have the same solution. Conversely, if you solved Problem 2, you could set $\alpha=\lambda^*$ to retrieve the same solution by solving Problem 1.

To sum it up, both problems are equivalent when $c=\Vert \beta^*(\alpha)\Vert^2$.

Solution 2:

Joe's answer looks good, but if you're also looking for a citation, this paper covers it as well in Theorem 1: http://papers.nips.cc/paper/3675-efficient-and-accurate-lp-norm-multiple-kernel-learning (Note: The meat of the proof is actually in the supplemental materials).

Kloft et al, "Efficient and Accurate Lp-Norm Multiple Kernel Learning". NIPS 2009.

When is $\binom{n}{k}$ divisible by $n$?

Closed form for $\int_0^{\pi/2}\frac{\sqrt{1+\sin\phi}}{\sqrt{\sin2\phi}\,\sqrt{\sin\phi+\cos\phi}}d\phi$

What happens if I repeatedly alternately normalize the rows and columns of a matrix?

What other tricks and techniques can I use in integration?

Algorithm to get the maximum size of n squares that fit into a rectangle with a given width and height

Integral $\int_0^\infty{_1F_2}\left(\begin{array}{c}\tfrac12\\1,\tfrac32\end{array}\middle|-x\right)\frac{dx}{1+4\,x}$

Why are $L^p$-spaces so ubiquitous?

How can we turn any number into a prime number by simply adding more digits?

Iterated integral question

"Too simple to be true"

Are there any nontrivial examples of contradictions arising in non-foundational or applied math due to naive set theory?