Any specific advantages/insights for kernel Ridge-regression in RKHS?

It is fast because we are effectively solving a linear problem. And yet we get a high performing non linear function. This is crucial for any application (think of an SVM classifying millions of data points). Any minimizer of the functional $$\mathcal F (f) = \frac{1}{n} \sum\limits_{i=1}^{n} l(f(X_i), Y_i) + G(\|f\|^2_{\mathcal H_K})$$ where $l$ is a loss function, $G$ is a strictly non-decreasing function of the squared norm of $f$ in the RKHS with kernel $K$ has the form $$f^*(.) = \sum\limits_{i=1}^{n} \alpha_i K (x_i, . )$$

This is called the representer theorem. In the proof below, we see that what we are actually doing is minimizing over a subspace of ${\mathcal H_K}$, i.e $S= \operatorname{span}\{ K(x_i, .) | 1 \leq i \leq n\}$, and the repesenter theorem guarantees that the minimizer we obtain is the minimizer over the whole space ${\mathcal H_K}$. If you proceed like this with another subspace, say the span of the Legendre polynomials or other Fourier basis, it does not hold in general that the minimizer over that subspace is actually the global minimizer.

The representer theorem has huge implications in practice. Let us look at an example. Let $\mathcal H_K$ be a RKHS with kernel $K$. Suppose we are interested in minimizing a regularization of the squared error risk

$$\hat f_{n,\lambda} = \operatorname{argmin}_{f \in \mathcal H_K} \frac{1}{n} \sum\limits_{i=1}^{n} (f(x_i) - y_i)^2 + \lambda \|f\|_{\mathcal H_K}^2$$

Then we know that any minimizer $f$ has the form

$$ f(.) = \sum\limits_{j=1}^{n} \alpha_j K(x_j,.)$$

Observe that

$$f(x_i) = \sum\limits_{j=1}^{n} \alpha_j K(x_j,x_i) = \sum\limits_{j=1}^{n} \alpha_j K(x_i,x_j)$$

and that

\begin{align} \|f\| ^2_{\mathcal H_K} &= \langle { f, f }\rangle_{\mathcal H_K} \\ &= \Big\langle { \sum\limits_{i=1}^{n} \alpha_i K(x_i,.), \sum\limits_{j=1}^{n} \alpha_j K(x_j,.) }\Big\rangle_{\mathcal H_K} \\ &= \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} \alpha_i \alpha_j \big\langle {K(x_i,.), K(x_j,.) }\big\rangle_{\mathcal H_K} \end{align}

Using matrix notation $\mathbb K_{ij}:=\big\langle K(x_i,.), K(x_j,.) \big\rangle_{\mathcal H_K}$, we get

$$\hat f_{n,\lambda} = \sum\limits_{j=1}^{n} \alpha_j^* K(x_j,.)$$

where

$$\mathbb \alpha^* = \operatorname{argmin}_{\mathbb \alpha} \frac{1}{n} \|{\mathbb Y-\mathbb K \mathbb \alpha}\|^2 + \lambda \mathbb \alpha \mathbb K \mathbb \alpha $$

This is a convex differentiable cost function, we can find its minimum by setting its gradient to zero

$$ - \mathbb K ( \mathbb Y - \mathbb K \alpha^*) + \lambda n \mathbb K \alpha^* = 0 $$

which implies, when matrix $\mathbb K$ is strictly positive definite, that

$$\alpha^* = ( \mathbb K + \lambda n \mathbb I )^{-1} \mathbb Y$$

where $\mathbb I$ is the identity matrix. This is a $n \times n$ linear system. It can be solved numerically by means of Cholesky factorization (complexity of $O(n^3)$).

Once the $\alpha^*$ computed, we can use $\hat f_{n,\lambda}$ on new data

$$\hat f_{n,\lambda} (x_{\text{new}}) = \sum\limits_{j=1}^{n} \alpha_j^* K(x_j,x_{\text{new}}) = \mathbb K_{x_{\text{new}}} \alpha^* = \mathbb K_{x_{\text{new}}}( \mathbb K + \lambda n \mathbb I )^{-1} \mathbb Y$$

This operation has cost $O(n^2)$ for each new example.

Here is the proof of the representer theorem where we explicitly use the reproducing property.

We project $f$ onto the subspace

$$S= \operatorname{span} \{ K(x_i, .) | 1 \leq i \leq n\}$$

By the orthogonal decomposition theorem in Hilbert spaces, we have $f= f_1 +f_2$, where $f_1$ is the component in $S$ and $f_2$ the one orthogonal to $S$. By Pythagoras theorem, we have

$$ \|f\|^2 \geq \|f_1\|^2$$

Since $G$ is non-decreasing, we have

$$G(\|f\|^2_{\mathcal H_K}) \geq G(\|f_1\|^2_{\mathcal H_K})$$

Notice that (using the reproducing property)

$$f(x_i)= \langle f, K(x_i,.) \rangle = \langle f_1, K(x_i,.) \rangle + \langle f_2, K(x_i,.) \rangle = \langle f_1, K(x_i,.) \rangle = f_1(x_i)$$

This implies that

$$l(f(x_i), y_i) = l(f_1(x_i), y_i)$$

So the loss function only depends on the component in $S$ and $G$ is minimized if $f$ lies in $S$. Hence we can find a minimizer in $S$

$$f^*(.) = \sum\limits_{i=1}^{n} \alpha_i K(x_i,.)$$

As $G$ is strictly non-decreasing, $||f_2||$ must be zero for $f$ to be a minimizer of $\mathcal F (f)$. In conclusion, any minimizer has the above form.

Any specific advantages/insights for kernel Ridge-regression in RKHS?

Related

Recent Posts