How to Prove the Semi-parametric Representer Theorem
This question concerns the generalized Representer Theorem, due to Schölkopf, Herbrich, and Smola. In this magnificent work, the authors provide two versions of the Representer Theorem, a non-parametric, and a semi-parametric one. Though, they provide a proof only for the non-parametric version, while they say that the proof for the semi-parametric version is slightly more technical, but straightforward.
I fully understand the proof for the non-parametric version, but unfortunately I cannot find a way in order to start the proof of the semi-parametric.
Below, I give the statements of the above two versions of the Representer Theorem, as well as the proof for the non-parametric case, and I would like to discuss about the proof of the semi-parametric version.
#Theorem 1 (Non-parametric Representer Theorem) Suppose we are given a nonempty set $\mathcal{X}$, a positive definite real-valued kernel $k$ on $\mathcal{X}\times\mathcal{X}$, samples $(\mathbf{x}_1,y_1),\cdots,(\mathbf{x}_m,y_m)\in\mathcal{x}\times\mathbb{R}$, a strictly monotonically increasing real-valued function $g$ on $[0,\infty)$, an arbitrary cost function $c\colon(\mathcal{x}\times\mathbb{R}^2)^m\to\mathbb{R}\cup\{\infty\}$, and a class of functions $$ \mathcal{F}=\bigg\{ f\in\mathbb{R}^{\mathcal{X}} \mid f(\cdot)=\sum_{i=1}^{\infty}\beta_i k(\cdot, \mathbf{z}_i), \beta_i\in\mathbb{R}, \mathbf{z}_i\in\mathcal{X}, \lVert f \rVert < \infty \bigg\}. $$ Here, $\lVert\cdot\rVert$ denotes the norm in the Reproducing Kernel Hilbert Space (RKHS) $\mathcal{H}$ associated with $k$. Then any $f\in\mathcal{F}$ minimizing the regularized risk functional $$ c \Big( (\mathbf{x}_1,y_1,f(\mathbf{x}_1)), \cdots, (\mathbf{x}_m,y_m,f(\mathbf{x}_m)) \Big) + g\big(\lVert f \rVert\big) $$ admits a representation of the form $$ f(\cdot) = \sum_{i=1}^{m} \alpha_i k(\cdot,\mathbf{x}_i). $$
Proof: Let $\phi: \mathcal{X}\to\mathbb{R}^{\mathcal{X}}$, $$ \mathbf{x}\mapsto k(\cdot,\mathbf{x}). $$ Since $k$ is a reproducing kernel, evaluation of the function $\phi(\mathbf{x})$ on the point $\mathbf{x}'$ yields $$ (\phi(\mathbf{x}))(\mathbf{x}')=k(\mathbf{x}',\mathbf{x})=\langle \phi(\mathbf{x}'), \phi(\mathbf{x}) \rangle, $$ for all $\mathbf{x},\mathbf{x}'\in\mathcal{X}$. Here $\langle \cdot,\cdot \rangle$ denotes the dot product in $\mathcal{H}$. Given $\mathbf{x}_1,\cdots,\mathbf{x}_m$, any $f\in\mathcal{F}$ can be decomposed into a part that lives in the span of the $\phi(\mathbf{x}_i)$, and a part which is orthogonal to it, i.e. $$ f = \sum_{i=1}^{m} \alpha_i\phi(\mathbf{x}_i) + u, $$ for some $\alpha\in\mathbb{R}^m$ and $u\in\mathcal{F}$ satisfying fot all $j$, $$ \langle u,\phi(\mathbf{x}_j) \rangle = 0. $$ Using the latter and the reproducing property mentioned above, application of $f$ to an arbitrary point $\mathbf{x}_j$ yields $$ f(\mathbf{x}_j) = \Big\langle \sum_{i=1}^{m} \alpha_i\phi(\mathbf{x}_i) + u, \phi(\mathbf{x}_j) \Big\rangle = \sum_{i=1}^{m} \alpha_i \Big\langle \phi(\mathbf{x}_i),\phi(\mathbf{x}_j) \Big\rangle, $$ which is independent of $u$. Consequently, the first term of the regularized risk functional is independent of $u$. As for the second term, since $u$ is orthogonal to $\sum_{i=1}^{m}\alpha_i\phi(\mathbf{x}_i)$, and $g$ is strictly monotonic, we get $$ g\big(\lVert f \rVert\big) = g\bigg(\bigg\lVert \sum_{i=1}^{m} \alpha_i\phi(\mathbf{x}_i) + u \bigg\rVert\bigg) = g\bigg( \sqrt{ \bigg\lVert \sum_{i=1}^{m} \alpha_i\phi(\mathbf{x}_i) \bigg\rVert^2 + \bigg\lVert u \bigg\rVert^2 } \bigg) \geq g\bigg( \bigg\lVert \sum_{i=1}^{m} \alpha_i\phi(\mathbf{x}_i) \bigg\rVert \bigg), $$ with equality iff $u=0$. Setting $u=0$ thus, does not affect the first term of the regularized risk functional, while strictly reducing the second term - hence, any minimizer must have $u=0$. Consequently, any solution takes the form $f=\sum_{i=1}^{m}\alpha_i\phi(\mathbf{x}_i)$, i.e., using the reproducing property, $$ f(\cdot) = \sum_{i=1}^{m}\alpha_i k(\cdot,\mathbf{x}_i). $$ Q.E.D.
Now, the statement of the semi-parametric version extends the non-parametric as follows:
#Theorem 2 (Semi-parametric Representer Theorem)
Suppose that, in addition to the assumptions of the previous theorem, we are given a set of $M$ real-valued functions $\{\psi_p\}_{p=1}^{M}$ defined on $\mathcal{X}$, with the property that the $m\times M$ matrix $\big( \psi_p(\mathbf{x}_i) \big)_{ip}$ has rank $M$. Then, any $\tilde{f}:=f+h$, with $f\in\mathcal{F}$ and $h\in\operatorname{span}\{\psi_p\}$, minimizing the regularized risk functional $$ c \Big( (\mathbf{x}_1,y_1,\tilde{f}(\mathbf{x}_1)), \cdots, (\mathbf{x}_m,y_m,\tilde{f}(\mathbf{x}_m)) \Big) + g\big(\lVert f \rVert\big), $$ admits a representation of the form $$ \tilde{f}(\cdot) = \sum_{i=1}^{m}\alpha_i k(\cdot,\mathbf{x}_i) + \sum_{p=1}^{M}\beta_p \psi_p(\cdot), $$ with unique coefficients $\beta_p\in\mathbb{R}$, for all $p=1,\cdots,M$.
It would be nice if you could provide a meaningful sketch of the proof. I think it may help some other people as well, who study this theory, but unfortunately do not have the appropriate background to prove such theorems by themselves (yet!). Thanks a lot!
Suppose that we have $\tilde f = f + h$ as described. Let $V = span( k(\cdot, x_i) )_{i=1,...,m}$. We can write any function $f \in \mathcal{F}$ as $f = f_s + f_p$ where $f_s \in V$ and $f_p \in V^{\perp}$.
This means that $$\tilde f(x_i) = f(x_i) + h(x_i) = f_s(x_i) + f_p(x_i) + h(x_i)$$ but since $f_p \in V^{\perp}$ we have that $f_p(x_i) =0$. Which yields: $$\tilde f(x_i) = f(x_i) + h(x_i) = f_s(x_i) + h(x_i)$$
Thus $$c( (x_i, y_i, \tilde f(x_i))_{i=1,...,m}) = c( (x_i, y_i, f_s(x_i) + h(x_i))_{i=1,...,m})$$ which means that $f_p$ has no effect on $c(\cdot)$. Moreover, since $$\|f\|^2 = \|f_s\|^2 + \|f_p\|^2$$ by the pythagorean theorem, we have $$\|f_s\| \le \|f\|$$ for all $f \in \mathcal{F}$. Thus $g(\|f_s\|) \le g(\|f\|)$ since $g$ is strictly increasing.
Thus we see if a function $\tilde f = f+h$ minimizes the cost function $f_p$ must be zero. Thus $\tilde f = f_s + h$ and since $f_s \in V$ we have that $$\tilde f = \sum_{i=1}^m a_i k(\cdot, x_i) + h$$
Finally when we evaluate $\tilde f(x_j) = \sum_{i=1}^m a_i k(x_j, x_i) + h(x_j)$ we get $m$ equations for $$h(x_j) = \sum_{p=1}^M \beta_p \psi_p(x_j) = \tilde f(x_j) - \sum_{i=1}^m a_i k(x_j, x_i).$$ By the assumptions on $\psi_p(x_j)$ we see that $(\psi_p(x_1),...,\psi_p(x_m))^T$ (for $p=1,2,...,M$) are linearly independent vectors, and thus the coefficients $\beta_p$ are uniquely determined.