Why use the kernel trick in an SVM as opposed to just transforming the data?

The kernel trick says that given your data $x_i \in \mathbb{R}^n, i \in \{1 \ldots m\}$ and a kernel $k : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ having the good properties (*) there is a non-linear transformation $\phi : \mathbb{R}^n \to \mathbb{R}^{m}$ such that $k(x_i,x_j) = \langle \phi(x_i),\phi(x_j)\rangle$.

Let $K_{ij} = k(x_i,x_j)$ the $m \times m$ dot product matrix. Then the good property of $k(.,.)$ is that $K$ is semi-definite positive, so we can diagonalize it to obtain $K = P D P^T$. Letting $Y = P D^{1/2}$ we have $Y Y^T = P D^{1/2} (D^{1/2} P^T) = P D P^T = K$ ie. $\phi(x_i) = Y_{i.}$ (the $i$th row).

That is to say you can do what you said when $m$ is small, but usually $n$ is small but $m$ is very large, so it is not praticable to actually compute $P,D$ and $\phi$ (instead we will compute the first few largest eigenvalues of $K$ in the case of kernel-PCA and spectral clustering)

You can use infinite-dimensional spaces with the kernel trick.

You might want to read my SVM summary and especially What is an example of a SVM kernel, where one implicitly uses an infinity-dimensional space?

Why use the kernel trick in an SVM as opposed to just transforming the data?

Related

Recent Posts