Gradients of marginal likelihood of Gaussian Process with squared exponential covariance, for learning hyper-parameters

The derivation of gradient of the marginal likelihood is given in http://www.gaussianprocess.org/gpml/chapters/RW5.pdf

But the gradient for the most commonly used covariance function, squared exponential covariance, is not explicitly given.

I am implementing the Rprop algorithm in http://ml.informatik.uni-freiburg.de/_media/publications/blumesann2013.pdf for learning hyper-parameters sigma (signal variance) and h (length). Alas, my implementation is not working well. I have derived the gradients but I am not sure if they are correct.

Can someone point me to a good tutorial / article that explicitly give the expressions for the hyper parameter gradients?


Solution 1:

We are looking to maximise the log probability of $lnP(y|x, \theta)$:

$$\ln P(y|x, \theta) = -\frac{1}{2}\ln|K| - \frac{1}{2}y^tK^{-1}y - \frac{N}{2}\ln{2\pi}$$

The three components can be seen as balancing the complexity of the GP (to avoid overfit) and the data fit, with a constant on the end. So the gradient is

$$\frac{\partial}{\partial\theta_i} \log P(y|x, \theta) = \frac{1}{2}y^TK^{-1}\frac{\partial K}{\partial\theta_i}K^{-1}y^T -\frac{1}{2}\mathrm{tr}\left(K^{-1}\frac{\partial K}{\partial\theta_i}\right)$$

So all we need to know is $\frac{\partial K}{\partial\theta_i}$ to be able to solve it. I think you got this far but I wasn't sure so I thought I would recap.

For the case of the RBF/expodentiated quadratic (never call it squared exponential as this is actually incorrect) kernel, under the following formulation:

$$K(x,x') = \sigma^2\exp\left(\frac{-(x-x')^T(x-x')}{2l^2}\right)$$

The derivatives with respect to the hyperparameters are as follows:

$$\frac{\partial K}{\partial\sigma} = 2\sigma\exp\left(\frac{-(x-x')^T(x-x')}{2l^2}\right)$$

$$\frac{\partial K}{\partial l} = \sigma^2\exp\left(\frac{-(x-x')^T(x-x')}{2l^2}\right) \frac{(x-x')^T(x-x')}{l^3}$$

However, often GP libraries use the notation:

$$K(x,x') = \sigma\exp\left(\frac{-(x-x')^T(x-x')}{l}\right)$$

where $\sigma$ and $l$ is confined to only to positive real numbers. Let $l=exp(\theta_1)$ and $\sigma=exp(2\theta_2)$, then by passing in $a,b$ we know are values will conform to this rule. In this case the derivatives are:

$$\frac{\partial K}{\partial\theta_1} = \sigma\exp\left(\frac{-(x-x')^T(x-x')}{2l^2}\right)\left(\frac{(x-x')^T(x-x')}{l^2}\right)$$

$$\frac{\partial K}{\partial \theta_2} = 2 \sigma\exp\left(\frac{-(x-x')^T(x-x')}{2l^2}\right)$$

There is is interesting work carried out by the likes of Mike Osborne looking at marginalising out hyper parameters. However as far as I am aware I think it is only appropriate for low numbers of parameters and isn't incorporated in standard libraries yet. Worth a look all the same.

Another not is that the optimisation space is multimodal the majority of the time so if you are using convex optimisation be sure to use a fare few initialisations.

Edit from comments: Optimizing multidimensional kernel functions

To extend the answer based on feedback from the comments, I thought I would also add a note on multidimensional kernel functions. Really, from the perspective of the maths multidimensional kernels are treated like any other by taking the partial derivative of the kernel $K$ with respect to a particular hyper parameter.

A nicety of ARD kernels (kernels with independent kernels for each dimension multiplied together, see here for different types of kernels) is that the derivatives come out very easily:

$$K(x,x') = K(x_1,x_1')K(x_2,x_2')K(x_3,x_3')$$

$$\frac{\partial K}{\partial l_1} = \frac{\partial K(x_1,x_1')}{\partial l_1}K(x_2,x_2')K(x_3,x_3')$$

In fact, many kernels (not just multidimensional input) can be created via the addition and multiplication of simple kernel functions, read more on kernel cookbook. This makes the maths for optimisation quite nice as you can simply apply the sum and product rules of derivatives to work out the optimization steps.

Solution 2:

I believe that the derivative of $\frac{\partial K}{\partial l}$, as it was given by j_f is not correct. I think that the correct one is the following (i present the derivation step by step):

$K(x,x') = \sigma^2\exp\big(\frac{-(x-x')^T(x-x')}{2l^2}\big)$
I now call $g(l)=\big(\frac{-(x-x')^T(x-x')}{2l^2}\big)$. So $K=\sigma^2exp\big( g(l) \big)$.

$\frac{\partial K}{\partial l} = \frac{\partial K}{\partial g} \frac{\partial g}{\partial l} = \sigma^2\exp\big(\frac{-(x-x')^T(x-x')}{2l^2}\big) \frac{\partial g}{\partial l}$.

With simple calculations, I finally get:

$\frac{\partial K}{\partial l} = \sigma^2\exp\big(\frac{-(x-x')^T(x-x')}{2l^2}\big) \frac{(x-x')^T(x-x')}{l^3}$.

Solution 3:

Maybe there is also an error in the gradient formulation, because in Rasmussen&Williams - Gaussian Process for Machine Learning, p.114, eq 5.9, it is expressed as:

Partial deivatives log marginal likelihood w.r.t. hyperparameters

where the 2 terms have different signs and the y targets vector is transposed just the first time.