Passing through RDMA network devices to docker containers

I'm wanting to passthrough infiniband to a docker container so that I can run some high performance apps over ipoib and use rdma.

Currently, I'm doing this with xen virtual machines. Now I'm looking into using CoreOS and docker as a much lighter weight and easier to manage alternative.

I have an IPoIB device ib0 with a static IP assigned to it of 10.10.10.10. I've managed to get that exposed inside a docker container with the following:

docker run --net=host --device=/dev/infiniband/uverbs0 --device=/dev/infiniband/rdma_cm -t -i ubuntu:14.04 /bin/bash

Great, that works. ib0 is available inside the docker container.

Now lets suppose I have a dual port HCA. On the host these appear as ib0 & ib1 and have two ip's assigned. 10.10.10.10/ib0 and 10.10.10.11/ib1

And now I want to pass ib0 to the first container and ib1 to the second. Using the method above both will appear in both containers because of the --net=host option. However, not specifying it means the devices do not appear at all.

Another scenario is I have a lot of machines that using SR-IOV to passthrough infiniband devices to xen virtual machines. How could I pass instead a virtual function infiniband device to a docker container and have that appear?

Note: pipework doesn't work in this situation but if I understand it better it might be able to be hacked to do what I want. I just don't quite understand what it's doing... yet.


Solution 1:

And now I can answer my own question on how to do this.

Use pipework which I have just patched to work with Infiniband or RDMA IPoIB devices.

You run it like this.

~ $ docker run --device=/dev/infiniband/uverbs0 --device=/dev/infiniband/rdma_cm -d container 
~ $ pipework ib0 container-id ip/netmask

Because IPoIB devices do not support bridging, the whole ib0 device is hidden from the host after the command is issued. i.e. It is moved to the network namespace of the container.

To get bridge-like functionality without bridging use SR-IOV and pass the virtual function through via pipework.

The latest incarnation uses virtual IPoIB which is similar to macvlan. Therefore the real ib0 remains visible in the host. It works very similarly to the ethernet version.