InfiniBand Connection Model

I cannot figure out what the model of a "connection" is with InfiniBand?

Specifically, I'm looking to do RDMA transfers. The eventual goal is RDMA write with Immediate, but I'm starting with just an RDMA transfer.

If it's to be likened to an IP connection, you initiate a connection, issue commands on that connection, then end the connection.

If it's to be likened to an HTTP query/response, you perform a self-contained request and that's that.

Each of these have fairly distinct APIs, but I can't find the patterns for these in the APIs.

For example, when I construct the structures necessary for an RDMA transfer, I give it my address information and the vaddr/rkey of the remote memory... but nowhere can I find where to give it the address information of the target host interface.

Nearly every example I've seen has an awful collection of C calls and complicated structure (OO, people?) and, furthermore, they either use the IBConnectionManager or use sockets to pass the other information, further clouding the base of the API. Nobody seems to have a clear description of what is actually necessary to perform RDMA-Write or RDMA-Write-With-Immediate.

So: How do I do this?


Infiniband, Mellanox and Open Fabrics Enterprise Distribution, OFED, support connection models similar to TCP/IP and UDP/IP.

There are the connected protocols like TCP/IP:

  • Reliable Connection, RC
  • Unreliable Connection, UC, connected, but not guaranteed to be reliable

There is an unconnected protocol like UDP:

  • Unreliable Datagram, UD

With UD you can do many things like UDP, including broadcast, multicast, and writing each buffer to a different host.

To do Remote Direct Memory Access, RDMA, writes, you use one of the connected protocols. RDMA writes are different than what TCP/IP provides, although many people run RDMA over Ethernet via RDMA over Converged Ethernet, RoCE (pronounced Rocky), iWarp, Soft RoCE and others. RDMA writes, appear to write directly into the remote computers memory.

RDMA can write into remote hosts, GPUs, storage devices at high speed, 100 or 200 Gbits per connection and you can combine connections. They do this without CPU action on the receiving side.

The APIs, the Verbs in OFED terminology, are tedious. An RDMA "Hello World" program is about 600 lines of code. Part of this complexity is because you are establishing security to write directly into the RAM on another computer and this security and memory management has to be handled in conjunction with the Operating System.

The overall sketch for each side is:

  • You create a Protection Domain, PD, that you will put resources under.
  • You create Memory Regions, MRs, for buffers, queue and resources.
  • You create Completion Queues, CQs (pronounced Cookies), to be selectively notified when things happen, such as "I have received a buffer" or "the buffer has been sent" or "the RDMA operation is complete".
  • You create Queue Pairs, QPs. In a connected protocol, RC or UC, this is your tunnel to the other side.
  • You transition the QPs through Initialization, Ready to Receive and finally Ready to Send.
  • You create Work Queue Entries, WQEs (pronounced Wookies), that specifies the buffers to transfer or places to put received data for non-RDMA transfers.
  • Now you can send and receive data.

At each of these steps there are contexts and structures and flags to fill out.

When I was beginning to learn how to write Infiniband RDMA code, I used this site to write my first program Infiniband: An Introduction...

And a fantastic blog by Dotan Barak called RDMAmojo. You will find Dotan's name in many of the manual pages for the Infiniband Verbs on Linux.

There have been a number of attempts to simplify the Verbs APIs, both Infiniband Verbs, IBV, and the RDMA Verbs. So far none have really taken hold.

As to where to put the address information of the of the target host interface, somehow you have to exchange information between the two end points. There is the Infiniband Verbs Connection manager, ibv_cm, which it seems very few use. There is the RDMA Connection manager, RDMA_CM, which again has a chunk of work to support. Finally there is just open a TCP/IP socket, often with IP over Infiniband, IPoIB, and just write a message with your connection data and read a message with the connection data from the other side. Many people use this.