Set up simple Infiniband Block Storage (SRP or iSER)

Well, to be frank I went the simple route and happily used iSCSI over IP over IB and it worked easily and performed well:

Infiniband IP setup ultra-quick primer.

first...

  • install opensm, infiniband-diags, rds-tools, sdpnetstat, srptools, perftest (for benchmarks)
  • load IB driver module, ib_umad, ib_ipoib
  • now you have a new network interface to configure.

performance settings:

  • connected mode, set the MTU to 65520
  • datagram mode, set the MTU to 2044
  • datagram performance : ~ 5 Gb/s
  • connected mode performance : ~ 6.3 Gb/s

YMMV with the IB controller model, driver, etc.

IP Settings :

net.ipv4.tcp_timestamps=0
net.ipv4.tcp_sack=0
net.core.netdev_max_backlog=250000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=16777216
net.ipv4.tcp_mem="16777216 16777216 16777216"
net.ipv4.tcp_rmem="4096 87380 16777216"
net.ipv4.tcp_wmem="4096 65536 16777216"

Some documentations :

http://support.systemfabricworks.com/lxr/#ofed+OFED-1.5/ofed-docs-1.5/ipoib_release_notes.txt

http://www.mjmwired.net/kernel/Documentation/infiniband/ipoib.txt

iperf, 4 threads :

[  3] local 192.168.1.2 port 50585 connected with 192.168.1.3 port 5003
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.75 GBytes  2.36 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.79 GBytes  2.40 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  3.31 GBytes  2.84 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  3.51 GBytes  3.02 Gbits/sec

Total aggregated bandwidth 1.3 GB/s, definitely better than 10 GigE.


I recently configured SRP target/initiator pair on Linux, and had ~100% performance increase (580MB/s on 10Gbps SDR) over traditional iSCSI-over-IPoIB configuration (300MB/s on SDR).

Setup:

  • Distribution: Debian sid
  • Linux kernel: 3.4.0-rc1 (3.3 or above is required for in-kernel SRP)
  • Infiniband stack: OFED-1.4 (which comes with Debian)
  • SRP/iSCSI target: Linux-iSCSI with in-kernel ib_srpt.ko
  • SRP initiator: in-kernel ib_srp.ko

NOTE: AFAIK, SCST is now obsolete as Linux (kernel) is going with Linux-iSCSI (LIO), obsoleting STGT (previous in-kernel implementation) as well. Plan is to merge SCST features into LIO.

InfiniBand Configuration:

  • set IB card to "connected" mode (echo connected > /sys/class/net/ib0/mode)
  • configure sysctl parameters (same as above post)
  • set MTU to maximum (ip link set dev ib0 mtu 65520)

SRP Configuration: This one is somewhat confusing to figure out, so I'll just paste from my worklog.

=== SRP target configuration ===
// NOTE: This is GUID of your IB interface on target-side. You can check it with ibstatus(1)
# targecli
/> cd /ib_srpt
/ib_srpt> create 0xfe800000000000000008f1040399d85a
Created target 0xfe800000000000000008f1040399d85a.
Entering new node /ib_srpt/0xfe800000000000000008f1040399d85a
/ib_srpt/0xfe...8f1040399d85a> cd luns
// This is just a dm-zero mapped "/dev/zero"-like block device
/ib_srpt/0xfe...0399d85a/luns> create /backstores/iblock/zero
/ib_srpt/0xfe...85a/luns/lun0> cd ../../acls
// This is GUID of your IB interface on initiator-side
/ib_srpt/0xfe...0399d85a/acls> create 0x00000000000000000008f1040399d832

In above (actual) example, GUID varies between 0xfe80... style and 0x0000xxx style, but I think both can be used interchangeably. You can configure canonicalization rule by editing /var/target/fabric/ib_srpt.spec (or whereever Python rtslib library (which Linux-iSCSI tool uses) is installed).

=== SRP initiator configuration ===
// uMAD device must match with IB interface being used
# ibsrpdm -c -d /dev/infiniband/umad1
id_ext=0008f1040399d858,ioc_guid=0008f1040399d858,dgid=fe800000000000000008f1040399d85a,pkey=ffff,service_id=0008f1040399d858
// Supply above string to ib_srp.ko, in order to setup SRP connection
# for i in $(ibsrpdm -c -d /dev/infiniband/umad1); \
do echo $i > /sys/class/infiniband_srp/srp-mlx4_0-2/add_target; done

If everything went successfully, you will see a message similar to below in your dmesg:

[10713.616495] scsi host9: ib_srp: new target: id_ext 0008f1040399d858 ioc_guid 0008f1040399d858 pkey ffff service_id 0008f1040399d858 dgid fe80:0000:0000:0000:0008:f104:0399:d85a
[10713.815843] scsi9 : SRP.T10:0008F1040399D858
[10713.891557] scsi 9:0:0:0: Direct-Access     LIO-ORG  IBLOCK 4.0 PQ: 0 ANSI: 5
[10713.988846] sd 9:0:0:0: [sde] 2147483648 512-byte logical blocks: (1.09 TB/1.00 TiB)
...

As a final note, both ib_srp.ko/ib_srpt.ko is still somewhat immature. They both work fine, but feature like disconnection seems unimplemented. So once SCSI block device is attached, there is no way to detach it. However, their performance is excellent.


Stability makes the difference. Mellanox primarily cares for performance as they sell hardware. As they bought Voltaire, they are pushing iSER due to their IB to Ethernet gateways.

We at ProfitBricks used iSER and Solaris 11 as target for our IaaS 2.0 cloud. But as we hit major ZFS performance as well as IPoIB and open-iscsi stability issues, we switched over to a Linux storage with SCST and SRP. We help improving this technology on the linux-rdma mailing list and with our own ib_srp patches. For us stability requires simplicity. So we are going for SRP as we have InfiniBand. RDMA is native for InfiniBand and SRP is RDMA-only.

I had a presentation on LinuxTag this year regarding this topic: InfiniBand/RDMA for Storage - SRP vs. iSER http://www.slideshare.net/SebastianRiemer/infini-band-rdmaforstoragesrpvsiser-21791250

It also shows how to establish an SRP connection.