Very slow write speed (NVME drive) on 10G network

Setting up an experimental lab cluster, the write speed for data received via 10G fiber connection is 10% of the local write speeds.

Testing transfer speed between two identical machines; iperf3 shows good memory to memory speed of 9.43Gbits/s. And speed for disk(read) to memory transfers of (9.35Gbit/s):

test@rbox1:~$ iperf3 -s -B 10.0.0.21

test@rbox3:~$ iperf3 -c 10.0.0.21 -F /mnt/k8s/test.3g 
Connecting to host 10.0.0.21, port 5201
        Sent 9.00 GByte / 9.00 GByte (100%) of /mnt/k8s/test.3g
[  5]   0.00-8.26   sec  9.00 GBytes  9.35 Gbits/sec

But sending data over 10G and writing to disk on the other machine is an order of magnitude slower:

test@rbox1:~$ iperf3 -s 10.0.0.21 -F /tmp/foo -B 10.0.0.21

test@rbox3:~$ iperf3 -c 10.0.0.21
Connecting to host 10.0.0.21, port 5201
[  5] local 10.0.0.23 port 39970 connected to 10.0.0.21 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   103 MBytes   864 Mbits/sec    0    428 KBytes       
[  5]   1.00-2.00   sec   100 MBytes   842 Mbits/sec    0    428 KBytes       
[  5]   2.00-3.00   sec  98.6 MBytes   827 Mbits/sec    0    428 KBytes       
[  5]   3.00-4.00   sec  99.3 MBytes   833 Mbits/sec    0    428 KBytes       
[  5]   4.00-5.00   sec  91.5 MBytes   768 Mbits/sec    0    428 KBytes       
[  5]   5.00-6.00   sec  94.4 MBytes   792 Mbits/sec    0    428 KBytes       
[  5]   6.00-7.00   sec  98.1 MBytes   823 Mbits/sec    0    428 KBytes       
[  5]   7.00-8.00   sec  91.2 MBytes   765 Mbits/sec    0    428 KBytes       
[  5]   8.00-9.00   sec  91.0 MBytes   764 Mbits/sec    0    428 KBytes       
[  5]   9.00-10.00  sec  91.5 MBytes   767 Mbits/sec    0    428 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   959 MBytes   804 Mbits/sec    0             sender
        Sent  959 MByte / 9.00 GByte (10%) of /mnt/k8s/test.3g
[  5]   0.00-10.00  sec   953 MBytes   799 Mbits/sec                  receiver

The NVME drive is capable of writing locally much much faster (detailed dd and fio measurements are below) - for single process and 4k/8k/10m chunks: fio random write speeds of 330/500/1300 MB/s

I am trying to achieve write speeds close to the actual local write speeds for the NVME drive (so yes, good to spell this assumption out -- I'm expecting to be able to reach very similar speeds writing to a single NVME drive over the network; but I can't even get 20% of it).

At this point I'm completely stomped, not seeing what else to try -- other than a different kernel/OS -- any pointers, corrections and help would be much appreciated.


And here some measurements/info/results:

What I tried so far:

  • jumbo frames (MTU 9000) on both machines and verified they worked (with ping -mping -M do -s 8972)

  • eliminated interference of network switch, I connected the two machines directly via 2m Dumplex OM3 Multimode fiber cable (the cable and tranceivers are identical on all machines) and I'm binding iperf3 server/client to these interfaces. Results are the same (slow).

  • disconnected all other ethernet/fiber cables for duration of the tests (to eliminate routing problems) - no change.

  • updated firmware of the motherboard and the Fiber NIC (again, no change). I have not updated the NVME firmware - seems to be the latest already.

  • even tried moving the 10G card from PCIE slot 1 to the next available slot; wondering if the NVME and 10G NIC were sharing and maxing physical hub lanes bandwidth (again, no measurable change).

Discovered some 'interesting' behaviour:

  1. increasing number of parallel clients increases bandwith utilization; with 1 client, the target machine is writing 900Mbits/sec; with 4 clients 1.26 Gbits/sec (more parallel clients beyond 4 have detrimental impact)
  2. testing write on a more powerful machine with AMD Ryzen 5 3600X and 64G RAM (identical NVME drive + 10G NIC) -- 1 client can reach up to 1.53Gbit/sec, 4 clients 2.15Gbits/sec (and 8 clients 2.13Gbit/sec). The traffic in this case flows through Mikrotik CS309 switch and MTU is 1500; the more powerful machine seems to receive/write faster
  3. there is no noticeable load during tests -- this applies to both small and the larger machine; it is 26% for 2 cores maybe

Edit 06/07:

Following @shodanshok comments, mounted remote machine over NFS; here are results:

nfs exports: /mnt/nfs *(rw,no_subtree_check,async,insecure,no_root_squash,fsid=0)

cat /etc/mtab | grep nfs 10.0.0.21:/mnt/nfs /mnt/nfs1 nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.0.0.21,mountvers=3,mountport=52335,mountproto=udp,local_lock=none,addr=10.0.0.21 0 0

fio --name=random-write --ioengine=libaio --rw=randwrite --bs=$SIZE --numjobs=1 --iodepth=1 --runtime=30 --end_fsync=1 --size=3g dd if=/dev/zero of=/mnt/nfs1/test bs=$SIZE count=$(3*1024/$SIZE)

            | fio (bs=4k)    | fio (bs=8k)    | fio (bs=1M)   | dd (bs=4k)    | dd (bs=1M)
nfs (udp)   |  153           |    210        |   984            |   907        |962
nfs (tcp)   |    157          |    205        |    947           |   946        |916

All those measurements are MB/s I'm satisfied that 1M blocks reach very close to the theoretical speed limit of the 10G connection.

Looks like iperf3 -F ... is not the way to test network write speeds; but I'll try to get iperf3 devs take on it as well.


Details of the setup:

Each machine has AMD Ryzen 3 3200G with 8GB RAM, MPG X570 GAMING PLUS (MS-7C37) motherboard. 1x 1TB NVME drive (consumer grade WD Blue SN550 NVMe SSD WDS100T2B0C) in the M.2 slot closest to the CPU. And one SolarFlare S7120 10G Fiber card in the PCIe slot.

NVME disk info:

test@rbox1:~$ sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     21062Y803544         WDC WDS100T2B0C-00PXH0                   1           1.00  TB /   1.00  TB    512   B +  0 B   211210WD

NVME disk write speed (4k/8k/10M)

test@rbox1:~$ dd if=/dev/zero of=/tmp/temp.bin bs=4k count=1000
1000+0 records in
1000+0 records out
4096000 bytes (4.1 MB, 3.9 MiB) copied, 0.00599554 s, 683 MB/s


test@rbox1:~$ dd if=/dev/zero of=/tmp/temp.bin bs=8k count=1000
1000+0 records in
1000+0 records out
8192000 bytes (8.2 MB, 7.8 MiB) copied, 0.00616639 s, 1.3 GB/s


test@rbox1:~$ dd if=/dev/zero of=/tmp/temp.bin bs=10M count=1000
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 7.00594 s, 1.5 GB/s

Testing random write speed with fio-3.16:

test@rbox1:~$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --runtime=30 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1

Run status group 0 (all jobs):
  WRITE: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=9447MiB (9906MB), run=30174-30174msec

Disk stats (read/write):
    dm-0: ios=2/969519, merge=0/0, ticks=0/797424, in_queue=797424, util=21.42%, aggrios=2/973290, aggrmerge=0/557, aggrticks=0/803892, aggrin_queue=803987, aggrutil=21.54%
  nvme0n1: ios=2/973290, merge=0/557, ticks=0/803892, in_queue=803987, util=21.54%



test@rbox1:~$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=8k --numjobs=1 --iodepth=1 --runtime=30 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=posixaio, iodepth=1

Run status group 0 (all jobs):
  WRITE: bw=491MiB/s (515MB/s), 491MiB/s-491MiB/s (515MB/s-515MB/s), io=14.5GiB (15.6GB), run=30213-30213msec

Disk stats (read/write):
    dm-0: ios=1/662888, merge=0/0, ticks=0/1523644, in_queue=1523644, util=29.93%, aggrios=1/669483, aggrmerge=0/600, aggrticks=0/1556439, aggrin_queue=1556482, aggrutil=30.10%
  nvme0n1: ios=1/669483, merge=0/600, ticks=0/1556439, in_queue=1556482, util=30.10%



test@rbox1:~$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=10m --numjobs=1 --iodepth=1 --runtime=30 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 10.0MiB-10.0MiB, (W) 10.0MiB-10.0MiB, (T) 10.0MiB-10.0MiB, ioengine=posixaio, iodepth=1

Run status group 0 (all jobs):
  WRITE: bw=1250MiB/s (1310MB/s), 1250MiB/s-1250MiB/s (1310MB/s-1310MB/s), io=36.9GiB (39.6GB), run=30207-30207msec

Disk stats (read/write):
    dm-0: ios=9/14503, merge=0/0, ticks=0/540252, in_queue=540252, util=68.96%, aggrios=9/81551, aggrmerge=0/610, aggrticks=5/3420226, aggrin_queue=3420279, aggrutil=69.16%
  nvme0n1: ios=9/81551, merge=0/610, ticks=5/3420226, in_queue=3420279, util=69.16%

Kernel:

test@rbox1:~$ uname -a
Linux rbox1 5.8.0-55-generic #62-Ubuntu SMP Tue Jun 1 08:21:18 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Fiber NIC:

test@rbox1:~$ sudo sfupdate 
Solarflare firmware update utility [v8.2.2]
Copyright 2002-2020 Xilinx, Inc. 
Loading firmware images from /usr/share/sfutils/sfupdate_images

enp35s0f0np0 - MAC: 00-0F-53-3B-7D-D0
    Firmware version:   v8.0.1
    Controller type:    Solarflare SFC9100 family
    Controller version: v6.2.7.1001
    Boot ROM version:   v5.2.2.1006

The Boot ROM firmware is up to date
The controller firmware is up to date

Fiber NIC's initialized and MTU set:

test@rbox1:~$ sudo dmesg | grep sf
[    0.210521] ACPI: 10 ACPI AML tables successfully acquired and loaded
[    1.822946] sfc 0000:23:00.0 (unnamed net_device) (uninitialized): Solarflare NIC detected
[    1.824954] sfc 0000:23:00.0 (unnamed net_device) (uninitialized): Part Number : SFN7x22F
[    1.825434] sfc 0000:23:00.0 (unnamed net_device) (uninitialized): no PTP support
[    1.958282] sfc 0000:23:00.1 (unnamed net_device) (uninitialized): Solarflare NIC detected
[    2.015966] sfc 0000:23:00.1 (unnamed net_device) (uninitialized): Part Number : SFN7x22F
[    2.031379] sfc 0000:23:00.1 (unnamed net_device) (uninitialized): no PTP support
[    2.112729] sfc 0000:23:00.0 enp35s0f0np0: renamed from eth0
[    2.220517] sfc 0000:23:00.1 enp35s0f1np1: renamed from eth1
[    3.494367] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 1748.247082] sfc 0000:23:00.0 enp35s0f0np0: link up at 10000Mbps full-duplex (MTU 1500)
[ 1809.625958] sfc 0000:23:00.1 enp35s0f1np1: link up at 10000Mbps full-duplex (MTU 9000)

Motherboard ID:

# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

Handle 0x0001, DMI type 1, 27 bytes
System Information
    Manufacturer: Micro-Star International Co., Ltd.
    Product Name: MS-7C37
    Version: 2.0

Additional HW info (mostly to list physical connections - bridges)

test@rbox1:~$ hwinfo --short
cpu:                                                            
                       AMD Ryzen 3 3200G with Radeon Vega Graphics, 1500 MHz
                       AMD Ryzen 3 3200G with Radeon Vega Graphics, 1775 MHz
                       AMD Ryzen 3 3200G with Radeon Vega Graphics, 1266 MHz
                       AMD Ryzen 3 3200G with Radeon Vega Graphics, 2505 MHz
storage:
                       ASMedia ASM1062 Serial ATA Controller
                       Sandisk Non-Volatile memory controller
                       AMD FCH SATA Controller [AHCI mode]
                       AMD FCH SATA Controller [AHCI mode]
network:
  enp35s0f1np1         Solarflare SFN7x22F-R3 Flareon Ultra 7000 Series 10G Adapter
  enp35s0f0np0         Solarflare SFN7x22F-R3 Flareon Ultra 7000 Series 10G Adapter
  enp39s0              Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
network interface:
  br-0d1e233aeb68      Ethernet network interface
  docker0              Ethernet network interface
  vxlan.calico         Ethernet network interface
  veth0ef4ac4          Ethernet network interface
  enp35s0f0np0         Ethernet network interface
  enp35s0f1np1         Ethernet network interface
  lo                   Loopback network interface
  enp39s0              Ethernet network interface
disk:
  /dev/nvme0n1         Sandisk Disk
  /dev/sda             WDC WD5000AAKS-4
partition:
  /dev/nvme0n1p1       Partition
  /dev/nvme0n1p2       Partition
  /dev/nvme0n1p3       Partition
  /dev/sda1            Partition
bridge:
                       AMD Matisse Switch Upstream
                       AMD Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
                       AMD Raven/Raven2 Device 24: Function 3
                       AMD Raven/Raven2 PCIe GPP Bridge [6:0]
                       AMD Matisse PCIe GPP Bridge
                       AMD Raven/Raven2 Device 24: Function 1
                       AMD Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
                       AMD FCH LPC Bridge
                       AMD Matisse PCIe GPP Bridge
                       AMD Matisse PCIe GPP Bridge
                       AMD Raven/Raven2 Device 24: Function 6
                       AMD Matisse PCIe GPP Bridge
                       AMD Raven/Raven2 Root Complex
                       AMD Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus A
                       AMD Raven/Raven2 Device 24: Function 4
                       AMD Matisse PCIe GPP Bridge
                       AMD Raven/Raven2 Device 24: Function 2
                       AMD Matisse PCIe GPP Bridge
                       AMD Raven/Raven2 Device 24: Function 0
                       AMD Raven/Raven2 Device 24: Function 7
                       AMD Raven/Raven2 PCIe GPP Bridge [6:0]
                       AMD Raven/Raven2 Device 24: Function 5

This answer was inspired by comment by @shodanshok, who commented (so I can't upvote his contribution -- posting an answer instead)

Edit 2021/06/09 - iperf3 developers identified a possible problem; newer releases of the package may have a different behaviour, YMMV. See: https://github.com/esnet/iperf/issues/1159

Originally, I was using iperf3 -F .... to measure write speed over the network (to verify 10G fiber connection). It however produced way slower results than writing data over NFS (generated with fio benchmark).

This was very puzzling because rsync was way below 100MB/s and even if decryption/encryption is taken into account, it shouldn't be so slow on 10G fiber. So I kept digging in the wrong direction.

The measurements below show that the 10G network with NVME (single) disk is capable of going over 900MB/s and there is spare CPU capacity.

In my setup, I'm using logical volumes (LVM) and it is curious that the LVM stats are not aligned with the NVME partition; this is the only partition there is on the system -- so potentially, it would be interesting to see what happens without LVM.

nfs exports: /mnt/nfs *(rw,no_subtree_check,async,insecure,no_root_squash,fsid=0)

cat /etc/mtab | grep nfs

10.0.0.21:/mnt/nfs /mnt/nfs1 nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.0.0.21,mountvers=3,mountport=52335,mountproto=udp,local_lock=none,addr=10.0.0.21 0 0

Commands use to produce the below measurements:

`fio --name=random-write --ioengine=libaio --rw=randwrite --bs=$SIZE --numjobs=1 --iodepth=1 --runtime=30 --end_fsync=1 --size=3g`

`dd if=/dev/zero of=/mnt/nfs1/test bs=$SIZE count=$(3*1024/$SIZE)`


|            | fio (bs=4k)    | fio (bs=8k)    | fio (bs=1M)   | dd (bs=4k)    | dd (bs=1M) |
|------------|----------------|----------------|---------------|---------------|------------|
|nfs (udp)   |  153           |    210         |    984        |    907        |    962     |
|nfs (tcp)   |  157           |    205         |    947        |    946        |    916     |

iostat plots

From:

`fio --name=random-write --ioengine=libaio --rw=randwrite --bs=1m --numjobs=1 --iodepth=1 --runtime=30 --end_fsync=1 --size=20g`

The

Local write speed NFS write speed (10G fiber)
CPU CPU
nvme nvme
lvm lvm

Using my 2 servers connected over 10Gbit ethernet and your way of testing shows that the NVMe drives I use are slower than I expected when using iperf3. I think you might see the disks are actually 100% busy in iostat or atop.

Running:

dd if=/dev/urandom of=/home/randomfile bs=1M count=10240
iperf3 -s -F /home/randomfile

# in a different session:
iostat -x 1

Shows on iperf3 server side for disk:

Device  %util
dm-0    0.00
dm-1    100.00
md0     0.00
nvme0n1 100.00
nvme1n1 100.00

And iperf:

[ ID] Interval           Transfer     Bitrate
        Sent 4.04 GByte / 4.04 GByte (100%) of /home/randomfile
[  5]   0.00-10.00  sec  4.04 GBytes  3.47 Gbits/sec                  receiver

Running it in reverse with the -R flag (reading instead of writing to the file)

iperf3 -c server1 -R

Disk on iperf3 server side:

Device  %util
dm-0    0.00
dm-1    0.00
md0     0.00
nvme0n1 0.40
nvme1n1 0.40

And iperf:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.9 GBytes  9.39 Gbits/sec   56             sender
[  5]   0.00-10.00  sec  10.9 GBytes  9.38 Gbits/sec                  receiver

Try your iperf again in reverse. It's probably not network related.

Added:

When moving the file to a ramdisk you can see the network functions correctly:

mount -t ramfs -o size=11G ramfs /mnt
mv /home/randomfile /mnt/
iperf3 -s -F /mnt/randomfile
iperf3 -c server1
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-2.51   sec  2.74 GBytes  9.39 Gbits/sec   35             sender
[  5]   0.00-2.51   sec  0.00 Bytes  0.00 bits/sec                  receiver

iperf3 -c server1 -R
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.59   sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-1.59   sec  1.73 GBytes  9.38 Gbits/sec                  receiver