ZFS Poor Write Performance When Adding More Spindles

I am using ZFS on Linux and am experiencing a rather strange symptom that when I add more disks to the system, the speed at which each drive writes reduces, effectively negating the additional spindles for sequential write performance.

The disks are connected with to the Host via an HBA (LSI 9300-8E) on SAS Disk Shelves

While testing below I ran following command on IO Zone iozone -i 0 -s 10000000 -r 1024 -t 10

Here are the results of my tests:

In my first test I have created a mirror with 12 Disks, which show expected write performance of around 100 MB/s per second to each disk..

zpool create -o ashift=12 -f PoolA mirror S1_D0 S2_D0 mirror S1_D1 S2_D1 mirror 
S1_D2 S2_D2 mirror S1_D3 S2_D3 mirror S1_D4 S2_D4 mirror S1_D5 S2_D5

              capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
PoolA       3.60G  10.9T      0  5.06K      0   638M
  mirror     612M  1.81T      0    863      0   106M
    S1_D0       -      -      0    862      0   106M
    S2_D0       -      -      0    943      0   116M
  mirror     617M  1.81T      0    865      0   107M
    S1_D1       -      -      0    865      0   107M
    S2_D1       -      -      0    939      0   116M
  mirror     613M  1.81T      0    860      0   106M
    S1_D2       -      -      0    860      0   106M
    S2_D2       -      -      0    948      0   117M
  mirror     611M  1.81T      0    868      0   107M
    S1_D3       -      -      0    868      0   107M
    S2_D3       -      -      0  1.02K      0   129M
  mirror     617M  1.81T      0    868      0   107M
    S1_D4       -      -      0    868      0   107M
    S2_D4       -      -      0    939      0   116M
  mirror     616M  1.81T      0    856      0   106M
    S1_D5       -      -      0    856      0   106M
    S2_D5       -      -      0    939      0   116M
----------  -----  -----  -----  -----  -----  -----

In the next test I add 12 More disks for a total of 24 disks and we effectively cut the bandwidth to each disk in half.

zpool create -o ashift=12 -f PoolA mirror S1_D0 S2_D0 mirror S1_D1 S2_D1 
mirror S1_D2 S2_D2 mirror S1_D3 S2_D3 mirror S1_D4 S2_D4 
mirror S1_D5 S2_D5 mirror S1_D6 S2_D6 mirror S1_D7 S2_D7 
mirror S1_D8 S2_D8 mirror S1_D9 S2_D9 mirror S1_D10 S2_D10 
mirror S1_D11 S2_D11

                capacity     operations    bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
PoolA        65.2M  21.7T      0  4.67K      0   588M
  mirror     6.56M  1.81T      0    399      0  49.0M
    S1_D0        -      -      0    399      0  49.0M
    S2_D0        -      -      0    513      0  63.1M
  mirror     5.71M  1.81T      0    400      0  48.7M
    S1_D1        -      -      0    400      0  48.7M
    S2_D1        -      -      0    515      0  62.6M
  mirror     6.03M  1.81T      0    396      0  49.1M
    S1_D2        -      -      0    396      0  49.1M
    S2_D2        -      -      0    509      0  62.9M
  mirror     5.89M  1.81T      0    394      0  49.0M
    S1_D3        -      -      0    394      0  49.0M
    S2_D3        -      -      0    412      0  51.3M
  mirror     5.60M  1.81T      0    400      0  49.0M
    S1_D4        -      -      0    400      0  49.0M
    S2_D4        -      -      0    511      0  62.9M
  mirror     4.65M  1.81T      0    401      0  48.9M
    S1_D5        -      -      0    401      0  48.9M
    S2_D5        -      -      0    511      0  62.3M
  mirror     5.36M  1.81T      0    397      0  49.2M
    S1_D6        -      -      0    397      0  49.2M
    S2_D6        -      -      0    506      0  62.5M
  mirror     4.88M  1.81T      0    395      0  49.2M
    S1_D7        -      -      0    395      0  49.2M
    S2_D7        -      -      0    509      0  63.3M
  mirror     5.01M  1.81T      0    393      0  48.2M
    S1_D8        -      -      0    393      0  48.2M
    S2_D8        -      -      0    513      0  63.0M
  mirror     5.00M  1.81T      0    399      0  48.7M
    S1_D9        -      -      0    399      0  48.7M
    S2_D9        -      -      0    513      0  62.5M
  mirror     5.00M  1.81T      0    398      0  49.2M
    S1_D10       -      -      0    398      0  49.2M
    S2_D10       -      -      0    509      0  62.8M
  mirror     5.55M  1.81T      0    401      0  50.0M
    S1_D11       -      -      0    401      0  50.0M
    S2_D11       -      -      0    506      0  63.1M
-----------  -----  -----  -----  -----  -----  -----

Hoping someone can maybe shed some light as to why adding more disks would effectively cut the performance to each disk.

ADDITIONAL REQUESTED INFORMATION

Hardware Summary

Server

Lenovo ThinkServer RD550, Single 10 Core Xeon, 256GB of Ram, OS on RAID 1 on 720ix controller.

Server HBA

LSI 9300-8e mpt3sas_cm0: LSISAS3008: FWVersion(12.00.00.00), ChipRevision(0x02), BiosVersion(06.00.00.00)

Disk Shelves

Disk Shelves are Lenovo ThinkServer SA120 with Dual SAS Controllers, Dual Power Supplies cabled in a redundant fashion with 2 path's to each disk.

Disk Shelf Connectivity

The disk shelves are connected via .5 Meter SAS Cables and daisy chained through the shelves with a loop back to the controller at the end.

Drive Information

48 x 2TB SAS Drives Seagate Model # ST2000NM0023 Drives are configured through multipath and each drive has redundant pathways.

Software Summary

Operating System / Kernel

CentOS 7.3 Output from "uname -a" Linux 4.9.9-1.el7.elrepo.x86_64 #1 SMP Thu Feb 9 11:43:40 EST 2017 x86_64 x86_64 x86_64 GNU/Linux

ZFS Tuning

/etc/modprobe.d/zfs.conf is a blank file currently, I haven't tried much here, the sequential write performance seems like it should increase with more disks.


The specification for an LSI 9300-8e HBA quotes 12Gb (Gigabit) throughput for connected storage. (https://docs.broadcom.com/docs/12353459) You multiply up that figure with the bandwidth getting ~9600MB/sec. overall throughput.

Is there an overall I/O queue depth setting for the HBA (driver) in the OS that is throttling your I/O? This still wouldn't explain how the bandwidth is halved so accurately.

Your figures would make a lot of sense if only a 'single path' or link is working for the SAS connection - is there any (bizarre) way only one link out of eight could be working? I am not aware of how wide or narrow SAS 'ports' (which are virtual rather than physical objects) are configured from 'phys' and if the HBA isn't talking to your disk shelf is there a fallback configuration option that might allow this?