ZFS Poor Write Performance When Adding More Spindles
I am using ZFS on Linux and am experiencing a rather strange symptom that when I add more disks to the system, the speed at which each drive writes reduces, effectively negating the additional spindles for sequential write performance.
The disks are connected with to the Host via an HBA (LSI 9300-8E) on SAS Disk Shelves
While testing below I ran following command on IO Zone iozone -i 0 -s 10000000 -r 1024 -t 10
Here are the results of my tests:
In my first test I have created a mirror with 12 Disks, which show expected write performance of around 100 MB/s per second to each disk..
zpool create -o ashift=12 -f PoolA mirror S1_D0 S2_D0 mirror S1_D1 S2_D1 mirror S1_D2 S2_D2 mirror S1_D3 S2_D3 mirror S1_D4 S2_D4 mirror S1_D5 S2_D5 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- PoolA 3.60G 10.9T 0 5.06K 0 638M mirror 612M 1.81T 0 863 0 106M S1_D0 - - 0 862 0 106M S2_D0 - - 0 943 0 116M mirror 617M 1.81T 0 865 0 107M S1_D1 - - 0 865 0 107M S2_D1 - - 0 939 0 116M mirror 613M 1.81T 0 860 0 106M S1_D2 - - 0 860 0 106M S2_D2 - - 0 948 0 117M mirror 611M 1.81T 0 868 0 107M S1_D3 - - 0 868 0 107M S2_D3 - - 0 1.02K 0 129M mirror 617M 1.81T 0 868 0 107M S1_D4 - - 0 868 0 107M S2_D4 - - 0 939 0 116M mirror 616M 1.81T 0 856 0 106M S1_D5 - - 0 856 0 106M S2_D5 - - 0 939 0 116M ---------- ----- ----- ----- ----- ----- -----
In the next test I add 12 More disks for a total of 24 disks and we effectively cut the bandwidth to each disk in half.
zpool create -o ashift=12 -f PoolA mirror S1_D0 S2_D0 mirror S1_D1 S2_D1 mirror S1_D2 S2_D2 mirror S1_D3 S2_D3 mirror S1_D4 S2_D4 mirror S1_D5 S2_D5 mirror S1_D6 S2_D6 mirror S1_D7 S2_D7 mirror S1_D8 S2_D8 mirror S1_D9 S2_D9 mirror S1_D10 S2_D10 mirror S1_D11 S2_D11 capacity operations bandwidth pool alloc free read write read write ----------- ----- ----- ----- ----- ----- ----- PoolA 65.2M 21.7T 0 4.67K 0 588M mirror 6.56M 1.81T 0 399 0 49.0M S1_D0 - - 0 399 0 49.0M S2_D0 - - 0 513 0 63.1M mirror 5.71M 1.81T 0 400 0 48.7M S1_D1 - - 0 400 0 48.7M S2_D1 - - 0 515 0 62.6M mirror 6.03M 1.81T 0 396 0 49.1M S1_D2 - - 0 396 0 49.1M S2_D2 - - 0 509 0 62.9M mirror 5.89M 1.81T 0 394 0 49.0M S1_D3 - - 0 394 0 49.0M S2_D3 - - 0 412 0 51.3M mirror 5.60M 1.81T 0 400 0 49.0M S1_D4 - - 0 400 0 49.0M S2_D4 - - 0 511 0 62.9M mirror 4.65M 1.81T 0 401 0 48.9M S1_D5 - - 0 401 0 48.9M S2_D5 - - 0 511 0 62.3M mirror 5.36M 1.81T 0 397 0 49.2M S1_D6 - - 0 397 0 49.2M S2_D6 - - 0 506 0 62.5M mirror 4.88M 1.81T 0 395 0 49.2M S1_D7 - - 0 395 0 49.2M S2_D7 - - 0 509 0 63.3M mirror 5.01M 1.81T 0 393 0 48.2M S1_D8 - - 0 393 0 48.2M S2_D8 - - 0 513 0 63.0M mirror 5.00M 1.81T 0 399 0 48.7M S1_D9 - - 0 399 0 48.7M S2_D9 - - 0 513 0 62.5M mirror 5.00M 1.81T 0 398 0 49.2M S1_D10 - - 0 398 0 49.2M S2_D10 - - 0 509 0 62.8M mirror 5.55M 1.81T 0 401 0 50.0M S1_D11 - - 0 401 0 50.0M S2_D11 - - 0 506 0 63.1M ----------- ----- ----- ----- ----- ----- -----
Hoping someone can maybe shed some light as to why adding more disks would effectively cut the performance to each disk.
ADDITIONAL REQUESTED INFORMATION
Hardware Summary
Server
Lenovo ThinkServer RD550, Single 10 Core Xeon, 256GB of Ram, OS on RAID 1 on 720ix controller.
Server HBA
LSI 9300-8e mpt3sas_cm0: LSISAS3008: FWVersion(12.00.00.00), ChipRevision(0x02), BiosVersion(06.00.00.00)
Disk Shelves
Disk Shelves are Lenovo ThinkServer SA120 with Dual SAS Controllers, Dual Power Supplies cabled in a redundant fashion with 2 path's to each disk.
Disk Shelf Connectivity
The disk shelves are connected via .5 Meter SAS Cables and daisy chained through the shelves with a loop back to the controller at the end.
Drive Information
48 x 2TB SAS Drives Seagate Model # ST2000NM0023 Drives are configured through multipath and each drive has redundant pathways.
Software Summary
Operating System / Kernel
CentOS 7.3 Output from "uname -a" Linux 4.9.9-1.el7.elrepo.x86_64 #1 SMP Thu Feb 9 11:43:40 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
ZFS Tuning
/etc/modprobe.d/zfs.conf is a blank file currently, I haven't tried much here, the sequential write performance seems like it should increase with more disks.
The specification for an LSI 9300-8e HBA quotes 12Gb (Gigabit) throughput for connected storage. (https://docs.broadcom.com/docs/12353459) You multiply up that figure with the bandwidth getting ~9600MB/sec. overall throughput.
Is there an overall I/O queue depth setting for the HBA (driver) in the OS that is throttling your I/O? This still wouldn't explain how the bandwidth is halved so accurately.
Your figures would make a lot of sense if only a 'single path' or link is working for the SAS connection - is there any (bizarre) way only one link out of eight could be working? I am not aware of how wide or narrow SAS 'ports' (which are virtual rather than physical objects) are configured from 'phys' and if the HBA isn't talking to your disk shelf is there a fallback configuration option that might allow this?