Poor write performance of software RAID10 array of 8 SSD drives

I have server with Supermicro X10DRW-i motherboard and RAID10 array of 8 KINGSTON SKC400S SSDs; OS is CentOS 6

  # cat /proc/mdstat 
Personalities : [raid10] [raid1] 

md2 : active raid10 sdj3[9](S) sde3[4] sdi3[8] sdd3[3] sdg3[6] sdf3[5] sdh3[7] sdb3[1] sda3[0]
      3978989568 blocks super 1.1 512K chunks 2 near-copies [8/8] [UUUUUUUU]
      bitmap: 9/30 pages [36KB], 65536KB chunk

  # mdadm --detail /dev/md2                
    /dev/md2:
            Version : 1.1
      Creation Time : Wed Feb  8 18:35:14 2017
         Raid Level : raid10
         Array Size : 3978989568 (3794.66 GiB 4074.49 GB)
      Used Dev Size : 994747392 (948.67 GiB 1018.62 GB)
       Raid Devices : 8
      Total Devices : 9
        Persistence : Superblock is persistent

      Intent Bitmap : Internal

        Update Time : Fri Sep 14 15:19:51 2018
              State : active 
     Active Devices : 8
    Working Devices : 9
     Failed Devices : 0
      Spare Devices : 1

             Layout : near=2
         Chunk Size : 512K

               Name : ---------:2  (local to host -------)
               UUID : 8a945a7a:1d43dfb2:cdcf8665:ff607a1b
             Events : 601432

        Number   Major   Minor   RaidDevice State
           0       8        3        0      active sync set-A   /dev/sda3
           1       8       19        1      active sync set-B   /dev/sdb3
           8       8      131        2      active sync set-A   /dev/sdi3
           3       8       51        3      active sync set-B   /dev/sdd3
           4       8       67        4      active sync set-A   /dev/sde3
           5       8       83        5      active sync set-B   /dev/sdf3
           6       8       99        6      active sync set-A   /dev/sdg3
           7       8      115        7      active sync set-B   /dev/sdh3

           9       8      147        -      spare   /dev/sdj3

I've noticed that write speed is just terrible, not even close to SSD performance.

# dd if=/dev/zero of=/tmp/testfile bs=1G count=1 oflag=dsync      
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 16.511 s, 65.0 MB/s

Read speed is fine though

# hdparm -tT /dev/md2

/dev/md2:
 Timing cached reads:   20240 MB in  1.99 seconds = 10154.24 MB/sec
 Timing buffered disk reads: 3478 MB in  3.00 seconds = 1158.61 MB/sec

After doing some troubleshooting on the issue, I found out that probably I've messed up the storage configuration initially: X10DRW-i has Intel C610 which has two separate SATA controllers, 6-port SATA and 4-port sSATA. So disks in the array are connected to different controllers, and I believe this is the root cause of poor performance. I have only one idea of fixing this: installing PCIe SAS controller (probably AOC-S3008L-L8E) and connecting SSD drives to it.

So I would like to confirm the following:

Am I right about the root cause, or I should double-check something?

Will my solution work?

If I reconnect drives to new controller, will my RAID and data survive? My research shows that yes, as UUIDs of partitions will remain the same, but I just want to be sure.

Thanks to everyone in advance.

UPD: iostat -x 1 while performing dd test: https://pastebin.com/aTfRYriU

# hdparm /dev/sda                                    

/dev/sda:
 multcount     = 16 (on)
 IO_support    =  1 (32-bit)
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 124519/255/63, sectors = 2000409264, start = 0

# cat /sys/block/md2/queue/scheduler                 
none

Though AFAIK scheduler is set on physical drives:

# cat /sys/block/sda/queue/scheduler 
noop anticipatory [deadline] cfq 

smartctl -a (on devices, not partitions): https://pastebin.com/HcBp7gUH

UPD2:

# dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 14.389 s, 74.6 MB/s

UPD3:

I just have run fstrim on / partition and got some effect, still write speed is too low: 227 MB/s, 162 MB/s, 112 MB/s, 341 MB/s, 202 MB/s in five consecutive tests.


Solution 1:

The measured low performance are the results of various factors:

  • after creation the array is entirely synched, causing the allocation of most (if not all) flash data pages on half the SSDs. This will put the SSDs in a low performance state until a secure erase / trim "frees" all/most/some pages. This explain the increased performance after an fstrim;
  • the (default) 512 KB chunk size is too much for maximum sequential/streaming performance (as benchmarked with dd). With an all-SSDs array I would select a 64 KB chunk size and, probably (but this should be confirmed with real-world test), with "far" layout. Please note that decreasing the chunk size, while benefical for streaming accesses, can penalize random reads/writes. This is mainly a concern with HDDs, but even SSDs can be somewhat affected;
  • by default, the linux kernel issues at most 512 KB sized I/O. This means that, even when asking dd to use 1 GB blocks (as per your first command), these will be split in a myriad of 512 KB-sized requests. Coupled with your 512 KB-sized chunk, this will engage a single SSD per write request, basically capping streaming write performance at single-SSD level and denying any potential speed increase due to RAID. While you can use the max_sectors_kb tunable (found in /sys/block/sdX/queue/max_sectors_kb), values bigger than 512 KB can (in some configuration/kernel versions) be ignored;
  • finally, while interesting and a obligatory first-stop, dd itself is a poor benchmark: it only tests streaming peformance at low (1) queue depth. Even with your current array config, a more comprehensive test as fio would show significant performance increase relative to a single-disk scenario, at least in random I/O.

What can you do to correct the current situation? First of all, you must accept to wipe the disks/array; obviously, you need to take backups as first step. Then:

  • stop and delete the array (mdadm -S /dev/md2)
  • trim all data blocks on any disk (blkdiscard /dev/sdX3)
  • recreate the array with 64 KB chunks and with the clean flag (mdadm --create /dev/md2 --level=10 --raid-devices=8 --chunk=64 --assume-clean /dev/sdX3)
  • re-bench with dd and fio;
  • if all looks good, restore your backup.

A last note about your SATA setup: splitting disk in this manner should clearly be avoided to get maximum peformance. That said, your write speed is so low that I would not blame your SATA controller. I would really recreate the array per above instruction before buying anything new.