MegaRAID: performance difference between volumes

I have a MegaRAID SAS 9361-8i with 2x 240GB SATA 6Gbps SSD in RAID1, 4x 10TB SAS 12Gbps HDD in RAID6 and 4x 480GB SATA 6 Gbps SSD in RAID5:

-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace TR 
-----------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Optl  N  223.062 GB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1 Optl  N  223.062 GB dflt N  N   dflt N      N  
 0 0   0   8:2      13  DRIVE Onln  N  223.062 GB dflt N  N   dflt -      N  
 0 0   1   8:5      16  DRIVE Onln  N  223.062 GB dflt N  N   dflt -      N  
 1 -   -   -        -   RAID6 Optl  N   18.190 TB enbl N  N   dflt N      N  
 1 0   -   -        -   RAID6 Optl  N   18.190 TB enbl N  N   dflt N      N  
 1 0   0   8:0      9   DRIVE Onln  N    9.094 TB enbl N  N   dflt -      N  
 1 0   1   8:1      11  DRIVE Onln  N    9.094 TB enbl N  N   dflt -      N  
 1 0   2   8:3      10  DRIVE Onln  N    9.094 TB enbl N  N   dflt -      N  
 1 0   3   8:4      12  DRIVE Onln  N    9.094 TB enbl N  N   dflt -      N  
 2 -   -   -        -   RAID5 Optl  N    1.307 TB dflt N  N   dflt N      N  
 2 0   -   -        -   RAID5 Optl  N    1.307 TB dflt N  N   dflt N      N  
 2 0   0   8:6      14  DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
 2 0   1   8:7      17  DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
 2 0   2   8:9      15  DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
 2 0   3   8:10     18  DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
-----------------------------------------------------------------------------

---------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name 
---------------------------------------------------------------
0/0   RAID1 Optl  RW     Yes     NRWBD -   ON  223.062 GB VD0  
1/1   RAID6 Optl  RW     Yes     RWBD  -   ON   18.190 TB VD1  
2/2   RAID5 Optl  RW     Yes     NRWBD -   ON    1.307 TB VD2  
---------------------------------------------------------------

---------------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model                     Sp Type 
---------------------------------------------------------------------------------------
8:0       9 Onln   1   9.094 TB SAS  HDD N   N  512B HUH721010AL5200           U  -    
8:1      11 Onln   1   9.094 TB SAS  HDD N   N  512B HUH721010AL5200           U  -    
8:2      13 Onln   0 223.062 GB SATA SSD N   N  512B Micron_5100_MTFDDAK240TCC U  -    
8:3      10 Onln   1   9.094 TB SAS  HDD N   N  512B HUH721010AL5200           U  -    
8:4      12 Onln   1   9.094 TB SAS  HDD N   N  512B HUH721010AL5200           U  -    
8:5      16 Onln   0 223.062 GB SATA SSD N   N  512B Micron_5100_MTFDDAK240TCC U  -    
8:6      14 Onln   2 446.625 GB SATA SSD N   N  512B Micron_5100_MTFDDAK480TCC U  -    
8:7      17 Onln   2 446.625 GB SATA SSD N   N  512B Micron_5100_MTFDDAK480TCC U  -    
8:9      15 Onln   2 446.625 GB SATA SSD N   N  512B Micron_5100_MTFDDAK480TCC U  -    
8:10     18 Onln   2 446.625 GB SATA SSD N   N  512B Micron_5100_MTFDDAK480TCC U  -    
---------------------------------------------------------------------------------------

Testing write speed on these VDs:

# lvcreate -ntest1 -L32G vg /dev/sda
# lvcreate -ntest2 -L32G vg /dev/sdb
# lvcreate -ntest3 -L32G vg /dev/sdc
# for i in 1 2 3; do sleep 10; dd if=/dev/zero of=/dev/vg/test$i bs=128M count=256 oflag=direct; done
34359738368 bytes (34 GB, 32 GiB) copied, 120.433 s, 285 MB/s  (test1/VD 0)
34359738368 bytes (34 GB, 32 GiB) copied, 141.989 s, 242 MB/s  (test2/VD 1)
34359738368 bytes (34 GB, 32 GiB) copied, 26.4339 s, 1.3 GB/s  (test3/VD 2)

# for i in 1 2 3; do sleep 10; dd if=/dev/vg/test$i of=/dev/zero bs=128M count=256 iflag=direct; done
34359738368 bytes (34 GB, 32 GiB) copied, 35.7277 s, 962 MB/s  (test1/VD 0)
34359738368 bytes (34 GB, 32 GiB) copied, 147.361 s, 233 MB/s  (test2/VD 1)
34359738368 bytes (34 GB, 32 GiB) copied, 16.7518 s, 2.1 GB/s  (test3/VD 2)

Running dd in parallel:

# sleep 10; for i in 1 2 3; do dd if=/dev/zero of=/dev/vg/test$i bs=128M count=256 oflag=direct & done
34359738368 bytes (34 GB, 32 GiB) copied, 28.1198 s, 1.2 GB/s  (test3/VD 2)
34359738368 bytes (34 GB, 32 GiB) copied, 115.826 s, 297 MB/s  (test1/VD 0)
34359738368 bytes (34 GB, 32 GiB) copied, 143.737 s, 239 MB/s  (test2/VD 1)

# sleep 10; for i in 1 2 3; do dd if=/dev/vg/test$i of=/dev/zero bs=128M count=256 iflag=direct & done
34359738368 bytes (34 GB, 32 GiB) copied, 16.8986 s, 2.0 GB/s  (test3/VD 2)
34359738368 bytes (34 GB, 32 GiB) copied, 35.7328 s, 962 MB/s  (test1/VD 0)
34359738368 bytes (34 GB, 32 GiB) copied, 153.147 s, 224 MB/s  (test2/VD 1)

The values for VD 0 and VD 1 are abysmal, and what is remarkable, VD 2 had similar values to the other ones until I deleted and recreated it, which I can't do with the others as they contain data.

The only limit I can readily explain is the read speed of VD 2, which is roughly three times the SATA link speed -- that makes sense for a RAID5 with four disks. The read speed of VD0 is a bit below twice the SATA link speed, that could be either a limitation of the media, or non-optimal interleaving of requests in a RAID1, but either would still be acceptable.

The other numbers make no sense to me. The controller is obviously able to handle data faster, and that parallel performance is not significantly different from looking at volumes in isolation also suggests that it is not choosing a bottlenecked data path.

My interpretation of the situation is that creating the volumes from the BIOS instead of from StorCLI somehow gave them a sub-optimal configuration. Comparing the output from storcli /c0/v0 show all and storcli /c0/v2 show all shows no unexplained differences, so my fear is that the error is somewhere deeper in the stack.

  • Is there a known configuration gotcha or bug that would explain this behaviour?
  • Is there a tool to analyze configurations for bottlenecks, or, failing that,
  • Can I somehow export internal configuration values to allow me to compare them between volumes?

Solution 1:

First dd is not really a good disk performance testing tool. All you are doing is testing streaming, sequential reads and writes. And dd alternates between synchronous read and write operations, so there are "dead periods" between operations to the device you're testing.

Yes, the multiple drives in the RAID5 and RAID6 array will allow even spinning disks to keep up with the effective performance of an SSD RAID1 mirror or, in the case of multiple SSDs in a RAID5 array, actually beat the performance of an SSD RAID1 mirror.

As long as you're streaming large amounts of data sequentially.

But try doing random small-block writes to those RAID5 and RAID6 arrays and watch performance plummet (especially the VD on spinning disks...) while the RAID1 on SSDs won't see anywhere near as much of a performance drop.

If you were to try doing random 512-byte writes to that RAID6 array on spinning disks, your performance would probably be a few tens of kb/sec, if that.

What happens on RAID5 and RAID6 when you write small blocks to random locations is "read-modify-write". Remember when you picked a "stripe size" or "segment size" when you created the RAID5 or RAID6 array? Depending on the controller, that's the amount of data in a block used to compute parity, or it's the amount of data per data disk that's used to compute parity. What did you pick? 64K? 128K? 2MB because "bigger must be faster"? (Nope, it's not...) The amount of data used to compute parity on a RAID5 or RAID6 array is commonly called "stripe size". So if you picked a 2MB segment size on a RAID5 array of 4 disks (3 data, one parity), and your controller treats that 2 MB as a per-disk value, that means the stripe size is 6 MB. (No, there isn't really a "parity disk" - parity data is spread over all disks. But it's a convenient way to think about the larger amount of space required to hold the data and parity.)

And 6 MB effectively becomes the smallest block of data on that RAID array that you can operate on.

Well, what do you think happens if you write 512 bytes to the middle of one of those 6 megabyte stripes?

The controller has to read that entire stripe from all the drives, update the correct portion of that chunk with the new 512 bytes, recompute the parity of the entire stripe, then write the stripe and parity back out to all the disks.

Now, there are a LOT of optimizations that good RAID controllers can and do make to that "read-modify-write" operation so you rarely see the full impact, but logically that's what has to happen. And eventually you can throw enough small random writes at any controller that its ability to optimize the IO and hide the underlying abysmal performance will be overwhelmed.

And on spinning hard drives, the controller has to wait for the drive heads to seek to the proper track- something SSDs don't have to do.

So sequential, streaming large-block write performance to RAID5 and RAID6 arrays can be very good. But random small-block write performance is horrible, and that horribleness is often exacerbated by using large segment/stripe sizes. And it's even worse on spinning disks.

And alignment of the IO operations matters, too. If your filesystem blocks don't align with your RAID5/RAID6 stripes, you'll wind up doing excessive read-modify-write operations. How many filesystems do you know that have 6 MB block sizes? That would be "none". And that's another reason why large segment/stripe sizes on RAID5/6 arrays are bad, along with the reason you often see RAID5/6 arrays with a power-of-two data disks - so the RAID stripe size can be matched to (or be smaller than) the file system block size.

Read performance? You have read-ahead disabled on VD 0 and VD 2:

---------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name 
---------------------------------------------------------------
0/0   RAID1 Optl  RW     Yes     NRWBD -   ON  223.062 GB VD0  
1/1   RAID6 Optl  RW     Yes     RWBD  -   ON   18.190 TB VD1  
2/2   RAID5 Optl  RW     Yes     NRWBD -   ON    1.307 TB VD2  
---------------------------------------------------------------

That's what NR means in the Cache column - no read-ahead. It's no surprise that VD 1 blows away both VD 0 and VD 2 in sequential, streaming read performance - the controller can read ahead on VD 1 and cache data waiting for the next read request (remember what I said earlier about dd leaving "dead time" while it's doing the other half of its data moving?), but it doesn't read ahead on VD 0 or VD 2.