Improving RAID performance
I just installed an LSI 9260-i8, using two virtual drives, the first composed of 4 SSDs, the second of 4 HDDs. Obviously the idea is to get better performance while maintaining some security and plenty of storage capacity.
The SSDs are great and that array is ridiculously fast when dealing with small to relatively large files. The HDDs host mostly huge files (500MB-30GB). It's intended a the main long term storage facility, while the SSD array is for operating files and short term storage only. This means very often files will be moved from the SSD array to the HDD array.
Problem is that performance very quickly declines after the first gig or so of a large operation is written. It starts at around 250MB/s, which isn't half bad write performance for an RAID 5 array of only 5 HDDS, but the copy I just did, consisting of 4 files totalling 12 GB, gradually declined to a 35MB/s low.
Now I guess one's advice would depend on a lot of metainfo, so here goes:
- The LSI card does not have a BBU (yet) so write-back is disabled.
- The HDDs are WD15EARS 2TB drives. Obviously these aren't the fastest HDDs out there, but a consistent 200MB/s isn't too much to ask I think.
- The SSDs are OCZ Vertex 2 60GB drives.
- Don't think it's relevant, but the HDDs have the spin down time upped to 5 minutes instead of the normal 8 seconds
- Drives show healthy in Storage Manager, no errors of note in logs
- Like I said, the SDDs are really fast, sporting up to 1100MB/s read speed, so that doesn't seem to be the bottleneck.
- Copy seems to pause, it'll run fast, stop, run fast again for about 500MB, etc etc, resulting in a lower speed overall.
- When creating the HDD array, I used a strip size of 512Kb. That's huge, but I'm expecting only large to huge files on that array. I'd rather not change that now either, as it would destroy existing data and I don't have a backup (yet)
- Operating system is Ubuntu 10.04 (64bit)
- Motherboard Asus WS Revolution (it's a workstation), 24GB of ECC RAM, Xeon W3570 at stock 3.2GHz
- The LSI card is inserted in the first PCIe slot (to avoid latency introduced by NF200)
- System is otherwise perfectly stable
- HDD array was formatted using "mkfs.ext4 -b 4096 -E stride=128,stripe-width=384 -L "DATA" /dev/sdb"
- fstab does not include data=writeback, nor noaccess, though I don't think that should be an issue influencing large files
Any and all advice is appreciated.
I think that "The LSI card does not have a BBU (yet) so write-back is disabled" is the bottleneck.
If you have UPS - enable the Write-Back.
If not - try to get the BBU.
If you can't - you can either risk the data consistency on the virtual drive by loosing the cached data in case of power surge if you enable Write-Back or stick to these speeds using write through cache.
Even if you align the partition to the logical volume (which is normally automatically done by most modern OSes) and format the volume with optimized cluster/block size big enough (i think it should be 2mb in your case) to get all the drives to process the single IO request i don't think you will achieve very big write performance difference.
Because the write performance of the RAID5 is very over-headed process. And since it is write through the XOR processor don't have the whole data in cache to perform the parity calculations in real time i think
With Write-Back enabled cache on 4x320gb hdds 515kb stip sized RAID 5 i get average 250-350 MB/s write speed writing big sequential files or average 150 MB/s copying big files inside the virtual volume. (i still don't have BBU but i have and old apc 700VA smart ups so i think its enough to minimize the power surges and eventual cache loss to a lot)
Are we discussing 100% random, 100% sequential or some mixed pattern? I am mostly experiencing high speeds when i fully read, write or copy big files on/from/to my array. On the other hand as already said random writes (reads) are much lower variating from less than 1 mb/s up to 190 mb/s average speeds depending on the file sizes and/or request sizes. Mostly under the 20mb/s range in everyday small size/file uses. So it depends a lot from the applications in the real life random transfers. As i am using windows OS my volumes are pretty mush as de-fragmented and for big files big operations like copying from/to are pretty fast
And one suggestion as a solution to the slow read/write random speeds of normal hdds - if you get to the point of reconfiguring your whole controller configuration why don't you consider CacheCade using 1 or 2 of the SSDs for no-power-dependent raid cache (something like the adaptec hybrid raids) and the rest for your OS/app drive as you are using them now? This way you should be able to boost the speed of your raid 5 volume even with write through i think because the actual write to the physical hdds should take place in the background and as you are using write through cache (no on board controller cache) and the ssds as cache instead i think you should be worry free of system resets. But for actual and concrete information on how cachecade works please read lsi's documentation and even ask LSI's technical support as i haven't got the chance to use it yet.
TomTom already has essentially answered it, but a little more context to the answer might be useful.
You're using RAID 5. RAID 5 has a well known performance issues when writing data.
For each RAID 5 stripe there is a parity data block, and the parity data blocks are spread out over all disks in a round-robin fashion. For each write to a RAID 5 array, the controller needs to recompute the parity information, and then write the new parity block to disk. A quote from here illustrates this (regarding a partial stripe update, but the same principle applies):
If you [...] modify the data block it recalculates the parity by subtracting the old block, and adding in the new version. Then in two separate operations it writes the data block followed by the new parity block. To do this it must first read the parity block from whichever drive contains the parity for that stripe block and reread the unmodified data for the updated block from the original drive. This read-read-write-write is known as the RAID5 write penalty since these two writes are sequential and synchronous the write system call cannot return until the reread and both writes complete, [...]
Around 35 MB/s sounds about right for a single SATA HDD doing a good bit of more-or-less random I/O due to the RAID 5 striping, and real-world RAID 5 write speeds are generally around ~1 disk performance for smaller arrays. So it's more-or-less expected performance; that it copies faster at the beginning is probably OS caching at play.
Getting a Battery Backup Unit and enabling write caching is not a cure-all solution. You write that you often copy large files (>1 GB). BBU + write caching helps tremendously with random small file writes, but less so with large sequential writes (because the on-controller buffer eventually fills up).
If you want to have good write performance, the answer is generally RAID 10.
And lastly, when you create your partitions, you should take care to ensure that the partition boundaries align with the array stripe boundaries.