How much does HDD cache matter with Linux softraid?

Solution 1:

Cache and RAID has an interesting relationship.

Expensive RAID controllers have built-in cache, and they turn the drive cache off (typically). The reason is that RAID is typically designed to ensure that your data is safe, and to increase performance. Cache helps the performance at the expense of reliability, because if the power dies, your cache goes away, even though the software thought it was safe. This causes bad things in software that really needs to know data exists on the disk. Things like databases.

The battery exists to either write the data to NVRAM, in the event of a controller, or to the physical disks, in the event of a battery-backed array.

Software RAID doesn't really have that sort of option. If the drives have said "ok, we've got the data", and then the power dies while the data is still in cache, there's a problem. There's no NVRAM that keeps the data, and the disks don't keep spinning thanks to a battery backup (on their own, anyway. Additional software may be available to do this).

I would read Question 9 under "Setup Considerations" in the Software RAID HOWTO: http://www.linuxjunkies.org/html/Software-RAID-0.4x-HOWTO.html#s3

These questions have some interesting reading:
SATA Disks that handle write caching properly?
LVM mirroring VS RAID1

Anyway, in response to your question...more drive cache gives the drive more space to "play" with. In other words, actually putting things on disk is expensive, in terms of time. Storing things in memory is really cheap.

The performance will really depend on the load that you're putting on the disks and where the bottleneck lies. Each of your disk's spindles has a statistic called IOPS (I/O Operations Per Second - http://adamstechblog.com/2009/02/10/how-to-calculate-iops-ios-per-second/) that determines how fast it can put bits on the spinning platters. If you feed the hard drive more data than it can put on a disk, then it uses cache. If you keep hammering it, it keeps feeding into cache. At the point the cache fills up, your computer waits on the disk to clear the "dirty" data (data that needs written).

So with a RAID system, you have several disks that are being fed, thus increasing the IOPS. After you add enough spindles, the disk stops being the bottleneck, and it becomes the transit to the array (you're not there yet, don't worry).

Essentially, more cache gives you more wiggle room when it comes to dumping a lot of data on the disk(s). If you're particularly IO driven, you'll see an improvement.

On the other hand, if you, your software, or your (non-existent at the moment) RAID array disables the drive cache, you paid a lot of money for nothing.

In the end, if you have a choice, get a lower cache in favor of a higher spin rate.