Linux - real-world hardware RAID controller tuning (scsi and cciss)

Most of the Linux systems I manage feature hardware RAID controllers (mostly HP Smart Array). They're all running RHEL or CentOS.

I'm looking for real-world tunables to help optimize performance for setups that incorporate hardware RAID controllers with SAS disks (Smart Array, Perc, LSI, etc.) and battery-backed or flash-backed cache. Assume RAID 1+0 and multiple spindles (4+ disks).

I spend a considerable amount of time tuning Linux network settings for low-latency and financial trading applications. But many of those options are well-documented (changing send/receive buffers, modifying TCP window settings, etc.). What are engineers doing on the storage side?

Historically, I've made changes to the I/O scheduling elevator, recently opting for the deadline and noop schedulers to improve performance within my applications. As RHEL versions have progressed, I've also noticed that the compiled-in defaults for SCSI and CCISS block devices have changed as well. This has had an impact on the recommended storage subsystem settings over time. However, it's been awhile since I've seen any clear recommendations. And I know that the OS defaults aren't optimal. For example, it seems that the default read-ahead buffer of 128kb is extremely small for a deployment on server-class hardware.

The following articles explore the performance impact of changing read-ahead cache and nr_requests values on the block queues.

http://zackreed.me/articles/54-hp-smart-array-p410-controller-tuning
http://www.overclock.net/t/515068/tuning-a-hp-smart-array-p400-with-linux-why-tuning-really-matters
http://yoshinorimatsunobu.blogspot.com/2009/04/linux-io-scheduler-queue-size-and.html

For example, these are suggested changes for an HP Smart Array RAID controller:

echo "noop" > /sys/block/cciss\!c0d0/queue/scheduler 
blockdev --setra 65536 /dev/cciss/c0d0
echo 512 > /sys/block/cciss\!c0d0/queue/nr_requests
echo 2048 > /sys/block/cciss\!c0d0/queue/read_ahead_kb

What else can be reliably tuned to improve storage performance?
I'm specifically looking for sysctl and sysfs options in production scenarios.


I've found that when I've had to tune for lower latency vs throughput, I've tuned nr_requests down from it's default (to as low as 32). The idea being smaller batches equals lower latency.

Also for read_ahead_kb I've found that for sequential reads/writes, increasing this value offers better throughput, but I've found that this option really depends on your workload and IO pattern. For example on a database system that I've recently tuned I changed this value to match a single db page size which helped to reduce read latency. Increasing or decreasing beyond this value proved to hurt performance in my case.

As for other options or settings for block device queues:

max_sectors_kb = I've set this value to match what the hardware allows for a single transfer (check the value of the max_hw_sectors_kb (RO) file in sysfs to see what's allowed)

nomerges = this lets you disable or adjust lookup logic for merging io requests. (turning this off can save you some cpu cycles, but I haven't seen any benefit when changing this for my systems, so I left it default)

rq_affinity = I haven't tried this yet, but here is the explanation behind it from the kernel docs

If this option is '1', the block layer will migrate request completions to the cpu "group" that originally submitted the request. For some workloads this provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion processing setting this option to '2' forces the completion to run on the requesting cpu (bypassing the "group" aggregation logic)"

scheduler = you said that you tried deadline and noop. I've tested both noop and deadline, but have found deadline win's out for the testing I've done most recently for a database server.

NOOP performed well, but for our database server I was still able to achieve better performance adjusting the deadline scheduler.

Options for deadline scheduler located under /sys/block/{sd,cciss,dm-}*/queue/iosched/ :

fifo_batch = kind of like nr_requests, but specific to the scheduler. Rule of thumb is tune this down for lower latency or up for throughput. Controls the batch size of read and write requests.

write_expire = sets the expire time for write batches default is 5000ms. Once again decrease this value decreases your write latency while increase the value increases throughput.

read_expire = sets the expire time for read batches default is 500ms. Same rules apply here.

front_merges = I tend to turn this off, and it's on by default. I don't see the need for the scheduler to waste cpu cycles trying to front merge IO requests.

writes_starved = since deadline is geared toward reads the default here is to process 2 read batches before a write batch is processed. I found the default of 2 to be good for my workload.


More than anything, everything depends on your workload.

read_ahead_kb can help you if it's really helpful to read lots of data from some file ahead of time, like when streaming video. Sometimes it can hurt you badly. Yes, the default 128 KB can sound like small, but with enough concurrency it starts to sound like big! On the other hand, with a server such as a video encoding server which only converts the videos from a format to another, that might be very good idea to tune.

nr_requests, when overtuned, can easily flood your RAID controller, which again hurts performance.

In the real world, you need to watch the latencies. If you are connected to SAN, take a look with iostat, sar or whatever you like to use, and see if I/O request service times are through the roof. Of course this helps with local disks, too: if latencies are very very big, consider tuning down your I/O elevator settings by downgrading max_requests and other settings.


FYI read_ahead_kb and blockdev --setra are just different ways to set the same setting using different units (kB vs sectors):

foo:~# blockdev --setra 65536 /dev/cciss/c0d0
foo:~# blockdev --getra /dev/cciss/c0d0
65536
foo:~# cat /sys/block/cciss\!c0d0/queue/read_ahead_kb
32768
foo:~# echo 2048 > /sys/block/cciss\!c0d0/queue/read_ahead_kb
foo:~# cat /sys/block/cciss\!c0d0/queue/read_ahead_kb
2048
foo:~# blockdev --getra /dev/cciss/c0d0
4096

So the

blockdev --setra 65536 /dev/cciss/c0d0

in your example has no effect.