md raid1 reading from one disk only

I have two identical HDD in a linux software raid 1. I observed a mostly reading load on this raid device, with the process showing mostly iowait. iotop shows about 75% disk utilisation overall.

If I look now at the disk utilisations of the physical disks, one disk shows about 1M/s read, the other shows only 100K/s read. /proc/mdstat shows the array is in good health. What could be the issue that not both disks are used equally?

Regarding the comment: I tried both. Reading with two threads and with one. It doesn't change anything.


Solution 1:

For sequential reads, there is no performance benefit from reading from both disks. Since the same data is on both disks, they would each have to seek over any data read by the other disk. But short seeks forward is not that much faster than reading all the intermediate data.

However if you have multiple processes reading different data from the disk in parallel, you should see major performance improvement compared to a single disk.

Two processes reading from the same disk will typically cause an expensive seek each time they alternate. With RAID1 the two processes could be reading from different disks, and the number of seeks would be reduced significantly.

Solution 2:

If I read sequentially with one thread, it all goes to one drive. If I read in two threads from two different files, the first thread goes to one drive, and the second randomly shifts from one drive to the other, sometimes balancing, and sometimes leaving the second drive near-idle with both threads reading from the first drive.

Reading the kernel source, this is to be expected. It is optimised for latency, not throughput. For spinning drives, read_balance() in raid1.c will pretty much always choose the drive with the closest head position to the request. If the second drive moves away from where the action is, it's hard to see how it could ever return.

If at least one underlying drive is an SSD, the device with the least pending requests will be chosen instead, so throughput will be balanced. So that's one solution, I guess: use SSDs.

There's a comment indicating this code was written in 2000. There's no way to tune it short of editing it and recompiling.