Linux I/O bottleneck with data-movers

First of all if your CPUs (and damn! That's a lot 24) eat data faster than what can provide the data storage, then you get iowait. That's when the kernel pause a process during a blocking io (a read that comes too slow or a sync write).
So check that the storage can provide enough throughput for 24 cores.

Example, let's assume your storage can provide 500MB/s throughput, that you are connected via 2 Gigabit Ethernet line (bond), the network will already limit the maximum throughput to something around 100-180 MB/s. If your process eat data at the speed of 50 MB/s and that you run 4 threads on your 4 core machine: 4 x 50 MB/s = 200 MB/s consumed. If the network can sustained the 180MB/s then you wil not have much latency and your CPUs will be loaded. The network here is a small bottleneck.
Now if you scale this up to 24 cores and 24 threads, you would need 1200 MB/s, even if you change the wiring to allow such throughput, your storage system does not provide more than 500 MB/s, it becomes a bottleneck.

When it comes to io wait, bottlenecks can be everywhere. Not only on the physical layers, but also in software and kernel space buffers. It really depends on the usage patterns. But as the software bottlenecks are much harder to identify, it usually is preferrable to check the theorical throughput on the hardware before investigating the software stacks.

As said, an iowait occurs when a process make a read and the data takes time to arrive, or when it makes a sync write and the data modification acknowledgment takes its time. During a sync write, the process enter uninterruptible sleep so data don't get corrupted. There is one handy tool to see which call makes a process hang: latencytop. It is not the only one of its kind, but you can give it a try.

Note: for your information, dm stands for device mapper not data movers.


First of all, holy inferno that's a lot of iron! :)

Unfortunately since your setup sounds very complex, I don't think anyone's going to be able to provide a straight-away "There's your problem!" answer, unless they've done something with an extremely similar or identical setup and encountered the same problem. So, while this text is labeled by SU as an "Answer", you should probably consider it more like a "Suggestion". And I can't put it in the comments because it's too many words. :S

Without knowledge of how your hardware is mapped to the devices, it's hard to say why the I/O is going one place and not another. How do you have the devices mounted? Are your programs accessing the sd* devices directly, or are all of your filesystems mounted on the dm devices and all file accesses occur through there?

Other things I have to ask about:

  • What kind of RAID is it? If you're calculating parity bits with RAID5 or RAID6, that is hopefully taken care of by the raid server hardware... if not, the processing servers are doing that.... which is suboptimal and can cause I/O latency if done in software.

  • You isolated one of the main differences between the two servers in your message. One is using fibre channel and one is using ethernet. The Fibre Channel should be providing better latency and bandwidth, but maybe that's also a problem: if it's providing a lot of throughput, it could be making the RAID server very busy itself... and congestion leads to buffers/caches filling up, which increases latency, which causes higher I/O waits.

It's almost as if you may have a buffer bloat problem with your disk arrays -- you know? Hardware RAID controllers normally have a great deal of on-board cache, don't they? So as I/O to the media gets queued up and the caches get full with dirty pages, eventually the whole thing is saturated (if the mechanical storage can't keep up with the load) and latency sails through the roof... surely you can produce more load with 24 cores + FC than with 4 cores + GbE :) Check the RAID server and see how busy the disks are... a lot of the "I/O" may just be control packets, etc. I'm not sure how FC works but if it's anything like TCP then you're going to see retransmissions if the latencies are too high.

Like if you ask someone a question over the phone and they don't answer for a few seconds, you say "Hello?" -- networking protocols (and FC is just a networking protocol) do the same thing, just in a shorter timescale. But of course that extra "Hello?" is expensive in the context of networking because it adds even more data to an already-congested pipe.

In closing, a general tip:

When debugging latency/IO waits/throughput issues, always measure. Measure everywhere. Measure at the wire, measure what the programs themselves are doing, measure at the processing end, measure on the RAID server, etc. Don't just look at it from one perspective -- try to consider each individual component of the system that is responsible for processing, reading or writing any of the data in the pipeline. Take apart one transaction or one discrete work unit and dissect exactly the path it takes through your hardware, and measure at each distinct component to see if there are bottlenecks or places where there is undue latency, etc. A friend of mine called this "peeling back the onion", and I've used the phrase ever since to refer to the task of debugging a data flow.