What network loads require NIC polling vs interrupts?

Does anyone have some data or basic calculations that can answer when frame coalescing (NAPI) is required and when a single interrupt per frame is sufficient?

My hardware: IBM BladeServer HS22, Broadcom 5709 Gigabit NIC hardware (MSI-X), with dual Xeon E5530 quad-core processors. Main purpose is Squid proxy server. Switch is a nice Cisco 6500 series.

Our basic problem is that during peak times (100 Mbps traffic, only 10,000 pps) that latency and packet loss increases. I have done a lot of tuning and kernel upgrade to 2.6.38 and it has improved the packet loss but latency is still poor. Pings are sporadic; jumping even to 200ms on local Gbps LAN. Squid average response jumps from 30ms to 500+ms even though CPU/memory load is fine.

The interrupts climb to about 15,000/second during the peak. Ksoftirqd isn't using much CPU; I have installed irqbalance to balance the IRQs (8 each for eth0 and eth1) across all the cores but that hasn't helped much.

Intel NICs seem to never have these kinds of problems, but do the fact of the bladesystem and fixed configuration hardware, we are kind of stuck with the Broadcoms.

Everything is pointing at the NIC as being the main culprit. The best idea I have right now is to try decrease the interrupts while keeping both latency low and throughput high.

The bnx2 unfortunately doesn't support adaptive-rx or tx.

The NAPI vs Adaptive Interrupts thread answer provides a great over view of interrupt moderation but no concrete information on how to calculate optimal ethtool coalesce settings for given workaround. Is there a better approach then just trial and error?

Does the above mentioned workload and hardware configuration even need NAPI? Or should it be able to live on single interrupt per packet?

Solution 1:

Great question that had me doing some reading to try and figure it out. Wish I could say I have an answer... but maybe some hints.

I can at least answer your question, "should it be able to live on single interrupt per packet". I think the answer is yes, based on a very busy firewall that I have access to:

Sar output:

03:04:53 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
03:04:54 PM        lo     93.00     93.00      6.12      6.12      0.00      0.00      0.00
03:04:54 PM      eth0 115263.00 134750.00  13280.63  41633.46      0.00      0.00      5.00
03:04:54 PM      eth8  70329.00  55480.00  20132.62   6314.51      0.00      0.00      0.00
03:04:54 PM      eth9  53907.00  66669.00   5820.42  21123.55      0.00      0.00      0.00
03:04:54 PM     eth10      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM     eth11      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth2 146520.00 111904.00  45228.32  12251.48      0.00      0.00     10.00
03:04:54 PM      eth3    252.00  23446.00     21.34   4667.20      0.00      0.00      0.00
03:04:54 PM      eth4      8.00     10.00      0.68      0.76      0.00      0.00      0.00
03:04:54 PM      eth5      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth6   3929.00   2088.00   1368.01    183.79      0.00      0.00      1.00
03:04:54 PM      eth7     13.00     17.00      1.42      1.19      0.00      0.00      0.00
03:04:54 PM     bond0 169170.00 201419.00  19101.04  62757.00      0.00      0.00      5.00
03:04:54 PM     bond1 216849.00 167384.00  65360.94  18565.99      0.00      0.00     10.00

As you can see, some very high packet per second counts, and no special ethtool tweaking was done on this machine. Oh... Intel chipset, though. :\

The only thing that was done was some manual irq balancing with /proc/irq/XXX/smp_affinity, on a per-interface basis. I'm not sure why they chose to go that way instead of with irqbalance, but it seems to work.

I also thought about the math required to answer your question, but I think there are way too many variables. So... to summarise, in my opinion, the answer is no, I don't think you can predict the outcomes here, but with enough data capture you should be able to tweak it to a better level.

Having said all that, my gut feel is that you're somehow hardware-bound here... as in a firmware or interop bug of some kind.

Solution 2:

Certainly given the CPU, chipset and bus capabilities in comparison to such a low amount of traffic you have there's no reason whatsoever for you to NEED any form of interrupt management. We have multiple RHEL 5.3 64-bit machines with 10Gbps NICs and their interrupts aren't too bad at all, this is 100 times less.

Obviously you have a fixed configuration (I use HP's blades which are pretty similar) so swapping out NICs for Intels is now an easy option but what I would say is that I'm starting to spot a number of similar problems around this forum and elsewhere with that particular Broadcom NIC. Ever the SE sites themselves had some problems with this kind of inconsistency and swapping to Intel NICs absolutely helped.

What I'd recommend is picking a single blade and adding an Intel based adapter to that one machine, you'll obviously have to add a interconnect or whatever IBM call them to get the signal out but try the same software setup but with this other NIC (probably disable the Broadcom if you can). Test this and see how you get on, I know what I've described needs a couple of bits of extra hardware but I'm imagine your IBM rep will happily loan you them. It's the only way to know for sure. Please let us know what you find out, I'm genuinely interested if there's a problem with these NICs, even if it's an odd edge-case. As an aside I'm meeting with Intel and Broadcom next week to discuss something entirely unrelated but I'll certainly discuss it with them and let you know if I find anything of interest.

Solution 3:

The question about interrupts is how they impact the overall system performance. Interrupts can preempt user and kernel land processing and while you may not see much CPU use, there are a lot of context switching occurring and that is a big performance hit. You can use vmstat and check the system column, cs header for the interrupts and context switches per second (interrupts include the clock so you must weight that in), its worth a check too.

Solution 4:

The short direct answer:

If you enable polling you will reduce the context switches (normally due to interupts) from whatever they are now (15kips in your case) to a predetermined number (usually 1k to 2k).

If you currently have traffic above the predetermined number then you should have better response times by enabling polling. The converse is also true. I would not say this is "necessary" unless the context switches are impacting performance.

Solution 5:

To followup: with the NAT and conntrack modules unloaded plus minimized iptables ruleset, we get terrific performance. The IPVS load balancer has done over 900 Mbps / 150 kpps. This is while still using the same Broadcom bnx2 chipsets.

So to conclude: the interrupt handling seems fine and defaults for Debian with 2.6.38/3.0.x kernel seem to perform acceptably.

Definitely I would prefer to use Intel NICs so that we can use standard Debian packages. Fighting the non-free bnx2 firmware has been a huge waste of time.