Packet drop on HP ProLiant DL360 G9 running RHEL 6.10

We have a HP ProLiant DL360 G9 running RHEL 6.10 with 2 X Intel 82599ES 10-Gigabit SFI/SFP+. HP product name is HP Ethernet 10Gb 2-port 560SFP+ Adapter

eth5 and eth6 showing a lot of packet drop (rx_missed_errors) I disabled flow control at NIC level then rx_missed_errors stopped increase but rx_no_dma_resources started increase daily.

  • They both standalone interfaces not part of a bonding.
  • Eth5 and eth6 are on different cards
  • Both cards installed to a PCIe 3.0 X16 slot
  • irqbalance is running on the server

Update 1

Ring parameters for eth5 and eth6 are the same and already at max.

Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096

I noticed following for eth6 in /proc/interrupts

 Sun Jun  2 19:39:42 EDT 2019

            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15      CPU16      CPU17      CPU18      CPU19
 165:          0          0          0          0          0     484430     111744     333783     458868     577617          0          0          0          0          0      17978     402211      84832     183482   10567190   PCI-MSI-edge      eth6-TxRx-0
 166:          0          0          0          0          0      92569    2522312      36248      19459       1970          0          0          0          0          0      10140      33710      10180    1071214     651534   PCI-MSI-edge      eth6-TxRx-1
 167:          0          0          0          0          0      41060    2532170      37345      10970      92570          0          0          0          0          0       3055      22158      12485    1203344     494179   PCI-MSI-edge      eth6-TxRx-2
 168:          0          0          0          0          0     218925       8555    2312817     115650     126113          0          0          0          0          0      14575       3965     114145     995924     538667   PCI-MSI-edge      eth6-TxRx-3
 169:          0          0          0          0          0       7354       7781     199591    2262057      45221          0          0          0          0          0      34813     176350     105008     649389     962393   PCI-MSI-edge      eth6-TxRx-4
 170:          0          0          0          0          0      27982      23890      44703     162340    2597754          0          0          0          0          0      25991      22873      11846     885511     943057   PCI-MSI-edge      eth6-TxRx-5
 171:          0          0          0          0          0      16710        370        155   17725587    7504781          0          0          0          0          0 1054801625    1644839      14655  583745291  266971465   PCI-MSI-edge      eth6-TxRx-6
 172:          0          0          0          0          0       9823       6688     407394      11207      44103          0          0          0          0          0      88057    2496075       9284      56799    1391075   PCI-MSI-edge      eth6-TxRx-7
 173:          0          0          0          0          0      21175       1995     125490     151465      27120          0          0          0          0          0      19960     177195    2288457     787724     848755   PCI-MSI-edge      eth6-TxRx-8
 174:          0          0          0          0          0       7835       2210       3990      56075     106870          0          0          0          0          0     109740      24135      27720    2599827    1510934   PCI-MSI-edge      eth6-TxRx-9
 175:          0          0          0          0          0      42450       2605      39545      54520     162830          0          0          0          0          0      56035      11380      33815      52905    3993251   PCI-MSI-edge      eth6-TxRx-10
 176:          0          0          0          0          0      92335      33470    2290862       7545     227035          0          0          0          0          0       7550      25460      17225      65205    1682649   PCI-MSI-edge      eth6-TxRx-11
 177:          0          0          0          0          0      81685      56468    2273033     264820     195585          0          0          0          0          0     120640      36250      29450     244895    1146510   PCI-MSI-edge      eth6-TxRx-12
 178:          0          0          0          0          0      39655      24693     703993    1680384      22325          0          0          0          0          0     147980      27170      41585      72085    1689466   PCI-MSI-edge      eth6-TxRx-13
 179:          0          0          0          0          0     108905       1335      48265    2415832      19985          0          0          0          0          0       3545      23360      12590      35185    1780334   PCI-MSI-edge      eth6-TxRx-14
 180:          0          0          0          0          0     134826     291569      98014       9159    2262093          0          0          0          0          0     128867      18499      20078      39858    1463678   PCI-MSI-edge      eth6-TxRx-15
 181:          0          0          0          0          0       3220      37430      39030     129550      11070          0          0          0          0          0    2382452      24840      10860     146795    1664089   PCI-MSI-edge      eth6-TxRx-16
 182:          0          0          0          0          0      23120      28700     134025      96455      31545          0          0          0          0          0      30340    2262857      24485     144620    1673189   PCI-MSI-edge      eth6-TxRx-17
 183:          0          0          0          0          0       8900      29070      22490     112785     186240          0          0          0          0          0      40690      31665    2274862      37160    1705474   PCI-MSI-edge      eth6-TxRx-18
 184:          0          0          0          0          0      77090      18270      68465      53235     142648          0          0          0          0          0      16295      33770      29175    2367462    1642926   PCI-MSI-edge      eth6-TxRx-19
 185:          0          0          0          0          0         11          0          0          0          0          0          0          0          0          0          0          0          0          0          4   PCI-MSI-edge      eth6

So looks like CPU/Core 15/18/19 are under stress to process traffic on eth6

Basically I have no idea where to look next, I am guessing this may have something to do with irq affinity but not sure. I am also think of disable irqbalance but not sure if that is going to make any difference.

any suggestions?

Update 2

NIC Driver info and I don't think we have that bug. As that was in 2009.

driver: ixgbe
version: 4.2.1-k
firmware-version: 0x800008ea
bus-info: 0000:08:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

The data arrived on both eth5/6 are multicast data. Is that enough, setup port mirroring needs a ticket to network engineer team and will take time. I also not sure what to tell them to look for.

If I understand your comments correctly, there is a way to balance eth6-rxtx queue to more than one CPU core. I did some search myself and collected following information, hopeful that is useful to you.

ethtool -x eth5 and eth6

RX flow hash indirection table for eth5 with 20 RX ring(s):
    0:      0     1     2     3     4     5     6     7
    8:      8     9    10    11    12    13    14    15
   16:      0     1     2     3     4     5     6     7
   24:      8     9    10    11    12    13    14    15
   32:      0     1     2     3     4     5     6     7
   40:      8     9    10    11    12    13    14    15
   48:      0     1     2     3     4     5     6     7
   56:      8     9    10    11    12    13    14    15
   64:      0     1     2     3     4     5     6     7
   72:      8     9    10    11    12    13    14    15
   80:      0     1     2     3     4     5     6     7
   88:      8     9    10    11    12    13    14    15
   96:      0     1     2     3     4     5     6     7
  104:      8     9    10    11    12    13    14    15
  112:      0     1     2     3     4     5     6     7
  120:      8     9    10    11    12    13    14    15
RSS hash key:
3c:f9:4a:0e:fc:7e:cb:83:c2:2a:a4:1c:cf:59:38:1c:ca:54:38:b9:6b:e8:2b:63:6e:d2:9f:eb:fc:04:c2:86:6d:e3:54:f2:73:30:6a:65

ethtool -n eth5 rx-flow-hash udp4 and eth6

UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA

I also run set_irq_affinity on both eth5 and eth6

sudo ./set_irq_affinity local eth5
IFACE CORE MASK -> FILE
=======================
eth5 0 1 -> /proc/irq/144/smp_affinity
eth5 1 2 -> /proc/irq/145/smp_affinity
eth5 2 4 -> /proc/irq/146/smp_affinity
eth5 3 8 -> /proc/irq/147/smp_affinity
eth5 4 10 -> /proc/irq/148/smp_affinity
eth5 10 400 -> /proc/irq/149/smp_affinity
eth5 11 800 -> /proc/irq/150/smp_affinity
eth5 12 1000 -> /proc/irq/151/smp_affinity
eth5 13 2000 -> /proc/irq/152/smp_affinity
eth5 14 4000 -> /proc/irq/153/smp_affinity
eth5 0 1 -> /proc/irq/154/smp_affinity
eth5 1 2 -> /proc/irq/155/smp_affinity
eth5 2 4 -> /proc/irq/156/smp_affinity
eth5 3 8 -> /proc/irq/157/smp_affinity
eth5 4 10 -> /proc/irq/158/smp_affinity
eth5 10 400 -> /proc/irq/159/smp_affinity
eth5 11 800 -> /proc/irq/160/smp_affinity
eth5 12 1000 -> /proc/irq/161/smp_affinity
eth5 13 2000 -> /proc/irq/162/smp_affinity
eth5 14 4000 -> /proc/irq/163/smp_affinity
sudo ./set_irq_affinity local eth6
IFACE CORE MASK -> FILE
=======================
eth6 5 20 -> /proc/irq/165/smp_affinity
eth6 6 40 -> /proc/irq/166/smp_affinity
eth6 7 80 -> /proc/irq/167/smp_affinity
eth6 8 100 -> /proc/irq/168/smp_affinity
eth6 9 200 -> /proc/irq/169/smp_affinity
eth6 15 8000 -> /proc/irq/170/smp_affinity
eth6 16 10000 -> /proc/irq/171/smp_affinity
eth6 17 20000 -> /proc/irq/172/smp_affinity
eth6 18 40000 -> /proc/irq/173/smp_affinity
eth6 19 80000 -> /proc/irq/174/smp_affinity
eth6 5 20 -> /proc/irq/175/smp_affinity
eth6 6 40 -> /proc/irq/176/smp_affinity
eth6 7 80 -> /proc/irq/177/smp_affinity
eth6 8 100 -> /proc/irq/178/smp_affinity
eth6 9 200 -> /proc/irq/179/smp_affinity
eth6 15 8000 -> /proc/irq/180/smp_affinity
eth6 16 10000 -> /proc/irq/181/smp_affinity
eth6 17 20000 -> /proc/irq/182/smp_affinity
eth6 18 40000 -> /proc/irq/183/smp_affinity
eth6 19 80000 -> /proc/irq/184/smp_affinity

Update 3

I modified upd4 rx-flow-hash to include source and destination port but it did not make any difference.

ethtool -n eth5 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
L4 bytes 0 & 1 [TCP/UDP src port]
L4 bytes 2 & 3 [TCP/UDP dst port]

Disabled irqbalance and manually update /proc/irq/171/smp_affinity_list to include all 10 'local' CPU cores.

cat /proc/irq/171smp_affinity_list
5-9,15-19

Here is grep 171: /proc/interrupts after I made above change(Add src and dst port to udp4 rx-flow-hash and added 5-9,15-19 to /proc/irq/171/smp_affinity_list) Let's call it before.

Here is grep 171: from /proc/interrupts this morning, let's call it after.

Before 171:          0          0          0          0          0      16840        390        155   17725587    7505131          0          0          0          0          0 1282081848  184961789      21430  583751571  266997575   PCI-MSI-edge      eth6-TxRx-6
After  171:          0          0          0          0          0      16840        390        155   17725587    7505131          0          0          0          0          0 1282085923  184961789      21430  583751571  267026844   PCI-MSI-edge      eth6-TxRx-6

As you can see from above, irq 171 only handled by CPU 19. If irqbalance is running a different CPU will handle irq 171, it seems for some reason, irq 171 can't be balanced to more than one CPU.

Here is the packet drop updates

Wed Jun 5 01:39:41 EDT 2019
ethtool -S eth6 | grep -E "rx_missed|no_buff|no_dma"
rx_no_buffer_count: 0
rx_missed_errors: 2578857
rx_no_dma_resources: 3456533

Thu Jun 6 05:43:34 EDT 2019
njia@c4z-ut-rttp-b19 $ sudo ethtool -S eth6 | grep -E "rx_missed|no_buff|no_dma"
rx_no_buffer_count: 0
rx_missed_errors: 2578857
rx_no_dma_resources: 3950904

Time does not matter here, as multicast data stops after 16:00 PM each day.

I found this article on Red Hat site Packet loss when multiple processes subscribe to the same multicast group.

Our developer also mentioned if we only have one instance of our application running the number of drops reduced significantly. Usually there are 8.

Increased net.core.rmem_default from 4Mb to 16Mb

sysctl -w net.core.rmem_default=16777216
net.core.rmem_default = 16777216

Here is current Udp stack status, will check again tomorrow.

Fri Jun  7 00:40:10 EDT 2019
netstat -s | grep -A 4 Udp:

Udp:
    90579753493 packets received
    1052 packets to unknown port received.
    1264898431 packet receive errors
    1295021855 packets sent

  1. Check the driver version. There was the bug with correct accounting of the rx_no_dma_resources, when the rx buffer is full. So check the length of the ring buffers (ethtool -g <iface>) and increase it (ethtool -G <iface> rx <size> tx <size>, but it will cause a short time break in the packet processing).

Note: After update of the question you've know there isn't a bug, but, I think, the issues should be solved in the order of importance. So, let's solve the issue with missing errors and only then try to solve the rx_no_dma_resources errors.

  1. The rx_missed_errors means the system doesn't have enough cpu resources to process the incoming packets. In most cases it happens when the cpu core, that should execute the irq handler, is under a load. Check the output of the cat /proc/interrupts command. Investigate how the NIC irq counters distibute between CPU cores. Disable the irqbalance and use the set_irq_affinity script to bind the irq handlers to the cores. If your system has multiple NUMA domains, you should use local or remote options of this script.

  2. Check the output of perf top command to investigate what cause the cpu load at network packets processing.

Update 1

As you can see in /proc/interrupts, some CPU cores (15, 18, 19) handle very much more (in hundreds times) interrupts from the eth6-TxRx-6 queue irq handler, than other cores. Check the load of these cpu cores. Likely, they are under overload very often.

So, except the incorrect CPU affinity and irqbalance, you have other issue. You should investigate the preponderant traffic type that pass through queue 6 of eth6 NIC. Use a switch port mirroring and wireshark (start from Statistics - Protocol Hierarchy). After this you can tune the RSS hashing with the ethtool to share this traffic between several NIC queues. It'll avoid the overloading of some cores.

Some notes about NUMA

You've asked the details about local and remote options of set_irq_affinity script. To answer this question I've draw the simplified diagram of a dual socket system.

A dual socket system diagram

The modern CPUs have integrated memory controller and PCI-express controller. In a case of multi socket systems there is an interprocessor link to provide data exchange between processors. Every processor can get access to all memory. But in case if a processor works with a data in a memory area, which is managed by a memory contoller of an other processor, this entails the overhead to request to this remote memory controller and penalty for data transfer on the interprocessor link.

Data transfer between a PCI-Express device and a system is implemented with DMA (Direct Memory Access), what able to a peripheral device read/write data into RAM without explicit requests to CPU. Obviously, it's very implementation specific, but it also inherit the same memory access limits.

So, how the irq affinity envolved in that all? Roughly, when the PCI-Express NIC receives a data from outside, it stores this data in the system RAM with DMA and generate the interrupt to notify the system. What happens if interrupt handler will be execute on the other CPU, not local? Naturally, the interrupt handler need the received data to process it. And the all overheads and penalties for the remote memory will be gotten. In the worst case it can lead to overload the interprocessor link.

So, as you can see, the correct setup of the irq affinity for NUMA systems is very important. The set_irq_affinity script automates the binding of NIC queues irq handlers to CPU cores. In the best case you will see the "stair" of non-zero counters in /proc/interrupts. Obviously, the irqbalance tries to play in own game and kills the benefits from this irq affinity completely.

Update 2

So, what information we have at current moment:

  • There is a lot of multicast traffic, that is processed by eth6-TxRx-6 interrupt handler.
  • The RSS hash of UDP4: ip source address and ip destination address.
  • After run of set_irq_affinity the handler of this queue is binded to the 16th core.

What you can do now:

  • Monitor the statistics and the core loading, especially, the 16th core. Is there still overloading and missing errors?

  • Is this multicast traffic the only one flow or several? If there are several flows, you can tune the hashing of udp4 with ethtool. If the NIC will use not only the ip addresses for hash, but and port numbers, likely it can able to share the processing between several receive queues, and, conseqencially, between several cpu cores. If this is the only one flow, then you can try to bind to corresponded irq handler more cpu cores.

Update 3

So, you have several issues simultaniously.

  1. In the netstat output you have:

1264898431 packet receive errors

But these errors don't relate to missing errors. When the systen doesn't have enough cpu resources to handle the irq, the packet will lost before any protocol handler will be executed. If memory for UDP socket buffers isn't enough, you will see corresponded errors in the output of the nstat -az UdpRcvbufErrors command. Monitor it and increase memory limit with sysctl variables. You can also monitor a receive queue of sockets with ss tool. It can be helpful too.

  1. Investigate, what proccesses consume the cpu time. After this you can profile the workload with perf record or perf top. It's really softirq overloads single core? This kernel process maintains many things, so perf top will be helpfull to investigate what exactly consume the most of cpu time.

  2. If you have only single multicast group, only single irq will be executed for this flow, because the n-tuple-hash will be same always. I don't know any workaround for this case. Only way is to use a faster processor. Also you can check the results of i7z tools to monitor the sleep states of the CPU.

  3. I don't know your application architecture specifics, but may be you also have issue with loss of multicast udp datagrams when several instances are run. Maybe it also related with incorrect binding of the application instances to the CPU cores. Try to bind the application processes to the cpu cores.

P.S. I will extend the answer when you provide the information about results of the steps above.