Packet drop on HP ProLiant DL360 G9 running RHEL 6.10
We have a HP ProLiant DL360 G9 running RHEL 6.10 with 2 X Intel 82599ES 10-Gigabit SFI/SFP+
. HP product name is HP Ethernet 10Gb 2-port 560SFP+ Adapter
eth5
and eth6
showing a lot of packet drop (rx_missed_errors
)
I disabled flow control at NIC level then rx_missed_errors
stopped increase but rx_no_dma_resources
started increase daily.
- They both standalone interfaces not part of a bonding.
- Eth5 and eth6 are on different cards
- Both cards installed to a PCIe 3.0 X16 slot
- irqbalance is running on the server
Update 1
Ring parameters for eth5
and eth6
are the same and already at max.
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
I noticed following for eth6
in /proc/interrupts
Sun Jun 2 19:39:42 EDT 2019
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15 CPU16 CPU17 CPU18 CPU19
165: 0 0 0 0 0 484430 111744 333783 458868 577617 0 0 0 0 0 17978 402211 84832 183482 10567190 PCI-MSI-edge eth6-TxRx-0
166: 0 0 0 0 0 92569 2522312 36248 19459 1970 0 0 0 0 0 10140 33710 10180 1071214 651534 PCI-MSI-edge eth6-TxRx-1
167: 0 0 0 0 0 41060 2532170 37345 10970 92570 0 0 0 0 0 3055 22158 12485 1203344 494179 PCI-MSI-edge eth6-TxRx-2
168: 0 0 0 0 0 218925 8555 2312817 115650 126113 0 0 0 0 0 14575 3965 114145 995924 538667 PCI-MSI-edge eth6-TxRx-3
169: 0 0 0 0 0 7354 7781 199591 2262057 45221 0 0 0 0 0 34813 176350 105008 649389 962393 PCI-MSI-edge eth6-TxRx-4
170: 0 0 0 0 0 27982 23890 44703 162340 2597754 0 0 0 0 0 25991 22873 11846 885511 943057 PCI-MSI-edge eth6-TxRx-5
171: 0 0 0 0 0 16710 370 155 17725587 7504781 0 0 0 0 0 1054801625 1644839 14655 583745291 266971465 PCI-MSI-edge eth6-TxRx-6
172: 0 0 0 0 0 9823 6688 407394 11207 44103 0 0 0 0 0 88057 2496075 9284 56799 1391075 PCI-MSI-edge eth6-TxRx-7
173: 0 0 0 0 0 21175 1995 125490 151465 27120 0 0 0 0 0 19960 177195 2288457 787724 848755 PCI-MSI-edge eth6-TxRx-8
174: 0 0 0 0 0 7835 2210 3990 56075 106870 0 0 0 0 0 109740 24135 27720 2599827 1510934 PCI-MSI-edge eth6-TxRx-9
175: 0 0 0 0 0 42450 2605 39545 54520 162830 0 0 0 0 0 56035 11380 33815 52905 3993251 PCI-MSI-edge eth6-TxRx-10
176: 0 0 0 0 0 92335 33470 2290862 7545 227035 0 0 0 0 0 7550 25460 17225 65205 1682649 PCI-MSI-edge eth6-TxRx-11
177: 0 0 0 0 0 81685 56468 2273033 264820 195585 0 0 0 0 0 120640 36250 29450 244895 1146510 PCI-MSI-edge eth6-TxRx-12
178: 0 0 0 0 0 39655 24693 703993 1680384 22325 0 0 0 0 0 147980 27170 41585 72085 1689466 PCI-MSI-edge eth6-TxRx-13
179: 0 0 0 0 0 108905 1335 48265 2415832 19985 0 0 0 0 0 3545 23360 12590 35185 1780334 PCI-MSI-edge eth6-TxRx-14
180: 0 0 0 0 0 134826 291569 98014 9159 2262093 0 0 0 0 0 128867 18499 20078 39858 1463678 PCI-MSI-edge eth6-TxRx-15
181: 0 0 0 0 0 3220 37430 39030 129550 11070 0 0 0 0 0 2382452 24840 10860 146795 1664089 PCI-MSI-edge eth6-TxRx-16
182: 0 0 0 0 0 23120 28700 134025 96455 31545 0 0 0 0 0 30340 2262857 24485 144620 1673189 PCI-MSI-edge eth6-TxRx-17
183: 0 0 0 0 0 8900 29070 22490 112785 186240 0 0 0 0 0 40690 31665 2274862 37160 1705474 PCI-MSI-edge eth6-TxRx-18
184: 0 0 0 0 0 77090 18270 68465 53235 142648 0 0 0 0 0 16295 33770 29175 2367462 1642926 PCI-MSI-edge eth6-TxRx-19
185: 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 4 PCI-MSI-edge eth6
So looks like CPU/Core 15/18/19 are under stress to process traffic on eth6
Basically I have no idea where to look next, I am guessing this may have something to do with irq affinity but not sure. I am also think of disable irqbalance but not sure if that is going to make any difference.
any suggestions?
Update 2
NIC Driver info and I don't think we have that bug. As that was in 2009.
driver: ixgbe
version: 4.2.1-k
firmware-version: 0x800008ea
bus-info: 0000:08:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
The data arrived on both eth5/6 are multicast data. Is that enough, setup port mirroring needs a ticket to network engineer team and will take time. I also not sure what to tell them to look for.
If I understand your comments correctly, there is a way to balance eth6-rxtx queue to more than one CPU core. I did some search myself and collected following information, hopeful that is useful to you.
ethtool -x
eth5
and eth6
RX flow hash indirection table for eth5 with 20 RX ring(s):
0: 0 1 2 3 4 5 6 7
8: 8 9 10 11 12 13 14 15
16: 0 1 2 3 4 5 6 7
24: 8 9 10 11 12 13 14 15
32: 0 1 2 3 4 5 6 7
40: 8 9 10 11 12 13 14 15
48: 0 1 2 3 4 5 6 7
56: 8 9 10 11 12 13 14 15
64: 0 1 2 3 4 5 6 7
72: 8 9 10 11 12 13 14 15
80: 0 1 2 3 4 5 6 7
88: 8 9 10 11 12 13 14 15
96: 0 1 2 3 4 5 6 7
104: 8 9 10 11 12 13 14 15
112: 0 1 2 3 4 5 6 7
120: 8 9 10 11 12 13 14 15
RSS hash key:
3c:f9:4a:0e:fc:7e:cb:83:c2:2a:a4:1c:cf:59:38:1c:ca:54:38:b9:6b:e8:2b:63:6e:d2:9f:eb:fc:04:c2:86:6d:e3:54:f2:73:30:6a:65
ethtool -n eth5 rx-flow-hash udp4
and eth6
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
I also run set_irq_affinity
on both eth5
and eth6
sudo ./set_irq_affinity local eth5
IFACE CORE MASK -> FILE
=======================
eth5 0 1 -> /proc/irq/144/smp_affinity
eth5 1 2 -> /proc/irq/145/smp_affinity
eth5 2 4 -> /proc/irq/146/smp_affinity
eth5 3 8 -> /proc/irq/147/smp_affinity
eth5 4 10 -> /proc/irq/148/smp_affinity
eth5 10 400 -> /proc/irq/149/smp_affinity
eth5 11 800 -> /proc/irq/150/smp_affinity
eth5 12 1000 -> /proc/irq/151/smp_affinity
eth5 13 2000 -> /proc/irq/152/smp_affinity
eth5 14 4000 -> /proc/irq/153/smp_affinity
eth5 0 1 -> /proc/irq/154/smp_affinity
eth5 1 2 -> /proc/irq/155/smp_affinity
eth5 2 4 -> /proc/irq/156/smp_affinity
eth5 3 8 -> /proc/irq/157/smp_affinity
eth5 4 10 -> /proc/irq/158/smp_affinity
eth5 10 400 -> /proc/irq/159/smp_affinity
eth5 11 800 -> /proc/irq/160/smp_affinity
eth5 12 1000 -> /proc/irq/161/smp_affinity
eth5 13 2000 -> /proc/irq/162/smp_affinity
eth5 14 4000 -> /proc/irq/163/smp_affinity
sudo ./set_irq_affinity local eth6
IFACE CORE MASK -> FILE
=======================
eth6 5 20 -> /proc/irq/165/smp_affinity
eth6 6 40 -> /proc/irq/166/smp_affinity
eth6 7 80 -> /proc/irq/167/smp_affinity
eth6 8 100 -> /proc/irq/168/smp_affinity
eth6 9 200 -> /proc/irq/169/smp_affinity
eth6 15 8000 -> /proc/irq/170/smp_affinity
eth6 16 10000 -> /proc/irq/171/smp_affinity
eth6 17 20000 -> /proc/irq/172/smp_affinity
eth6 18 40000 -> /proc/irq/173/smp_affinity
eth6 19 80000 -> /proc/irq/174/smp_affinity
eth6 5 20 -> /proc/irq/175/smp_affinity
eth6 6 40 -> /proc/irq/176/smp_affinity
eth6 7 80 -> /proc/irq/177/smp_affinity
eth6 8 100 -> /proc/irq/178/smp_affinity
eth6 9 200 -> /proc/irq/179/smp_affinity
eth6 15 8000 -> /proc/irq/180/smp_affinity
eth6 16 10000 -> /proc/irq/181/smp_affinity
eth6 17 20000 -> /proc/irq/182/smp_affinity
eth6 18 40000 -> /proc/irq/183/smp_affinity
eth6 19 80000 -> /proc/irq/184/smp_affinity
Update 3
I modified upd4
rx-flow-hash
to include source and destination port but it did not make any difference.
ethtool -n eth5 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
L4 bytes 0 & 1 [TCP/UDP src port]
L4 bytes 2 & 3 [TCP/UDP dst port]
Disabled irqbalance
and manually update /proc/irq/171/smp_affinity_list
to include all 10 'local' CPU cores.
cat /proc/irq/171smp_affinity_list
5-9,15-19
Here is grep 171: /proc/interrupts
after I made above change(Add src and dst port to udp4
rx-flow-hash
and added 5-9,15-19
to /proc/irq/171/smp_affinity_list
) Let's call it before.
Here is grep 171: from /proc/interrupts
this morning, let's call it after.
Before 171: 0 0 0 0 0 16840 390 155 17725587 7505131 0 0 0 0 0 1282081848 184961789 21430 583751571 266997575 PCI-MSI-edge eth6-TxRx-6
After 171: 0 0 0 0 0 16840 390 155 17725587 7505131 0 0 0 0 0 1282085923 184961789 21430 583751571 267026844 PCI-MSI-edge eth6-TxRx-6
As you can see from above, irq
171 only handled by CPU 19.
If irqbalance
is running a different CPU will handle irq
171, it seems for some reason, irq
171 can't be balanced to more than one CPU.
Here is the packet drop updates
Wed Jun 5 01:39:41 EDT 2019
ethtool -S eth6 | grep -E "rx_missed|no_buff|no_dma"
rx_no_buffer_count: 0
rx_missed_errors: 2578857
rx_no_dma_resources: 3456533
Thu Jun 6 05:43:34 EDT 2019
njia@c4z-ut-rttp-b19 $ sudo ethtool -S eth6 | grep -E "rx_missed|no_buff|no_dma"
rx_no_buffer_count: 0
rx_missed_errors: 2578857
rx_no_dma_resources: 3950904
Time does not matter here, as multicast data stops after 16:00 PM each day.
I found this article on Red Hat site Packet loss when multiple processes subscribe to the same multicast group.
Our developer also mentioned if we only have one instance of our application running the number of drops reduced significantly. Usually there are 8.
Increased net.core.rmem_default
from 4Mb
to 16Mb
sysctl -w net.core.rmem_default=16777216
net.core.rmem_default = 16777216
Here is current Udp
stack status, will check again tomorrow.
Fri Jun 7 00:40:10 EDT 2019
netstat -s | grep -A 4 Udp:
Udp:
90579753493 packets received
1052 packets to unknown port received.
1264898431 packet receive errors
1295021855 packets sent
- Check the driver version. There was the bug with correct accounting of the
rx_no_dma_resources
, when therx buffer
is full. So check the length of the ring buffers (ethtool -g <iface>
) and increase it (ethtool -G <iface> rx <size> tx <size>
, but it will cause a short time break in the packet processing).
Note: After update of the question you've know there isn't a bug, but, I think, the issues should be solved in the order of importance. So, let's solve the issue with missing
errors and only then try to solve the rx_no_dma_resources
errors.
-
The
rx_missed_errors
means the system doesn't have enough cpu resources to process the incoming packets. In most cases it happens when the cpu core, that should execute the irq handler, is under a load. Check the output of thecat /proc/interrupts
command. Investigate how the NIC irq counters distibute between CPU cores. Disable theirqbalance
and use theset_irq_affinity
script to bind the irq handlers to the cores. If your system has multiple NUMA domains, you should uselocal
orremote
options of this script. -
Check the output of
perf top
command to investigate what cause the cpu load at network packets processing.
Update 1
As you can see in /proc/interrupts
, some CPU cores (15, 18, 19) handle very much more (in hundreds times) interrupts from the eth6-TxRx-6
queue irq handler, than other cores. Check the load of these cpu cores. Likely, they are under overload very often.
So, except the incorrect CPU affinity and irqbalance
, you have other issue. You should investigate the preponderant traffic type that pass through queue 6 of eth6
NIC. Use a switch port mirroring and wireshark (start from Statistics - Protocol Hierarchy
). After this you can tune the RSS hashing with the ethtool
to share this traffic between several NIC queues. It'll avoid the overloading of some cores.
Some notes about NUMA
You've asked the details about local
and remote
options of set_irq_affinity
script. To answer this question I've draw the simplified diagram of a dual socket system.
The modern CPUs have integrated memory controller and PCI-express controller. In a case of multi socket systems there is an interprocessor link to provide data exchange between processors. Every processor can get access to all memory. But in case if a processor works with a data in a memory area, which is managed by a memory contoller of an other processor, this entails the overhead to request to this remote memory controller and penalty for data transfer on the interprocessor link.
Data transfer between a PCI-Express device and a system is implemented with DMA (Direct Memory Access), what able to a peripheral device read/write data into RAM without explicit requests to CPU. Obviously, it's very implementation specific, but it also inherit the same memory access limits.
So, how the irq affinity envolved in that all? Roughly, when the PCI-Express NIC receives a data from outside, it stores this data in the system RAM with DMA and generate the interrupt to notify the system. What happens if interrupt handler will be execute on the other CPU, not local? Naturally, the interrupt handler need the received data to process it. And the all overheads and penalties for the remote memory will be gotten. In the worst case it can lead to overload the interprocessor link.
So, as you can see, the correct setup of the irq affinity for NUMA systems is very important. The set_irq_affinity
script automates the binding of NIC queues irq handlers to CPU cores. In the best case you will see the "stair" of non-zero counters in /proc/interrupts
. Obviously, the irqbalance
tries to play in own game and kills the benefits from this irq affinity completely.
Update 2
So, what information we have at current moment:
- There is a lot of multicast traffic, that is processed by
eth6-TxRx-6
interrupt handler. - The RSS hash of
UDP4
:ip source address
andip destination address
. - After run of
set_irq_affinity
the handler of this queue is binded to the 16th core.
What you can do now:
-
Monitor the statistics and the core loading, especially, the 16th core. Is there still overloading and missing errors?
-
Is this multicast traffic the only one flow or several? If there are several flows, you can tune the hashing of
udp4
withethtool
. If the NIC will use not only the ip addresses for hash, but and port numbers, likely it can able to share the processing between several receive queues, and, conseqencially, between several cpu cores. If this is the only one flow, then you can try to bind to corresponded irq handler more cpu cores.
Update 3
So, you have several issues simultaniously.
- In the
netstat
output you have:
1264898431 packet receive errors
But these errors don't relate to missing errors. When the systen doesn't have enough cpu resources to handle the irq, the packet will lost before any protocol handler will be executed. If memory for UDP socket buffers isn't enough, you will see corresponded errors in the output of the nstat -az UdpRcvbufErrors
command. Monitor it and increase memory limit with sysctl variables. You can also monitor a receive queue of sockets with ss
tool. It can be helpful too.
-
Investigate, what proccesses consume the cpu time. After this you can profile the workload with
perf record
orperf top
. It's reallysoftirq
overloads single core? This kernel process maintains many things, soperf top
will be helpfull to investigate what exactly consume the most of cpu time. -
If you have only single multicast group, only single irq will be executed for this flow, because the n-tuple-hash will be same always. I don't know any workaround for this case. Only way is to use a faster processor. Also you can check the results of
i7z
tools to monitor the sleep states of the CPU. -
I don't know your application architecture specifics, but may be you also have issue with loss of multicast udp datagrams when several instances are run. Maybe it also related with incorrect binding of the application instances to the CPU cores. Try to bind the application processes to the cpu cores.
P.S. I will extend the answer when you provide the information about results of the steps above.