Unexpected ARP Probe and ARP Announcement on Windows 10

In our system, there are three hosts all connected to the same Ethernet switch, which is illustrated below:

A (192.168.0.21, WIN10_1809) <-> Switch <-> B (192.168.0.100, Debian Linux 9)
                                  ^
                                  |
                       C (192.168.0.201, WIN10_1809)

Between any two of these hosts, there are periodically network communications, both lower-level ping operations and upper-level business messages (based on TCP or UDP).

Occasionally (like once in a day or two days) host B and host C would find out host A is inaccessible by ping operations (which would last around 7 seconds) while host A would have no problem as of pinging host B and host C. At the same time, the upper-level TCP or UDP communications related with host A would also fail while the communications between host B and host C are totally normal.

The problem happens on multiple systems in our company and it looks like networking hardware (have replaced the switch and the connecting cables) and network traffic (problem still happens even when system is idle and is with less than 1% bandwidth usage) don't give a major contribution to the problem.

Then, by checking the network traffic in the system using Wireshark (captured through the Ethernet switch, download), we find out the ping requests have been sent out while no response received:

No.     Time        Source          Destination     Protocol Length Info
1455    1.509228    192.168.0.100   192.168.0.21    ICMP    98  Echo (ping) request  id=0x6812, seq=1/256, ttl=64 (no response found!)
1848    2.250592    192.168.0.201   192.168.0.21    ICMP    66  Echo (ping) request  id=0x30f0, seq=30977/377, ttl=128 (no response found!)
2413    3.512684    192.168.0.100   192.168.0.21    ICMP    98  Echo (ping) request  id=0x6818, seq=1/256, ttl=64 (no response found!)
3269    5.516020    192.168.0.100   192.168.0.21    ICMP    98  Echo (ping) request  id=0x681c, seq=1/256, ttl=64 (no response found!)

At the same time, ping requests from host A have been replied as observed:

1130    1.130713    192.168.0.21    192.168.0.100   ICMP    60  Echo (ping) request  id=0x0008, seq=2313/2313, ttl=255 (reply in 1133)
1131    1.130713    192.168.0.21    192.168.0.201   ICMP    60  Echo (ping) request  id=0x0008, seq=2312/2057, ttl=255 (reply in 1132)
1795    2.131109    192.168.0.21    192.168.0.100   ICMP    60  Echo (ping) request  id=0x0008, seq=2314/2569, ttl=255 (reply in 1798)
1796    2.131110    192.168.0.21    192.168.0.201   ICMP    60  Echo (ping) request  id=0x0008, seq=2315/2825, ttl=255 (reply in 1797)
2249    3.131295    192.168.0.21    192.168.0.100   ICMP    60  Echo (ping) request  id=0x0008, seq=2316/3081, ttl=255 (reply in 2252)
2250    3.131296    192.168.0.21    192.168.0.201   ICMP    60  Echo (ping) request  id=0x0008, seq=2317/3337, ttl=255 (reply in 2251)

Also, we find out that host A would initiate an ARP probe and ARP announcement process when the error happens.

2838    1.501535    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.100? Tell 192.168.0.21
2841    1.501831    JUMPINDU_64:8b:23   SuperMic_78:e0:f1   ARP 60  192.168.0.100 is at 00:e0:4b:64:8b:23
2876    1.516569    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.201? Tell 192.168.0.21
2879    1.516654    SuperMic_8d:2f:67   SuperMic_78:e0:f1   ARP 60  192.168.0.201 is at ac:1f:6b:8d:2f:67
3234    1.817465    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.21? (ARP Probe)
4179    2.817637    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.21? (ARP Probe)
5043    3.817780    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.21? (ARP Probe)
5897    4.817833    SuperMic_78:e0:f1   Broadcast   ARP 60  ARP Announcement for 192.168.0.21

In which, SuperMic_78:e0:f1 is host A, JUMPINDU_64:8b:23 is host B and SuperMic_8d:2f:67 is host C.

According to RFC 5227:

Before beginning to use an IPv4 address (whether received from manual configuration, DHCP, or some other means), a host implementing this specification MUST test to see if the address is already in use, by broadcasting ARP Probe packets. This also applies when a network interface transitions from an inactive to an active state, when a computer awakes from sleep, when a link-state change signals that an Ethernet cable has been connected, when an 802.11 wireless interface associates with a new base station, or when any other change in connectivity occurs where a host becomes actively connected to a logical link.

But from the windows event log on host A, there is no evidence for any event listed above, just the three event logs listed below -- not sure if these are the cause or the effect of the problem:

ID   Source                   Description
7040 Service Control Manager  The start type of the windows modules installer service was changed from auto start to demand start
16   Kernel-General           The access history in hive \??\C:\ProgramData\Microsoft\Provisioning\Microsoft-Desktop-Provisioning-Sequence.dat was cleared updating 0 keys and creating 0 modified pages
7040 Service Control Manager  The start type of the windows modules installer service was changed from demand start to auto start

We have also checked the log files in fields and have seen no evidence of problem happening there -- WIN7 and old releases of SW have been used in the fields while WIN10 and new SW are being used at home.

Have investigated for nearly two months and no root cause has been found. Any advice or suggestion would be greatly appreciated. Also, please let me know if there is other places better suited for this kind of problem.

It turns out that a scheduled task provided by Windows 10 itself caused the problem, which is located under Microsoft/Windows/Management/Provisioning/Logon. It would initiate a network stack restart when being executed for the first time after OS starts up (since 1803 or 1809 release):

\windows\system32\provtool.exe /turn 5 /source LogonIdleTask

When we manually run the task after OS starts up, the problem could be reproduced. Then, after disabling the task, the problem ceases to happen again on five systems, which have been watched for nearly one week.

Also, we could reach here mostly because of this post on OSR. Don't know what the task actually does and why the network stack restart is needed although.

p.s. Just leave this in case anyone meets the same problem, hope it helps.

Unexpected ARP Probe and ARP Announcement on Windows 10

Related

Recent Posts