Finding cause of TCP retransmission within a LAN
Hello denizens of Server Fault
I have an irritating problem with a LAN of about 100 computers, 2 Windows domain servers, and 12 VoIP phones. Since their installation around a year ago, every week or so, we notice a VoIP phone resetting itself - occasionally in the middle of a call. Simultaneously there are often signs of temporary loss of connection on computers: freezes in explorer while accessing network shares, errors in our administration software due to loss of connection to the database server.
I have been doing some Wireshark monitoring on the connection between the VoIP PBX and the rest of the network. Wireshark picks up a clump of retransmitted TCP packets at the times when we record phone restarts. The Wireshark log shows about 2 clusters of retransmissions a day ranging from 5 packets to hundreds. Those in each cluster are mainly between the PBX and some set of the VoIP phones, but not always the same set. Often retransmissions at the same time are to phones connected to the same switch, but sometimes retransmissions occur together to phones at opposite ends of the network. There are usually some coincident retransmissions in passing TCP traffic, for example between client machines and the file servers.
The spikes in retransmissions and phone resets do not correlate well with when the network is heavily loaded. They seem to occur slightly more during the day, but most in the evening, when traffic should be decreasing. They occur reasonably often late at night when most computers are turned off and traffic should be lowest.
Do you have any ideas that might help diagnose the cause of problems like this? One thing I have not yet tried, but should have, is updating the firmware of all the switches.
TCP retransmissions are usually due to network congestion. Look for a large number of broadcast packets at the time the issue occurs. If the percentage of broadcast traffic in your capture is above about 3% of the total traffic captured, then you definitely have congestion. Look for both physical layer (ARP) and network layer (name resolution) broadcasts on the network. If you find a high volume of broadcast traffic you can trace it to the source from the capture data.
Gathering traffic statistics for your switches may show you have periods where you are running at or near capacity. This can lead to retries when responses don't come back within the inital timeout (often 3 seconds). This increases congestion momentarily until congestion mitigation mechanisms kick in.
Look for people using streaming media as that can soak up bandwith quickly.
You may be able to mitigate the problem for the phones by traffic shaping. This will just move the problem to other users.
Sounds like a spanning tree loop or a broadcast storm to me, especially if the retransmissions and the issues are localized to the same switch (which differs). When it happens, what are the port states on your L2 device? Probably a bad switch or bad root bridge priorities? Interesting problem.