How do you diagnose packet loss?
Solution 1:
I am a network engineer, so I'll describe this from my perspective.
For me, diagnosing packet loss usually starts with "it's not working very well". From there, I usually try to find kit as close to both ends of the communication (typically, a workstation in an office and a server somewhere) and ping as close to the other end as possible (ideally the "remote end-point", but sometimes there are firewalls I can't send pings through, so will have to settle for a LAN interface on a router) and see if I can see any loss.
If I can see loss, it's usually a case of "not enough bandwidth" or "link with issues" somewhere in-between, so find the route through the network and start from the middle, that usually gives you one end or the other.
If I cannot see loss, the next two steps tend to be "send more pings" or "send larger pings". If that doesn't sort give an indication of what the problem is, it's time to start looking at QoS policies and interface statistics through the whole path between the end-points.
If that doesn't find anything, it's time to start question your assumptions, are you actually suffering from packet loss. The only sure way of finding that is to do simultaneous captures on both ends, either by using WireShark (or equivalent) on the hosts or by hooking up sniffer machines (probably using WireShark or similar) via network taps. Then comes the fun of comparing the two packet captures...
Sometimes, what is attributed as "packet loss" is simply something on the server side being noticeably slower (like, say, moving the database from "on the same LAN" to "20 ms away" and using queries that requires an awful lot of back-and-forth between the front-end and the database).
Solution 2:
From the perspective of a Linux system, I'll first look for packet loss on the network interface with ethtool -S ethX
.
Most of the time, increasing the ring buffer with ethtool -G ethX rx VALUE
solves this.
Sometimes interrupts are not balancing because the system is missing the irqbalance service, so look in chkconfig
(EL) or update-rc
(Debuntu) to see if this service is running. You can tell if interrupts are not balancing because /proc/interrupts
will show only Core 0 servicing all IRQ channels.
Failing this, you might need to increase net.core.netdev_max_backlog
if the system is passing more than a few gigabit of traffic, and maybe net.core.netdev_budget
.
If that doesn't work, you could tweak the interrupt coalescing values with ethtool -C
.
If there are no packet drops on the network interface, look in netstat -s
and see if there are drops in the socket buffers, these will be reported with statistics like "pruned from receive queue
" and "dropped from out-of-order queue
".
You can try increasing the default and max socket buffers for the appropriate protocol (eg: net.ipv4.tcp_rmem
for TCP).
If the application sets its own socket buffer size, then the application may need configuration changes. If your application has hard-coded socket buffer sizes, complain to your application vendor.
Personally I dislike protocol offloading onto NICs (checksumming, segmentation offload, large receive offload) as it seems to cause more trouble than it's worth. Playing around with these settings using ethtool -K
may be worth a shot.
Look at the module options for your NIC (modinfo <drivername>
) as you may need to alter some features. To give one example I have encountered, using Intel's Flow Director on a system which handles one big TCP stream will probably harm the efficiency of that stream, so turn FDir off.
Beyond that you are getting into hand-tuning this specific system for its specific workload, which I guess is beyond the scope of your question.
Solution 3:
I will start by using packet capturing tool such as: wireshark (on Windows) and tcpdump (on Linux terminal).
I will also check the firewall configuration (host firewall as well as network firewall).