Diagnosing packet loss / high latency in Ubuntu

We have a Linux box (Ubuntu 12.04) running Nginx (1.5.2), which acts as a reverse proxy/load balancer to some Tornado and Apache hosts. The upstream servers are physically and logically close (same DC, sometimes same-rack) and show sub-millisecond latency between them:

 PING appserver (10.xx.xx.112) 56(84) bytes of data.
 64 bytes from appserver (10.xx.xx.112): icmp_req=1 ttl=64 time=0.180 ms
 64 bytes from appserver (10.xx.xx.112): icmp_req=2 ttl=64 time=0.165 ms
 64 bytes from appserver (10.xx.xx.112): icmp_req=3 ttl=64 time=0.153 ms

We receive a sustained load of about 500 requests per second, and are currently seeing regular packet loss / latency spikes from the Internet, even from basic pings:

sam@AM-KEEN ~> ping -c 1000 loadbalancer
PING 50.xx.xx.16 (50.xx.xx.16): 56 data bytes
64 bytes from loadbalancer: icmp_seq=0 ttl=56 time=11.624 ms
64 bytes from loadbalancer: icmp_seq=1 ttl=56 time=10.494 ms

... many packets later ...

Request timeout for icmp_seq 2
64 bytes from loadbalancer: icmp_seq=2 ttl=56 time=1536.516 ms
64 bytes from loadbalancer: icmp_seq=3 ttl=56 time=536.907 ms
64 bytes from loadbalancer: icmp_seq=4 ttl=56 time=9.389 ms

... many packets later ...

Request timeout for icmp_seq 919
64 bytes from loadbalancer: icmp_seq=918 ttl=56 time=2932.571 ms
64 bytes from loadbalancer: icmp_seq=919 ttl=56 time=1932.174 ms
64 bytes from loadbalancer: icmp_seq=920 ttl=56 time=932.018 ms
64 bytes from loadbalancer: icmp_seq=921 ttl=56 time=6.157 ms

--- 50.xx.xx.16 ping statistics ---
1000 packets transmitted, 997 packets received, 0.3% packet loss
round-trip min/avg/max/stddev = 5.119/52.712/2932.571/224.629 ms

The pattern is always the same: things operate fine for a while (<20ms), then a ping drops completely, then three or four high-latency pings (>1000ms), then it settles down again.

Traffic comes in through a bonded public interface (we will call it bond0) configured as such:

 bond0     Link encap:Ethernet  HWaddr 00:xx:xx:xx:xx:5d
           inet addr:50.xx.xx.16  Bcast:50.xx.xx.31  Mask:255.255.255.224
           inet6 addr: <ipv6 address> Scope:Global
           inet6 addr: <ipv6 address> Scope:Link
           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
           RX packets:527181270 errors:1 dropped:4 overruns:0 frame:1
           TX packets:413335045 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0 
           RX bytes:240016223540 (240.0 GB)  TX bytes:104301759647 (104.3 GB)

Requests are then submitted via HTTP to upstream servers on the private network (we can call it bond1), which is configured like so:

  bond1     Link encap:Ethernet  HWaddr 00:xx:xx:xx:xx:5c  
            inet addr:10.xx.xx.70  Bcast:10.xx.xx.127  Mask:255.255.255.192
            inet6 addr: <ipv6 address> Scope:Link
            UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
            RX packets:430293342 errors:1 dropped:2 overruns:0 frame:1
            TX packets:466983986 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0 
            RX bytes:77714410892 (77.7 GB)  TX bytes:227349392334 (227.3 GB)

Output of uname -a:

Linux <hostname> 3.5.0-42-generic #65~precise1-Ubuntu SMP Wed Oct 2 20:57:18 UTC 2013 x86_64 GNU/Linux

We have customized sysctl.conf in an attempt to fix the problem, with no success. Output of /etc/sysctl.conf (with irrelevant configs omitted):

# net: core
net.core.netdev_max_backlog = 10000

# net: ipv4 stack
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_sack = 1
net.ipv4.tcp_fack = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_max_syn_backlog = 10000
net.ipv4.tcp_congestion_control = cubic
net.ipv4.ip_local_port_range = 8000 65535

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 2

net.ipv4.tcp_thin_dupack = 1
net.ipv4.tcp_thin_linear_timeouts = 1

net.netfilter.nf_conntrack_max = 99999999
net.netfilter.nf_conntrack_tcp_timeout_established = 300

Output of dmesg -d, with non-ICMP UFW messages suppressed:

[508315.349295 <   19.852453>] [UFW BLOCK] IN=bond1 OUT= MAC=<mac addresses> SRC=118.xx.xx.143 DST=50.xx.xx.16 LEN=68 TOS=0x00 PREC=0x00 TTL=51 ID=43221 PROTO=ICMP TYPE=3 CODE=1 [SRC=50.xx.xx.16 DST=118.xx.xx.143 LEN=40 TOS=0x00 PREC=0x00 TTL=249 ID=10220 DF PROTO=TCP SPT=80 DPT=53817 WINDOW=8190 RES=0x00 ACK FIN URGP=0 ] 

[517787.732242 <    0.443127>] Peer 190.xx.xx.131:59705/80 unexpectedly shrunk window 1155488866:1155489425 (repaired)

How can I go about diagnosing the cause of this problem, on a Debian-family Linux box?


If any part of your network along the way is saturated, or if any link within the network is throwing errors, you can experience packet loss. This won't show up in the interface error counts unless you happen to be having the problem right at the switch-to-server cable connection. If the problem is anywhere else in the network, it shows up as lost packets.

You can surface this problem if you've got TCP traffic, since there are counters in the kernel that keep track of TCP taking recovery steps to account for lost packets in the stream. Have a look at the -s (stats) option on netstat. The presented values are counters, so you'll need to watch them for a while to get a feel for what's normal and what's anomalous, but the data is there. The retransmit and data loss counters are particularly useful.

[sadadmin@busted ~]$ netstat -s | egrep -i 'loss|retran'
2058 segments retransmited
526 times recovered from packet loss due to SACK data
193 TCP data loss events
TCPLostRetransmit: 7
2 timeouts after reno fast retransmit
1 timeouts in loss state
731 fast retransmits
18 forward retransmits
97 retransmits in slow start
4 sack retransmits failed

Some tools will sample these values and trend them for you so you can easily see when something has gone amiss. I use munin.