Lower TCP throughput from 1Gbps server than 100Mbps server over large RTT

We've got infrastructure distributed in a few major locations around the world - Singapore, London and Los Angeles. The RTT between any two locations is over >150ms.

We've recently upgraded all of the servers to use 1Gbps links (from 100Mbps). We've been running some TCP-based tests between servers at the different locations and have seen some surprising results. These results are completely repeatable.

  1. Los Angeles (100Mbps) to London (100Mbps): ~96Mbps throughput
  2. Los Angeles (100Mbps) to London (1Gbps): ~96Mbps throughput
  3. Los Angeles (1Gbps) to London (100Mbps): 10-40Mbps throughput (volatile)
  4. Los Angeles (1Gbps) to London (1Gbps): 10-40Mbps throughput (volatile)
  5. Los Angeles (1Gbps) to Los Angeles (1Gbps): >900Mbps throughput

It appears that whenever the sender is running at 1Gbps, our throughput suffers very significantly over long links.

The testing approach earlier is extremely simple - I'm just using cURL to download a 1GB binary from the target server (so in the above case, the cURL client runs on the London server and downloads from LA, so that LA is the sender). This is using a single TCP connection of course.

Repeating the same tests over UDP using iperf, the problem disappears!

  1. Los Angeles (100Mbps) to London (100Mbps): ~96Mbps throughput
  2. Los Angeles (100Mbps) to London (1Gbps): ~96Mbps throughput
  3. Los Angeles (1Gbps) to London (100Mbps): ~96Mbps throughput
  4. Los Angeles (1Gbps) to London (1Gbps): >250Mbps throughput

This points squarely at some TCP or NIC/port configuration issue in my eyes.

Both servers are running CentOS 6.x, with TCP cubic. Both have 8MB maximum TCP send & receive windows, and have TCP timestamps and selective acknowledgements enabled. The same TCP configuration is used in all test cases. The full TCP config is below:

net.core.somaxconn = 128
net.core.xfrm_aevent_etime = 10
net.core.xfrm_aevent_rseqth = 2
net.core.xfrm_larval_drop = 1
net.core.xfrm_acq_expires = 30
net.core.wmem_max = 8388608
net.core.rmem_max = 8388608
net.core.wmem_default = 131072
net.core.rmem_default = 131072
net.core.dev_weight = 64
net.core.netdev_max_backlog = 1000
net.core.message_cost = 5
net.core.message_burst = 10
net.core.optmem_max = 20480
net.core.rps_sock_flow_entries = 0
net.core.netdev_budget = 300
net.core.warnings = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_mem = 1528512      2038016 3057024
net.ipv4.tcp_wmem = 4096        131072  8388608
net.ipv4.tcp_rmem = 4096        131072  8388608
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_frto = 2
net.ipv4.tcp_frto_response = 0
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_abc = 0
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_dma_copybreak = 4096
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.tcp_available_congestion_control = cubic reno
net.ipv4.tcp_allowed_congestion_control = cubic reno
net.ipv4.tcp_max_ssthresh = 0
net.ipv4.tcp_thin_linear_timeouts = 0
net.ipv4.tcp_thin_dupack = 0

Attached are a couple of images of wireshark IO graphs of some test cases (sorry, I can't post images directly yet):

Test case 1 (100Mbps -> 100Mbps) - nice smooth transfer. No losses in capture. - http://103.imagebam.com/download/dyNftIGh-1iCFbjfMFvBQw/25498/254976014/100m.png

Test case 3 (1Gbps -> 100Mbps) - votaile transfer, takes a long time to get to any speed - never approaches 100Mbps. Yet no losses/retransmits in the capture! - http://101.imagebam.com/download/KMYXHrLmN6l0Z4KbUYEZnA/25498/254976007/1g.png

So in summary, when a long link is used with a 1Gbps connection, we get a much lower TCP throughput than when we use a 100Mbps connection.

I'd very much appreciate some pointers from any TCP experts out there!

Thanks!

UPDATE (2013-05-29):

We've solved the issue with test case #4 above (1Gbps sender, 1Gbps receiver, over a large RTT). We can now hit ~970Mbps within a couple of seconds of the transfer starting. The issue appears to have been a switch used with the hosting provider. Moving to a different one solved that.

However, test case #3 mostly remains problematic. If we have a receiver running at 100Mbps and the sender at 1Gbps, then we see approximately a 2-3 minute wait for the receiver to reach 100Mbps (but it does now reach the full rate, unlike before). As soon as we drop the sender down to 100Mbps or increase the receiver to 1Gbps, then the problem vanishes and we can ramp up to full speed in a second or two.

The underlying reason is that we're seeing losses, of course, very soon after the transfer starts. However, this doesn't tally with my understanding of how slow-start works; the interface speed shouldn't have any bearing on this, as it should be governed by the ACKs from the receiver.

Suggestions gratefully received please! If I could offer a bounty here, I would!


Solution 1:

The main issue is big WAN delay. It will be very worse if it also having random packet lost.

1, the tcp_mem also need set large to allocate more memory. For example, set it as net.ipv4.tcp_mem = 4643328 6191104 9286656

2, you can capture the packets through wireshark/tcpdump for about several minutes then analysize whether it has random packet lost. You can also upload the packets file if you like.

3, you can try to tune the other tcp parameters Eg. set tcp_westwood=1 and tcp_bic=1

Solution 2:

Solved! For full details see http://comments.gmane.org/gmane.linux.drivers.e1000.devel/11813

In short, it appears the 1Gbps connected server would send bursts of traffic during TCP's exponential growth phase that would flood buffers in some intermediate device (who knows what). This leaves two options:

1) Contact each intermediate network operator and get them to configure appropriate buffers to allow for my desired bandwidth and RTT. Pretty unlikely! 2) Limit the bursts.

I chose to limit each TCP flow to operate at 100Mbps at most. The number here is fairly arbitrary - I chose 100Mbps purely because I knew the previous path could handle 100Mbps and I didn't need any more for an individual flow.

Hope this helps someone in the future.