Clarification about Linux TCP window size and delays

Solution 1:

After doing a little more digging into my traffic, I was able to see that my data was nothing but a sequence of small bursts with small idle periods between them.

With the useful tool ss, I was able to retrieve the current congestion window size of my connection (see the cwnd value in the output):

[user@localhost ~]$ /usr/sbin/ss -i -t -e | grep -A 1 56001

ESTAB 0 0 192.168.1.1:56001
192.168.2.1:45614 uid:1001 ino:6873875 sk:17cd4200ffff8804 ts sackscalable wscale:8,9 rto:277 rtt:74/1 ato:40 cwnd:36 send 5.6Mbps rcv_space:5792

I ran the tool several times and discovered that the congestion window size was regularly reset to the initial value (10ms, on my Linux box). The connection was constantly looping back to the slow start phase. During the slow start period, bursts with a number of messages exceeding the window size were delayed, waiting for the acks related to the first packets of the burst.

The fact that the traffic consists of a sequence of bursts likely explains the reset of the congestion window size.

By deactivating the slow start mode after idle period, I was able to get rid of the delays.

[user@host ~]$ cat /proc/sys/net/ipv4/tcp_slow_start_after_idle 0

Solution 2:

This isn't going to be some subtle thing like a setting somewhere. This is going to be a problem with the protocol layered on top of TCP or a code bug. There's no magic "go faster" switch for TCP except for unusual cases like networks with very high latency or packet loss caused by noise.

The most obvious explanation would be if the code calls write or send with very small chunks. You need to accumulate at least 2KB per send, ideally 16KB. You say you batch the messages, but it's not clear what that means. Do you pass them in one call to write or send? Do you bundle them into a single protocol data unit for the protocol layered on top of TCP? Doing both of these things helps a lot with latency.

Also, get rid of TCP_NODELAY. It can reduce throughput. It's only for applications that weren't designed to work with TCP or for applications that cannot predict which side will need to transmit next.

Unless, of course, you are in fact layering a protocol on top of TCP where you don't know which side is going to transmit next (like telnet, for example). Then it can make sense to set TCP_NODELAY. Significant expertise is required to make that kind of protocol work with low latency. If that's your situation, post more details about the protocol you're layering on top of TCP, what its protocol data unit sizes look like, and what determines which side transmits when.

If you do in fact batch the messages available at one time and pass them in a single call to write or send, then most likely the problem is that the other side is not sending an application-layer acknowledgement for each batch. These improve latency by giving TCP ACKs packets to piggyback on. Your protocol should include them to ensure sides alternate which helps keep latency down.