Latency in TCP/IP-over-Ethernet networks
What resources (books, Web pages etc) would you recommend that:
- explain the causes of latency in TCP/IP-over-Ethernet networks;
- mention tools for looking out for things that cause latency (e.g. certain entries in
netstat -s
); - suggest ways to tweak the Linux TCP stack to reduce TCP latency (Nagle, socket buffers etc).
The closest I am aware of is this document, but it's rather brief.
Alternatively, you're welcome to answer the above questions directly.
edit To be clear, the question isn't just about "abnormal" latency, but about latency in general. Additionally, it is specifically about TCP/IP-over-Ethernet and not about other protocols (even if they have better latency characteristics.)
Solution 1:
In regards to kernel tunables for latency, one sticks out in mind:
echo 1 > /proc/sys/net/ipv4/tcp_low_latency
From the documentation:
If set, the TCP stack makes decisions that prefer lower latency as opposed to higher throughput. By default, this option is not set meaning that higher throughput is preferred. An example of an application where this default should be changed would be a Beowulf compute cluster. Default: 0
You can also disable Nagle's algorithm in your application (which will buffer TCP output until maximum segment size) with something like:
#include <sys/types.h>
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <linux/tcp.h>
int optval = 1;
int mysock;
void main() {
void errmsg(char *msg) {perror(msg);exit(1);}
if((mysock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
errmsg("setsock failed");
}
if((setsockopt(mysock, SOL_SOCKET, TCP_NODELAY, &optval, sizeof(optval))) < 0) {
errmsg("setsock failed");
}
/* Some more code here ... */
close(mysock);
}
The "opposite" of this option is TCP_CORK
, which will "re-Nagle" packets. Beware, however, as TCP_NODELAY
might not always do what you expect, and in some cases can hurt performance. For example, if you are sending bulk data, you will want to maximize throughput per-packet, so set TCP_CORK
. If you have an application that requires immediate interactivity (or where the response is much larger than the request, negating the overhead), use TCP _NODELAY
. On another note, this behavior is Linux-specific and BSD is likely different, so caveat administrator.
Make sure you do thorough testing with your application and infrastructure.
Solution 2:
In my experience the biggest cause of abnormal latency on otherwise healthy high-speed networks are TCP Windowing (RFC1323, section 2) faults, with a closely related second in faults surrounding TCP Delayed Acks (RFC1122 section 4.2.3.2). Both of these methods are enhancements to TCP for better handling of high speed networks. When they break, speeds drop to very slow levels. Faults in these cases affect large transfers (think backup streams), where extremely transactional small traffic (average data transfer is under the MTU size and there is a LOT of back-n-forth) will be less affected by these.
Again, I've seen the biggest problems with these two issues when two different TCP/IP stacks are talking. Such as Windows/Linux, 2.4-Linux/2.6-Linux, Windows/NetWare, Linux/BSD. Like to like works very, very well. Microsoft rewrote the Windows TCP/IP stack in Server 2008 which introduced Linux interoperability problems that didn't exist with Server 2003 (I believe these are fixed, but I'm not 100% sure of that).
Disagreements on the exact method of Delayed or Selective Acknowledgments can lead to cases like this:
192.168.128.5 -> 192.168.128.20: 1500b payload, SEQ 1562 192.168.128.5 -> 192.168.128.20: 1500b payload, SEQ 9524 [200ms pass] 192.168.128.20 -> 192.168.128.5: ACK 1562 192.168.128.5 -> 192.168.128.20: 1500b payload, SEQ 12025 192.168.128.5 -> 192.168.128.20: 1500b payload, SEQ 13824 [200ms pass] 192.168.128.20 -> 192.168.128.5: ACK 12025
Throughput goes through the floor because of all of the 200ms timeouts (Windows defaults it's delayed-ack timer to 200ms). In this case, both sides of the conversation failed to handle TCP Delayed Ack.
TCP Windowing faults are harder to notice because their impact can be less obvious. In extreme cases Windowing fails completely and you get packet->ack->packet->ack->packet->ack which is really slow when transferring anything significantly larger than about 10KB and will magnify any fundamental latency on the link. The harder to detect mode is when both sides are continually renegotiating their Window size and one side (the sender) fails to respect the negotiation which requires a few packets to handle before data can continue to be passed. This kind of fault shows up in red blinking lights in Wireshark traces, but manifests as lower than expected throughput.
As I mentioned, the above tend to plague large transfers. Traffic like streaming video or backup streams can be really nailed by them, as well as simple downloading of very large files (like Linux distro ISO files). As it happens, TCP Windowing was designed as a way to work around fundamental latency problems as it allows pipelining of data; you don't have to wait for round-trip-time for each packet sent you can just send a big block and wait for a single ACK before sending more.
That said, certain network patterns don't benefit from these work-arounds. Highly transactional, small transfers, such as those generated by databases, suffer most from normal latency on the line. If the RTT is high these workloads will suffer greatly, where large streaming workloads will suffer a lot less.
Solution 3:
There are many answers to this question.
Remember how TCP works. Client sends SYN, server answers SYN/ACK and client answers ACK. Once the server has received the ACK, it can now send data. This means that you have to wait 2 times the round trip time (RTT) to send the first bit of meaningful data. If you have 500ms of RTT, you get a 1 second delay right there from the start. If the sessions are short lived but numerous, this will create a lot of latency.
Once the session is established, the server sends data units that have to be acknowledged by the client. The server can only send so much data in the wild before it requires the acknowledgment of the first data unit. This can create latency as well. If a data unit gets dropped, you have to pick up the transmission from there and therefore create extra latency.
On the IP level, you have fragmentation (even though it is quite rare today). If you send 1501 byte frames and the other side only supports a MTU of 1500, you will be sending an extra IP packet for just that last bit of data. This can be overcome by using Jumbo frames.
The best way to increase TCP/IP throughput is to reduce latency as much as possible and avoid transmission errors as much as possible. I do not know of any kernel tweaks but I'm sure someone will.
Solution 4:
In case of the WAN a primary factor for introducing latency is the Speed of Light. It's takes a theoretical minimum of ~36.2ms for data to cross North America.
One way trip along fiber optic cables in seconds:
- $_DISTANCE_IN_MILES * ( Cable_Refraction / SPEED_OF_LIGHT )
Multiply times 1000 to convert from seconds to milliseconds. Double it for the Roundtrip:
- $_DISTANCE_IN_MILES * ( Cable_Refraction / SPEED_OF_LIGHT ) * 1000 * 2
Here's latency from Washington, DC to Los Angeles, CA:
- 2308 * (1.46 / 186000) * 1000 * 2 = 36.23311ms
- speed of light (in miles per second) = 186000
- refraction index of fiber optic cable = 1.46
- distance (from DC to LA in Miles) = 2308
More about the formula