Need help troubleshooting intermittant TCP timeout in HAProxy
I have an application where the client connects to a server via a simple TCP based protocol over TLS/SSL. In development, this has worked great for many months while we were building our application out. Recently as we prepare for launch, I've gone ahead and added HAProxy into the mix in order to facilitate some order of load distribution. Everything works, technically, but the issue is, the client is now seeing completely random time outs. They are not typically consistent, but happen at roughly 60 seconds long. Sometimes it can happen after 25 seconds. The server that haproxy forwards the TCP connection to notices and does a clean disconnect, the problem is you do not want a bunch of simultaneous connections getting disrupted and reconnected over and over without any reason. This has implications on our publish/subscribe infrastructure in addition to other things. The client is smart enough to reconnect right away - however that's not the behavior we desire. The server that's responsible for accepting these TCP connections over SSL does not require a keep alive. I'm going to go ahead and assume that there is some implicit value I'm not seeing in my HAProxy config causing these random timeouts, or something that requires a TCP keep alive. The fact that the timeouts are not always consistent however, makes me wonder otherwise. If it was 60 seconds on the dot each and every time I'd be convinced it's a configuration issue. In this particular case, it is not always 60 seconds. Here's what my configuration looks like right now:
global
stats socket /home/haproxy/status user haproxy group haproxy
log 127.0.0.1 local1 info
# log 127.0.0.1 local5 info
maxconn 4096
ulimit-n 8250
# typically: /home/haproxy
chroot /home/haproxy
user haproxy
group haproxy
daemon
quiet
pidfile /home/haproxy/haproxy.pid
defaults
log global
mode http
option httplog
option dontlognull
retries 3
option redispatch
maxconn 2000
contimeout 5000
clitimeout 60000
srvtimeout 60000
# Configuration for one application:
# Example: listen myapp 0.0.0.0:80
listen www 0.0.0.0:443
mode tcp
balance leastconn
# Example server line (with optional cookie and check included)
# server srv3.0 10.253.43.224:8000 srv03.0 check inter 2000 rise 2 fall 3
# Status port (by default, localhost only...for debugging purposes)
server ANID3 10.0.1.2:8888 check inter 3000 rise 2 fall 3 maxconn 500
server ANID1 10.0.1.3:8888 check inter 3000 rise 2 fall 3 maxconn 500
server ANID2 10.0.1.4:8888 check inter 3000 rise 2 fall 3 maxconn 500
listen health 0.0.0.0:9999
mode http
balance roundrobin
stats uri /haproxy-status
I verified that HAProxy is the issue by having our client bypass it and go directly to a single app server where there is no time outs and everything is nice and dandy. As soon as I route it through one of our two haproxy servers, the random disconnects happen ranging anywhere between 25-60 seconds.
Thanks for taking a look at this. It is quite frustrating, but I'm sure it is a lack of understanding in what exactly HAProxy expects from my client.
Solution 1:
There should be no reason for early close of connection, I don't even see how that can happen. Your timeouts are set to 60s so it should be 60s.
Hmmm wait a minute, aren't you running haproxy inside a VM with a fast-running clock ? It's a problem in some VMs, where the clock sometimes runs far too fast (more than twice the correct speed) or instead too slow with large jumps once a minute. Haproxy knows how to defend against too long pauses and time jumps that it can detect, but obviously it cannot defend against clocks running too fast without being reported by the system.
If you're in a VM, you can try this :
$ while sleep 1; do date; done
And let this run for one or two minutes. Check by yourself if it's running at the correct speed. It's been a while since I last observed this nasty issue, but it does not mean it won't happen again.
BTW, you should set "option tcplog
" in your TCP section and check the logs. You will then see there if from haproxy's point of view, it was a timeout, a client or server abort, and after how long a time.