Intermittent 504 errors with HAProxy
I've been struggling with this issue for weeks and I'm running out of ideas. I run HAProxy to proxy requests among 3 backends based on the requests' path/headers.
My backends are:
- An Amazon S3 bucket
- A Node.js app (2 servers)
- A service called prerender.io
The last backend (prerender.io) seems to have zero issues (although it has very little traffic). The two others return 504 errors to the client randomly (about every minute according to the logs, but no clear pattern).
Here is my (sanitized) config:
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5s
timeout client 120s
timeout server 120s
frontend foobar
mode http
bind *:80
bind *:443 ssl crt /etc/ssl/certs/foobar.com.pem
redirect scheme https code 301 if !{ ssl_fc }
default_backend s3
acl api path_beg -i /api/
use_backend node if api
acl user-agent-bot hdr_sub(User-Agent) -i baiduspider twitterbot facebookexternalhit
use_backend prerender if user-agent-bot
backend s3
mode http
http-request set-path /index.html
reqirep ^Host: Host:\ my-bucket.s3-website-us-east-1.amazonaws.com
reqidel ^Authorization:.*
rspidel ^x-amz-id-2:.*
rspidel ^x-amz-request-id:.*
server s3 my-bucket.s3-website-us-east-1.amazonaws.com:80 check inter 5000
backend node
mode http
balance roundrobin
option forwardfor
server api01 1.2.3.4:3333 check
server api02 5.6.7.8:3333 check
backend prerender
mode http
server prerender service.prerender.io:443 check inter 5000 ssl verify none
http-request set-header X-Prerender-Token my-secret-token
reqrep ^([^\ ]*)\ /(.*)$ \1\ /https://app.wwoof.fr/\2
I have myself experienced those 504 visiting the website. All I have to do is refresh the page and it works again immediately. I do not have to wait 120s (server timeout) before getting those 504, they appear immediately upon request.
Sample (sanitized) errors from the log:
Sep 28 14:27:13 node/api01 0/0/1/-1/1 504 195 - - sR-- 38/38/30/14/0 0/0 "GET /api/hosts/2266 HTTP/1.1"
Sep 28 14:34:15 node/api02 0/0/0/-1/0 504 195 - - sR-- 55/55/41/25/0 0/0 "GET /api/hosts/4719 HTTP/1.1"
Sep 28 14:34:15 node/api01 0/0/1/-1/1 504 195 - - sR-- 54/54/41/16/0 0/0 "GET /api/hosts/2989 HTTP/1.1"
Sep 28 14:38:41 node/api01 0/0/1/-1/1 504 195 - - sR-- 50/50/47/25/0 0/0 "POST /api/users HTTP/1.1"
Sep 28 14:42:13 node/api02 0/0/1/-1/1 504 195 - - sR-- 134/134/102/49/0 0/0 "POST /api/users HTTP/1.1"
Sep 28 14:42:29 node/api02 0/0/1/-1/1 504 195 - - sR-- 130/130/105/51/0 0/0 "GET /api/hosts/1634 HTTP/1.1"
I have similar logs for the s3 backend. I looked into the docs to understand what sR
means. The first character is a code reporting the first event which caused the session to terminate :
s : the server-side timeout expired while waiting for the server to send or receive data.
The second character indicates the TCP or HTTP session state when it was closed :
R : the proxy was waiting for a complete, valid REQUEST from the client (HTTP mode only). Nothing was sent to any server.
This combination sR
doesn't make sense to me. How could the server timeout expire since it is set to 120s? And why is the second letter referring to the client? Those letters seem contradictory.
The 0/0/1/-1/1
part represent times. Long story short it indicates that we do not wait 120 seconds, it fails right away.
Both s3 and Node.js backends have this exact same issue. I used to front the whole thing with Nginx and it was working fine so I am confident this issue has nothing to do with my config. Any advice or suggestion for debugging this?
Solution 1:
I think I finally figured it out. The solution consisted in increasing the timeout
values:
timeout connect 20s
timeout client 10m
timeout server 10m
I'm not sure why increasing client/server timeouts from 2 min to 10 min solved the issue. I believe it has something to do with keep-alive
and the fact that HAProxy maintains open connections with S3/Node.
Hope this helps!
Solution 2:
I also hit this issue and it turned out to be a bug in v1.7.10:
https://discourse.haproxy.org/t/intermittent-504-errors-and-sr-after-upgrade-to-1-7-10/2029
Upgrading to v1.7.11+ fixes the issue.