Intermittent 504 errors with HAProxy

I've been struggling with this issue for weeks and I'm running out of ideas. I run HAProxy to proxy requests among 3 backends based on the requests' path/headers.

My backends are:

  • An Amazon S3 bucket
  • A Node.js app (2 servers)
  • A service called prerender.io

The last backend (prerender.io) seems to have zero issues (although it has very little traffic). The two others return 504 errors to the client randomly (about every minute according to the logs, but no clear pattern).

Here is my (sanitized) config:

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5s
    timeout client  120s
    timeout server  120s

frontend foobar
    mode http
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/foobar.com.pem
    redirect scheme https code 301 if !{ ssl_fc }

    default_backend s3

    acl api path_beg -i /api/
    use_backend node if api

    acl user-agent-bot hdr_sub(User-Agent) -i baiduspider twitterbot facebookexternalhit 
    use_backend prerender if user-agent-bot

backend s3
    mode http
    http-request set-path /index.html
    reqirep ^Host:   Host:\ my-bucket.s3-website-us-east-1.amazonaws.com
    reqidel ^Authorization:.*
    rspidel ^x-amz-id-2:.*
    rspidel ^x-amz-request-id:.*
    server s3 my-bucket.s3-website-us-east-1.amazonaws.com:80 check inter 5000

backend node
    mode http
    balance roundrobin
    option forwardfor
    server api01 1.2.3.4:3333 check
    server api02 5.6.7.8:3333 check

backend prerender
    mode http
    server prerender service.prerender.io:443 check inter 5000 ssl verify none
    http-request set-header X-Prerender-Token my-secret-token
    reqrep ^([^\ ]*)\ /(.*)$ \1\ /https://app.wwoof.fr/\2

I have myself experienced those 504 visiting the website. All I have to do is refresh the page and it works again immediately. I do not have to wait 120s (server timeout) before getting those 504, they appear immediately upon request.

Sample (sanitized) errors from the log:

Sep 28 14:27:13 node/api01 0/0/1/-1/1 504 195 - - sR-- 38/38/30/14/0 0/0 "GET /api/hosts/2266 HTTP/1.1"
Sep 28 14:34:15 node/api02 0/0/0/-1/0 504 195 - - sR-- 55/55/41/25/0 0/0 "GET /api/hosts/4719 HTTP/1.1"
Sep 28 14:34:15 node/api01 0/0/1/-1/1 504 195 - - sR-- 54/54/41/16/0 0/0 "GET /api/hosts/2989 HTTP/1.1"
Sep 28 14:38:41 node/api01 0/0/1/-1/1 504 195 - - sR-- 50/50/47/25/0 0/0 "POST /api/users HTTP/1.1"
Sep 28 14:42:13 node/api02 0/0/1/-1/1 504 195 - - sR-- 134/134/102/49/0 0/0 "POST /api/users HTTP/1.1"
Sep 28 14:42:29 node/api02 0/0/1/-1/1 504 195 - - sR-- 130/130/105/51/0 0/0 "GET /api/hosts/1634 HTTP/1.1"

I have similar logs for the s3 backend. I looked into the docs to understand what sR means. The first character is a code reporting the first event which caused the session to terminate :

s : the server-side timeout expired while waiting for the server to send or receive data.

The second character indicates the TCP or HTTP session state when it was closed :

R : the proxy was waiting for a complete, valid REQUEST from the client (HTTP mode only). Nothing was sent to any server.

This combination sR doesn't make sense to me. How could the server timeout expire since it is set to 120s? And why is the second letter referring to the client? Those letters seem contradictory.

The 0/0/1/-1/1 part represent times. Long story short it indicates that we do not wait 120 seconds, it fails right away.

Both s3 and Node.js backends have this exact same issue. I used to front the whole thing with Nginx and it was working fine so I am confident this issue has nothing to do with my config. Any advice or suggestion for debugging this?


Solution 1:

I think I finally figured it out. The solution consisted in increasing the timeout values:

timeout connect 20s
timeout client  10m
timeout server  10m

I'm not sure why increasing client/server timeouts from 2 min to 10 min solved the issue. I believe it has something to do with keep-alive and the fact that HAProxy maintains open connections with S3/Node.

Hope this helps!

Solution 2:

I also hit this issue and it turned out to be a bug in v1.7.10:

https://discourse.haproxy.org/t/intermittent-504-errors-and-sr-after-upgrade-to-1-7-10/2029

Upgrading to v1.7.11+ fixes the issue.