HAProxy: Display a "BADREQ" | BADREQ's by the thousands

My HAProxy Configuration.

#HA-Proxy version 1.3.22 2009/10/14  Copyright 2000-2009 Willy Tarreau <[email protected]>
global
    maxconn 10000
    spread-checks 50
    user haproxy
    group haproxy
    daemon
    stats socket /tmp/haproxy
    log localhost   local0
    log localhost   local1 notice

defaults
    mode    http
    maxconn 50000
    timeout client 10000
    option forwardfor except 127.0.0.1
    option httpclose
    option httplog

listen dcaustin 0.0.0.0:80
    mode http
    timeout connect 12000
    timeout server 60000
    timeout queue 120000
    balance roundrobin
    option httpchk GET /index.html
    log global
    option httplog
    option dontlog-normal
    server web1 10.10.10.101:80 maxconn 300 check fall 1
    server web2 10.10.10.102:80 maxconn 300 check fall 1
    server web3 10.10.10.103:80 maxconn 300 check fall 1
    server web4 10.10.10.104:80 maxconn 300 check fall 1

listen stats 0.0.0.0:9000
    mode http
    balance
    log global
    timeout client 5000
    timeout connect 4000
    timeout server 30000
    stats uri /haproxy

HAProxy is running, and the socket is working...

adam@dcaustin:/etc/haproxy# echo "show info" | socat stdio /tmp/haproxy 
Name: HAProxy
Version: 1.3.22
Release_date: 2009/10/14
Nbproc: 1
Process_num: 1
Pid: 6320
Uptime: 0d 0h14m58s
Uptime_sec: 898
Memmax_MB: 0
Ulimit-n: 20017
Maxsock: 20017
Maxconn: 10000
Maxpipes: 0
CurrConns: 47
PipesUsed: 0
PipesFree: 0
Tasks: 51
Run_queue: 1
node: dcaustin
desiption: 

Errors show nothing from socket...

adam@dcaustin:/etc/haproxy# echo "show errors" | socat stdio /tmp/haproxy 
adam@dcaustin:/etc/haproxy# 

However...

My Error log is exploding with "badrequests" with the Error code cR. cR (according to 1.3 documentation) is The "timeout http-request" stroke before the client sent a full HTTP request. This is sometimes caused by too large TCP MSS values on the client side for PPPoE networks which cannot transport full-sized packets, or by clients sending requests by hand and not typing fast enough, or forgetting to enter the empty line at the end of the request. The HTTP status code is likely a 408 here.

Correct on the 408, but we're getting literally thousands of these requests every hour. (This log snippet is an clip for about 10 seconds of time...)

Jun 30 11:08:52 localhost haproxy[6320]: 92.22.213.32:26448 [30/Jun/2011:11:08:42.384] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10002 408 212 - - cR-- 35/35/18/0/0 0/0 "<BADREQ>"
Jun 30 11:08:54 localhost haproxy[6320]: 71.62.130.24:62818 [30/Jun/2011:11:08:44.457] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10001 408 212 - - cR-- 39/39/16/0/0 0/0 "<BADREQ>"
Jun 30 11:08:55 localhost haproxy[6320]: 84.73.75.236:3589 [30/Jun/2011:11:08:45.021] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10008 408 212 - - cR-- 35/35/15/0/0 0/0 "<BADREQ>"
Jun 30 11:08:55 localhost haproxy[6320]: 69.39.20.190:49969 [30/Jun/2011:11:08:45.709] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10000 408 212 - - cR-- 37/37/16/0/0 0/0 "<BADREQ>"
Jun 30 11:08:56 localhost haproxy[6320]: 2.29.0.9:58772 [30/Jun/2011:11:08:46.846] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10001 408 212 - - cR-- 43/43/22/0/0 0/0 "<BADREQ>"
Jun 30 11:08:57 localhost haproxy[6320]: 212.139.250.242:57537 [30/Jun/2011:11:08:47.568] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10000 408 212 - - cR-- 42/42/21/0/0 0/0 "<BADREQ>"
Jun 30 11:08:58 localhost haproxy[6320]: 74.79.195.75:55046 [30/Jun/2011:11:08:48.559] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10000 408 212 - - cR-- 46/46/24/0/0 0/0 "<BADREQ>"
Jun 30 11:08:58 localhost haproxy[6320]: 74.79.195.75:55044 [30/Jun/2011:11:08:48.554] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10004 408 212 - - cR-- 45/45/24/0/0 0/0 "<BADREQ>"
Jun 30 11:08:58 localhost haproxy[6320]: 74.79.195.75:55045 [30/Jun/2011:11:08:48.554] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10005 408 212 - - cR-- 44/44/24/0/0 0/0 "<BADREQ>"
Jun 30 11:09:00 localhost haproxy[6320]: 68.197.56.2:52781 [30/Jun/2011:11:08:50.975] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10000 408 212 - - cR-- 49/49/28/0/0 0/0 "<BADREQ>"

From what I read on google, if i wanted to see what the bad requests are, I can show errors to the socket and it will spit them out. We do run a pretty heavily trafficed website and the percentage of "BADREQS" to normal requests is quite low, but I'd like to be able to get ahold of what that request WAS so I can debug it.

stats

# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,
dcaustin,FRONTEND,,,64,120,50000,88433,105889100,2553809875,0,0,4641,,,,,OPEN,,,,,,,,,1,1,0,,,,0,45,0,128,
dcaustin,web1,0,0,10,28,300,20941,25402112,633143416,,0,,0,3,0,0,UP,1,1,0,0,0,2208,0,,1,1,1,,20941,,2,11,,30,
dcaustin,web2,0,0,9,30,300,20941,25026691,641475169,,0,,0,3,0,0,UP,1,1,0,0,0,2208,0,,1,1,2,,20941,,2,11,,30,
dcaustin,web3,0,0,10,27,300,20940,30116527,635015040,,0,,0,9,0,0,UP,1,1,0,0,0,2208,0,,1,1,3,,20940,,2,10,,31,
dcaustin,web4,0,0,5,28,300,20940,25343770,643209546,,0,,0,8,0,0,UP,1,1,0,0,0,2208,0,,1,1,4,,20940,,2,11,,31,
dcaustin,BACKEND,0,0,34,95,50000,83762,105889100,2553809875,0,0,,0,34,0,0,UP,4,4,0,,0,2208,0,,1,1,0,,83762,,1,43,,122,

88500 "Sessions" and 4500 errors. in the last 20 minutes.


Your timeouts are too low. Increase them.

timeout connect 30s
timeout client  30s

The absolute minimum is 5 seconds for traffic between two servers in the same rack. A TCP connection takes 3 seconds to open if there is any packet loss, which invariably happens from time to time.

The minimum timeout is 15 seconds to support international traffic, like a client from Australia connecting to a server in North America. There is quite a high latency and low bandwidth in some locations in the world, much worse than one would expect. Being reasonable on timeouts is a prerequisite to do business worldwide.

The minimum timeout is 30 seconds to support mobile connections and poor reception WiFi. It's unreliable connectivity that can and do experience short periods of blackout.

Keep in mind. Timeouts are meant to handle the worst case scenario of connectivity and they should only catch truly failed connections. They could be set somewhat shorter but this has no benefit except generating errors on clients and servers, which is not a benefit.

Consider that a periodic request made every 5 seconds, something as simple as a healthcheck or a polling API, is actually as much as 17280 requests per day. Thus a good timeout setting should cause less than 0.01% of false positives or it's creating errors every day for no reason.

88500 Sessions and 4500 errors in the last 20 minutes.

That's 5% of errors. It's a very high error rate.

Considering that the average webpage takes more than 20 sub requests to load, it means that every single page on your site is failing to load partially.