HAProxy: Display a "BADREQ" | BADREQ's by the thousands
My HAProxy Configuration.
#HA-Proxy version 1.3.22 2009/10/14 Copyright 2000-2009 Willy Tarreau <[email protected]>
global
maxconn 10000
spread-checks 50
user haproxy
group haproxy
daemon
stats socket /tmp/haproxy
log localhost local0
log localhost local1 notice
defaults
mode http
maxconn 50000
timeout client 10000
option forwardfor except 127.0.0.1
option httpclose
option httplog
listen dcaustin 0.0.0.0:80
mode http
timeout connect 12000
timeout server 60000
timeout queue 120000
balance roundrobin
option httpchk GET /index.html
log global
option httplog
option dontlog-normal
server web1 10.10.10.101:80 maxconn 300 check fall 1
server web2 10.10.10.102:80 maxconn 300 check fall 1
server web3 10.10.10.103:80 maxconn 300 check fall 1
server web4 10.10.10.104:80 maxconn 300 check fall 1
listen stats 0.0.0.0:9000
mode http
balance
log global
timeout client 5000
timeout connect 4000
timeout server 30000
stats uri /haproxy
HAProxy is running, and the socket is working...
adam@dcaustin:/etc/haproxy# echo "show info" | socat stdio /tmp/haproxy
Name: HAProxy
Version: 1.3.22
Release_date: 2009/10/14
Nbproc: 1
Process_num: 1
Pid: 6320
Uptime: 0d 0h14m58s
Uptime_sec: 898
Memmax_MB: 0
Ulimit-n: 20017
Maxsock: 20017
Maxconn: 10000
Maxpipes: 0
CurrConns: 47
PipesUsed: 0
PipesFree: 0
Tasks: 51
Run_queue: 1
node: dcaustin
desiption:
Errors show nothing from socket...
adam@dcaustin:/etc/haproxy# echo "show errors" | socat stdio /tmp/haproxy
adam@dcaustin:/etc/haproxy#
However...
My Error log is exploding with "badrequests" with the Error code cR. cR (according to 1.3 documentation) is The "timeout http-request" stroke before the client sent a full HTTP request. This is sometimes caused by too large TCP MSS values on the client side for PPPoE networks which cannot transport full-sized packets, or by clients sending requests by hand and not typing fast enough, or forgetting to enter the empty line at the end of the request. The HTTP status code is likely a 408 here.
Correct on the 408, but we're getting literally thousands of these requests every hour. (This log snippet is an clip for about 10 seconds of time...)
Jun 30 11:08:52 localhost haproxy[6320]: 92.22.213.32:26448 [30/Jun/2011:11:08:42.384] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10002 408 212 - - cR-- 35/35/18/0/0 0/0 "<BADREQ>"
Jun 30 11:08:54 localhost haproxy[6320]: 71.62.130.24:62818 [30/Jun/2011:11:08:44.457] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10001 408 212 - - cR-- 39/39/16/0/0 0/0 "<BADREQ>"
Jun 30 11:08:55 localhost haproxy[6320]: 84.73.75.236:3589 [30/Jun/2011:11:08:45.021] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10008 408 212 - - cR-- 35/35/15/0/0 0/0 "<BADREQ>"
Jun 30 11:08:55 localhost haproxy[6320]: 69.39.20.190:49969 [30/Jun/2011:11:08:45.709] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10000 408 212 - - cR-- 37/37/16/0/0 0/0 "<BADREQ>"
Jun 30 11:08:56 localhost haproxy[6320]: 2.29.0.9:58772 [30/Jun/2011:11:08:46.846] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10001 408 212 - - cR-- 43/43/22/0/0 0/0 "<BADREQ>"
Jun 30 11:08:57 localhost haproxy[6320]: 212.139.250.242:57537 [30/Jun/2011:11:08:47.568] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10000 408 212 - - cR-- 42/42/21/0/0 0/0 "<BADREQ>"
Jun 30 11:08:58 localhost haproxy[6320]: 74.79.195.75:55046 [30/Jun/2011:11:08:48.559] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10000 408 212 - - cR-- 46/46/24/0/0 0/0 "<BADREQ>"
Jun 30 11:08:58 localhost haproxy[6320]: 74.79.195.75:55044 [30/Jun/2011:11:08:48.554] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10004 408 212 - - cR-- 45/45/24/0/0 0/0 "<BADREQ>"
Jun 30 11:08:58 localhost haproxy[6320]: 74.79.195.75:55045 [30/Jun/2011:11:08:48.554] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10005 408 212 - - cR-- 44/44/24/0/0 0/0 "<BADREQ>"
Jun 30 11:09:00 localhost haproxy[6320]: 68.197.56.2:52781 [30/Jun/2011:11:08:50.975] dcaustin dcaustin/<NOSRV> -1/-1/-1/-1/10000 408 212 - - cR-- 49/49/28/0/0 0/0 "<BADREQ>"
From what I read on google, if i wanted to see what the bad requests are, I can show errors to the socket and it will spit them out. We do run a pretty heavily trafficed website and the percentage of "BADREQS" to normal requests is quite low, but I'd like to be able to get ahold of what that request WAS so I can debug it.
stats
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,
dcaustin,FRONTEND,,,64,120,50000,88433,105889100,2553809875,0,0,4641,,,,,OPEN,,,,,,,,,1,1,0,,,,0,45,0,128,
dcaustin,web1,0,0,10,28,300,20941,25402112,633143416,,0,,0,3,0,0,UP,1,1,0,0,0,2208,0,,1,1,1,,20941,,2,11,,30,
dcaustin,web2,0,0,9,30,300,20941,25026691,641475169,,0,,0,3,0,0,UP,1,1,0,0,0,2208,0,,1,1,2,,20941,,2,11,,30,
dcaustin,web3,0,0,10,27,300,20940,30116527,635015040,,0,,0,9,0,0,UP,1,1,0,0,0,2208,0,,1,1,3,,20940,,2,10,,31,
dcaustin,web4,0,0,5,28,300,20940,25343770,643209546,,0,,0,8,0,0,UP,1,1,0,0,0,2208,0,,1,1,4,,20940,,2,11,,31,
dcaustin,BACKEND,0,0,34,95,50000,83762,105889100,2553809875,0,0,,0,34,0,0,UP,4,4,0,,0,2208,0,,1,1,0,,83762,,1,43,,122,
88500 "Sessions" and 4500 errors. in the last 20 minutes.
Your timeouts are too low. Increase them.
timeout connect 30s
timeout client 30s
The absolute minimum is 5 seconds for traffic between two servers in the same rack. A TCP connection takes 3 seconds to open if there is any packet loss, which invariably happens from time to time.
The minimum timeout is 15 seconds to support international traffic, like a client from Australia connecting to a server in North America. There is quite a high latency and low bandwidth in some locations in the world, much worse than one would expect. Being reasonable on timeouts is a prerequisite to do business worldwide.
The minimum timeout is 30 seconds to support mobile connections and poor reception WiFi. It's unreliable connectivity that can and do experience short periods of blackout.
Keep in mind. Timeouts are meant to handle the worst case scenario of connectivity and they should only catch truly failed connections. They could be set somewhat shorter but this has no benefit except generating errors on clients and servers, which is not a benefit.
Consider that a periodic request made every 5 seconds, something as simple as a healthcheck or a polling API, is actually as much as 17280 requests per day. Thus a good timeout setting should cause less than 0.01% of false positives or it's creating errors every day for no reason.
88500 Sessions and 4500 errors in the last 20 minutes.
That's 5% of errors. It's a very high error rate.
Considering that the average webpage takes more than 20 sub requests to load, it means that every single page on your site is failing to load partially.