HAProxy not balancing server load fairly

HAProxy doesn't seem to keep the connections to the servers balanced.

Keep this in mind:

  • using HAProxy v1.3.26
  • 5 equally balanced server specs
  • algorithm is round robin, but no weights are applied for each server
  • global max connections set in haproxy to be 80,000

As seen in the picture, servers C and D seem to be getting way more connections than the other ones. Due to this extra load, they keep going down and get rebooted automatically.

statconfig

I tried reading the HAproxy official docs and did some Googling but didn't find anything useful. Hopefully someone here can help.

A couple of questions:

  1. Why is this when the config says to use roundrobin, server specs are the same, and no weights are applied?

  2. What determines the "max" sub-column in the "Sessions" column (the one that says 1970, 1444, etc.). Servers C, D, E are in the 3K range and the other 2 are little under 2K. Why the difference?

  3. How to keep it all balanced?

  4. Can someone explain each column? I'm surprised that the official doc of HAproxy doesn't really explain it.


Solution 1:

What happens if you take C and D out. How does the behavior change?

What does your configuration look like?

Disclaimer: The below is my observations of HAProxy behavior rather than what it might actually be.

HAProxy always uses weights from what I understand. If you look at your weights on the screenshot it says all of them are weight 1. Running roundrobin, we have 4 servers of weight 50 and one of weight 1. The four of weight 50 are near perfect with regard to the number of sessions (2-3 off). The server with the 1 weight is has the right amount of sessions proportionately speaking.

Try explicitly setting the weights to the same higher value. That should help it be a little more fine grained in its weight calculation which should achieve higher accuracy. If the weight is one on all of them then it deals in terms of 20% which isn't very accurate. Now if you set all 5 to weight 20 then it can deal in terms of 1%.

  1. Not sure. I need to see the configuration first.
  2. I am pretty certain the "max" column is the max number of connections it has had at one any given point rather than an actual maximum allowed.
  3. We use round robin and it works great so again, we need to see the config and try some things.
  4. Most of the columns I think are pretty clear. What helps me is to look at the extended column above (e.g. Queue, Session Rate, etc.)

Hopefully that gives you some things to look at.

Solution 2:

You should use the "leastconn" method instead of round robin. It uses slightly more CPU, but does better load balancing if your sessions aren't super-short.

If you look, your A,B and E servers have ~250 current sessions open (Sessions Cur). But C+D have many times that. But because you specified "round robin", those overloaded servers get an even distribution of all new traffic.

"leastconn" allows servers to recover if they get overwhelmed. "roundrobin" keeps sending everybody traffic equally (piling on more connections to a slow server) until they fall over.

The meanings of all the session variables are documented towards the end of the very comprehensive documentation. (Search for "statistics and monitoring")