Per-packet round-robin load balancing for UDP
I need to load-balance UDP traffic between a number of "realservers" and do it in a truly round-robin fashion. I've started with keepalived, but unexpectedly discovered, that LVS treats UDP traffic as a "connection" (whatever that is in terms of UDP..). In practice, that means, that all traffic from a particular client goes to the very same "realserver" all the time (this is a big deal, because some clients may generate such an amount of traffic, that the single backend will be overwhelmed).
Apparently, this is expected behaviour, however more recent LVS versions have a "--ops" flag, that makes LVS bypass its aforementioned behaviour so that each UDP datagram is treated independently (this is what I want!). But (there's always a but..) this functionality is not exposed from keepalived.conf.
Is there any solution out there, that will let me
- do a round-robin distribution between backends for UDP
- detect "dead" backends and remove them from round-robin (adding them back when they become "alive" would've been also useful)
Should be Linux-based, obviously. DNS round-robin in any form will not really work here, because the clients are not DNS-aware.
P.S. I am going to try pulse/piranha, but from reading the documentation I've gathered, that it does not expose the "--ops" flag as well. I am also going to give mon a try (make mon check backends and add/remove realservers by invoking ipvsadm directly).
Solution 1:
The requirement was satisfied as follows:
I've installed a more recent version of ipvsadm (and its kernel modules), the one that supports the --ops
flag (1.26). Since keepalived does not expose this flag in its configuration file, you have to apply it manually. Luckily, you can do that after the "virtual service" is created (in terms of plain ipvsadm, you can first ipvsam -A
a virtual service without --ops
, and then ipvsadm -E
it to add one packet scheduling).
Since keepalived creates the the virtual service for you, all you have to do is to edit it after it is created, which happens when quorum is gained for this virtual server (basically, there is a sufficient number of working realservers). Here's how it looks in the keepalived.conf
file:
virtual_server <VIP> <VPORT> {
lb_algo rr
lb_kind NAT
protocol UDP
...
# Enable one-packet scheduling when quorum is gained
quorum_up "ipvsadm -E -u <VIP>:<VPORT> --ops -s rr"
... realserver definitions, etc ...
}
This works, but I've encountered a number of problems (kind of) with this setup:
- There is small time gap (less than a second, more like 1/10), between quorum going up and the script in
quorum_up
getting executed. Any datagrams that manage to go through the director during that time will create a connection entry in ipvsadm, and further datagrams from that source host / port will be stuck on the same realserver even after the--ops
flag is added. You can minimize the chance of this happening by making sure that the virtual service is never deleted once it is created. You do that by specifyinginhibit_on_failure
flag in your realserver definitions so that they are not deleted when the corresponding realserver is down (when all realservers are deleted, the virtual service is also deleted), but instead their weight is set to zero (they stop receiving traffic then). As a result, the only time datagrams can slip by is during keepalived startup (assuming you have at least one realserver up at that time, so that quorum will be gained immediately). - When
--ops
is active, the director does not rewrite the source host / port of the datagrams that the realservers sends to the clients, so the source host / port are those of the realserver that has sent this particular datagram. This might be a problem (it was for my clients). You can amend that bySNAT
'ing those datagrams with iptables. - I've noticed significant system CPU load when the director is under load. Turns out, CPU is hogged by ksoftirqd. It does not happen if you turn off
--ops
. Presumably, the problem is that the packet dispatching algorithm is fired on every datagram instead of just the first datagram in the "connection" (if that even applies to UDP..). I haven't actually found the way to "fix" that, but maybe I haven't tried hard enough. The system has some specific load requirements and under that load the processor usage does not max out; neither are there any lost datagrams, so this problem is not considered a show-stopper. It is still rather alarming though.
Summary: the setup definitely works (also under load), but the hoops one has to jump through and the problems I've encountered (especially №3.. maybe someone knows the solution?), mean that, given time, I would've used a userspace program (written in C, probably) for listening on a UDP socket and distributing the received datagrams between realservers, in conjunction with something that would check the health of realservers for me, SNAT
in iptables to rewrite the source host / port and keepalived in VRRP mode for HA.
Solution 2:
There must be a way doing this with multipath routing....
Load-balancer and realserver share IPs in a subnet (10.0.0/24). For both real-servers you add the same IP, from another subnet, as a secondary for the loopback interface (172.16.1.1/32). It is on this address that your service will listen.
+-------------------------------------+
+----|A: eth0:10.0.0.2/24 lo:172.16.1.1/32 |
+--------------------+ | +-------------------------------------+
|LB eth0:10.0.0.1/24 |---|
+--------------------+ | +-------------------------------------+
+----|B: eth0:10.0.0.3/24 lo:172.16.1.1/32 |
+-------------------------------------+
and then you can use:
ip route add 172.16.1.1/32 nexthop via 10.0.0.2 nexthop via 10.0.0.3
But so far the good news: apparently recent linux kernels will cache the routes so that packets from the same source will still end up on the same destination. There are some patches to disable this behaviour, but they all seem to be for older kernels (such as the multipath equalize patch for the 2.4 kernel, mpath in 2.6). Maybe a more thorough search might find you a working patch for a recent kernel.
The failover you can realize easily by running CARP for both 10.0.0.2 and 10.0.0.3. That way, B takes over 10.0.0.2 when A goes down.