Solution 1:

I use haproxy too for our load balancing because at the time of design, Amazon's Elastic Load Balancing (ELB) did not support servers within a VPC. They have that feature now (I believe, have not used it since haproxy is working great for us).

We didn't try keepalived at all for two reasons:

  1. We doubted we can change the private IP on server without going through AWS (console or API). Also, AWS does not allow two servers within the same VPC to have the same internal IP address.
  2. We required a multi-AZ setup for high availability. Server's internal IP within a VPC is based on the VPC subnet and each subnet can belong to only 1 VPC. Therefore, we can't have two hosts on two different AZ's be within the same subnet.

Therefore, the solution we implemented was:

  • Setup haproxy on two servers (one on each AZ)
  • Also divide our backend (e.g. web) servers across two or more AZ's
  • Set the Elastic IP to one of the haproxy servers (in the "primary" AZ of our choice). That is the VIP that web clients will access.
  • Monitor the VIP from an external source (outside that AWS region). In the event of a failure (instance or the whole AZ), remap the Elastic IP to the secondary haproxy server (assuming tests are passing on that host).
    • EIP Ref: http://aws.amazon.com/articles/1346
    • Note: We are doing this manually for now (very rarely needed--once or twice a year for AZ outages), but this can be easily scripted using AWS API's and have the monitoring server trigger the switchover upon the failure condition.
    • Also note there is a cost to EIP remaps ($0.10 per remap over the 100 free remaps per month). Since AZ outages are relatively rare, I don't think this will be an issue.

One potential risk is that in time of major AWS outages, we sometimes noticed AWS console and API will start to fail (completely or more frequently than normal). This may impact attempts to remap the elastic IP.