Systemd and Disaster-Recovery Stand-By systems

As your use-case is pretty custom and your needs might change a in the future, why don't you do something like the following...

create a new systemd timer (e.g. failover-manager) on both machines that runs once per minute. The systemd timer will start an associated one-shot systemd service at regular intervals.

That one-shot systemd service can just run a bash script that contains your logic:

  • run your amIthePrimary check
  • if primary, start your systemd service and check that it started without error.
  • if not primary, stop your systemd service if it is running.
  • if your script is unable to start/stop/verify-running, then it should fail (monitorable), otherwise succeeds.
  • this timer script doesn't need to output anything (noise wise) except when it makes a change ie:
    • "I am primary, but service not running. Starting... Waiting a few seconds... verified service is running without error." or
    • "I am not the primary but the service IS running. Stopping."

This way you always know you can monitor that your regular/timer check is running without errors. If your timer-service has issues (non-zero exit) you can catch this with monitoring. You can monitor for failures of your main application service separately.

If your needs change you can easily adapt your timer script or the frequency it runs at.

There are probably cleaner ways to do this, but they would likely depend on an event being generated from whatever is behind your amIthePrimary check...and you didn't provide any details on that. i.e. event driven failover rather than polling.

You could also put your amIthePrimary check into ExecStartPre=... but when it fails to prevent the service from starting, your service will be in a FAILED state which may confuse your monitoring because it isn't a bad fail but rather an intentional fail. So you might prefer using the timer approach because then you can monitor your timer process and your main service process separately. timer should always be running, active and not failing. Your service (if running) should never be in a failed state or monitoring should go off. There is another question about how to know whether the service should be running or not from a monitoring perspective, but that is beyond the scope of the question.

Update - including sample implementation example

Untested, but just to make my suggestion more clear.

failover-manager.sh

Let's say this script is deployed to /opt/failover-manager/failover-manager.sh

#!/bin/bash

# expected ENV.  Provided by the service that starts this script.
#
# APP_SERVICE (your main application)
# SECONDS_TO_START (e.g. some java apps start very slowly)

if [ -z "$APP_SERVICE" -o -z "$SECONDS_TO_START" ]; then
    echo "Missing environment"
    exit 1
fi

function is_running {
    systemctl is-active --quiet $1
    return $?
}

if amIthePrimary; then
    if is_running $APP_SERVICE; then   # no change, no log
        exit 0
    else
        echo "I AM primary, but service NOT running.  STARTING..."
        systemctl start $APP_SERVICE
        sleep $SECONDS_TO_START
        if is_running $APP_SERVICE; then 
            echo "Verified service is STARTED without error: $APP_SERVICE."
            exit 0
        else
            echo "Service $APP_SERVICE has not yet STARTED after $SECONDS_TO_START seconds."
            exit 1
        fi
    fi
else
    if is_running $APP_SERVICE; then 
        echo "I am NOT primary, but service IS running.  Stopping..."
        systemctl stop $APP_SERVICE
        sleep $SECONDS_TO_START
        if is_running $APP_SERVICE; then 
            echo "Service $APP_SERVICE has not yet STOPPED after $SECONDS_TO_START seconds."
            exit 1
        else
            echo "Verified service is STOPPED: $APP_SERVICE."
            exit 0
        fi
    else   # no change, no log
        exit 0
    fi
fi

failover-manager.timer

[Unit]
Description=Timer that starts failover-manager.service
Requires=failover-manager.service

[Timer]
Unit=failover-manager.service
# every 1 minute
OnCalendar=*:0/1
AccuracySec=1s
Persistent=true


[Install]
WantedBy=timers.target

failover-manager.service

This guy is run by the timer above.

[Unit]
Description=Checks if we need to start or stop our application.

[Service]
Type=oneshot
Environment=APP_SERVICE="my-application.service" SECONDS_TO_START="5"    
WorkingDirectory=/opt/failover-manager/
ExecStart=/opt/failover-manager/failover-manager.sh

User=root
Group=root

pure systemd options?

If you are looking for a pure systemd mechanism to accomplish this in a clean way it may not be possible.

Your use-case is custom and IMO beyond the scope of systemd.

So you can "hack" it in using ExecStartPre or using requires/wants type dependency mechanisms...but all those approaches depend on a process either being in stopped state due to failure (breaks monitoring...is it an intended failure or something broken failure)... or that process being started/stopped by "something" that is aware of something outside the systemd world. The latter doesn't break monitoring but does require something beyond systemd and what I proposed is one way to do that.

alternatives

Like @anx suggested... perhaps re-engineering how your DR failover works.

This is also the approach we take. If we have a standby box/cloud/rack/etc, then we like to make sure everything is running already (e.g. services, etc).

Then the question is just... how to make the switch-over.

There are two common ways fail-over to a standby endpoint can be accomplished...

1 - DNS failover

Set a low DNS ttl (cache time) for your critical endpoints and update your DNS records to point at the standby endpoint (e.g. CNAME, A, AAAA DNS update) when a failure is detected.

Many managed DNS providers (e.g. dnsmadeeasy, dynect) offer this as part of their service (detection and fail-over). But of course you can implement this with your own DNS or any DNS provider that enables you to set a low TTL and easily manually or automatically (monitoring + DNS API) update your DNS records.

One potential issue here is that you may worry about bots making requests to the "non-active" endpoint. It will definitely happen but if your application is well designed it won't break anything to have a few requests coming into the standby DR endpoint.

The good thing is this forces you to think about how to make your application architecture more robust in terms of multiple concurrent endpoints receiving traffic (sharing databases, replication, etc).

If it is a big deal you can potentially add iptables rules to manage this...but then you may have the same problem as before...how to trigger the change (because now it is both DNS and iptables that need to have changes for the failover to happen).

2 - load balancer failover

It is quite common to have standby servers that are not active in a load balancer and can be quickly added/swapped into the pool of active servers behind the load balancer.

In this case the load balancer or a third component can manage the health checks and updating the load balancer config to swap healthy servers for unhealthy.

This doesn't work as well for a DR case as load balancers are usually rack or datacenter local. So for DR you probably are better of building on a DNS-based fail-over to a different data-center/region.