Using two Debian servers, I need to setup a strong failover environment for cron jobs that can be only called on one server at a time.

Moving a file in /etc/cron.d should do the trick, but is there a simple HA solution to operate such action ? And if possible not with heartbeat ;)


Solution 1:

I think heartbeat / pacemaker would be the best solution, since they can take care a lot of a lot of race conditions, fencing, etc for you in order to ensure the job only runs on one host at a time. It's possible to design something yourself, but it likely won't account for all the scenarios those packages do, and you'll eventually end up replacing most of, if not all, of the wheel.

If you don't really care about such things and you want a simpler setup. I suggest staggering the cron jobs on the servers by a few minutes. Then when the job starts on the primary it can somehow leave a marker on whatever shared resource the jobs operate on (you don't specify this, so I'm being intentionally vague). If it's a database, they can update a field in a table or if it's on a shared filesystem lock a file.

When the job runs on the second server, it can check for the presence of the marker and abort if it is there.

Solution 2:

We use two approaches depending on the requirements. Both involve having the crons present and running from all machines, but with a bit of sanity checking involved:

  1. If the machines are in a primary and secondary (there may be more than one secondary) relationship then the scripts are modified to check whether the machine they are running on is a primary state. If not, then they simply exit quietly. I don't have an HB setup to hand at the moment but I believe you can query HB for this information.

  2. If all machines are eligible primaries (such as in a cluster) then some locking is used. By way of either a shared database or PID file. Only one machine ever obtains the lock status and those which don't exit quietly.

Solution 3:

To make long story short you have to turn your cron scripts into some kind of cluster-able applications. Being the implementation as lightweight or as heavyweight as you need, they still need one thing - be able to properly resume/restart action (or recover their state) after primary node failover. The trivial case is that they are stateless programs (or "stateless enough" programs), that can be simply restarted any time and will do just fine. This is probably not your case. Note that for stateless programs you don't need failover because you could simply run them in parallel on all the nodes.

In normally complicated case, your scripts should be on cluster's shared storage, should store their state in files there, should change the state stored on disk only atomically, and should be able to continue their action from any transient state they will detect on startup.