How to control the rate of automatic restarts of a runit service?

I have this runit service with run and log/run scripts properly working.

As it happens, the service itself can crash for external reasons and might not be able to start for many minutes. The default way that runit handles this situation is by restarting the service every couple of seconds. How do I change this behaviour?

My last insight was to add a check script and do some magic there, but it seems much more complicated than it should be. Is there a better simpler way?


Solution 1:

I'm not familiar with this facility, however, if it was my task to solve this problem, and a very short man page reading did not offer a simple knob to tune this behaviour, I'd do the following:

Either extend the existing service start script, or if that is cumbersome, insert a new start script into the chain (which in turn starts the original start script). Instead of starting the service right away, the new start script should check if the last start happened recently enough. This can be done by checking a signaling file created by the previous start. If the file does not exist, the script can go on and touch the file and start the service. If the file exists, the script should check if the file is old enough. If it is not old enough, it should wait (sleep) in a loop until the file gets old enough.

Something like this might work (waits at least 1 minute between restarts):

#!/bin/bash

SIGNALDIR=/tmp
SIGNALFILE=service.started

while /bin/true; do
        found=`find "${SIGNALDIR}" -maxdepth 1 -name "${SIGNALFILE}" -mmin -1 | wc -l`
        [ "${found}" -eq 0 ] && break
        echo "Waiting"
        sleep 10
done

touch "${SIGNALDIR}/${SIGNALFILE}"
original service start...

Solution 2:

You should be rate-limiting your restarts in the ./finish file for that service, which is run upon abnormal termination. The ./finish script will receive the return code from ./run and from there you can determine what to do, etc. For that matter, you should have your ./finish script screaming loudly about the failures and sending notifications and jumping all around on fire...

Solution 3:

I'm really not a fan of init based process management (and runit is basically an init substitute). As yo uare discovering, simple-minded restarting of failed processes as soon as they die is not a particularly good strategy. I've used init to restart monit, but that's as far as it goes. (potentially OOM killer could kill monit).

So, I'd encourage you to look for a replacement rather than patch things up.

Monit is pretty old, but it does the job well, and I'm not aware of anything better having come along. It's got the nice feature of not needing to malloc more memory after start-up, so beats the hell out of anything written in a scripting language. The last thing you want is your process monitor dying because it can't get memory.