How to safeguard a shell script against running out of control?

Solution 1:

Computers do exactly what they are told. The only way to ensure that a script "behaves properly" is to write it so that it will behave properly, under all scenarios.

Some basic advice:

  1. Implement some kind of monitoring system.
    The fact that your system blew up without you knowing it was coming tells me you either do not have a monitoring system, or your current system isn't good enough.
    Invest some time in making sure that your servers tell you that there's a problem before they fall over.
  2. Include appropriate safeguards in scripts run from cron.
    Your script stepped on its own tail. That shouldn't happen.
    You've learned the hard way that you need to guard against this sort of thing (and have the system notify you if it happens).
  3. Design and Test More Thoroughly.
    Carefully evaluate every script you are going to deploy to make sure it won't produce undesirable side effects. If you can imagine a failure scenario, test for it (and handle it properly!).
    Take the time to simulate failures (either by hard-coding the condition to true in your script, or by generating the circumstances to test your detection logic.