ansible: How to properly handle errors that break handler notification?

A problem I keep running into in ansible is where one deployment step should run when any of a number of preparation step is changed, but the changed status is lost due to fatal errors.

When after one successfull preparation step, ansible cannot continue, I still want the machine to eventually reach the state the playbook was meant to achieve. But ansible forgets, e.g.:

- name: "(a) some task is changed"
  git:
    update: yes
    ...
  notify:
   # (b) ansible knows about having to call handler later!
   - apply

- name: "(c) connection lost here"
  command: ...
  notify:
   - apply

- name: apply
  # (d) handler never runs: on the next invocation git-fetch is a no-op
  command: /bin/never

Since the preparation step (a) is now a no-op, running again does not recover this information. For some tasks, just running ALL handlers is good enough. For others one can rewrite the handlers into tasks that know when: to run. But some tasks & checks are expensive and/or unreliable, so this is not always good enough.

Partial solutions:

  1. Write out a file and check for its existence later instead of relying on the ansible handler. This feels like an antipattern. After all, ansible knows whats left to do - I just do not know how to get it to remember it across multiple attempts.
  2. Stay in a loop until it works or manual fix is applied, however long that may be: This seems like a bad trade, because now I might not be able to use ansible against the same group of targets .. or I have to safeguard against undesirable side-effects of multiple concurrent runs
  3. Just require a higher reliability of targets so its rare enough to justify always manually resolving these situations, using --start-at-task= and checking which handlers are still needed: Experience says, things do occasionally break, and right now I am adding more things that can.

Is there a pattern, feature or trick to properly handle such errors?

  • Ansible Tips and Tricks: Dealing with Unreliable Connections and Services
  • Ansible Docs: Error handling in playbooks
  • Ansible issues #9323: Do not lose handler notifications on failure

The Ansible docs you linked to suggest a way to deal with this:

Ansible runs handlers at the end of each play. If a task notifies a handler but another task fails later in the play, by default the handler does not run on that host, which may leave the host in an unexpected state. For example, a task could update a configuration file and notify a handler to restart some service. If a task later in the same play fails, the configuration file might be changed but the service will not be restarted.

You can change this behavior with the --force-handlers command-line option, by including force_handlers: True in a play, or by adding force_handlers = True to ansible.cfg. When handlers are forced, Ansible will run all notified handlers on all hosts, even hosts with failed tasks. (Note that certain errors could still prevent the handler from running, such as a host becoming unreachable.)

Placing it in ansible.cfg will ensure that it is the default behavior for every playbook and role you run.

Very little can save you if the host dies during a playbook run.