Systemd becomes unresponsive

Within two months, on two of my Ubuntu 16.04LTS servers systemd has suddenly become unresponsive. Symptoms:

  • All systemctl commands for controlling services or accessing logs fail with error messages:
Failed to retrieve unit state: Connection timed out
Failed to get properties: Connection timed out
  • systemd does not heed the signal from logrotate for reopening its log, continuing to write to the renamed log file /var/log/syslog.1 while the newly created /var/log/syslog remains empty.
  • Lots of zombie processes accumulating from cronjobs and system management tasks.
  • Running services continue to run normally but starting or stopping services is no longer possible as even the legacy scripts in /etc/init.d redirect to the non-functional systemctl.
  • Nothing unusual in the logs except the Connection timed out messages from attempted interactions with systemd.

The commonly proposed corrective measures:

  • systemctl daemon-reexec
  • kill -TERM 1
  • removing /run/systemd/system/session-*.scope.d

do not fix the problem. The only remedy is to reboot the entire system, which is of course both disruptive and problematic for a server on the other side of the globe.

Questions:

  • What are possible causes for that sort of systemd malfunction?
  • How can I diagnose this further?
  • Is there a less disruptive way to recover from an unresponsive systemd than to reboot?

Solution 1:

this is a very old question, but I hope it can save someone else time.

I had a identical problem, some zombies and systemctl respond any request with a timeout. As expected the problem was to remove the daemons. At least on our case the solution was:

telinit u
systemctl daemon-reexec
systemctl daemon-reload