What could cause a systemd service stop to end with job being canceled?

systemd

Every unit in systemd has a job slot internally, and there can be only one job installed for the unit at a time. Jobs encapsulate state change requests for units in general, but their effects vary according to the unit type. In services, they may initiate a state change request, but the action may run even when you cancel the installed job (or cancel and replace with another job type, which will keep the other one in the waiting state until that operation completes, since the unit_start/stop functions internally can also decide when a certain job is runnable).

As an illustration, if you have a stop action that takes a long time, then calling start while the stop job is running will with the default job mode (replace) cancel the installed/running stop job, and install a start job in the unit's job slot. Since unit_stop has previously initiated a transition to deactivating (and whatever maps to the service's internal sub state - stop, stop-sigterm, stop-sigkill, stop-final, stop-final-sigterm, stop-final-sigkill), unit_start will now return -EAGAIN, which causes systemd to put the start job in JOB_WAITING state, and on the next state change it will be added to run queue, checked if runnable again, and depending on the result, be run or put in waiting again (from unit_notify). Everytime a job is run it is deleted from the run queue. This is basically why systemctl start will just be waiting all that time (if you don't use --no-block).

This was an overview on some moving parts here. There are three things to keep in mind about jobs: They have a type (start, stop, restart, reload, etc), a result (timeout, done, canceled, dependency, skipped, etc) and a mode (replace, isolate, flush, etc). Modes apply to an entire transaction (the requested job and its requirement and propagatively dependent jobs applied together in a consistent manner), there is documentation on what each of them do.

In your specific case, it appears that when you do a systemctl stop, another job comes in and replaces your stop job, and the systemctl client disconnects as the job it enqueued was canceled. This could be due to some dependency, or something else (like ExecStop= ends up calling systemctl start unit (which only works the first time) or something similar, or a unit that Wants/Requires/BindsTo the same unit starting up triggering a start job that replaces the stop job you triggered, etc). It could be a service that is socket activated, and due to a busy connection, is retriggered, enqueuing a start job due to the Triggers= dependency in the socket unit, cancelling your stop job. It could also be a timer or something else: in short, the stop job is being replaced due to some other job coming in and replacing it.

Ofcourse, as you note, this is all prone to races, it may or may not happen, so it happens occasionally in your case. It would be a good idea to review your setup to avoid these issues.

In my case I got

[root@server:~]# systemctl start nginx
Job for nginx.service canceled.

and the reason was that I had defined nginx to have a BindsTo= to have another service, so that it is running exactly when that other service is running.

Due to a bug, the other service one day started exiting immediately, which caused systemd to cancel the start job of nginx.

Unfortunetely systemd seems to give no futher indication on the reason of the cancellation -- I feel like it would be much better if it did that (and I've feature-requested it).

What could cause a systemd service stop to end with job being canceled?

Related

Recent Posts