Systemd Service Restart When Program in Tmux Window Fails
I have a dotnet program running inside of bash in tmux which occasionally fails with a non-zero error code. I am attempting to use a systemd service file to programmatically start my dotnet program inside of tmux.
Here is the service file:
[Unit]
Description=dotnet application
[Service]
Type=forking
ExecStart=/home/alpine_sour/scripts/rofdl
Restart=always
User=root
[Install]
WantedBy=multi-user.target
Here is the rofdl shell script:
#!/bin/bash
/usr/bin/tmux kill-session -t "rof" 2> /dev/null || true
/usr/bin/tmux new -s "rof" -d "cd /home/alpine_sour/rofdl && dotnet run"
Now, when I start the service, systemd chooses the main PID as the tmux server, which I assume is because it was the first executed command. Therefore, when my program in the tmux window exits with ANY error code AND there are no more windows, the tmux server exits with a success error code, causing the systemd not to restart. Even if I were to Restart=always, the tmux server would only restart if my program fails AND there are no other windows.
Process: 24980 ExecStart=/home/alpine_sour/scripts/rofdl (code=exited, status=0/SUCCESS)
Main PID: 24984 (tmux: server)
├─24984 /usr/bin/tmux new -s rofdl -d cd /home/alpine_sour/rofdl && dotnet run -- start
├─24985 sh -c cd /home/alpine_sour/rofdl && dotnet run -- start
├─24987 dotnet run -- start
└─25026 dotnet exec /home/alpine_sour/rofdl/bin/Debug/netcoreapp2.1/rofdl.dll start
So I'm wondering how I would get systemd to track the lowest level of the process fork rather than the higher level tmux server. I need a way to tell systemd to track the child process of the tmux server rather than the server itself and restart accordingly.
Solution 1:
Preliminary notes
- This answer is based on experiments in Debian 9.
- I assume your service is a system service (in
/etc/systemd/system
). - What you posted near the end of the question body looks like an excerpt from
systemctl status …
. It says nothing about cgroups. This answer assumes Control Groups are involved. I thinksystemd
requires them, so they must be. -
The command itself may run in a loop, until it succeeds:
cd /home/alpine_sour/rofdl && while ! dotnet run; do :; done
but I understand you want a
systemd
solution.
Problems
First please read how tmux
works. Understanding which process is whose child will be very helpful.
Which processes belong to the service
In your original case the service will be considered inactive (and ready to restart, if applicable) after all processes from its cgroup exit.
Your script tries to kill the old tmux
session, not the old tmux
server. Then tmux new
(equivalent to tmux new-session
) either starts a server or uses the old one.
If it uses the old one then neither the server nor your command (
dotnet …
) will be a descendant of the script. These processes will not belong to the cgroup associated with the service. After the script exits,systemd
will consider the service inactive.If it starts a new
tmux
server then the server and the command will be assigned to the cgroup associated with the service. Then our command may terminate but if there are other sessions/windows (created later) within the server, the server may remain andsystemd
will consider the service active.
If there is one main process, the whole cgroup gets killed after the main process exits. With Type=simple
the main process is the one specified by ExecStart=
. With Type=forking
you need to use PIDFile=
and pass a PID this way to specify the main process. And when you stop a service, systemd
kills all processes that belong to the service. Therefore it's important to include only processes specific to the service in the cgroup. In your case you may want to exclude tmux
server, even if it's started from within the service.
There are tools/ways to move processes between cgroups. Or you can run a separate tmux
server specific to the service.
How systemd
knows which exit status to use
Restart=on-failure
sets dependency on the exit status of the main process. With Type=forking
it's advised to use PIDFile=
so systemd
knows what exit status to use.
systemd
may or may not be able to retrieve the exit status though.
Who retrieves exit status
After a child exits, its parent can retrieve the exit status (compare zombie process).
Regardless whether the tmux
server is old or new, your command won't be a child of systemd
unless it gets orphaned, the kernel sets its parent to PID 1 (or some other) and the new parent is the right systemd
.
The command you provide to tmux new
makes the tmux
server run a shell, then the shell either runs dotnet
and waits for it to exit, or exec
s to dotnet
while keeping the tmux
server as a parent. In any case dotnet
has a parent which is not systemd
.
You could orphan dotnet
like this: nohup dotnet … &
, then let the said shell exit. You would also need to store the PID, use PIDFile=
in the unit configuration file, so the service knows which process to monitor. Then it might kinda work.
To be clear: in my tests nohup sleep 300 &
was successfully adopted by systemd
who could then retrieve its exit status (after I took care of cgroups).
But since you want to use tmux
in the first place, I guess your command interacts with the terminal. So nohup
is not the right tool here. Orphaning a process while keeping it connected to the terminal may be tricky. You want to orphan it but you cannot let the shell within tmux
simply exit, because this will kill its pane (or leave it in a dead state).
Note Type=forking
relies on adoption by systemd
. The main service process is supposed to fork and exit. Then systemd
adopts its child. Such daemon should not interact with any terminal though.
Another approach is to let the shell within the tmux
server exec
to dotnet
. After it exits, the tmux
server (as a parent) knows its exit status. In some circumstances we can query the server from another script and retrieve the exit status.
Or the shell triggered by tmux new
may store the status in a file, so it can be retrieved by another script.
Because what you run with ExecStart=
is a child of systemd
for sure, this is the best candidate for "another script". It should wait until it can retrieve the exit status, then use it as its own exit status, so systemd
gets it. Note the service should be Type=simple
in this case.
Alternatively you can start dotnet …
outside of tmux
, then reptyr
from the inside of the tmux
server. This way dotnet
can be a child of systemd
from the very beginning, problems may appear when you try to steal its tty.
Solutions and examples
reptyr
to tmux
This example runs the script in tty2
. The script prepares tmux
and exec
s to dotnet
. Finally a shell within tmux
tries to steal tty of what is now dotnet
.
The service file:
[Unit]
Description=dotnet application
[email protected]
[Service]
Type=simple
ExecStart=/home/alpine_sour/scripts/rofdl
Restart=on-failure
User=root
StandardInput=tty
TTYPath=/dev/tty2
TTYReset=yes
TTYVHangup=yes
[Install]
WantedBy=multi-user.target
/home/alpine_sour/scripts/rofdl
:
#!/bin/sh
tmux="/usr/bin/tmux"
"$tmux" kill-session -t "rof" 2> /dev/null
"$tmux" new-session -s "rof" -d "sleep 5; exec /usr/bin/reptyr $$" || exit 1
cd /home/alpine_sour/rofdl && exec dotnet run
Notes:
- My tests with
htop
instead ofdotnet run
revealed a race condition (htop
changes settings of its terminal,reptyr
can interfere; hencesleep 5
as a poor workaround) and problems with mouse support. - It's possible to remove the
tmux
server from the cgroup associated with the service. You probably want to do this. See way below, where there is/sys/fs/cgroup/systemd/
in the code.
Without tmux
?
The above solution uses /dev/tty2
anyway. If you need tmux
only to provide a controlling terminal, consider cd /home/alpine_sour/rofdl && exec dotnet run
without reptyr
, without tmux
. Even without the script:
ExecStart=/bin/sh -c 'cd /home/alpine_sour/rofdl && exec dotnet run' rofdl
This is the simplest.
Separate tmux
server
tmux
allows you to run more than one server per user. You need -L
or -S
(see man 1 tmux
) to specify a socket, then stick to it. This way your service can run an exclusive tmux
server. Advantages:
- The server and everything you run within this
tmux
belongs to the cgroup of the service by default. - The service can destroy the
tmux
server without a risk that anyone (or anything) else loses their sessions. Nobody else should use this server, unless they want to monitor/interact with the service. If anyone uses it for anything else, it's their problem.
The ability to kill the tmux
server freely allows you to orphan processes that run in tmux
. Consider the following example.
The service file:
[Unit]
Description=dotnet application
[Service]
Type=forking
ExecStart=/home/alpine_sour/scripts/rofdl
Restart=on-failure
User=root
PIDFile=/var/run/rofdl.service.pid
[Install]
WantedBy=multi-user.target
/home/alpine_sour/scripts/rofdl
:
#!/bin/sh
tmux="/usr/bin/tmux"
service="rofdl.service"
"$tmux" -L "$service" kill-server 2> /dev/null
"$tmux" -L "$service" new-session -s "rof" -d '
trap "" HUP
ppid="$PPID"
echo "$$" > '" '/var/run/$service.pid' "'
cd /home/alpine_sour/rofdl && dotnet run
status="$?"
'" '$tmux' -L '$service' kill-server 2> /dev/null "'
while [ "$ppid" -eq "$(ps -o ppid= -p "$$")" ]; do sleep 2; done
exit "$status"
' || exit 1
Explanation:
The main script kills the exclusive
tmux
server (if any) and starts it anew. After the server is started, the script exits. The service remains because there is at least one process left in the cgroup, the said server.The server spawns a shell to process the "inner" script. The script begins at
'
after-d
and ends at'
before||
. It's all quoted, but quoting changes from single- to double-quotes and back few times. It's because$tmux
and$service
need to be expanded by the shell processing the main script, other variables (e.g.$status
) must not be expanded until in the "inner" shell, insidetmux
. The following resource may be helpful: Parameter expansion (variable expansion) and quotes within quotes.The shell inside
tmux
prepares to ignoreHUP
signal.The shell registers its PID in the pidfile the service expects.
Then it runs
dotnet
and stores its exit status (strictly, ifcd
fails then it will be the exit status ofcd
).The shell kills the
tmux
server. We could do this withkill "$PPID"
as well (see this), but if somebody had killed the server and another process got its PID, we would kill a wrong process. Addressingtmux
is safer. Because of thetrap
the shell survives.Then the shell loops until its PPID is different than what it was before. We cannot rely on comparing
$ppid
to$PPID
because the latter is not dynamic; we retrieve the current PPID fromps
.Now the shell knows it has a new parent, it should be
systemd
. Only nowsystemd
is able to retrieve exit status from the shell. The shell exits with the exact exit status retrieved fromdotnet
earlier. This waysystemd
gets the exit status despite the factdotnet
was never its child.
Retrieving exit status from common tmux
server
Your original approach uses a common (default) tmux
server, it only manipulates a session named rof
. In general other sessions may exist or arise, so the service should never kill the whole server. There are few aspects. We should:
- prevent
systemd
from killing thetmux
server, even if the server was started from within the service; - make
systemd
considerdotnet
process a part of the service, even if it was started fromtmux
not started from within the service; - retrieve the exit status from
dotnet
somehow.
The service file:
[Unit]
Description=dotnet application
[Service]
Type=simple
ExecStart=/home/alpine_sour/scripts/rofdl
Restart=on-failure
User=root
[Install]
WantedBy=multi-user.target
Note it's Type=simple
now, because the main script is the only assured child we can retrieve the exit status from. The script needs to find out the exit status of dotnet …
and report it as its own.
/home/alpine_sour/scripts/rofdl
:
#!/bin/sh
tmux="/usr/bin/tmux"
service="rofdl.service"
slice="/sys/fs/cgroup/systemd/system.slice"
"$tmux" kill-session -t "rof" 2> /dev/null
( sh -c 'echo "$PPID"' > "$slice/tasks"
exec "$tmux" new-session -s "rof" -d "
'$tmux' set-option -t 'rof' remain-on-exit on "'
echo "$$" > '" '$slice/$service/tasks' "'
cd /home/alpine_sour/rofdl && dotnet run
exit "$?"
' || exit 1
)
pane="$("$tmux" display-message -p -t "rof" "#{pane_id}")"
while sleep 2; do
[ "$("$tmux" display-message -p -t "$pane" "#{pane_dead}")" -eq 0 ] || {
status="$("$tmux" display-message -p -t "$pane" "#{pane_dead_status}")"
status="${status:-255}"
exit "$status"
}
done
Explanation:
If
tmux new-session
creates a server (because there was none), we want it in another cgroup from the very beginning to prevent race condition when something else starts using the server and we haven't change its cgroup yet andsystemd
decides to kill the service for whatever reason. I tried to runtmux new-session
withcgexec
and failed; therefore another approach: a subshell which changes its own cgroup (by writing to/sys/fs/cgroup/systemd/system.slice/tasks
) and thenexec
s totmux new-session
.The shell inside
tmux
starts by enablingremain-on-exit
option for the session. After it exits, the pane remains and another process (the main script in our case) can retrieve its exit status from thetmux
server.In the meantime the main script retrieves the unique ID of the pane the other shell runs in. If someone attaches to the session or creates new panes and plays with them, the main script will still be able to find the right pane.
The shell inside
tmux
registers its PID in the cgroup associated with the service by writing it to/sys/fs/cgroup/systemd/system.slice/rofdl.service/tasks
.The shell inside
tmux
runsdotnet …
. Afterdotnet
terminates, the shell exits. Exit status retrieved fromdotnet
is reported by the shell to thetmux
server.Because of
remain-on-exit on
, the pane remains in a dead state after the "inner" shell exits.In the meantime the main shell loops until the pane is dead. Then it queries the
tmux
server for the relevant exit status and reports it as its own. This waysystemd
gets the exit status fromdotnet
.
Notes:
Again there are quotes within quotes.
-
Instead of
dotnet run
it could beexec dotnet run
. The last form is nice:dotnet
replaces the inner shell, so there is one process instead of two. The problem is whendotnet
is killed by a signal it cannot handle. It turns out#{pane_dead_status}
will report an empty string if the process in the pane is forcefully killed by a signal. Maintaining a shell betweendotnet
andtmux
prevents this: the shell transforms information (see this question) and returns a number.Some shells (implementations?) run the very last command with implicit
exec
, something we don't want. That's why I usedexit "$?"
afterdotnet …
.But if the shell itself is forcefully killed, the problem with empty
#{pane_dead_status}
reappears. As the last resortstatus="${status:-255}"
converts empty status to255
(although I'm not sure255
is the best value in such case). -
There's a race condition: when the main script queries
tmux
for#{pane_id}
, it may not be the right pane. If somebody attached and played inside the session aftertmux new-session
and beforetmux display-message
, we might get a wrong pane. The time window is small, still this is not as elegant as I wanted.If
tmux new-session
could print#{pane_id}
to the console liketmux display-message -p
can, there should be no problem. With-PF
it can show it within the session. There is no support for-p
. You may want some logic in case the
tmux
server gets killed.
Retrieving exit status via file
The above example can be modified, so remain-on-exit on
is not needed, #{pane_id}
is not needed (race condition avoided, at least the described one).
The service file from the previous example remains.
/home/alpine_sour/scripts/rofdl
:
#!/bin/sh
tmux="/usr/bin/tmux"
service="rofdl.service"
slice="/sys/fs/cgroup/systemd/system.slice"
statf="/var/run/$service.status"
rm "$statf" 2>/dev/null
"$tmux" kill-session -t "rof" 2> /dev/null
( sh -c 'echo "$PPID"' > "$slice/tasks"
exec "$tmux" new-session -s "rof" -d '
echo "$$" > '" '$slice/$service/tasks' "'
cd /home/alpine_sour/rofdl && dotnet run
echo "$?" > '" '$statf.tmp'
mv '$statf.tmp' '$statf'
" || exit 1
)
while sleep 2; do
status="$(cat "$statf" 2>/dev/null)" && exit "$status"
done
The mechanism is pretty straightforward: the main shell removes the old status file (if any), triggers tmux
and loops until the file reappears. The "inner" shell writes the exit status of dotnet
to the file, when ready.
Notes:
- What if the inner shell is killed? What if the file cannot be created? It's relatively easy to get to a situation where the main script cannot exit the loop.
- Writing to a temporary file and then renaming is a good practice. If we did
echo "$?" > "$statf"
, the file would be created empty, then written to. This might lead to a situation when the main script reads an empty string as status. In general the receiver might get incomplete data: reading until EOF while the sender is mid-write and the file is yet about to grow. Renaming makes the right file with the right content appear instantly.
Final notes
- If you cannot go without
tmux
, the solution with a separatetmux
server seems most robust. -
This is what the documentation says about
Restart=
:In this context, a clean exit means an exit code of
0
, or one of the signalsSIGHUP
,SIGINT
,SIGTERM
orSIGPIPE
, and […]Note
$?
in a shell is just a number. Again: this link. If yourdotnet
exits because of a signal and restarting depends on (un-)clean exit, the solutions wheresystemd
retrieves exit code directly fromdotnet
may behave differently than solutions wheresystemd
retrieves exit status from an intermediary shell. ResearchSuccessExitStatus=
, it may be useful.