Why doesn't MongoDB automatically restart?

It seems like MongoDB 3.6 isn't automatically configured to restart if it crashes. Looking at the systemd service that is bundled with the latest .deb package for Ubuntu 16.04LTS it doesn't seem to have restarts configured:

$ sudo systemctl cat mongod
# /lib/systemd/system/mongod.service
[Unit]
Description=High-performance, schema-free document-oriented database
After=network.target
Documentation=https://docs.mongodb.org/manual

[Service]
User=mongodb
Group=mongodb
ExecStart=/usr/bin/mongod --config /etc/mongod.conf
PIDFile=/var/run/mongodb/mongod.pid
# file size
LimitFSIZE=infinity
# cpu time
LimitCPU=infinity
# virtual memory size
LimitAS=infinity
# open files
LimitNOFILE=64000
# processes/threads
LimitNPROC=64000
# locked memory
LimitMEMLOCK=infinity
# total threads (user+kernel)
TasksMax=infinity
TasksAccounting=false

# Recommended limits for for mongod as specified in
# http://docs.mongodb.org/manual/reference/ulimit/#recommended-settings

[Install]
WantedBy=multi-user.target

Sending SIGKILL and SIGSEGV both kill the process and it isn't restarted. I'm not sure if those are "caught" by systemd and not just restarted however.

So a few questions: Is this crucial for a high availability service like a database? It sure seems like it. Is there any reason MongoDB wouldn't have this configured out of the box?


Solution 1:

Unexpected shutdown is definitely a case where admin intervention would be strongly recommended, although you can always change the service default for your deployments.

If the reason for a mongod process shutting down is an invariant that cannot be fixed without manual intervention (eg. lack of disk space or data file corruption), automatic restarts won't be helpful and could potentially make the situation worse. In general, mongod should not shut down on recoverable errors. The MongoDB Server Exception Architecture distinguishes between per-operation fatal errors and those that are fatal to the entire process. Process-fatal errors are situations where continuing may lead to dire outcomes like loss of data or corrupt data on disk. A user or O/S initiated signal to terminate the process (such as the Out-of-Memory aka OOM Killer on Linux) will also cause mongod to shutdown.

An example error mentioned in comments was an index build that segfaulted on some secondaries with an older version of MongoDB. With automatic service restarts this scenario could potentially lead to an endless loop where a secondary might crash, restart, resume the index build, encounter the same condition, and restart .. only to resume a doomed index build. While this restart loop is in progress, the intermittent availability of the secondary could impact clients using secondary read preferences or other members of the replica set (for example, repeatedly seeking on an upstream oplog to resume syncing).

As a system administrator I would prefer to review the MongoDB logs and try to understand why the process shut down so the root cause can be addressed. Ideally a deployment will have sufficient fault tolerance to be able to cope with members being unavailable so there is time to investigate and remedy the situation.

Depending on the nature of the issue and deployment (standalone, replica set, or sharded cluster) I may also want to take a backup of the data files before attempting any automatic or manual recovery. For example, when restarted after an unclean shutdown mongod has an initial recovery stage that will apply outstanding journal entries and run storage engine checks like data file integrity in the dbPath. For a standalone server it would be prudent to take a copy of the unmodified data files before any recovery/repair attempts. With a replica set deployment the data is already duplicated on another member of the replica set, so if the standard recovery is unsuccessful I would re-sync this member rather than attempting any repair.

Solution 2:

If you are using systemd then Restart=always under the [Service] section should allow the service to restart after a crash.

Solution 3:

If you're really concerned about high-availability, you'd be running a replicaset, and can deal with 1, or more, nodes failing.

Having personally managed a large, sharded mongodb deployment in production for 5 years, I'd prefer instances NOT auto-restart, as I'd want to investigate any issues before it went back into rotation in the replicaset.

https://docs.mongodb.com/manual/core/replica-set-high-availability/