Nagios (Return code of 141 is out of bounds) on random services
We had a similar problem where one service checked via NRPE in a container returned an expected WARNING
, then after some minutes the same service returned CRITICAL
with the 141/SIGPIPE error. On the next check it returned WARNING
then CRITICAL
, then WARNING
and so on.
I performed a traffic capture for the error and found Nagios issue #305 to quite precisely describe what I had observed. It seems to be caused by an unclean connection close on the NRPE server side while using SSL (SSL_shutdown()
) which makes it send a TCP RST to the client which causes an aborted read and thus the SIGPIPE.
Applying the patch nrpe-ssl_shutdown-2.patch
attached to the issue report to the NRPE source, rebuilding and reinstalling/restarting it seemed to stop the problem from repeating, and warnings are now reported normally without critical errors.
We had this problem on several occasions; it seems to be caused by the plugin dying unexpectedly.
The actions we took:
- Increase plugin timeout in Nagios to 120
- On some complex perl plugins, disabled the EPN (Add to 2nd line of script #nagios:-epn)
- Where check used NRPE, ensures the NRPE was using /dev/urandom so that it would never block for lack of entropy
- Set a reasonable command_timeout (30 sec) in nrpe.cfg
- Ensure Nagios server has sufficient memory/CPU to run all the checks that need to be running concurrently.
Between them, these seemed to solve the issue.