Nagios (Return code of 141 is out of bounds) on random services

We had a similar problem where one service checked via NRPE in a container returned an expected WARNING, then after some minutes the same service returned CRITICAL with the 141/SIGPIPE error. On the next check it returned WARNING then CRITICAL, then WARNING and so on.

I performed a traffic capture for the error and found Nagios issue #305 to quite precisely describe what I had observed. It seems to be caused by an unclean connection close on the NRPE server side while using SSL (SSL_shutdown()) which makes it send a TCP RST to the client which causes an aborted read and thus the SIGPIPE.

Applying the patch nrpe-ssl_shutdown-2.patch attached to the issue report to the NRPE source, rebuilding and reinstalling/restarting it seemed to stop the problem from repeating, and warnings are now reported normally without critical errors.


We had this problem on several occasions; it seems to be caused by the plugin dying unexpectedly.

The actions we took:

  1. Increase plugin timeout in Nagios to 120
  2. On some complex perl plugins, disabled the EPN (Add to 2nd line of script #nagios:-epn)
  3. Where check used NRPE, ensures the NRPE was using /dev/urandom so that it would never block for lack of entropy
  4. Set a reasonable command_timeout (30 sec) in nrpe.cfg
  5. Ensure Nagios server has sufficient memory/CPU to run all the checks that need to be running concurrently.

Between them, these seemed to solve the issue.