Nagios (Return code of 141 is out of bounds) on random services

We had a similar problem where one service checked via NRPE in a container returned an expected WARNING, then after some minutes the same service returned CRITICAL with the 141/SIGPIPE error. On the next check it returned WARNING then CRITICAL, then WARNING and so on.

I performed a traffic capture for the error and found Nagios issue #305 to quite precisely describe what I had observed. It seems to be caused by an unclean connection close on the NRPE server side while using SSL (SSL_shutdown()) which makes it send a TCP RST to the client which causes an aborted read and thus the SIGPIPE.

Applying the patch nrpe-ssl_shutdown-2.patch attached to the issue report to the NRPE source, rebuilding and reinstalling/restarting it seemed to stop the problem from repeating, and warnings are now reported normally without critical errors.

We had this problem on several occasions; it seems to be caused by the plugin dying unexpectedly.

The actions we took:

Increase plugin timeout in Nagios to 120
On some complex perl plugins, disabled the EPN (Add to 2nd line of script #nagios:-epn)
Where check used NRPE, ensures the NRPE was using /dev/urandom so that it would never block for lack of entropy
Set a reasonable command_timeout (30 sec) in nrpe.cfg
Ensure Nagios server has sufficient memory/CPU to run all the checks that need to be running concurrently.

Between them, these seemed to solve the issue.

Nagios (Return code of 141 is out of bounds) on random services

Related

Recent Posts