nagios wrongly reports packet loss
Lately, on my nagios 3.2.3 install (CentOS5, monitoring ~ 300 hosts, 1150 services) has sdtarted to occasionally report high packet loss on 50-60 hosts at a time. Problem is it's bogus. Manual runs of ping (or its own check_ping binary) finds no fault with any of the affected hosts. The only possible cures I found so far are:
- run all the checks manually (they will succeed but it may act up again on next check)
- acknowledge and wait for the problem to go away (may take several ours)
I suspect (but have no particular reason other than single rescheduled checks succeeding) that the problem may lay with all the checks being mass scheduled together - in which case introducing some jitter in the scheduling (how?) might help. Or it may be something completely different.
Ideas, anyone?
Edit:
For people interested in constructive debate (rather than point scoring). I am not trying to measure packet loss. Network performance is not my concern in this instance, and if it was, it would be investigated with the proper tools for the job. NAGIOS (for the unwary) is mostly used to check upness in host servicesand to generate alerts. When it starts generating large amounts of fishy alerts is therefore highly annoying. I am 99.9% positive that the problem is either due to either:
- some Nagios/Nagios-Plugin snag
- some system (memory-cpu- i/O - network stack) problem
possibly caused by the burst of requests sent by the nagios scheduler. The packet losses are all above 50% - if they were real, our phones would be melting. So far I have no evidence for (2), so I am looking for "prior art" in (1). I may well be mistaken in my belief, but, if I have to reach for wireshark or similar, a suggestion on what to look for would be greatly appreciated.
Solution 1:
After you have verified the packet loss by different tools, First of all you need to find out which plugin is actually checking for packet loss. Locate that plugin and manually run it after the interval defined in the nagios, and check its output if that can give you a clue. The problem doesnt seems to be that packet loss is there but its the fault plugin. once you have verified the plugin output, then compare that output with output of other tools (to see if it shows any packet loss and if others dont). Usually the plugin is check_ping.