Why is this bash script triggering so many false positives for monitoring memory usage?
I am monitoring hundreds of servers both dedicated and virtual using the following script:
#!/bin/bash
PATH=/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin
threshold=90
serverip=$($(which ifconfig) | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1' | head -1)
memused=$(free | awk '/Mem/{printf("RAM Usage: %.2f%\n"), $3/$2*100}' | awk '{print $3}' | cut -d"." -f1)
if [ "$memused" -gt "$threshold" ]
then
CTIME=$(date +%Y-%m-%d-%H%M%S)
ps aux > /root/.example/logs/lowmem-"${CTIME}"-ps.log
top -n 1 -o %MEM -c > /root/.example/logs/lowmem-"${CTIME}"-top.log
free -m > /root/.example/logs/lowmem-"${CTIME}"-free.log
mysqladmin proc -v status > /root/.example/logs/lowmem-"${CTIME}"-mysqlproc.log
bash /example/general/slack.sh "#server-alerts" ":warning: $(hostname) - ${serverip} - Memory Usage has reached 90% - Check logs /root/.example/logs/lowmem-${CTIME} \n \`\`\`$(head -1 /root/.example/logs/lowmem-"${CTIME}"-free.log) \n $(head -2 /root/.example/logs/lowmem-"${CTIME}"-free.log | tail -1) \n $(tail -1 /root/.example/logs/lowmem-"${CTIME}"-free.log)\`\`\`"
crontab -l | grep -v '/example/mon_mem.sh' | crontab -
sleep 900
crontab -l | { cat; echo "* * * * * bash /example/mon_mem.sh"; } | crontab -
fi
While it works in most cases, we are randomly getting false positives, its completely random servers and its not consistent with each server so one server might trigger but then not trigger ever again(falsely)
Example of a false positive:
total used free shared buff/cache available
Mem: 2048 345 1580 27 122 1674
Swap: 2048 0 2048
An alert came in from this server but you can see only 345 MB was in use.
3 problems:
-
You are calling
free
twice: once for triggering the warning, once for sending the report. The numbers will have changed in between. Store the output (in a variable), and retrieve the same data twice. -
"Used" memory should approach the total amount of memory, and "free" should approach zero, always. If you have unused memory, that means you have wasted resources that should, while not allocated, at least serve as caches.
I recommend you change the
memused
line that currently compares the second against the third column ($3/$2
) to instead compare the first against the last column. -
Your method of message delivery seems to lose formatting. Might want to check your delivery method (slack.sh) to render your input in monospace, or replace tab&spaces with appropriate spacers.
This is how the table should look like:
total used free shared buff/cache available Mem: 2048 345 1580 27 122 1674 Swap: 2048 0 2048 The five numbers start with the "total" memory, and if anything, the last number is the one you should care about.