Why is this bash script triggering so many false positives for monitoring memory usage?

I am monitoring hundreds of servers both dedicated and virtual using the following script:

#!/bin/bash

PATH=/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin

threshold=90

serverip=$($(which ifconfig) | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1' | head -1)
memused=$(free | awk '/Mem/{printf("RAM Usage: %.2f%\n"), $3/$2*100}' |  awk '{print $3}' | cut -d"." -f1)

if [ "$memused" -gt "$threshold" ]
then
    CTIME=$(date +%Y-%m-%d-%H%M%S)
    ps aux > /root/.example/logs/lowmem-"${CTIME}"-ps.log
    top -n 1 -o %MEM -c > /root/.example/logs/lowmem-"${CTIME}"-top.log
    free -m > /root/.example/logs/lowmem-"${CTIME}"-free.log
    mysqladmin proc -v status > /root/.example/logs/lowmem-"${CTIME}"-mysqlproc.log
    bash /example/general/slack.sh "#server-alerts" ":warning: $(hostname) -  ${serverip} - Memory Usage has reached 90% - Check logs /root/.example/logs/lowmem-${CTIME} \n \`\`\`$(head -1 /root/.example/logs/lowmem-"${CTIME}"-free.log) \n $(head -2 /root/.example/logs/lowmem-"${CTIME}"-free.log | tail -1) \n $(tail -1 /root/.example/logs/lowmem-"${CTIME}"-free.log)\`\`\`"
    crontab -l | grep -v '/example/mon_mem.sh' | crontab -
    sleep 900
    crontab -l | { cat; echo "* * * * * bash /example/mon_mem.sh"; } | crontab -
fi

While it works in most cases, we are randomly getting false positives, its completely random servers and its not consistent with each server so one server might trigger but then not trigger ever again(falsely)

Example of a false positive:

total used free shared buff/cache available 
Mem: 2048 345 1580 27 122 1674 
Swap: 2048 0 2048

An alert came in from this server but you can see only 345 MB was in use.


3 problems:

  1. You are calling free twice: once for triggering the warning, once for sending the report. The numbers will have changed in between. Store the output (in a variable), and retrieve the same data twice.

  2. "Used" memory should approach the total amount of memory, and "free" should approach zero, always. If you have unused memory, that means you have wasted resources that should, while not allocated, at least serve as caches.

    I recommend you change the memused line that currently compares the second against the third column ($3/$2) to instead compare the first against the last column.

  3. Your method of message delivery seems to lose formatting. Might want to check your delivery method (slack.sh) to render your input in monospace, or replace tab&spaces with appropriate spacers.

    This is how the table should look like:

    total used free shared buff/cache available
    Mem: 2048 345 1580 27 122 1674
    Swap: 2048 0 2048

    The five numbers start with the "total" memory, and if anything, the last number is the one you should care about.