Which metric should I use to determine when a server is low on memory?

There are numerous (hundreds?) of different measures for memory usage on a Linux machine, but what is a good heuristic/metric to use to help determine if a server needs more memory?

Some ideas:

Looking at MemTotal - Active - Inactive from /proc/meminfo as a measure of "wired" memory
Looking at the sum of RSS values from all processes in ps
Looking at Committed_AS in /proc/meminfo

Linux Kernel 4.20 added PSI, which stands for "pressure stall information". It gives you more insights why a machine is overloaded. And which resource is the bottleneck.

There are three new files under /proc/pressure:

/proc/pressure/cpu
/proc/pressure/memory
/proc/pressure/io

To quote from Tracking pressure-stall information concerning /proc/pressure/memory:

Its output looks like:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is similar to the CPU information: it tracks the percentage of the time that at least one process could be running if it weren't waiting for memory resources. In particular, the time spent for swapping in, refaulting pages from the page cache, and performing direct reclaim is tracked in this way. It is, thus, a good indicator of when the system is thrashing due to a lack of memory.

The full line is a little different: it tracks the time that nobody is able to use the CPU for actual work due to memory pressure. If all processes are waiting for paging I/O, the CPU may look idle, but that's not because of a lack of work to do. If those processes are performing memory reclaim, the end result is nearly the same; the CPU is busy, but it's not doing the work that the computer is there to do. If the full numbers are much above zero, it's clear that the system lacks the memory it needs to support the current workload.

I have not access to a production server with Linux 4.20 yet, but here is a small experiment on my Desktop (which has no swap configured). Initially, I have no memory pressure at all (all counters are 0):

$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Then I increased the memory usage until I eventually ran out of memory, which froze the machine until the OOM killed some processes. Before it froze, the pressure on the memory increased:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some avg10=0.00 avg60=0.00 avg300=0.00 total=47047
full avg10=0.00 avg60=0.00 avg300=0.00 total=32839

some avg10=0.00 avg60=0.00 avg300=0.00 total=116425
full avg10=0.00 avg60=0.00 avg300=0.00 total=81497

some avg10=1.26 avg60=0.22 avg300=0.04 total=183863
full avg10=0.72 avg60=0.13 avg300=0.02 total=127684

Now, after the system has recovered, the memory pressure is again 0, and the total counters no longer increase:

$ cat /proc/pressure/memory 
some avg10=0.00 avg60=0.00 avg300=0.07 total=53910568
full avg10=0.00 avg60=0.00 avg300=0.02 total=27766222

...

$ cat /proc/pressure/memory 
some avg10=0.00 avg60=0.00 avg300=0.05 total=53910568
full avg10=0.00 avg60=0.00 avg300=0.00 total=27766222

There's no right answer to this.

Peter is correct in saying that the values you need to be looking at are reported in top and free (you can get the source code for the procps package which shows how to get these values from 'C' - but for scripts it's simpler to just run 'free')

If the system has unused memory (the first line of output from free) then it's unlikely to get much faster by adding more memory - but it might get faster by reducing the VFS cache pressure (keep stuff in cache longer).

Although there's no right answer, there's lots of wrong ones - you can't tell from userspace which pages are shared but accesed via different locations - looking at memory usage to determine how much memory is free just doesn't work.

As a starting point then you should be looking at the two values for free memory reported by 'free'

I have said this before, the best measure to get real time memory requirement is to observe the COmmitted_AS field in /proc/meminfo and comparing it over time to see how much memory you need.

Theoretically, if your Committed_AS is always over than (Memfree+swapfree) then you are fine. But if it is less than that and you accumulate your workload on the system over time, you are approaching towards a OOM situation. Committed_AS value determines how much memory is required to the system if all the memory requests were being granted to the system at this very instant.

Monitoring it is a good measure over time to see whether you need to increase RAM or you need to decrease workload.

Really it all depends on the application(s), however you can use the method employed by the kernel to determine memory pressure which should give you a general overview on the hosts capability to manage the memory.

Memory pressure is ideal since it is devoid of worrying about page cache, swappiness or even how much memory you actually have.

Memory pressure is effectively a count of how many pages want to be marked active as per /proc/meminfo. The kernel measures memory pressure by keeping track of how many pages go from 'inactive' to 'active' in the page table. A lot of shifting between these two statuses indicates you probably do not have a lot of spare memory available to make more pages active.

Low memory pressure is indicated by having very few promotions from inactive to active (because the kernel clearly has enough space to make active pages stay active).

This script will measure pressure every PERIODIC seconds. The more data you can collect the better. The idea here is you graph the data and stick your Y axis with 0 at the centre. In ideal circumstances the graph should a horizontal line following 0. If the lines regularly spike outside of 0 (particularly 'Active' being positive, or spiking quite high regularly), the memory pressure on the host is high and more memory would be beneficial.

#!/usr/bin/python
import os
import sys
import re
import time

PERIODIC = 1
pgs = re.compile('Active:\s+([0-9]+) kB\nInactive:\s+([0-9]+) kB')
meminfo = open('/proc/meminfo')

def read_meminfo():
    content = meminfo.read(4096)
    m = pgs.search(content, re.M)
    active, inactive = int(m.group(1)), int(m.group(2))
    active = active / 4
    inactive = inactive / 4
    meminfo.seek(0, 0)
    return active,inactive  

if __name__ == "__main__":
    oldac, oldin = read_meminfo()
    while True:
        time.sleep(PERIODIC)
        active, inactive = read_meminfo()
        print "Inactive Pressure:\t%d" % (inactive - oldin)
        print "Active Pressure:\t%d" % (active - oldac)
        oldac = active
        oldin = inactive

Why is SMF manifest losing configuration data when exported on SmartOS?

Set a default variable in nginx with set

Is someone bruteforcing my password? sshd: unknown [net] and sshd: [accepted] flashing in htop

Why is iptables rejecting the second and subsequent fragments of an allowed packet?

How can I resolve "(Service check did not exit properly)" and "(null)" results with my services?

How to send 0x80 byte to a tcp port using netcat or similar tool?

How to setup client for squid transparent proxy?

Redirect all incoming traffic from a secondary public IP to an internal IP address using iptables

How to change Postgresql database from Read-only to Writable

Different Python versions under the same uwsgi Emperor?

Simulate slow connection between two ubuntu server machines

Configure SSH credentials per environment