How to monitor and log the memory/cpu usage of processes over time? [closed]

Solution 1:

It you want just the top offenders, consider running top with a relatively long interval (60 seconds plus) in batch mode. You may need more than one top running to capture the top offenders on multiple resources. I have configured systems to run top for a few cycles when a resource was being over used.

Consider running sar in batch mode to capture resource utilization. I realize this is server based, but it useful to determine times when problems are occurring.

Run munin and enable notifications. This may give you a chance to get in and watch the server going down. You may be able to correct the problem before it goes down.

For memory leaks, a steady increase in swap usage indicates a problem. I once watched a server slowly die over a period of days. The problem service was a program monitoring other processes for memory leaks. The system admin kept insisting the increasing swap usage was not a problem, right up until the server stopped responding.

You may find that cfengine's anomaly detection can be used to trigger a script to capture the system state when things go wrong. You may want a lot of information besides just the processes using the most resources. For a sudden influx of usage you may want a list of network connections (by address not name). Memory usage is also useful.

Solution 2:

sysstat is made pretty much exactly for your kind of purpose.

Solution 3:

I've used atop before:

http://freshmeat.net/projects/atop/

"Atop is an ASCII full-screen performance monitor that is capable of reporting the activity of all processes (even if processes have finished during the interval), daily logging of system and process activity for long-term analysis, highlighting overloaded system resources by using colors, etc. At regular intervals, it shows system-level activity related to the CPU, memory, swap, disks, and network layers, and for every active process it shows the CPU utilization, the memory growth, priority, username, state, and exit code."

Solution 4:

Have you tried collectd?
It's very powerful and customizable.
Has a lot of plugins and could be integrated with nagios.

http://collectd.org/features.shtml

Solution 5:

nmon is a great tool that does what you're looking for. Developed for AIX and Linux. Produces a ton of detailed output and easy to put into reports. If you google it, there is an IBM wiki that has a bunch of documentation and additional utilities for parsing the data.