How to find the process(es) which are hogging the machine

Scenario: All of a sudden, my computer feels sluggish. Mouse moves but windows take ages to open, etc. uptime says the load is 7.69 and raising.

What is the fastest way to find out which process(es) are the cause of the load?

Now, "top" and similar tools isn't the answer because they either show CPU or memory usage but not both at the same time. What I need is the single command which I might be able to type as it happens - something that will figure out any of

System is trying to swap 8GB of RAM to disk because process X ...

or

process X seeks all over the disk

or

process X uses 400% CPU"

So what I'm looking for is iostat, htop/atop and similar tools run into one with an output like this:

 1235 cp - Disk trashing
   87 chrome - Uses 2 GB of RAM
  137 nfs_bench - Uses 95% of the network bandwidth

I don't want a tool that gives me some numbers which I can analyze but a tool that tells me exactly which process causes the current load. Assume that the user in front of the keyboard barely knows how to write "process", but the user is quickly overwhelmed when it comes to "resident size", "virtual memory" or "process life cycle".

My argument goes like this: A user notices a problem. There can be thousands of reasons ... well, almost :-) The user wants to know the source of the problem.

The current solutions give me lots of numbers, and I need to know what these numbers mean. What I'm looking for is a meta tool. 99% of the data is irrelevant to the problem. So what the tool should do is look for processes which hog some resource and list only those along with "this process needs a lot of CPU, this produces many IRQs, this process allocates a lot of RAM (and it's still growing)".

This will be a relatively short list. It will be much more simple for someone new to this to locate the culprit from this list than from the output of, say, htop which gives me about 5000 numbers but requires me to fold multi-threaded processes myself (I have 50 lines which say VIRT 2750M but only 16 GB of RAM - the machine ought to swap itself to death but of course, this is a misinterpretation of the data that can happen quickly).


Solution 1:

I do have to smile at the responses because each told you to run tool X. The only problem is if what you're seeing is intermittent there will be no way to correlate anything. A tool like sar can help if you run it at a high enough frequency, but I'd claim collectl is even better.

Like sar, you run it as a daemon by installing the RPM and doing /etc/init.d/collectl start.

Now when you see something sluggish, collectl -p /var/log/collectl/filename --top will play back the data and show you the top processes. You could have also just run collectl --top and see them in real time. BTW - anything you can do in real time you can playback as well.

As for CPU load, what if you ARE getting overloaded with interrupts? collectl -sC will not only show the loads on individual CPUs (or use -sc for average load), it will show how they're spending their time. Include -j (-scj) and you'll see the number of interrupts/CPU. Use uppercase -J and you'll see the TYPES of each interrupt/CPU.

Of course, if you really like vmstat, you can always playback collectl data with --vmstat and it will show historical data in vmstat format.

There are far more switches than I have time to list, but you can check it out at SourceForge or just google it.

Solution 2:

"top" works reasonably well, as long as you look at the right numbers. Let's see:

top - 13:11:45 up 13 days,  1:13, 21 users,  load average: 0.06, 0.11, 0.26
Tasks: 271 total,   2 running, 267 sleeping,   0 stopped,   2 zombie
Cpu(s): 19.0%us,  6.3%sy,  0.0%ni, 74.0%id,  0.5%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:   8183668k total,  8002712k used,   180956k free,    12476k buffers
Swap: 11847900k total,   723480k used, 11124420k free,   767016k cached

Now, if the system is slow because CPU is all taken, it shows as "us" and "sy" columns on "Cpu(s):" row being close to 100% together.

If it's slow due to swapping, "Mem:" "free" shows very low values and "Swap:" "used" high values.

If it's slow due to I/O in general, then "Cpu(s):" "wa" tells that time is spent on I/O wait.

Now, if you know I/O waits are the problem, you can use programs "iotop" to know which processes create most I/O.

Solution 3:

Based on the 400% usage, I'll assume that you have a quad-core processor. Your load average is almost double the capacity and half of the processes are waiting for CPU.

First, renice your shell to 0 or -10 to get a more responsive system, and then use htop to find the offending process(es) and follow that with strace on a given process. Other tools that could be useful are:

  • vmsat
  • sar
  • iostat
  • pmap

Solution 4:

A sluggish mouse could also be due to a too high interrupt load, or, USB controllers being very busy (I assume it's a USB mouse).