I have a couple processes that are eating up a lot of system CPU time (as determined by looking at vmstat). Is there an easy way to find out what kind of system calls are being made?

I know there is strace, but is there a quicker and easier way? Does there exist something like a "top" for system calls?


I think strace with the -c flag is probably the closest I know of. If you haven't used the -c flag, try this:

$  sudo strace -c -p 12345

Where 12345 is the Process ID (PID) of the process in question. Note that stracing a process does add additional overhead, so while you're tracing it, the process will run slower.

After running that for however long you want to collect data, press Ctrl-C to stop your data collection and output the results. It'll produce something like this:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 31.88    0.001738         145        12           futex
 16.79    0.000915          11        80           tgkill
 12.36    0.000674          34        20           read
  9.76    0.000532         266         2           statfs
  8.42    0.000459          13        35           time
  4.38    0.000239           6        40           gettimeofday
  3.65    0.000199           4        48           sigprocmask
  2.94    0.000160          18         9           open
  2.88    0.000157          12        13           stat64
  1.32    0.000072           9         8           munmap
  0.90    0.000049           6         8           mmap2
  0.88    0.000048           3        14         7 sigreturn
  0.79    0.000043           5         9           close
  0.77    0.000042           4        10           rt_sigprocmask
  0.64    0.000035           3        12           setitimer
  0.55    0.000030           5         6         6 rt_sigsuspend
  0.53    0.000029           4         8           fstat64
  0.29    0.000016           8         2           setresuid32
  0.13    0.000007           4         2           _llseek
  0.09    0.000005           3         2           prctl
  0.04    0.000002           2         1           geteuid32
------ ----------- ----------- --------- --------- ----------------
100.00    0.005451                   341        13 total

As you can see, this is a breakdown of all system calls made by the application, sorted by total time, and including the average time per call and number of calls for each syscall. If you want to sort them differently, see the man page for strace, as there's a couple of options.


The type of strace switches that I tend to use is this.

strace -ffttT -p pid -o /tmp/strace.out

An example of this would look like,

19:35:57.485493 mprotect(0x7f35e7472000, 16384, PROT_READ) = 0 <0.000037>
19:35:57.485599 mprotect(0x7f35e7692000, 4096, PROT_READ) = 0 <0.000030>
19:35:57.485697 mprotect(0x7f35e78b7000, 4096, PROT_READ) = 0 <0.000030>
19:35:57.485782 munmap(0x7f35e7896000, 129588) = 0 <0.000037>
19:35:57.485875 set_tid_address(0x7f35e78949d0) = 10730 <0.000029>
19:35:57.485960 set_robust_list(0x7f35e78949e0, 0x18) = 0 <0.000024>
19:35:57.486048 futex(0x7fff8f58628c, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000025>
19:35:57.486131 futex(0x7fff8f58628c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1,       NULL, 7f35e7894700) = -1 EAGAIN (Resource temporarily unavailable) <0.000024>

You see the time difference at the right hand side of the system call showing how much time it has taken to go from one system call to another.

It will catch you the time difference between the system calls. So, when you see that a system call has quite a few seconds gap with the next system call, then there it is making some noise.

Another method is to coredump it with gcore. However, that requires a little experience navigating through gdb.

But, if the thread is a kernel thread, then you can't strace or coredump it. In that case, we have to use something more complex. In RHEL5 kernel, we use oprofile. In RHEL6, we use perf. I prefer perf over oprofile. Perf data can be collected with graph like format showing the system call where maximum percentage of CPU is being used.

With a test perf, I see this.

38.06%  swapper  [kernel.kallsyms]  [k] mwait_idle_with_hints                                                                                                               ↑

29.45%  swapper  [kernel.kallsyms]  [k] read_hpet 
4.90%  swapper  [kernel.kallsyms]  [k] acpi_os_read_port                                                                                                                   ▒
4.74%  swapper  [kernel.kallsyms]  [k] hpet_next_event   

It shows the kernel function where 38% CPU time is being spent. Now, we can check the function and see what it is doing and what it is supposed to do.

With a few examples, it is not that much hard.


Maybe try out one of the sampling profilers, such as oprofile, or for newer kernels, perf. If you're lucky, "perf top" might tell you precisely what you want. See here for some examples