Why do user and system cpu in cpuacct.stat not add up to cpuacct.usage?

Solution 1:

I'm not a kernel developer, but, digging through the kernel source code, cpuacct.usage (updated via cgroup_account_cputime) and cpuacct.stat (updated via cgroup_account_cputime_field seem to be calculated by different kernel components.

From what I understand the output of cpu.stat seems to heavily depend on kernel configuration, in particular CONFIG_VIRT_CPU_ACCOUNTING_GEN, CONFIG_VIRT_CPU_ACCOUNTING_NATIVE and CONFIG_VIRT_CPU_ACCOUNTING. From their descriptions they seem to be more precise. A relevant file is kernel/sched/cputime.c, where timing updates seems to be caused by some kernel events(irqs etc.)

The output of cpuacct.usage seems to be calculated by the scheduler when switching between tasks. For example update_curr, which calls cgroup_account_cputime is called from enqueue_entity and dequeue_entity which seem to schedule tasks. This does not seem as affected by configuration.

Solution 2:

cpuacct.stat contains CPU usage accumulated by process(es) in the cgroup expressed in ticks of 1/100th of a second, also called "user jiffies" (USER_HZ). It may not be as precise as the CPU times accounted in nanoseconds.

You can obtain the USER_HZ from shell (typically 100)

$ getconf CLK_TCK
100

This should be mapped to a number of scheduler ticks per second, unless you are on a real-time or tickless kernel.

cpuacct.usage gives the overall CPU time in nanoseconds, measured as precisely as the kernel can report usage times.

cpuacct.usage_all or cpuacct.usage_percpu will report usage per CPU core (thread) again measured in nanoseconds.

Note that the cpuacct subsystem was originally written as a demonstration of cgroups capabilities. It wasn't meant for precise reporting.