What are stalled-cycles-frontend and stalled-cycles-backend in 'perf stat' result?

Does anybody know what is the meaning of stalled-cycles-frontend and stalled-cycles-backend in perf stat result ? I searched on the internet but did not find the answer. Thanks

$ sudo perf stat ls                     

Performance counter stats for 'ls':

      0.602144 task-clock                #    0.762 CPUs utilized          
             0 context-switches          #    0.000 K/sec                  
             0 CPU-migrations            #    0.000 K/sec                  
           236 page-faults               #    0.392 M/sec                  
        768956 cycles                    #    1.277 GHz                    
        962999 stalled-cycles-frontend   #  125.23% frontend cycles idle   
        634360 stalled-cycles-backend    #   82.50% backend  cycles idle
        890060 instructions              #    1.16  insns per cycle        
                                         #    1.08  stalled cycles per insn
        179378 branches                  #  297.899 M/sec                  
          9362 branch-misses             #    5.22% of all branches         [48.33%]

   0.000790562 seconds time elapsed

The theory:

Let's start from this: nowaday's CPU's are superscalar, which means that they can execute more than one instruction per cycle (IPC). Latest Intel architectures can go up to 4 IPC (4 x86 instruction decoders). Let's not bring macro / micro fusion into discussion to complicate things more :).

Typically, workloads do not reach IPC=4 due to various resource contentions. This means that the CPU is wasting cycles (number of instructions is given by the software and the CPU has to execute them in as few cycles as possible).

We can divide the total cycles being spent by the CPU in 3 categories:

  1. Cycles where instructions get retired (useful work)
  2. Cycles being spent in the Back-End (wasted)
  3. Cycles spent in the Front-End (wasted).

To get an IPC of 4, the number of cycles retiring has to be close to the total number of cycles. Keep in mind that in this stage, all the micro-operations (uOps) retire from the pipeline and commit their results into registers / caches. At this stage you can have even more than 4 uOps retiring, because this number is given by the number of execution ports. If you have only 25% of the cycles retiring 4 uOps then you will have an overall IPC of 1.

The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.).

The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.

Another stall reason is branch prediction miss. That is called bad speculation. In that case uOps are issued but they are discarded because the BP predicted wrong.

The implementation in profilers:

How do you interpret the BE and FE stalled cycles?

Different profilers have different approaches on these metrics. In vTune, categories 1 to 3 add up to give 100% of the cycles. That seams reasonable because either you have your CPU stalled (no uOps are retiring) either it performs usefull work (uOps) retiring. See more here: https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm

In perf this usually does not happen. That's a problem because when you see 125% cycles stalled in the front end, you don't know how to really interpret this. You could link the >1 metric with the fact that there are 4 decoders but if you continue the reasoning, then the IPC won't match.

Even better, you don't know how big the problem is. 125% out of what? What do the #cycles mean then?

I personally look a bit suspicious on perf's BE and FE stalled cycles and hope this will get fixed.

Probably we will get the final answer by debugging the code from here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c


To convert generic events exported by perf into your CPU documentation raw events you can run:

more /sys/bus/event_source/devices/cpu/events/stalled-cycles-frontend 

It will show you something like

event=0x0e,umask=0x01,inv,cmask=0x01

According to the Intel documentation SDM volume 3B (I have a core i5-2520):

UOPS_ISSUED.ANY:

  • Increments each cycle the # of Uops issued by the RAT to RS.
  • Set Cmask = 1, Inv = 1, Any= 1 to count stalled cycles of this core.

For the stalled-cycles-backend event translating to event=0xb1,umask=0x01 on my system the same documentation says:

UOPS_DISPATCHED.THREAD:

  • Counts total number of uops to be dispatched per- thread each cycle
  • Set Cmask = 1, INV =1 to count stall cycles.

Usually, stalled cycles are cycles where the processor is waiting for something (memory to be feed after executing a load operation for example) and doesn't have any other stuff to do. Moreover, the frontend part of the CPU is the piece of hardware responsible to fetch and decode instructions (convert them to UOPs) where as the backend part is responsible to effectively execute the UOPs.