RDTSCP in NASM always returns the same value (timing a single instruction)

I am using RDTSC and RDTSCP in NASM to measure machine cycles for various assembly language instructions to help in optimization.

I read "How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures" by Gabriele Paoloni at Intel (September 2010) and other web resources (most of which were examples in C).

Using the code below (translated from C), I test various instructions, but RDTSCP always returns zero in RDX and 7 in RAX. I first thought 7 is the number of cycles, but obviously not all instructions take 7 cycles.

rdtsc
cpuid
addsd xmm14,xmm1 ; Instruction to time
rdtscp
cpuid

This returns 7, which is not surprising because on some architectures addsd is 7 cycles with latency included. The first two instructions can (according to some) be reversed, cpuid first then rdtsc, but that makes no difference here.

When I change the instruction to a 2-cycle instruction:

rdtsc
cpuid
add rcx,rdx ; Instruction to time
rdtscp
cpuid

This also returns 7 in rax and zero in rdx.

So my questions are:

  1. How do I access and interpret the values returned in RDX:RAX?

  2. Why does RDX always return zero, and what is it supposed to return?

UPDATE:

If I change the code to this:

cpuid
rdtsc
mov [start_time],rax
addsd xmm14,xmm1 ; INSTRUCTION
rdtscp
mov [end_time],rax
cpuid
mov rax,[end_time]
mov rdx,[start_time]
sub rax,rdx

I get 64 in rax, but that sounds like too many cycles.


Your first code (leading to the title question) is buggy because it overwrites the rdtsc and rdtscp results with the cpuid results in EAX,EBX,ECX and EDX.

Use lfence instead of cpuid; on Intel since forever and AMD with Spectre mitigation enabled, lfence will serialize the instruction stream and thus do what you want with rdtsc.


Remember that RDTSC counts reference cycles, not core clock cycles. Get CPU cycle count? for that and more about RDTSC.

You don't have cpuid or lfence inside your measurement interval. But you do have rdtscp itself in the measurement interval. Back-to-back rdtscp is not fast, 64 reference cycles sounds totally reasonable if you ran without warming up the CPU. Idle clock speed is usually a lot slower than a reference cycle; 1 reference cycle is equal or close to the "sticker" frequency, e.g. max non-turbo sustained frequency, on Intel CPUs. e.g. 4008 MHz on a "4GHz" Skylake CPU.


This is not how you time a single instruction

What matters is latency before another instruction can use the result, not latency until it fully retires from the out-of-order back-end. RDTSC can be useful for timing relative variations in how long one load or one store instruction takes, but the overhead means you won't get a good absolute time.

You can try to subtract measurement overhead, though. e.g. clflush to invalidate cache line via C function. And see also the followups: Using time stamp counter and clock_gettime for cache miss and Memory latency measurement with time stamp counter.


This is what I usually use to profile latency or throughput (and uops fused and unfused domain) of an instruction of short block. Adjust how you use it to bottleneck on latency like here, or not if you want to just test throughput. e.g. with a %rep block with enough different registers to hide latency, or break dependency chains with a pxor xmm3, xmm3 after a short block and let out-of-order exec work its magic. (As long as you don't bottleneck on the front-end.)

You might want to use NASM's smartalign package, or use YASM, to avoid a wall of single-byte NOP instructions for the ALIGN directive. NASM defaults to really stupid NOPs even in 64-bit mode where long-NOP is always supported.

global _start
_start:
    mov   ecx, 1000000000
; linux static executables start with XMM0..15 already zeroed
align 32                     ; just for good measure to avoid uop-cache effects
.loop:
    ;; LOOP BODY, put whatever you want to time in here
    times 4   addsd  xmm4, xmm3

    dec   ecx
    jnz   .loop

    mov  eax, 231
    xor  edi, edi
    syscall          ; x86-64 Linux sys_exit_group(0)

Run this with something like this one-liner that links it into a static executable and profiles it with perf stat, which you can up-arrow and re-run every time you change the source:

(I actually put the nasm+ld + optional disassemble into a shell script called asm-link, to save typing when I'm not profiling. Disassembling makes sure that what's in your loop is what you meant to profile, especially if you have some %if stuff in your code. And also so it's on your terminal right before the profile, if you want to scroll back while testing theories in your head.)

t=testloop; nasm -felf64 -g "$t.asm" && ld "$t.o" -o "$t" &&  objdump -drwC -Mintel "$t" &&
 taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread -r4 ./"$t"

Result from i7-6700k at 3.9GHz (current perf has a unit-scaling display bug for the secondary column. It's fixed upstream but Arch Linux hasn't updated yet.):

 Performance counter stats for './testloop' (4 runs):

          4,106.09 msec task-clock                #    1.000 CPUs utilized            ( +-  0.01% )
                17      context-switches          #    4.080 M/sec                    ( +-  5.65% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.487 M/sec                  
    16,012,778,144      cycles                    # 3900323.504 GHz                   ( +-  0.01% )
     1,001,537,894      branches                  # 243950284.862 M/sec               ( +-  0.00% )
     6,008,071,198      instructions              #    0.38  insn per cycle           ( +-  0.00% )
     5,013,366,769      uops_issued.any           # 1221134275.667 M/sec              ( +-  0.01% )
     5,013,217,655      uops_executed.thread      # 1221097955.182 M/sec              ( +-  0.01% )

          4.106283 +- 0.000536 seconds time elapsed  ( +-  0.01% )

On my i7-6700k (Skylake), addsd has 4 cycle latency, 0.5c throughput. (i.e. 2 per clock, if latency wasn't the bottleneck). See https://agner.org/optimize/, https://uops.info/, and http://instlatx64.atw.hu/.

16 cycles per branch = 16 cycles per chain of 4 addsd = 4 cycle latency for addsd, reproducing Agner Fog's measurement of 4 cycles to better than 1 part in 100 even for this test that includes a tiny bit of startup overhead, and interrupt overhead.

Take your pick of different counters to record. Adding a :u, like instructions:u to a perf even will count only user-space instructions, excluding any that ran during interrupt handlers. I usually don't do that, so I can see that overhead as part of the explanation for wall-clock time. But if you do, cycles:u can match very closely with instructions:u.

-r4 runs it 4 times and averages, which can be useful to see if there's a lot of run-to-run variation instead of just getting one average from a higher value in ECX.

Adjust your initial ECX value to make the total time about 0.1 to 1 second, that's usually plenty, especially if your CPU ramps up to max turbo very quickly (e.g. Skylake with hardware P-states and a fairly aggressive energy_performance_preference). Or max non-turbo with turbo disabled.

But this counts in core clock cycles, not reference cycles, so it still gives the same result regardless of CPU frequency changes. (+- some noise from stopping the clock during the transition.)