Negative clock cycle measurements with back-to-back rdtsc?

I am writing a C code for measuring the number of clock cycles needed to acquire a semaphore. I am using rdtsc, and before doing the measurement on the semaphore, I call rdtsc two consecutive times, to measure the overhead. I repeat this many times, in a for-loop, and then I use the average value as rdtsc overhead.

Is this correct, to use the average value, first of all?

Nonetheless, the big problem here is that sometimes I get negative values for the overhead (not necessarily the averaged one,but at least the partial ones inside the for loop).

This also affects the consecutive calculation of the number of cpu cycles needed for the sem_wait() operation, which sometimes also turns out to be negative. If what I wrote is not clear, here there's a part of the code I am working on.

Why am I getting such negative values?


(editor's note: see Get CPU cycle count? for a correct and portable way of getting the full 64-bit timestamp. An "=A" asm constraint will only get the low or high 32 bits when compiled for x86-64, depending on whether register allocation happens to pick RAX or RDX for the uint64_t output. It won't pick edx:eax.)

(editor's 2nd note: oops, that's the answer to why we're getting negative results. Still worth leaving a note here as a warning not to copy this rdtsc implementation.)


#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>

static inline uint64_t get_cycles()
{
  uint64_t t;
           // editor's note: "=A" is unsafe for this in x86-64
  __asm volatile ("rdtsc" : "=A"(t));
  return t;
}

int num_measures = 10;

int main ()
{
   int i, value, res1, res2;
   uint64_t c1, c2;
   int tsccost, tot, a;

   tot=0;    

   for(i=0; i<num_measures; i++)
   {    
      c1 = get_cycles();
      c2 = get_cycles();

      tsccost=(int)(c2-c1);


      if(tsccost<0)
      {
         printf("####  ERROR!!!   ");
         printf("rdtsc took %d clock cycles\n", tsccost);
         return 1;
      }   
      tot = tot+tsccost;
   }

   tsccost=tot/num_measures;
   printf("rdtsc takes on average: %d clock cycles\n", tsccost);      

   return EXIT_SUCCESS;
}

Solution 1:

When Intel first invented the TSC it measured CPU cycles. Due to various power management features "cycles per second" is not constant; so TSC was originally good for measuring the performance of code (and bad for measuring time passed).

For better or worse; back then CPUs didn't really have too much power management, often CPUs ran at a fixed "cycles per second" anyway. Some programmers got the wrong idea and misused the TSC for measuring time and not cycles. Later (when the use of power management features became more common) these people misusing TSC to measure time whined about all the problems that their misuse caused. CPU manufacturers (starting with AMD) changed TSC so it measures time and not cycles (making it broken for measuring the performance of code, but correct for measuring time passed). This caused confusion (it was hard for software to determine what TSC actually measured), so a little later on AMD added the "TSC Invariant" flag to CPUID, so that if this flag is set programmers know that the TSC is broken (for measuring cycles) or fixed (for measuring time).

Intel followed AMD and changed the behaviour of their TSC to also measure time, and also adopted AMD's "TSC Invariant" flag.

This gives 4 different cases:

  • TSC measures both time and performance (cycles per second is constant)

  • TSC measures performance not time

  • TSC measures time and not performance but doesn't use the "TSC Invariant" flag to say so

  • TSC measures time and not performance and does use the "TSC Invariant" flag to say so (most modern CPUs)

For cases where TSC measures time, to measure performance/cycles properly you have to use performance monitoring counters. Sadly, performance monitoring counters are different for different CPUs (model specific) and requires access to MSRs (privileged code). This makes it considerably impractical for applications to measure "cycles".

Also note that if the TSC does measure time, you can't know what time scale it returns (how many nanoseconds in a "pretend cycle") without using some other time source to determine a scaling factor.

The second problem is that for multi-CPU systems most operating systems suck. The correct way for an OS to handle the TSC is to prevent applications from using it directly (by setting the TSD flag in CR4; so that the RDTSC instruction causes an exception). This prevents various security vulnerabilities (timing side-channels). It also allows the OS to emulate the TSC and ensure it returns a correct result. For example, when an application uses the RDTSC instruction and causes an exception, the OS's exception handler can figure out a correct "global time stamp" to return.

Of course different CPUs have their own TSC. This means that if an application uses TSC directly they get different values on different CPUs. To help people work around the OS's failure to fix the problem (by emulating RDTSC like they should); AMD added the RDTSCP instruction, which returns the TSC and a "processor ID" (Intel ended up adopting the RDTSCP instruction too). An application running on a broken OS can use the "processor ID" to detect when they're running on a different CPU from last time; and in this way (using the RDTSCP instruction) they can know when "elapsed = TSC - previous_TSC" gives an in valid result. However; the "processor ID" returned by this instruction is just a value in an MSR, and the OS has to set this value on each CPU to something different - otherwise RDTSCP will say that the "processor ID" is zero on all CPUs.

Basically; if the CPUs supports the RDTSCP instruction, and if the OS has correctly set the "processor ID" (using the MSR); then the RDTSCP instruction can help applications know when they've got a bad "elapsed time" result (but it doesn't provide anyway of fixing or avoiding the bad result).

So; to cut a long story short, if you want an accurate performance measurement you're mostly screwed. The best you can realistically hope for is an accurate time measurement; but only in some cases (e.g. when running on a single-CPU machine or "pinned" to a specific CPU; or when using RDTSCP on OSs that set it up properly as long as you detect and discard invalid values).

Of course even then you'll get dodgy measurements because of things like IRQs. For this reason; it's best to run your code many times in a loop and discard any results that are too much higher than other results.

Finally, if you really want to do it properly you should measure the overhead of measuring. To do this you'd measure how long it takes to do nothing (just the RDTSC/RDTSCP instruction alone, while discarding dodgy measurements); then subtract the overhead of measuring from the "measuring something" results. This gives you a better estimate of the time "something" actually takes.

Note: If you can dig up a copy of Intel's System Programming Guide from when Pentium was first released (mid 1990s - not sure if it's available online anymore - I have archived copies since the 1980s) you'll find that Intel documented the time stamp counter as something that "can be used to monitor and identify the relative time of occurrence of processor events". They guaranteed that (excluding 64-bit wrap-around) it would monotonically increase (but not that it would increase at a fixed rate) and that it'd take a minimum of 10 years before it wrapped around. The latest revision of the manual documents the time stamp counter with more detail, stating that for older CPUs (P6, Pentium M, older Pentium 4) the time stamp counter "increments with every internal processor clock cycle" and that "Intel(r) SpeedStep(r) technology transitions may impact the processor clock"; and that newer CPUs (newer Pentium 4, Core Solo, Core Duo, Core 2, Atom) the TSC increments at a constant rate (and that this is the "architectural behaviour moving forward"). Essentially, from the very beginning it was a (variable) "internal cycle counter" to be used for a time-stamp (and not a time counter to be used to track "wall clock" time), and this behaviour changed soon after the year 2000 (based on Pentium 4 release date).

Solution 2:

  1. do not use avg value

    Use the smallest one or avg of smaller values instead (to get avg because of CACHE's) because the bigger ones has been interrupted by OS multi tasking.

    You could also remember all values and then found the OS process granularity boundary and filter out all values after this boundary (usually > 1ms which is easily detectable)

    enter image description here

  2. no need to measure overhead of RDTSC

    You just measure offseted by some time and the same offset is present in both times and after substraction it is gone.

  3. for variable clock source of RDTS (like on laptops)

    You should change the speed of CPU to its max by some steady intensive computation loop usually few seconds are enough. You should measure the CPU frequency continuosly and start measure your thing only when it is stable enough.

Solution 3:

If you code starts off on one processor then swaps to another, the timestamp difference may be negative due to processors sleeping etc.

Try setting the processor affinity before you start measuring.

I can't see if you are running under Windows or Linux from the question, so I'll answer for both.

Windows:

DWORD affinityMask = 0x00000001L;
SetProcessAffinityMask(GetCurrentProcessId(), affinityMask);

Linux:

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
sched_setaffinity (getpid(), sizeof(cpuset), &cpuset)