faster equivalent of gettimeofday

In trying to build a very latency sensitive application, that needs to send 100s of messages a seconds, each message having the time field, we wanted to consider optimizing gettimeofday. Out first thought was rdtsc based optimization. Any thoughts ? Any other pointers ? Required accurancy of the time value returned is in milliseconds, but it isn't a big deal if the value is occasionally out of sync with the receiver for 1-2 milliseconds. Trying to do better than the 62 nanoseconds gettimeofday takes


Solution 1:

POSIX Clocks

I wrote a benchmark for POSIX clock sources:

  • time (s) => 3 cycles
  • ftime (ms) => 54 cycles
  • gettimeofday (us) => 42 cycles
  • clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
  • clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
  • clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
  • clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
  • clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
  • clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
  • clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
  • clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
  • rdtsc (cycles) => 24 cycles

These numbers are from an Intel Core i7-4771 CPU @ 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.

You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).

TSC

Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.

The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).

Memcached

Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.

Solution 2:

Have you actually benchmarked, and found gettimeofday to be unacceptably slow?

At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.

While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.

That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).

Solution 3:

Like bdonian says, if you're only sending a few hundred messages per second, gettimeofday is going to be fast enough.

However, if you were sending millions of messages per second, it might be different (but you should still measure that it is a bottleneck). In that case, you might want to consider something like this:

  • have a global variable, giving the current timestamp in your desired accuracy
  • have a dedicated background thread that does nothing except update the timestamp (if timestamp should be updated every T units of time, then have the thread sleep some fraction of T and then update the timestamp; use real-time features if you need to)
  • all other threads (or the main process, if you don't use threads otherwise) just reads the global variable

The C language does not guarantee that you can read the timestamp value if it is larger than sig_atomic_t. You could use locking to deal with that, but locking is heavy. Instead, you could use a volatile sig_atomic_t typed variable to index an array of timestamps: the background thread updates the next element in the array, and then updates the index. The other threads read the index, and then read the array: they might get a tiny bit out-of-date timestamp (but they get the right one next time), but they do not run into the problem where they read the timestamp at the same time it is being updated, and get some bytes of the old value and some of the new value.

But all this is much overkill for just hundreds of messages per second.

Solution 4:

Below is a benchmark. I see about 30ns. printTime() from rashad How to get current time and date in C++?

#include <string>
#include <iostream>
#include <sys/time.h>
using namespace std;

void printTime(time_t now)
{
    struct tm  tstruct;
    char       buf[80];
    tstruct = *localtime(&now);
    strftime(buf, sizeof(buf), "%Y-%m-%d.%X", &tstruct);
    cout << buf << endl;
}

int main()
{
   timeval tv;
   time_t tm;

   gettimeofday(&tv,NULL);
   printTime((time_t)tv.tv_sec);
   for(int i=0; i<100000000; i++)
        gettimeofday(&tv,NULL);
   gettimeofday(&tv,NULL);
   printTime((time_t)tv.tv_sec);

   printTime(time(NULL));
   for(int i=0; i<100000000; i++)
        tm=time(NULL);
   printTime(time(NULL));

   return 0;
}

3 sec for 100,000,000 calls or 30ns;

2014-03-20.09:23:35
2014-03-20.09:23:38
2014-03-20.09:23:38
2014-03-20.09:23:41