What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

Two different threads within a single process can share a common memory location by reading and/or writing to it.

Usually, such (intentional) sharing is implemented using atomic operations using the lock prefix on x86, which has fairly well-known costs both for the lock prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).

Here I'm interested in produced-consumer costs where a single thread P writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.

What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.

In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.


Okay, I couldn't find any authoritative source, so I figured I'd give it a go myself.

#include <pthread.h>
#include <sched.h>
#include <atomic>
#include <cstdint>
#include <iostream>


alignas(128) static uint64_t data[SIZE];
alignas(128) static std::atomic<unsigned> shared;
#ifdef EMPTY_PRODUCER
alignas(128) std::atomic<unsigned> unshared;
#endif
alignas(128) static std::atomic<bool> stop_producer;
alignas(128) static std::atomic<uint64_t> elapsed;

static inline uint64_t rdtsc()
{
    unsigned int l, h;
    __asm__ __volatile__ (
        "rdtsc"
        : "=a" (l), "=d" (h)
    );
    return ((uint64_t)h << 32) | l;
}

static void * consume(void *)
{
    uint64_t    value = 0;
    uint64_t    start = rdtsc();

    for (unsigned n = 0; n < LOOPS; ++n) {
        for (unsigned idx = 0; idx < SIZE; ++idx) {
            value += data[idx] + shared.load(std::memory_order_relaxed);
        }
    }

    elapsed = rdtsc() - start;
    return reinterpret_cast<void*>(value);
}

static void * produce(void *)
{
    do {
#ifdef EMPTY_PRODUCER
        unshared.store(0, std::memory_order_relaxed);
#else
        shared.store(0, std::memory_order_relaxed);
#enfid
    } while (!stop_producer);
    return nullptr;
}



int main()
{
    pthread_t consumerId, producerId;
    pthread_attr_t consumerAttrs, producerAttrs;
    cpu_set_t cpuset;

    for (unsigned idx = 0; idx < SIZE; ++idx) { data[idx] = 1; }
    shared = 0;
    stop_producer = false;

    pthread_attr_init(&consumerAttrs);
    CPU_ZERO(&cpuset);
    CPU_SET(CONSUMER_CPU, &cpuset);
    pthread_attr_setaffinity_np(&consumerAttrs, sizeof(cpuset), &cpuset);

    pthread_attr_init(&producerAttrs);
    CPU_ZERO(&cpuset);
    CPU_SET(PRODUCER_CPU, &cpuset);
    pthread_attr_setaffinity_np(&producerAttrs, sizeof(cpuset), &cpuset);

    pthread_create(&consumerId, &consumerAttrs, consume, NULL);
    pthread_create(&producerId, &producerAttrs, produce, NULL);

    pthread_attr_destroy(&consumerAttrs);
    pthread_attr_destroy(&producerAttrs);

    pthread_join(consumerId, NULL);
    stop_producer = true;
    pthread_join(producerId, NULL);

    std::cout <<"Elapsed cycles: " <<elapsed <<std::endl;
    return 0;
}

Compile with the following command, replacing defines:

gcc -std=c++11 -DCONSUMER_CPU=3 -DPRODUCER_CPU=0 -DSIZE=131072 -DLOOPS=8000 timing.cxx -lstdc++ -lpthread -O2 -o timing

Where:

  • CONSUMER_CPU is the number of the cpu to run consumer thread on.
  • PRODUCER_CPU is the number of the cpu to run producer thread on.
  • SIZE is the size of the inner loop (matters for cache)
  • LOOPS is, well...

Here are the generated loops:

Consumer thread

  400cc8:       ba 80 24 60 00          mov    $0x602480,%edx
  400ccd:       0f 1f 00                nopl   (%rax)
  400cd0:       8b 05 2a 17 20 00       mov    0x20172a(%rip),%eax        # 602400 <shared>
  400cd6:       48 83 c2 08             add    $0x8,%rdx
  400cda:       48 03 42 f8             add    -0x8(%rdx),%rax
  400cde:       48 01 c1                add    %rax,%rcx
  400ce1:       48 81 fa 80 24 70 00    cmp    $0x702480,%rdx
  400ce8:       75 e6                   jne    400cd0 <_ZL7consumePv+0x20>
  400cea:       83 ee 01                sub    $0x1,%esi
  400ced:       75 d9                   jne    400cc8 <_ZL7consumePv+0x18>

Producer thread, with empty loop (no writing to shared):

  400c90:       c7 05 e6 16 20 00 00    movl   $0x0,0x2016e6(%rip)        # 602380 <unshared>
  400c97:       00 00 00 
  400c9a:       0f b6 05 5f 16 20 00    movzbl 0x20165f(%rip),%eax        # 602300 <stop_producer>
  400ca1:       84 c0                   test   %al,%al
  400ca3:       74 eb                   je     400c90 <_ZL7producePv>

Producer thread, writing to shared:

  400c90:       c7 05 66 17 20 00 00    movl   $0x0,0x201766(%rip)        # 602400 <shared>
  400c97:       00 00 00 
  400c9a:       0f b6 05 5f 16 20 00    movzbl 0x20165f(%rip),%eax        # 602300 <stop_producer>
  400ca1:       84 c0                   test   %al,%al
  400ca3:       74 eb                   je     400c90 <_ZL7producePv>

The program counts the number of CPU cycles consumed, on consumer's core, to complete the whole loop. We compare the first producer, which does nothing but burn CPU cycles, to the second producer, which disrupts the consumer by repetitively writing to shared.

My system has a i5-4210U. That is, 2 cores, 2 threads per core. They are exposed by the kernel as Core#1 → cpu0, cpu2 Core#2 → cpu1, cpu3.

Result without starting the producer at all:

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3          n/a           2.11G              1.80G

Results with empty producer. For 1G operations (either 1000*1M or 8000*128k).

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3           3            3.20G              3.26G       # mono
    3           2            2.10G              1.80G       # other core
    3           1            4.18G              3.24G       # same core, HT

As expected, since both threads are cpu hogs and both get a fair share, the producer burning cycles slows down consumer by about half. That's just cpu contention.

With producer on cpu#2, as there is no interaction, consumer runs with no impact from the producer running on another cpu.

With producer on cpu#1, we see hyperthreading at work.

Results with disruptive producer:

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3           3            4.26G              3.24G       # mono
    3           2           22.1 G             19.2 G       # other core
    3           1           36.9 G             37.1 G       # same core, HT
  • When we schedule both thread on the same thread of the same core, there is no impact. Expected again, as the producer writes remain local, incurring no synchronization cost.

  • I cannot really explain why I get much worse performance for hyperthreading than for two cores. Advice welcome.


The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.

At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.

Other core example

These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.

Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.

  1. The producer now wants to write to the shared data and sends out a request for exclusive ownership of the cache line.
  2. The Consumer receives his cache line still in shared state and happily reads the value.
  3. The consumer continues to read the shared value until the exclusive request arrives.
  4. At which point the Consumer sends a shared request for the cache line.
  5. At this point the Consumer clears its instructions from the first unfulfilled load instruction of the shared value.
  6. While the Consumer waits for the data it runs ahead speculatively.

So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.

Same core HT

Here the two HT shares the core and must share its resources.

The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.

Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.

  1. Next cycle a write from the Produces occures.
  2. The Consumer thread detects the write and cancels all its instructions from the first unfulfilled read.
  3. The Consumer re-issues its instructions taking ~5-14 cycles to run again.
  4. Finally the first instruction, which is a read, is issued and executed as it did not read a speculative value but a correct one as its in front of the queue.

So for every read of the shared value the Consumer is reset.

Conclusion

The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.

What would have happened if the CPU waited to see if the value had actually changed?

For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.

But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.

So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.

Read more here