Does the Intel Memory Model make SFENCE and LFENCE redundant?

Right, LFENCE and SFENCE are not useful in normal code because x86's acquire / release semantics for regular stores make them redundant unless you're using other special instructions or memory types.

The only fence that matters for normal lockless code is the full barrier (including StoreLoad) from a locked instruction, or a slow MFENCE. Prefer xchg for sequential-consistency stores over mov+mfence. Are loads and stores the only instructions that gets reordered? because it's faster.

Does `xchg` encompass `mfence` assuming no non-temporal instructions? (yes, even with NT instructions, as long as there's no WC memory.)

Jeff Preshing's Memory Reordering Caught in the Act article is an easier-to-read description of the same case Bartosz's post talks about, where you need a StoreLoad barrier like MFENCE. Only MFENCE will do; you can't construct MFENCE out of SFENCE + LFENCE. (Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?)

If you had questions after reading the link you posted, read Jeff Preshing's other blog posts. They gave me a good understanding of the subject. :) Although I think I found the tidbit about SFENCE/LFENCE normally being a no-op in Doug Lea's page. Jeff's posts didn't consider NT loads/stores.

Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer are good. I wrote this answer a lot longer ago than that answer, so parts of this answer are showing my inexperience years ago. My answer there considers the C++ intrinsics and C++ compile-time memory order, which is not at all the same thing as x86 asm runtime memory ordering. But you still don't want _mm_lfence().)

SFENCE is only relevant when using movnt (Non-Temporal) streaming stores, or working with memory regions with a type set to something other than the normal Write-Back. Or with clflushopt, which is kind of like a weakly-ordered store. NT stores bypass the cache as well as being weakly ordered. x86's normal memory model is strongly ordered, other than NT stores, WC (write-combining) memory, and ERMSB string ops (see below)).

LFENCE is only useful for memory ordering with weakly-ordered loads, which are very rare. (Or possible for LoadStore ordering with regular loads before NT stores?)

NT loads (movntdqa) from WB memory are still strongly ordered, even on a hypothetical future CPU that doesn't ignore the NT hint; the only way to do weakly-ordered loads on x86 is when reading from weakly-ordered memory (WC), and then I think only with movntdqa. This doesn't happen by accident in "normal" programs, so you only have to worry about this if you mmap video RAM or something.

(The primary use-case for lfence is not memory ordering at all, it's for serializing instruction execution, e.g. for Spectre mitigation, or with RDTSC. See Is LFENCE serializing on AMD processors? and the "linked questions" sidebar for that question.)

Memory ordering in C++, and how it maps to x86 asm

I got curious about this a couple weeks ago, and posted a fairly detailed answer to a recent question: Atomic operations, std::atomic<> and ordering of writes. I included lots of links to stuff about the memory model of C++ vs. hardware memory models.

If you're writing in C++, using std::atomic<> is an excellent way to tell the compiler what ordering requirements you have, so it doesn't reorder your memory operations at compile time. You can and should use weaker release or acquire semantics where appropriate, instead of the default sequential consistency, so the compiler doesn't have to emit any barrier instructions at all on x86. It just has to keep the ops in source order.

On a weakly ordered architecture like ARM or PPC, or x86 with movnt, you need a StoreStore barrier instruction between writing a buffer and setting a flag to indicate the data is ready. Also, the reader needs a LoadLoad barrier instruction between checking the flag and reading the buffer.

Not counting movnt, x86 already has LoadLoad barriers between every load, and StoreStore barriers between every store. (LoadStore ordering is also guaranteed). MFENCE is all 4 kinds of barriers, including StoreLoad, which is the only barrier x86 doesn't do by default. MFENCE makes sure loads don't use old prefetched values from before other threads saw your stores and potentially did stores of their own. (As well as being a barrier for NT store ordering and load ordering.)

Fun fact: x86 lock-prefixed instructions are also full memory barriers. They can be used as a substitute for MFENCE in old 32bit code that might run on CPUs not supporting it. lock add [esp], 0 is otherwise a no-op, and does the read/modify/write cycle on memory that's very likely hot in L1 cache and already in the M state of the MESI coherency protocol.

SFENCE is a StoreStore barrier. It's useful after NT stores to create release semantics for a following store.

LFENCE is almost always irrelevant as a memory barrier because the only weakly-ordered load

a LoadLoad and also a LoadStore barrier. (loadNT / LFENCE / storeNT prevents the store from becoming globally visible before the load. I think this could happen in practice if the load address was the result of a long dependency chain, or the result of another load that missed in cache.)

ERMSB string operations

Fun fact #2 (thanks @EOF): The stores from ERMSB (Enhanced rep movsb/rep stosb on IvyBridge and later) are weakly-ordered (but not cache-bypassing). ERMSB builds on regular Fast-String Ops (wide stores from the microcoded implementation of rep stos/movsb that's been around since PPro).

Intel documents the fact that ERMSB stores "may appear to execute out of order" in section 7.3.9.3 of their Software Developers Manual, vol1. They also say

"Order-dependent code should write to a discrete semaphore variable after any string operations to allow correctly ordered data to be seen by all processors"

They don't mention any barrier instructions being necessary between the rep movsb and the store to a data_ready flag.

The way I read it, there's an implicit SFENCE after rep stosb / rep movsb (at least a fence for the string data, probably not other in-flight weakly ordered NT stores). Anyway, the wording implies that a write to the flag / semaphore becomes globally visible after all the string-move writes, so no SFENCE / LFENCE is needed in code that fills a buffer with a fast-string op and then writes a flag, or in code that reads it.

(LoadLoad ordering always happens, so you always see data in the order that other CPUs made it globally visible. i.e. using weakly-ordered stores to write a buffer doesn't change the fact that loads in other threads are still strongly ordered.)

summary: use a normal store to write a flag indicating that a buffer is ready. Don't have readers just check the last byte of the block written with memset/memcpy.

I also think ERMSB stores prevent any later stores from passing them, so you still only need SFENCE if you're using movNT. i.e. the rep stosb as a whole has release semantics wrt. earlier instructions.

There's a MSR bit that can be cleared to disable ERMSB for the benefit of new servers that need to run old binaries that writes a "data ready" flag as part of a rep stosb or rep movsb or something. (In that case I guess you get the old fast-string microcode that may use an efficient cache protocol, but does make all the stores appear to other cores in order).

Does the Intel Memory Model make SFENCE and LFENCE redundant?

Memory ordering in C++, and how it maps to x86 asm

ERMSB string operations

Related

Recent Posts