When are x86 LFENCE, SFENCE and MFENCE instructions required?
The simplest answer: you must use one of 3 fences (LFENCE
, SFENCE
, MFENCE
) to provide one of 6 data Consistency:
- Relaxed
- Consume
- Acquire
- Release
- Acquire-Release
- Sequential
C++11:
Initially, you should consider this problem from the point of view of the degree of order of memory access, which is well documented and standardized in C++11. You should read first: http://en.cppreference.com/w/cpp/atomic/memory_order
x86/x86_64:
1. Acquire-Release Consistency: Then, it is important to understand that in the x86 to access to conventional RAM (marked by default as WB - Write Back, and the same effect with WT (Write Throught) or UC (Uncacheable)) by using asm MOV
without any additional commands automatically provides order of memory for Acquire-Release Consistency - std::memory_order_acq_rel
.
I.e. for this memory makes sense to use only std::memory_order_seq_cst
only for provide Sequential Consistency. Ie when you are using: std::memory_order_relaxed
or std::memory_order_acq_rel
then the compiled assembler code for std::atomic::store()
(or std::atomic::load()
) will be the same - only MOV
without any L/S/MFENCE
.
Note: But you must know, that not only CPU but and C++-compiler can reorder operations with memory, and all 6 memory barriers always affect on the C++-compiler regardless of CPU architecture.
Then, you must know, how can it be compiled from C++ to ASM (native machine code) or how can you write it on assembler. To provide any Consistency exclude Sequential you can simple write MOV
, for example MOV reg, [addr]
and MOV [addr], reg
etc.
2. Sequential Consistency: But to provide Sequential Consistency you must use implicit (LOCK
) or explicit fences (L/S/MFENCE
) as described here: Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?
-
LOAD
(without fence) andSTORE
+MFENCE
-
LOAD
(without fence) andLOCK XCHG
-
MFENCE
+LOAD
andSTORE
(without fence) -
LOCK XADD
( 0 ) andSTORE
(without fence)
For example, GCC uses 1, but MSVC uses 2. (But you must know, that MSVS2012 has a bug: Does the semantics of `std::memory_order_acquire` requires processor instructions on x86/x86_64? )
Then, you can read Herb Sutter, your link: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c
Exception to the rule:
This rule is true for access by using MOV
to conventional RAM marked by default as WB - Write Back. Memory is marking in the Page Table, in each PTE (Page Table Enrty) for each Page (4 KB continuous memory).
But there are some exceptions:
If we marks memory in Page Table as Write Combined (
ioremap_wc()
in POSIX), then automaticaly provides only Acquire Consistency, and we must act as in the following paragraph.See answer to my question: https://stackoverflow.com/a/27302931/1558037
- Writes to memory are not reordered with other writes, with the following exceptions:
- writes executed with the CLFLUSH instruction;
- streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
- string operations (see Section 8.2.4.1).
In both cases 1 & 2 you must use additional SFENCE
between two writes to the same address even if you want Acquire-Release Consistency, because here automaticaly provides only Acquire Consistency and you must do Release (SFENCE
) yourself.
Answer to your two questions:
Sometimes when doing a store a CPU will write to its store buffer instead of the L1 cache. I do not however understand the terms on which a CPU will do this?
From the point of view of the user the cache L1 and Store Buffer act differently. L1 fast, but Store-Buffer faster.
Store-Buffer - is a simple Queue where stores only Writes, and which can not be reordered - it is made for performance increase and Hide Latency of access to cache (L1 - 1ns, L2 - 3ns, L3 - 10ns) (CPU-Core think that Write has stored to the cache and executes next command, but at the same time your Writes only saved to the Store-Buffer and will be saved to the cache L1/2/3 later), i.e. CPU-Core don't need to wait when Writes will have been stored to cache.
Cache L1/2/3 - look like transparent associate array (address - value). It is fast but not the fastest, because x86 automatically provides Acquire-Release Consistency by using cache coherent protocol MESIF/MOESI. It is done for simpler multithread programming, but decrease performance. (Truly, we can use Write Contentions Free algorithms and data structures without using cache coherent, i.e. without MESIF/MOESI for example over PCI Express). Protocols MESIF/MOESI works over QPI which connects Cores in CPU and Cores between different CPUs in multiprocessor systems (ccNUMA).
CPU2 may wish to load a value which has been written in to CPU1's store buffer. As I understand it, the problem is CPU2 cannot see the new value in CPU1's store buffer.
Yes.
Why can't the MESI protocol just include flushing store buffers as part of its protocol??
MESI protocol can't just include flushing store buffers as part of its protocol, because:
- MESI/MOESI/MESIF protoclos are not related to the Store-Buffer and do not know about it.
- Automatically flushing Store Buffer at each Writes would decrease performance - and would make it useless.
- Manualy flushing Store Buffer on all remote CPU-Cores (we don't know on which Core store-buffer contain required Write) by using some command - would decrease performance (in 8 CPUs x 15 Cores = 120 Cores at the same time flush Store-Buffer - this is terrible)
But manualy flushing Store Buffer on current CPU-Core - yes, you can do it by execute SFENCE
command. You can use SFENCE
in two cases:
- To provide Sequential Consistency on RAM with Write Back cacheable
- To provide Acquire-Release Consistency on exceptions of the rule: RAM with Write Combined cacheable, for writes executed with the CLFLUSH instruction and for Non-Temporal SSE/AVX commands
Note:
Do we need LFENCE
in any cases on x86/x86_64? - the question is not always clear: Does it make any sense instruction LFENCE in processors x86/x86_64?
Other platform:
Then, you can read as in theory (for a spherical processor in vacuo) with Store-Buffer and Invalidate-Queue, your link: http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf
And how you can provide Sequential Consistency on other platforms, not only with L/S/MFENCE and LOCK but and with LL/SC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html