为什么这个`std :: atomic_thread_fence`工作 [英] Why does this `std::atomic_thread_fence` work
问题描述
首先,我想列出我对此的一些理解,如果我错了,请纠正我.
Firstly I want to list some of my undertandings regarding to this, please correct me if I'm wrong.
- x86中的
MFENCE
可以确保完全隔离 -
顺序一致性可防止对STORE-STORE,STORE-LOAD,LOAD-STORE和LOAD-LOAD进行重新排序
- a
MFENCE
in x86 can ensure a full barrier Sequential-Consistency prevents reordering of STORE-STORE, STORE-LOAD, LOAD-STORE and LOAD-LOAD
这是根据维基百科.
std::memory_order_seq_cst
无法保证阻止STORE-LOAD重新排序.
std::memory_order_seq_cst
makes no guarantee to prevent STORE-LOAD reorder.
This is according to Alex's answer, "Loads May Be Reordered with Earlier Stores to Different Locations"(for x86) and mfence will not always be added.
std::memory_order_seq_cst
是否表示顺序一致性?根据第2/3点,对我来说似乎不正确. std::memory_order_seq_cst
仅在以下情况下表示顺序一致性
Whether a std::memory_order_seq_cst
indicates Sequential-Consistency? According to point 2/3, it seems not correct to me. std::memory_order_seq_cst
indicates Sequential-Consistency only when
- 至少将一个显式
MFENCE
添加到LOAD
或STORE
- 加载(没有围栏)并锁定XCHG
- 锁定XADD(0)和存储(无围栏)
- at least one explicit
MFENCE
added to eitherLOAD
orSTORE
- LOAD (without fence) and LOCK XCHG
- LOCK XADD ( 0 ) and STORE (without fence)
否则,仍然有可能重新订购.
otherwise there will still be possible reorders.
根据@LWimsey的评论,我在这里犯了一个错误,如果LOAD
和STORE
均为memory_order_seq_cst
,则没有重新排序. Alex可能指出了使用非原子或非SC的情况.
According to @LWimsey's comment, I made a mistake here, if both the LOAD
and STORE
are memory_order_seq_cst
, there's no reorder. Alex may indicated situations where non-atomic or non-SC is used.
std::atomic_thread_fence(memory_order_seq_cst)
始终生成全屏
This is according to Alex's answer. So I can always replace asm volatile("mfence" ::: "memory")
with std::atomic_thread_fence(memory_order_seq_cst)
这对我来说很奇怪,因为memory_order_seq_cst
在原子函数和篱笆函数之间的用法似乎有所不同.
This is quite strange to me, because a memory_order_seq_cst
seems to have quite a difference usage between atomic functions and fence functions.
现在我进入MSVC 2015标准库的头文件中的此代码,该文件实现了std::atomic_thread_fence
Now I come to this code in header file of MSVC 2015's standard library, which implements std::atomic_thread_fence
inline void _Atomic_thread_fence(memory_order _Order)
{ /* force memory visibility and inhibit compiler reordering */
#if defined(_M_ARM) || defined(_M_ARM64)
if (_Order != memory_order_relaxed)
{
_Memory_barrier();
}
#else
_Compiler_barrier();
if (_Order == memory_order_seq_cst)
{ /* force visibility */
static _Uint4_t _Guard;
_Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
_Compiler_barrier();
}
#endif
}
所以我的主要问题是_Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
如何创建完整的屏障MFENCE
,或者实际上做了什么来启用类似MFENCE
的等效机制,因为_Compiler_barrier()
在这里显然不足以存储完整的内存障碍,还是该陈述与第3点有些相似?
So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
create a full barrier MFENCE
, or what has actually done to enable an equivalent mechanism like MFENCE
, because a _Compiler_barrier()
is obviously not enough here for a full memory barrier, or this statement works somewhat similar to point 3?
推荐答案
所以我的主要问题是
_Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
如何创建完全障碍MFENCE
So my major question is how can
_Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
create a full barrier MFENCE
这将编译为带有存储目标的xchg
指令.像mfence
一样,这是一个完全的内存屏障(耗尽存储缓冲区) 1 .
This compiles to an xchg
instruction with a memory destination. This is a full memory barrier (draining the store buffer) exactly1 like mfence
.
在此之前和之后都有编译器障碍,也可以防止围绕它进行编译时重新排序.因此,可以防止在原子和非原子C ++对象上进行任何方向的所有重新排序,从而使其足以执行ISO C ++ atomic_thread_fence(mo_seq_cst)
承诺的所有操作.
With compiler barriers before and after that, compile-time reordering around it is also prevented. Therefore all reordering in either direction is prevented (of operations on atomic and non-atomic C++ objects), making it more than strong enough to do everything that ISO C++ atomic_thread_fence(mo_seq_cst)
promises.
对于弱于seq_cst的命令,仅需要编译器障碍. x86的硬件内存排序模型是程序顺序+具有存储转发功能的存储缓冲区.这对于acq_rel
足够强大,而编译器不会发出任何特殊的asm指令,而只是阻止编译时重新排序. https://preshing.com/20120930/weak-vs-strong-memory -models/
For orders weaker than seq_cst, only a compiler barrier is needed. x86's hardware memory-ordering model is program-order + a store buffer with store forwarding. That's strong enough for acq_rel
without the compiler emitting any special asm instructions, just blocking compile-time reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/
脚注1 :对于std :: atomic而言已足够. lock
指令所指示的从WC存储器加载的MOVNTDQA指令的顺序可能不如MFENCE严格.
Footnote 1: exactly enough for the purposes of std::atomic. Weakly ordered MOVNTDQA loads from WC memory may not be as strictly ordered by lock
ed instructions as by MFENCE.
- 这是一个在x86上更好的写障碍:lock + addl或xchgl?
- 锁xchg与mfence具有相同的行为吗?-等同于std :: atomic用途,但对于使用WC内存区域的设备驱动程序可能会有一些细微差别.和性能差异.尤其是在Skylake上其中,
mfence
阻止了OoO执行程序,例如lfence
- 为什么LOCK成为x86的全部障碍?
- Which is a better write barrier on x86: lock+addl or xchgl?
- Does lock xchg have the same behavior as mfence? - equivalent for std::atomic purposes, but some minor differences that might matter for a device driver using WC memory regions. And perf differences. Notably on Skylake where
mfence
blocks OoO exec likelfence
- Why is LOCK a full barrier on x86?
x86上的原子读-修改-写(RMW)操作只能使用lock
前缀或
Atomic read-modify-write (RMW) operation on x86 are only possible with a lock
prefix, or xchg
with memory which is like that even without a lock prefix in the machine code. A lock-prefixed instruction (or xchg with mem) is always a full memory barrier.
使用lock add dword [esp], 0
之类的指令代替mfence
是一种众所周知的技术. (并且在某些CPU上性能更好.)此MSVC代码是相同的想法,但是它代替了对栈指针所指向的任何对象的无操作,而是对虚拟变量执行了xchg
.实际位置无关紧要,但是性能最好的选择是仅由当前内核访问并且已在缓存中处于高温的缓存行.
Using an instruction like lock add dword [esp], 0
as a substitute for mfence
is a well-known technique. (And performs better on some CPUs.) This MSVC code is the same idea, but instead of a no-op on whatever the stack pointer is pointing-to, it does an xchg
on a dummy variable. It doesn't actually matter where it is, but a cache line that's only ever accessed by the current core and is already hot in cache is the best choice for performance.
使用所有内核将争夺访问的static
共享变量是最糟糕的选择.该代码太可怕了!不必与其他内核在同一缓存行中进行交互,以控制此内核在其自己的L1d缓存上的操作顺序.这完全是傻瓜. MSVC显然仍在其std::atomic_thread_fence()
的实现中使用此可怕的代码,即使对于保证mfence
可用的x86-64,也是如此. ( Godbolt用MSVC 19.14 )
Using a static
shared variable that all cores will contend for access to is the worst possible choice; this code is terrible! Interacting with the same cache line as other cores is not necessary to control the order of this core's operations on its own L1d cache. This is completely bonkers. MSVC still apparently uses this horrible code in its implementation of std::atomic_thread_fence()
, even for x86-64 where mfence
is guaranteed available. (Godbolt with MSVC 19.14)
如果您要进行seq_cst store ,则可以选择mov
+ mfence
(gcc这样做),也可以使用和进行存储单个xchg
(使用clang和MSVC可以做到这一点,因此代码生成很好,没有共享的虚拟变量).
If you're doing a seq_cst store, your options are mov
+mfence
(gcc does this) or doing the store and the barrier with a single xchg
(clang and MSVC do this, so the codegen is fine, no shared dummy var).
该问题的大部分早期内容(陈述事实")似乎是错误的,并且包含一些误解或误导性内容,甚至没有错.
std::memory_order_seq_cst
不能保证阻止STORE-LOAD重新排序.
std::memory_order_seq_cst
makes no guarantee to prevent STORE-LOAD reorder.
C ++保证使用完全不同的模型进行订购,在该模型中,从发布存储中获取看到值的负载与它同步",并保证C ++源代码中的后续操作在发布存储之前可以查看代码中的所有存储.
C++ guarantees order using a totally different model, where acquire loads that see a value from a release store "synchronize with" it, and later operations in the C++ source are guaranteed to see all the stores from code before the release store.
它还保证即使在不同对象之间,所有所有 seq_cst操作的总顺序也是如此. (较弱的订单允许线程在全局可见之前重新加载其自己的商店,即商店转发.这就是为什么只有seq_cst必须耗尽商店缓冲区的原因.它们还允许IRIW重新排序.
It also guarantees that there's a total order of all seq_cst operations even across different objects. (Weaker orders allow a thread to reload its own stores before they become globally visible, i.e. store forwarding. That's why only seq_cst has to drain the store buffer. They also allow IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)
StoreLoad重新排序之类的概念基于以下模型:
Concepts like StoreLoad reordering are based on a model where:
- 所有内核间通信都是通过将存储提交到缓存一致的共享内存中
- 重新排序发生在一个核心之间,即它自己对缓存的访问之间.例如存储缓冲区会延迟存储可见性,直到以后的加载(如x86允许)为止. (除了核心可以通过商店转发尽早看到自己的商店.)
就此模型而言,seq_cst确实需要在seq_cst存储和以后的seq_cst加载之间的某个点上耗尽存储缓冲区.实现此目的的有效方法是在seq_cst存储之后放置一个完整的屏障 . (而不是在每次seq_cst加载之前.便宜的负载比便宜的存储更重要.)
In terms of this model, seq_cst does require draining the store buffer at some point between a seq_cst store and a later seq_cst load. The efficient way to implement this is to put a full barrier after seq_cst stores. (Instead of before every seq_cst load. Cheap loads are more important than cheap stores.)
在像AArch64这样的ISA上,有加载获取和存储释放指令,它们实际上具有顺序释放的语义,这与x86加载/存储仅"是常规释放不同. (因此,AArch64 seq_cst不需要单独的屏障;微体系结构可能会延迟耗尽存储缓冲区,除非/直到执行加载获取,而仍然没有将存储释放提交给L1d缓存.)其他ISA通常需要完整的屏障在seq_cst存储之后清空存储缓冲区的指令.
On an ISA like AArch64, there are load-acquire and store-release instructions which actually have sequential-release semantics, unlike x86 loads/stores which are "only" regular release. (So AArch64 seq_cst doesn't need a separate barrier; a microarchitecture could delay draining the store buffer unless / until a load-acquire executes while there's still a store-release not committed to L1d cache yet.) Other ISAs generally need a full barrier instruction to drain the store buffer after a seq_cst store.
当然,甚至AArch64也需要针对seq_cst
围栏的完整屏障指令,这与seq_cst
加载或存储操作不同.
Of course even AArch64 needs a full barrier instruction for a seq_cst
fence, unlike a seq_cst
load or store operation.
std::atomic_thread_fence(memory_order_seq_cst)
始终生成全屏
实际上是.
所以我总是可以用
std::atomic_thread_fence(memory_order_seq_cst)
在实践中可以,但是从理论上讲,实现可以允许围绕std::atomic_thread_fence
的非原子操作进行一些重新排序,并且仍然符合标准. 总是是一个很强的词.
In practice yes, but in theory an implementation could maybe allow some reordering of non-atomic operations around std::atomic_thread_fence
and still be standards-compliant. Always is a very strong word.
ISO C ++仅在涉及std::atomic
加载或存储操作时才提供任何保证. GNU C ++可以让您从asm("" ::: "memory")
编译器壁垒(acq_rel)和asm("mfence" ::: "memory")
完整壁垒中摆脱出自己的原子操作.将其转换为ISO C ++ signal_fence和thread_fence将留下一个便携式" ISO C ++程序,该程序具有数据争用UB,因此不能保证任何事情.
ISO C++ only guarantees anything when there are std::atomic
load or store operations involved. GNU C++ would let you roll your own atomic operations out of asm("" ::: "memory")
compiler barriers (acq_rel) and asm("mfence" ::: "memory")
full barriers. Converting that to ISO C++ signal_fence and thread_fence would leave a "portable" ISO C++ program that has data-race UB and thus no guarantee of anything.
(尽管请注意,滚动自己的原子应该至少使用volatile
,而不仅仅是障碍,以确保编译器不会产生多个负载,即使您避免了将负载从循环中提升的明显问题也是如此. .谁怕最糟糕的优化编译器?).
(Although note that rolling your own atomics should use at least volatile
, not just barriers, to make sure the compiler doesn't invent multiple loads, even if you avoid the obvious problem of having loads hoisted out of a loop. Who's afraid of a big bad optimizing compiler?).
始终请记住,实现的作用必须至少 与ISO C ++保证的一样强.通常最终会变得更强大.
Always remember that what an implementation does has to be at least as strong as what ISO C++ guarantees. That often ends up being stronger.
这篇关于为什么这个`std :: atomic_thread_fence`工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!