为什么这个`std :: atomic_thread_fence`工作 [英] Why does this `std::atomic_thread_fence` work

查看:125
本文介绍了为什么这个`std :: atomic_thread_fence`工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我想列出我对此的一些理解,如果我错了,请纠正我.

Firstly I want to list some of my undertandings regarding to this, please correct me if I'm wrong.

  1. x86中的MFENCE可以确保完全隔离
  2. 顺序一致性可防止对STORE-STORE,STORE-LOAD,LOAD-STORE和LOAD-LOAD进行重新排序

  1. a MFENCE in x86 can ensure a full barrier
  2. Sequential-Consistency prevents reordering of STORE-STORE, STORE-LOAD, LOAD-STORE and LOAD-LOAD

这是根据维基百科.

std::memory_order_seq_cst无法保证阻止STORE-LOAD重新排序.

std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

这是根据

This is according to Alex's answer, "Loads May Be Reordered with Earlier Stores to Different Locations"(for x86) and mfence will not always be added.

std::memory_order_seq_cst是否表示顺序一致性?根据第2/3点,对我来说似乎不正确. std::memory_order_seq_cst仅在以下情况下表示顺序一致性

Whether a std::memory_order_seq_cst indicates Sequential-Consistency? According to point 2/3, it seems not correct to me. std::memory_order_seq_cst indicates Sequential-Consistency only when

  1. 至少将一个显式MFENCE添加到LOADSTORE
  2. 加载(没有围栏)并锁定XCHG
  3. 锁定XADD(0)和存储(无围栏)
  1. at least one explicit MFENCE added to either LOAD or STORE
  2. LOAD (without fence) and LOCK XCHG
  3. LOCK XADD ( 0 ) and STORE (without fence)

否则,仍然有可能重新订购.

otherwise there will still be possible reorders.

根据@LWimsey的评论,我在这里犯了一个错误,如果LOADSTORE均为memory_order_seq_cst,则没有重新排序. Alex可能指出了使用非原子或非SC的情况.

According to @LWimsey's comment, I made a mistake here, if both the LOAD and STORE are memory_order_seq_cst, there's no reorder. Alex may indicated situations where non-atomic or non-SC is used.

std::atomic_thread_fence(memory_order_seq_cst)始终生成全屏

这是根据

This is according to Alex's answer. So I can always replace asm volatile("mfence" ::: "memory") with std::atomic_thread_fence(memory_order_seq_cst)

这对我来说很奇怪,因为memory_order_seq_cst在原子函数和篱笆函数之间的用法似乎有所不同.

This is quite strange to me, because a memory_order_seq_cst seems to have quite a difference usage between atomic functions and fence functions.

现在我进入MSVC 2015标准库的头文件中的此代码,该文件实现了std::atomic_thread_fence

Now I come to this code in header file of MSVC 2015's standard library, which implements std::atomic_thread_fence

inline void _Atomic_thread_fence(memory_order _Order)
    {   /* force memory visibility and inhibit compiler reordering */
 #if defined(_M_ARM) || defined(_M_ARM64)
    if (_Order != memory_order_relaxed)
        {
        _Memory_barrier();
        }

 #else
    _Compiler_barrier();
    if (_Order == memory_order_seq_cst)
        {   /* force visibility */
        static _Uint4_t _Guard;
        _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
        _Compiler_barrier();
        }
 #endif
    }

所以我的主要问题是_Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);如何创建完整的屏障MFENCE,或者实际上做了什么来启用类似MFENCE的等效机制,因为_Compiler_barrier()在这里显然不足以存储完整的内存障碍,还是该陈述与第3点有些相似?

So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst); create a full barrier MFENCE, or what has actually done to enable an equivalent mechanism like MFENCE, because a _Compiler_barrier() is obviously not enough here for a full memory barrier, or this statement works somewhat similar to point 3?

推荐答案

所以我的主要问题是_Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);如何创建完全障碍MFENCE

So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst); create a full barrier MFENCE

这将编译为带有存储目标的xchg指令.像mfence一样,这是一个完全的内存屏障(耗尽存储缓冲区) 1 .

This compiles to an xchg instruction with a memory destination. This is a full memory barrier (draining the store buffer) exactly1 like mfence.

在此之前和之后都有编译器障碍,也可以防止围绕它进行编译时重新排序.因此,可以防止在原子和非原子C ++对象上进行任何方向的所有重新排序,从而使其足以执行ISO C ++ atomic_thread_fence(mo_seq_cst)承诺的所有操作.

With compiler barriers before and after that, compile-time reordering around it is also prevented. Therefore all reordering in either direction is prevented (of operations on atomic and non-atomic C++ objects), making it more than strong enough to do everything that ISO C++ atomic_thread_fence(mo_seq_cst) promises.

对于弱于seq_cst的命令,仅需要编译器障碍. x86的硬件内存排序模型是程序顺序+具有存储转发功能的存储缓冲区.这对于acq_rel足够强大,而编译器不会发出任何特殊的asm指令,而只是阻止编译时重新排序. https://preshing.com/20120930/weak-vs-strong-memory -models/

For orders weaker than seq_cst, only a compiler barrier is needed. x86's hardware memory-ordering model is program-order + a store buffer with store forwarding. That's strong enough for acq_rel without the compiler emitting any special asm instructions, just blocking compile-time reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/

脚注1 :对于std :: atomic而言已足够. lock指令所指示的从WC存储器加载的MOVNTDQA指令的顺序可能不如MFENCE严格.

Footnote 1: exactly enough for the purposes of std::atomic. Weakly ordered MOVNTDQA loads from WC memory may not be as strictly ordered by locked instructions as by MFENCE.

  • Which is a better write barrier on x86: lock+addl or xchgl?
  • Does lock xchg have the same behavior as mfence? - equivalent for std::atomic purposes, but some minor differences that might matter for a device driver using WC memory regions. And perf differences. Notably on Skylake where mfence blocks OoO exec like lfence
  • Why is LOCK a full barrier on x86?

x86上的原子读-修改-写(RMW)操作只能使用lock前缀或

Atomic read-modify-write (RMW) operation on x86 are only possible with a lock prefix, or xchg with memory which is like that even without a lock prefix in the machine code. A lock-prefixed instruction (or xchg with mem) is always a full memory barrier.

使用lock add dword [esp], 0之类的指令代替mfence是一种众所周知的技术. (并且在某些CPU上性能更好.)此MSVC代码是相同的想法,但是它代替了对栈指针所指向的任何对象的无操作,而是对虚拟变量执行了xchg .实际位置无关紧要,但是性能最好的选择是仅由当前内核访问并且已在缓存中处于高温的缓存行.

Using an instruction like lock add dword [esp], 0 as a substitute for mfence is a well-known technique. (And performs better on some CPUs.) This MSVC code is the same idea, but instead of a no-op on whatever the stack pointer is pointing-to, it does an xchg on a dummy variable. It doesn't actually matter where it is, but a cache line that's only ever accessed by the current core and is already hot in cache is the best choice for performance.

使用所有内核将争夺访问的static共享变量是最糟糕的选择.该代码太可怕了!不必与其他内核在同一缓存行中进行交互,以控制此内核在其自己的L1d缓存上的操作顺序.这完全是傻瓜. MSVC显然仍在其std::atomic_thread_fence()的实现中使用此可怕的代码,即使对于保证mfence可用的x86-64,也是如此. ( Godbolt用MSVC 19.14 )

Using a static shared variable that all cores will contend for access to is the worst possible choice; this code is terrible! Interacting with the same cache line as other cores is not necessary to control the order of this core's operations on its own L1d cache. This is completely bonkers. MSVC still apparently uses this horrible code in its implementation of std::atomic_thread_fence(), even for x86-64 where mfence is guaranteed available. (Godbolt with MSVC 19.14)

如果您要进行seq_cst store ,则可以选择mov + mfence(gcc这样做),也可以使用进行存储单个xchg(使用clang和MSVC可以做到这一点,因此代码生成很好,没有共享的虚拟变量).

If you're doing a seq_cst store, your options are mov+mfence (gcc does this) or doing the store and the barrier with a single xchg (clang and MSVC do this, so the codegen is fine, no shared dummy var).

该问题的大部分早期内容(陈述事实")似乎是错误的,并且包含一些误解或误导性内容,甚至没有错.

std::memory_order_seq_cst不能保证阻止STORE-LOAD重新排序.

std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

C ++保证使用完全不同的模型进行订购,在该模型中,从发布存储中获取看到值的负载与它同步",并保证C ++源代码中的后续操作在发布存储之前可以查看代码中的所有存储.

C++ guarantees order using a totally different model, where acquire loads that see a value from a release store "synchronize with" it, and later operations in the C++ source are guaranteed to see all the stores from code before the release store.

它还保证即使在不同对象之间,所有所有 seq_cst操作的总顺序也是如此. (较弱的订单允许线程在全局可见之前重新加载其自己的商店,即商店转发.这就是为什么只有seq_cst必须耗尽商店缓冲区的原因.它们还允许IRIW重新排序.

It also guarantees that there's a total order of all seq_cst operations even across different objects. (Weaker orders allow a thread to reload its own stores before they become globally visible, i.e. store forwarding. That's why only seq_cst has to drain the store buffer. They also allow IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

StoreLoad重新排序之类的概念基于以下模型:

Concepts like StoreLoad reordering are based on a model where:

  • 所有内核间通信都是通过将存储提交到缓存一致的共享内存中
  • 重新排序发生在一个核心之间,即它自己对缓存的访问之间.例如存储缓冲区会延迟存储可见性,直到以后的加载(如x86允许)为止. (除了核心可以通过商店转发尽早看到自己的商店.)

就此模型而言,seq_cst确实需要在seq_cst存储和以后的seq_cst加载之间的某个点上耗尽存储缓冲区.实现此目的的有效方法是在seq_cst存储之后放置一个完整的屏障 . (而不是在每次seq_cst加载之前.便宜的负载比便宜的存储更重要.)

In terms of this model, seq_cst does require draining the store buffer at some point between a seq_cst store and a later seq_cst load. The efficient way to implement this is to put a full barrier after seq_cst stores. (Instead of before every seq_cst load. Cheap loads are more important than cheap stores.)

在像AArch64这样的ISA上,有加载获取和存储释放指令,它们实际上具有顺序释放的语义,这与x86加载/存储仅"是常规释放不同. (因此,AArch64 seq_cst不需要单独的屏障;微体系结构可能会延迟耗尽存储缓冲区,除非/直到执行加载获取,而仍然没有将存储释放提交给L1d缓存.)其他ISA通常需要完整的屏障在seq_cst存储之后清空存储缓冲区的指令.

On an ISA like AArch64, there are load-acquire and store-release instructions which actually have sequential-release semantics, unlike x86 loads/stores which are "only" regular release. (So AArch64 seq_cst doesn't need a separate barrier; a microarchitecture could delay draining the store buffer unless / until a load-acquire executes while there's still a store-release not committed to L1d cache yet.) Other ISAs generally need a full barrier instruction to drain the store buffer after a seq_cst store.

当然,甚至AArch64也需要针对seq_cst 围栏的完整屏障指令,这与seq_cst加载或存储操作不同.

Of course even AArch64 needs a full barrier instruction for a seq_cst fence, unlike a seq_cst load or store operation.

std::atomic_thread_fence(memory_order_seq_cst)始终生成全屏

实际上是.

所以我总是可以用std::atomic_thread_fence(memory_order_seq_cst)

在实践中可以,但是从理论上讲,实现可以允许围绕std::atomic_thread_fence的非原子操作进行一些重新排序,并且仍然符合标准. 总是是一个很强的词.

In practice yes, but in theory an implementation could maybe allow some reordering of non-atomic operations around std::atomic_thread_fence and still be standards-compliant. Always is a very strong word.

ISO C ++仅在涉及std::atomic加载或存储操作时才提供任何保证. GNU C ++可以让您从asm("" ::: "memory")编译器壁垒(acq_rel)和asm("mfence" ::: "memory")完整壁垒中摆脱出自己的原子操作.将其转换为ISO C ++ signal_fence和thread_fence将留下一个便携式" ISO C ++程序,该程序具有数据争用UB,因此不能保证任何事情.

ISO C++ only guarantees anything when there are std::atomic load or store operations involved. GNU C++ would let you roll your own atomic operations out of asm("" ::: "memory") compiler barriers (acq_rel) and asm("mfence" ::: "memory") full barriers. Converting that to ISO C++ signal_fence and thread_fence would leave a "portable" ISO C++ program that has data-race UB and thus no guarantee of anything.

(尽管请注意,滚动自己的原子应该至少使用volatile,而不仅仅是障碍,以确保编译器不会产生多个负载,即使您避免了将负载从循环中提升的明显问题也是如此. .谁怕最糟糕的优化编译器?).

(Although note that rolling your own atomics should use at least volatile, not just barriers, to make sure the compiler doesn't invent multiple loads, even if you avoid the obvious problem of having loads hoisted out of a loop. Who's afraid of a big bad optimizing compiler?).

始终请记住,实现的作用必须至少 与ISO C ++保证的一样强.通常最终会变得更强大.

Always remember that what an implementation does has to be at least as strong as what ISO C++ guarantees. That often ends up being stronger.

这篇关于为什么这个`std :: atomic_thread_fence`工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆