将fetch_add(0,memory_order_relaxed/release)转换为mfence + mov是否合法? [英] Is the transformation of fetch_add(0, memory_order_relaxed/release) to mfence + mov legal?

查看:79
本文介绍了将fetch_add(0,memory_order_relaxed/release)转换为mfence + mov是否合法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

本文 N4455没有Sane编译器会优化Atomics 讨论了编译器可以应用于原子的各种优化.在优化周围对于seqlock示例,Atomics 提到了在LLVM中实现的转换,其中将 fetch_add(0,std :: memory_order_release)转换为 mfence 而不是通常的 lock add xadd .这个想法是,这避免了对缓存行的独占访问,并且相对便宜.不管为防止生成的 mov 指令对 StoreLoad 进行重新排序而提供的排序约束,仍然仍然需要 mfence .

转换是针对这样的 read-don-not-modify-write 执行的操作,而不考虑顺序,并且为 fetch_add(0,memory_order_relaxed)生成了等效的程序集.

但是,我想知道这是否合法.C ++标准在 [atomic.order] 下明确指出:

原子读取-修改-写入操作应始终读取在与读取-修改-写入操作相关联的写入之前写入的最后一个值(按修改顺序).

以前,安东尼·威廉姆斯 也提到了有关RMW操作看到最新"价值的事实.

>

我的问题是:基于原子变量的修改顺序,基于编译器是否发出 lock add,线程可以看到的值的行为是否存在差异mfence ,然后是普通负载?这种转换是否可能导致执行RMW操作的线程加载的值比最新的早?这是否违反了C ++内存模型的保证?

解决方案

(我前一阵子开始写,但是却陷入了停顿;我不确定它是否可以构成一个完整的答案,但我认为其中一些可能值得我认为@LWimsey的评论比我写的要好,成为答案的核心.)

是的,这很安全.

请记住,按条件规则的应用方式是在真实计算机上执行必须始终产生与C ++抽象计算机上可能执行的匹配的结果.优化是合法的,以使某些执行C ++抽象机无法在目标上执行.即使针对x86进行编译,也使得所有IRIW的重新排序都是不可能的,例如,无论编译器是否喜欢它.(请参见下文;某些PowerPC硬件是实践中唯一可以做到的主流硬件.)


我认为专门针对RMW的用词的原因是,它将负载与修改顺序"相关联.ISO C ++要求每个原子对象分别存在.(也许.)

请记住,C ++正式定义其排序模型的方式是与之同步,并且每个对象都存在修改顺序(所有线程都可以同意).类似于具有一致缓存的概念的硬件 1 ,它创建每个内核访问的内存的单一一致视图.一致的共享内存的存在(通常使用MESI始终保持一致)使一堆事情变得隐含,就像不可能读陈旧"的内容一样.价值观.(尽管硬件内存模型通常确实像C ++一样明确地记录了它.)

因此转换是安全的.

ISO C ++在另一部分的注释中确实提到了一致性的概念: http://eel.is/c++draft/intro.races#14

由评估B确定的原子对象M的值应为修饰M的某种副作用A所存储的值,其中B不会在A之前发生.
[注14:这种副作用的集还受到此处描述的其余规则的限制,特别是受到以下一致性要求的限制. —尾注]

...

[注19:前面的四个相干要求有效禁止将原子操作的编译器重新排序为单个对象,即使这两个操作都是宽松的负载.这有效地使C ++可用的大多数硬件提供的缓存一致性保证原子操作. —尾注]

[注20:一个原子的负载所观察到的值取决于之前发生"的关系,取决于原子量.预期的阅读是,必须存在一个他们观察到原子载荷与修改的关联,together with suitably chosen modification orders and the happens如上所述得出的之前"关系满足结果这里施加的约束.—尾注]

因此,ISO C ++本身注意到缓存一致性给出了一些顺序,而x86具有一致性缓存.(抱歉,我没有给出一个完整的论据, 是对的.抱歉,LWimsey关于修改顺序中最新的含义是相关的.)

(在许多ISA(但不是全部)上,内存模型还排除了严格排序的内存模型(TSO =总存储量订单=程序订单+具有存储转发功能的存储缓冲区.

脚注1: std :: thread 可以运行的所有内核都具有一致的缓存.在所有ISA上的所有实际C ++实现中都是正确的,而不仅仅是x86-64.有一些异构的板卡,它们具有单独的CPU,它们共享内存而没有缓存一致性,但是同一进程的普通C ++线程不会在这些不同的内核上运行.请参阅此答案以获取有关此内容的更多详细信息.

The paper N4455 No Sane Compiler Would Optimize Atomics talks about various optimizations compilers can apply to atomics. Under the section Optimization Around Atomics, for the seqlock example, it mentions a transformation implemented in LLVM, where a fetch_add(0, std::memory_order_release) is turned into a mfence followed by a plain load, rather than the usual lock add or xadd. The idea is that this avoids taking exclusive access of the cacheline, and is relatively cheaper. The mfence is still required regardless of the ordering constraint supplied to prevent StoreLoad reordering for the mov instruction generated.

This transformation is performed for such read-don't-modify-write operations regardless of the ordering, and equivalent assembly is produced for fetch_add(0, memory_order_relaxed).

However, I am wondering if this is legal. The C++ standard explicitly notes under [atomic.order] that:

Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

This fact about RMW operations seeing the 'latest' value has also been noted previously by Anthony Williams.

My question is: Is there a difference of behavior in the value the thread could see based on the modification order of the atomic variable, based on whether the compiler emits a lock add vs mfence followed by a plain load? Is it possible for this transformation to cause the thread performing the RMW operation to instead load values older than the latest one? Does this violate the guarantees of the C++ memory model?

解决方案

(I started writing this a while ago but got stalled; I'm not sure it adds up to a full answer, but thought some of this might be worth posting. I think @LWimsey's comments do a better job of getting to the heart of an answer than what I wrote.)

Yes, it's safe.

Keep in mind that the way the as-if rule applies is that execution on the real machine has to always produce a result that matches one possible execution on the C++ abstract machine. It's legal for optimizations to make some executions that the C++ abstract machine allows impossible on the target. Even compiling for x86 at all makes all IRIW reordering impossible, for example, whether the compiler likes it or not. (See below; some PowerPC hardware is the only mainstream hardware that can do it in practice.)


I think the reason that wording is there for RMWs specifically is that it ties the load to the "modification order" which ISO C++ requires to exist for each atomic object separately. (Maybe.)

Remember that the way C++ formally defines its ordering model is in terms of synchronizes-with, and existence of a modification order for each object (that all threads can agree on). Not like hardware where there is a notion of coherent caches1 creating a single coherent view of memory that each core accesses. The existence of coherent shared memory (typically using MESI to maintain coherence at all times) makes a bunch of things implicit, like the impossibility of reading "stale" values. (Although HW memory models do typically document it explicitly like C++ does).

Thus the transformation is safe.

ISO C++ does mention the concept of coherency in a note in another section: http://eel.is/c++draft/intro.races#14

The value of an atomic object M, as determined by evaluation B, shall be the value stored by some side effect A that modifies M, where B does not happen before A.
[Note 14: The set of such side effects is also restricted by the rest of the rules described here, and in particular, by the coherence requirements below. — end note]

...

[Note 19: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note]

[Note 20: The value observed by a load of an atomic depends on the "happens before" relation, which depends on the values observed by loads of atomics. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the "happens before" relation derived as described above, satisfy the resulting constraints as imposed here. — end note]

So ISO C++ itself notes that cache coherence gives some ordering, and x86 has coherent caches. (I'm not making a complete argument that this is enough ordering, sorry. LWimsey's comments about what it even means to be the latest in a modification order are relevant.)

(On many ISAs (but not all), the memory model also rules out IRIW reordering when you have stores to 2 separate objects. (e.g. on PowerPC, 2 reader threads can disagree about the order of 2 stores to 2 separate objects). Very few implementations can create such reordering: if shared cache is the only way data can get between cores, like on most CPUs, that creates an order for stores.)

Is it possible for this transformation to cause the thread performing the RMW operation to instead load values older than the latest one?

On x86 specifically, it's very easy to reason about. x86 has a strongly-ordered memory model (TSO = Total Store Order = program order + a store buffer with store-forwarding).

Footnote 1: All cores that std::thread can run across have coherent caches. True on all real-world C++ implementations across all ISAs, not just x86-64. There are some heterogeneous boards with separate CPUs sharing memory without cache coherency, but ordinary C++ threads of the same process won't be running across those different cores. See this answer for more details about that.

这篇关于将fetch_add(0,memory_order_relaxed/release)转换为mfence + mov是否合法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆