Intel x86 上是否需要内存排序:consume、acq_rel 和 seq_cst? [英] Are memory orderings: consume, acq_rel and seq_cst ever needed on Intel x86?

查看:96
本文介绍了Intel x86 上是否需要内存排序:consume、acq_rel 和 seq_cst?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

C++11 规定了六种内存顺序:

C++11 specifies six memory orderings:

typedef enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
} memory_order;

https://en.cppreference.com/w/cpp/atomic/memory_order

其中默认值为 seq_cst.

where the default is seq_cst.

可以通过放宽操作的内存顺序来获得性能提升.但是,这取决于体系结构提供的保护.例如,Intel x86 是一种强大的内存模型,可以保证各种加载/存储组合不会被重新排序.

Performance gains can be found by relaxing the memory ordering of operations. However, this depends on what protections the architecture provides. For example, Intel x86 is a strong memory model and guarantees that various loads/store combinations will not be re-ordered.

因此,relaxedacquirerelease 似乎是在 x86 上寻求额外性能所需的唯一顺序.

As such relaxed, acquire and release seem to be the only orderings required when seeking additional performance on x86.

这是正确的吗?如果没有,是否需要在 x86 上使用 consumeacq_relseq_cst?

Is this correct? If not, is there ever a need to use consume, acq_rel and seq_cst on x86?

推荐答案

如果您关心可移植的性能,那么理想情况下,您应该编写 C++ 源代码,对每个操作使用最少的必要顺序.对于纯存储,在 x86 上真正花费额外"的是 mo_seq_cst,因此即使对于 x86,也要避免这种情况.

If you care about portable performance, you should ideally write your C++ source with the minimum necessary ordering for each operation. The only thing that really costs "extra" on x86 is mo_seq_cst for a pure store, so make a point of avoiding that even for x86.

(relaxed ops 还可以允许对周围的非原子操作进行更多的编译时优化,例如 CSE 和死存储消除,因为宽松的 ops 避免了编译器障碍.如果你不需要任何顺序wrt.周围的代码,告诉编译器这个事实,以便它可以优化.)

(relaxed ops can also allow more compile-time optimization of the surrounding non-atomic operations, e.g. CSE and dead store elimination, because relaxed ops avoid a compiler barrier. If you don't need any order wrt. surrounding code, tell the compiler that fact so it can optimize.)

请记住,如果您只有 x86 硬件,则无法完全测试较弱的命令,尤其是只有 acquirerelease 的原子 RMW,因此在实践中它更安全如果您正在做的任何事情已经很复杂且难以推理正确性,请将您的 RMW 保留为 seq_cst.

Keep in mind that you can't fully test weaker orders if you only have x86 hardware, especially atomic RMWs with only acquire or release, so in practice it's safer to leave your RMWs as seq_cst if you're doing anything that's already complicated and hard to reason about correctness.

很少需要seq_cst的用例(在以后的加载发生之前排空存储缓冲区).几乎总是像获取或释放这样较弱的命令也是安全的.

There are very few use-cases where seq_cst is required (draining the store buffer before later loads can happen). Almost always a weaker order like acquire or release would also be safe.

有像https://preshing.com/20120515/memory-reordering-caught-in-the-act/,但即使实现锁定通常也只需要获取和释放顺序.(当然,获取锁确实需要原子 RMW,所以在 x86 上也可能是 seq_cst.)我想出的一个实际用例是 让多个线程在一个数组中设置位.避免原子 RMW,并通过重新检查最近存储的值来检测一个线程何时踩到另一个线程.您必须等到您的商店在全球范围内可见,然后才能安全地重新加载它们以进行检查.

There are artificial cases like https://preshing.com/20120515/memory-reordering-caught-in-the-act/, but even implementing locking generally only requires acquire and release ordering. (Of course taking a lock does require an atomic RMW, so on x86 that might as well be seq_cst.) One practical use-case I came up with was to have multiple threads set bits in an array. Avoid atomic RMWs and detect when one thread stepped on another by re-checking values that were recently stored. You have to wait until your stores are globally visible before you can safely reload them to check.

因此,relaxedacquirerelease 似乎是 x86 上唯一需要的排序.

As such relaxed, acquire and release seem to be the only orderings required on x86.

从一个 POV 来看,在 C++ 源代码中,您要求任何比 seq_cst 弱的排序(性能除外);这就是为什么它是所有 std::atomic 函数的默认值.请记住,您正在编写 C++,而不是 x86 asm.

From one POV, in C++ source you don't require any ordering weaker than seq_cst (except for performance); that's why it's the default for all std::atomic functions. Remember you're writing C++, not x86 asm.

或者,如果您想描述 x86 asm 可以做什么的全部范围,那么它是加载的 acq,纯存储的 rel 和原子 RMW 的 seq_cst.(lock 前缀是一个完整的屏障;fetch_add(1, Relaxed) 编译为与 seq_cst 相同的 asm).x86 asm 无法轻松加载或存储1.

Or if you mean to describe the full range of what x86 asm can do, then it's acq for loads, rel for pure stores, and seq_cst for atomic RMWs. (The lock prefix is a full barrier; fetch_add(1, relaxed) compiles to the same asm as seq_cst). x86 asm can't do a relaxed load or store1.

在 C++ 中使用 relaxed 的唯一好处(当为 x86 编译时)是允许通过 编译时重新排序,例如允许优化,如存储合并和死存储消除.永远记住你不是在写 x86 asm;C++ 内存模型适用于编译时排序/优化决策.

The only benefit to using relaxed in C++ (when compiling for x86) is to allow more optimization of surrounding non-atomic operations by reordering at compile time, e.g. to allow optimizations like store coalescing and dead-store elimination. Always remember that you're not writing x86 asm; the C++ memory model applies for compile-time ordering / optimization decisions.

acq_relseq_cst 对于 ISO C++ 中的原子 RMW 操作几乎相同,我认为在为多副本原子的 x86 和 ARMv8 等 ISA 编译时没有区别.(没有 IRIW 重新排序,例如 POWER 可以通过在存储提交到 L1d 之前在 SMT 线程之间进行存储转发来完成).memory_order_seq_cst 和 memory_order_acq_rel 有何不同?

acq_rel and seq_cst are nearly identical for atomic RMW operations in ISO C++, I think no difference when compiling for ISAs like x86 and ARMv8 that are multi-copy-atomic. (No IRIW reordering like e.g. POWER can do by store-forwarding between SMT threads before a store commits to L1d). How do memory_order_seq_cst and memory_order_acq_rel differ?

对于屏障,atomic_thread_fence(mo_acq_rel) 在 x86 上编译为零指令,而 fence(seq_cst) 编译为 mfence 或更快的等价物(例如,某个堆栈内存上的虚拟 lock 指令).memory_order_seq_cst 栅栏什么时候有用?

For barriers, atomic_thread_fence(mo_acq_rel) compiles to zero instructions on x86, while fence(seq_cst) compiles to mfence or a faster equivalent (e.g. a dummy locked instruction on some stack memory). When is a memory_order_seq_cst fence useful?

如果你只为 x86 编译,你可以说 acq_relconsume 真的没用.consume 旨在公开大多数弱排序 ISA(尤其不是 DEC Alpha)所做的依赖排序.但不幸的是,它的设计方式编译器无法安全实现,因此他们目前只是放弃并推广它,这对一些弱排序的 ISA 造成了障碍.但是在 x86 上,acquire 是免费的",所以没问题.

You could say acq_rel and consume are truly useless if you're only compiling for x86. consume was intended to expose the dependency ordering that most weakly-ordered ISAs do (notably not DEC Alpha). But unfortunately it was designed in a way that compilers couldn't implement safely so they currently just give up and promote it to acquire, which costs a barrier on some weakly-ordered ISAs. But on x86, acquire is "free" so it's fine.

如果您确实需要高效消费,例如对于 RCU,您唯一真正的选择是使用 relaxed 并且不要向编译器提供足够的信息来优化它生成的 asm 中的数据依赖性.C++11:memory_order_relaxed 和 memory_order_consume 的区别.

If you actually do need efficient consume, e.g. for RCU, your only real option is to use relaxed and don't give the compiler enough information to optimize away the data dependency from the asm it makes. C++11: the difference between memory_order_relaxed and memory_order_consume.

脚注 1:我不认为 movnt 是一个轻松的原子存储,因为通常的 C++ -> asm 映射 用于发布操作仅使用 mov 存储,而不是 sfence,因此不会订购 NT 商店.即 std::atomic 如果您一直在使用 _mm_stream_ps() 存储,则使用 _mm_sfence() 由您决定.

Footnote 1: I'm not counting movnt as a relaxed atomic store because the usual C++ -> asm mapping for release operations uses just a mov store, not sfence, and thus would not order an NT store. i.e. std::atomic leaves it up to you to use _mm_sfence() if you'd been messing around with _mm_stream_ps() stores.

PS:整个答案是假设正常的 WB(回写)可缓存内存区域.如果你只是在主流操作系统下正常使用 C++,那么你所有的内存分配都是 WB,而不是弱序 WC 或强序不可缓存的 UC 或其他任何东西.事实上,即使您想要页面的 WC 映射,大多数操作系统也没有相应的 API.并且 std::atomic 发布存储会在 WC 内存上被破坏,像 NT 存储一样弱排序.

PS: this entire answer is assuming normal WB (write-back) cacheable memory regions. If you just use C++ normally under a mainstream OS, all your memory allocations will be WB, not weakly-ordered WC or strongly-ordered uncacheable UC or anything else. In fact even if you wanted a WC mapping of a page, most OSes don't have an API for that. And std::atomic release stores would be broken on WC memory, weakly-ordered like NT stores.

这篇关于Intel x86 上是否需要内存排序:consume、acq_rel 和 seq_cst?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆