在 C++11 中是否有与 asm("" ::: "memory") 相等的编译器屏障? [英] Is there any compiler barrier which is equal to asm("" ::: "memory") in C++11?

查看:29
本文介绍了在 C++11 中是否有与 asm("" ::: "memory") 相等的编译器屏障?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的测试代码如下,我发现只有memory_order_seq_cst禁止编译器重新排序.

My test code is as below, and I found that only the memory_order_seq_cst forbade compiler's reorder.

#include <atomic>

using namespace std;

int A, B = 1;

void func(void) {
    A = B + 1;
    atomic_thread_fence(memory_order_seq_cst);
    B = 0;
}

其他选择,例如memory_order_releasememory_order_acq_rel,根本没有产生任何编译器障碍.

And other choices such as memory_order_release, memory_order_acq_rel did not generate any compiler barrier at all.

我认为他们必须像下面那样使用原子变量.

I think they must work with atomic variable just as below.

#include <atomic>

using namespace std;

atomic<int> A(0);
int B = 1;

void func(void) {
    A.store(B+1, memory_order_release);
    B = 0;
}

但我不想使用原子变量.同时,我认为"asm("":::"memory")"级别太低了.

But I do not want to use atomic variable. At the same time, I think the "asm("":::"memory")" is too low level.

还有更好的选择吗?

推荐答案

re:您的

但我不想使用原子变量.

But I do not want to use atomic variable.

为什么不呢?如果是出于性能原因,请将它们与 memory_order_relaxedatomic_signal_fence(mo_whatever) 一起使用,以阻止编译器重新排序,除了编译器屏障可能会阻止某些编译时优化之外的任何运行时开销,取决于周围的代码.

Why not? If it's for performance reasons, use them with memory_order_relaxed and atomic_signal_fence(mo_whatever) to block compiler reordering without any runtime overhead other than the compiler barrier potentially blocking some compile-time optimizations, depending on the surrounding code.

如果是其他原因,那么 atomic_signal_fence 可能会给您提供恰好在您的目标平台上运行的代码.我怀疑它的大多数实现都会在实践中对非atomic<> 加载和存储进行排序,至少作为一个实现细节,并且如果访问 atomic<>; 变量.因此,它可能有助于在实践中避免任何仍然存在的数据竞争未定义行为的一些实际后果.(例如,作为 SeqLock 实现的一部分,为了提高效率,您希望使用共享数据的非原子读/写,以便编译器可以使用 SIMD 向量副本,例如.)

If it's for some other reason, then maybe atomic_signal_fence will give you code that happens to work on your target platform. I suspect that most implementations of it do order non-atomic<> loads and stores in practice, at least as an implementation detail, and probably effectively required if there are accesses to atomic<> variables. So it might help in practice to avoid some actual consequences of any data-race Undefined Behaviour which would still exist. (e.g. as part of a SeqLock implementation where for efficiency you want to use non-atomic reads / writes of the shared data so the compiler can use SIMD vector copies, for example.)

请参阅 LWN 上的谁会害怕一个糟糕的优化编译器?,了解有关该糟糕之处的一些详细信息如果您只使用编译器屏障来强制重新加载非atomic 变量,而不是使用具有读取精确一次语义的东西,那么您可能会遇到(如发明的加载).(在那篇文章中,他们谈论的是 Linux 内核代码,因此他们将 volatile 用于手动加载/存储原子.但通常不要这样做:何时将 volatile 与多线程一起使用? - 几乎从不)

See Who's afraid of a big bad optimizing compiler? on LWN for some details about the badness you can run into (like invented loads) if you only use compiler barriers to force reloads of non-atomic variables, instead of using something with read-exactly-once semantics. (In that article, they're talking about Linux kernel code so they're using volatile for hand-rolled load/store atomics. But in general don't do that: When to use volatile with multi threading? - pretty much never)

不管有什么障碍,如果两个线程同时运行这个函数,你的程序就会因为并发访问非原子<>变量而出现未定义行为.因此,此代码唯一有用的方法是,如果您正在谈论与在同一线程中运行的信号处理程序同步.

Regardless of any barriers, if two threads run this function at the same time, your program has Undefined Behaviour because of concurrent access to non-atomic<> variables. So the only way this code can be useful is if you're talking about synchronizing with a signal handler that runs in the same thread.

这也与要求编译器屏障"一致,仅防止在编译时重新排序,因为乱序执行和内存重新排序始终保留单个线程的行为.因此,您永远不需要额外的屏障指令来确保按程序顺序查看自己的操作,您只需要在编译时停止编译器重新排序内容即可.请参阅 Jeff Preshing 的帖子:编译时的内存排序

That would also be consistent with asking for a "compiler barrier", to only prevent reordering at compile time, because out-of-order execution and memory reordering always preserve the behaviour of a single thread. So you never need extra barrier instructions to make sure you see your own operations in program order, you just need to stop the compiler reordering stuff at compile time. See Jeff Preshing's post: Memory Ordering at Compile Time

这就是atomic_signal_fence 是为了.您可以将它与任何 std::memory_order 一起使用,就像 thread_fence 一样,以获得不同强度的屏障,并且仅阻止您需要阻止的优化.

This is what atomic_signal_fence is for. You can use it with any std::memory_order, just like thread_fence, to get different strengths of barrier and only prevent the optimizations you need to prevent.

... atomic_thread_fence(memory_order_acq_rel) 根本没有产生任何编译器障碍!

... atomic_thread_fence(memory_order_acq_rel) did not generate any compiler barrier at all!

在几个方面完全错误.

atomic_thread_fence 是一个编译器屏障加上任何运行时屏障都是必要的,以限制我们的加载/存储变得可见的顺序重新排序其他线程.

atomic_thread_fence is a compiler barrier plus whatever run-time barriers are necessary to restrict reordering in the order our loads/stores become visible to other threads.

我猜你的意思是当你查看 x86 的 asm 输出时它没有发出任何障碍指令.像 x86 的 MFENCE 这样的指令不是编译器屏障",它们是运行时内存屏障,甚至阻止 StoreLoad 在运行时重新排序.(这是 x86 允许的唯一重新排序.只有在使用弱排序 (NT) 存储时才需要 SFENCE 和 LFENCE,例如 MOVNTPS (_mm_stream_ps).)

I'm guessing you mean it didn't emit any barrier instructions when you looked at the asm output for x86. Instructions like x86's MFENCE are not "compiler barriers", they're run-time memory barriers and prevent even StoreLoad reordering at run-time. (That's the only reordering that x86 allows. SFENCE and LFENCE are only needed when using weakly-ordered (NT) stores, like MOVNTPS (_mm_stream_ps).)

在像 ARM 这样的弱排序 ISA 上,thread_fence(mo_acq_rel) 不是免费的,而是编译为指令.gcc5.4 使用 dmb ish.(在 Godbolt 编译器浏览器).

On a weakly-ordered ISA like ARM, thread_fence(mo_acq_rel) isn't free, and compiles to an instruction. gcc5.4 uses dmb ish. (See it on the Godbolt compiler explorer).

编译器屏障只是防止在编译时重新排序,而不必阻止运行时重新排序.因此,即使在 ARM 上,atomic_signal_fence(mo_seq_cst) 也不会编译为无指令.

A compiler barrier just prevents reordering at compile time, without necessarily preventing run-time reordering. So even on ARM, atomic_signal_fence(mo_seq_cst) compiles to no instructions.

一个足够弱的屏障允许编译器在存储到 A 之前先存储到 B,但 gcc 碰巧决定仍然在即使使用 thread_fence(mo_acquire) 源订单(不应与其他商店一起订购商店).

A weak enough barrier allows the compiler to do the store to B ahead of the store to A if it wants, but gcc happens to decide to still do them in source order even with thread_fence(mo_acquire) (which shouldn't order stores with other stores).

所以这个例子并没有真正测试某些东西是否是编译器障碍.

So this example doesn't really test whether something is a compiler barrier or not.

来自 gcc 的奇怪编译器行为,例如一个与编译器屏障不同的示例:

在 Godbolt 上查看此来源+asm.

#include <atomic>
using namespace std;
int A,B;

void foo() {
  A = 0;
  atomic_thread_fence(memory_order_release);
  B = 1;
  //asm volatile(""::: "memory");
  //atomic_signal_fence(memory_order_release);
  atomic_thread_fence(memory_order_release);
  A = 2;
}

这会以您期望的方式使用 clang 进行编译:thread_fence 是 StoreStore 屏障,因此 A=0 必须在 B=1 之前发生,并且不能与 A=2 合并.

This compiles with clang the way you'd expect: the thread_fence is a StoreStore barrier, so the A=0 has to happen before B=1, and can't be merged with the A=2.

    # clang3.9 -O3
    mov     dword ptr [rip + A], 0
    mov     dword ptr [rip + B], 1
    mov     dword ptr [rip + A], 2
    ret

但是对于 gcc,barrier 没有任何作用,并且在 asm 输出中只存在到 A 的最终存储.

But with gcc, the barrier has no effect, and only the final store to A is present in the asm output.

    # gcc6.2 -O3
    mov     DWORD PTR B[rip], 1
    mov     DWORD PTR A[rip], 2
    ret

但是使用 atomic_signal_fence(memory_order_release),gcc 的输出匹配 clang.所以 atomic_signal_fence(mo_release) 具有我们预期的屏障效果,但是 atomic_thread_fence 任何比 seq_cst 弱的东西都不会充当编译器屏障.强>

But with atomic_signal_fence(memory_order_release), gcc's output matches clang. So atomic_signal_fence(mo_release) is having the barrier effect we expect, but atomic_thread_fence with anything weaker than seq_cst isn't acting as a compiler barrier at all.

这里的一个理论是 gcc 知道对于多个线程写入非原子<>变量是正式的未定义行为.这并没有多大意义,因为如果用于与信号处理程序同步,atomic_thread_fence 应该仍然可以工作,它只是比必要的强.

One theory here is that gcc knows that it's officially Undefined Behaviour for multiple threads to write to non-atomic<> variables. This doesn't hold much water, because atomic_thread_fence should still work if used to synchronize with a signal handler, it's just stronger than necessary.

顺便说一句,使用atomic_thread_fence(memory_order_seq_cst),我们得到了预期的

BTW, with atomic_thread_fence(memory_order_seq_cst), we get the expected

    # gcc6.2 -O3, with a mo_seq_cst barrier
    mov     DWORD PTR A[rip], 0
    mov     DWORD PTR B[rip], 1
    mfence
    mov     DWORD PTR A[rip], 2
    ret

即使只有一个屏障,我们也能做到这一点,这仍然允许 A=0 和 A=2 存储一个接一个地发生,因此允许编译器跨屏障合并它们.(观察者未能看到单独的 A=0 和 A=2 值是一种可能的排序,因此编译器可以决定这是总是发生的事情).不过,当前的编译器通常不会进行这种优化.请参阅我对 可以 num++ 是原子的回答末尾的讨论'int num'?.

We get this even with only one barrier, which would still allow the A=0 and A=2 stores to happen one after the other, so the compiler is allowed to merge them across a barrier. (Observers failing to see separate A=0 and A=2 values is a possible ordering, so the compiler can decide that's what always happens). Current compilers don't usually do this kind of optimization, though. See discussion at the end of my answer on Can num++ be atomic for 'int num'?.

这篇关于在 C++11 中是否有与 asm("" ::: "memory") 相等的编译器屏障?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆