在C ++ 11中是否有任何等于asm(":::" memory“)的编译器障碍? [英] Is there any compiler barrier which is equal to asm("" ::: "memory") in C++11?

查看:70
本文介绍了在C ++ 11中是否有任何等于asm(":::" memory“)的编译器障碍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的测试代码如下,我发现只有memory_order_seq_cst禁止编译器重新排序.

#include <atomic>

using namespace std;

int A, B = 1;

void func(void) {
    A = B + 1;
    atomic_thread_fence(memory_order_seq_cst);
    B = 0;
}

其他选择,例如memory_order_releasememory_order_acq_rel根本不会产生任何编译器障碍.

我认为它们必须使用原子变量,如下所示.

#include <atomic>

using namespace std;

atomic<int> A(0);
int B = 1;

void func(void) {
    A.store(B+1, memory_order_release);
    B = 0;
}

但是我不想使用原子变量.同时,我认为"asm("":::" memory)"的级别太低.

还有更好的选择吗?

解决方案

关于:您的

但是我不想使用原子变量.

为什么不呢?如果出于性能原因,请将它们与memory_order_relaxedatomic_signal_fence(mo_whatever)一起使用,以阻止编译器重新排序,而不会产生任何运行时开销,除了编译器障碍可能会阻止某些编译时优化外,具体取决于周围的代码.

如果是由于其他原因,那么atomic_signal_fence可能会为您提供恰好在目标平台上工作的代码.我怀疑它确实对非atomic<>的装载和/或存储进行排序,因此它甚至可能有助于避免C ++中的数据争用未定义行为.


足够做什么?

不管有什么障碍,如果两个线程同时运行此函数,则由于并发访问非atomic<>变量,您的程序将具有未定义行为".因此,此代码唯一有用的方法是,如果您要与在同一线程中运行的信号处理程序进行同步.

这也与要求编译器屏障"相一致,仅防止在编译时进行重新排序,因为乱序执行和内存重新排序始终会保留单个线程的行为.因此,您永远不需要额外的障碍说明来确保按程序顺序看到自己的操作,只需要在编译时停止编译器的重新排序即可.请参阅Jeff Preshing的文章:在编译时进行内存排序

这就是 atomic_signal_fence 的目的.您可以将其与任何std::memory_order一起使用,就像thread_fence一样,以获得不同的屏障强度,并且仅阻止需要阻止的优化.


... atomic_thread_fence(memory_order_acq_rel)根本没有生成任何编译器障碍!

完全错误的几种方式.

atomic_thread_fence 编译器障碍 plus ,它是必需的运行时障碍,用于限制加载/存储对其他线程可见的顺序中的重新排序.

我猜你的意思是,当您查看x86的asm输出时,它没有发出任何屏障指令.诸如x86的MFENCE之类的指令不是编译器壁垒",它们是运行时内存壁垒,甚至可以防止在运行时对StoreLoad进行重新排序. (这是x86允许的唯一重新排序.只有在使用弱排序(NT)商店(例如

这将按您期望的方式用clang进行编译:thread_fence是StoreStore的障碍,因此A = 0必须在B = 1之前发生,并且不能与A = 2合并.

    # clang3.9 -O3
    mov     dword ptr [rip + A], 0
    mov     dword ptr [rip + B], 1
    mov     dword ptr [rip + A], 2
    ret

但是对于gcc而言,屏障没有作用,并且asm输出中仅存在A的最终存储.

    # gcc6.2 -O3
    mov     DWORD PTR B[rip], 1
    mov     DWORD PTR A[rip], 2
    ret

但使用atomic_signal_fence(memory_order_release)时,gcc的输出与clang匹配. 所以atomic_signal_fence(mo_release)具有我们期望的屏障作用,但是atomic_thread_fence具有比seq_cst弱的任何东西根本不充当编译器屏障.

这里的一种理论是,gcc知道多个线程写入非atomic<>变量是正式的未定义行为.这并没有多大用处,因为atomic_thread_fence如果用于与信号处理程序进行同步仍然可以工作,但它比必要的功能更强大.

顺便说一句,用atomic_thread_fence(memory_order_seq_cst),我们得到了预期的

    # gcc6.2 -O3, with a mo_seq_cst barrier
    mov     DWORD PTR A[rip], 0
    mov     DWORD PTR B[rip], 1
    mfence
    mov     DWORD PTR A[rip], 2
    ret

即使只有一个障碍,我们也能做到这一点,这仍然允许A = 0和A = 2的存储一个接一个地发生,因此允许编译器跨一个障碍合并它们. (观察者可能看不到单独的A = 0和A = 2值,这是可能的顺序,因此编译器可以确定总是发生这种情况).不过,当前的编译器通常不会进行这种优化.在 num ++是否可以用于原子'int num'?.

My test code is as below, and I found that only the memory_order_seq_cst forbade compiler's reorder.

#include <atomic>

using namespace std;

int A, B = 1;

void func(void) {
    A = B + 1;
    atomic_thread_fence(memory_order_seq_cst);
    B = 0;
}

And other choices such as memory_order_release, memory_order_acq_rel did not generate any compiler barrier at all.

I think they must work with atomic variable just as below.

#include <atomic>

using namespace std;

atomic<int> A(0);
int B = 1;

void func(void) {
    A.store(B+1, memory_order_release);
    B = 0;
}

But I do not want to use atomic variable. At the same time, I think the "asm("":::"memory")" is too low level.

Is there any better choice?

解决方案

re: your edit:

But I do not want to use atomic variable.

Why not? If it's for performance reasons, use them with memory_order_relaxed and atomic_signal_fence(mo_whatever) to block compiler reordering without any runtime overhead other than the compiler barrier potentially blocking some compile-time optimizations, depending on the surrounding code.

If it's for some other reason, then maybe atomic_signal_fence will give you code that happens to work on your target platform. I suspect that it does order non-atomic<> loads and/or stores, so it might even help avoid data-race Undefined Behaviour in C++.


Sufficient for what?

Regardless of any barriers, if two threads run this function at the same time, your program has Undefined Behaviour because of concurrent access to non-atomic<> variables. So the only way this code can be useful is if you're talking about synchronizing with a signal handler that runs in the same thread.

That would also be consistent with asking for a "compiler barrier", to only prevent reordering at compile time, because out-of-order execution and memory reordering always preserve the behaviour of a single thread. So you never need extra barrier instructions to make sure you see your own operations in program order, you just need to stop the compiler reordering stuff at compile time. See Jeff Preshing's post: Memory Ordering at Compile Time

This is what atomic_signal_fence is for. You can use it with any std::memory_order, just like thread_fence, to get different strengths of barrier and only prevent the optimizations you need to prevent.


... atomic_thread_fence(memory_order_acq_rel) did not generate any compiler barrier at all!

Totally wrong, in several ways.

atomic_thread_fence is a compiler barrier plus whatever run-time barriers are necessary to restrict reordering in the order our loads/stores become visible to other threads.

I'm guessing you mean it didn't emit any barrier instructions when you looked at the asm output for x86. Instructions like x86's MFENCE are not "compiler barriers", they're run-time memory barriers and prevent even StoreLoad reordering at run-time. (That's the only reordering that x86 allows. SFENCE and LFENCE are only needed when using weakly-ordered (NT) stores, like MOVNTPS (_mm_stream_ps).)

On a weakly-ordered ISA like ARM, thread_fence(mo_acq_rel) isn't free, and compiles to an instruction. gcc5.4 uses dmb ish. (See it on the Godbolt compiler explorer).

A compiler barrier just prevents reordering at compile time, without necessarily preventing run-time reordering. So even on ARM, atomic_signal_fence(mo_seq_cst) compiles to no instructions.

A weak enough barrier allows the compiler to do the store to B ahead of the store to A if it wants, but gcc happens to decide to still do them in source order even with thread_fence(mo_acquire) (which shouldn't order stores with other stores).

So this example doesn't really test whether something is a compiler barrier or not.


Strange compiler behaviour from gcc for an example that is different with a compiler barrier:

See this source+asm on Godbolt.

#include <atomic>
using namespace std;
int A,B;

void foo() {
  A = 0;
  atomic_thread_fence(memory_order_release);
  B = 1;
  //asm volatile(""::: "memory");
  //atomic_signal_fence(memory_order_release);
  atomic_thread_fence(memory_order_release);
  A = 2;
}

This compiles with clang the way you'd expect: the thread_fence is a StoreStore barrier, so the A=0 has to happen before B=1, and can't be merged with the A=2.

    # clang3.9 -O3
    mov     dword ptr [rip + A], 0
    mov     dword ptr [rip + B], 1
    mov     dword ptr [rip + A], 2
    ret

But with gcc, the barrier has no effect, and only the final store to A is present in the asm output.

    # gcc6.2 -O3
    mov     DWORD PTR B[rip], 1
    mov     DWORD PTR A[rip], 2
    ret

But with atomic_signal_fence(memory_order_release), gcc's output matches clang. So atomic_signal_fence(mo_release) is having the barrier effect we expect, but atomic_thread_fence with anything weaker than seq_cst isn't acting as a compiler barrier at all.

One theory here is that gcc knows that it's officially Undefined Behaviour for multiple threads to write to non-atomic<> variables. This doesn't hold much water, because atomic_thread_fence should still work if used to synchronize with a signal handler, it's just stronger than necessary.

BTW, with atomic_thread_fence(memory_order_seq_cst), we get the expected

    # gcc6.2 -O3, with a mo_seq_cst barrier
    mov     DWORD PTR A[rip], 0
    mov     DWORD PTR B[rip], 1
    mfence
    mov     DWORD PTR A[rip], 2
    ret

We get this even with only one barrier, which would still allow the A=0 and A=2 stores to happen one after the other, so the compiler is allowed to merge them across a barrier. (Observers failing to see separate A=0 and A=2 values is a possible ordering, so the compiler can decide that's what always happens). Current compilers don't usually do this kind of optimization, though. See discussion at the end of my answer on Can num++ be atomic for 'int num'?.

这篇关于在C ++ 11中是否有任何等于asm(":::" memory“)的编译器障碍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆