gcc内存屏障__sync_synchronize vs asm volatile(""::" memory") [英] gcc memory barrier __sync_synchronize vs asm volatile("": : :"memory")
问题描述
这听起来类似于gcc内建的 __ sync_synchronize 。
这两个相似吗?如果没有,有什么区别,以及什么时候会使用另一个?
有一个重大的区别 - 第一个选项(内联asm)在运行时实际上什么也不做,在那里没有执行任何命令,并且CPU不知道它。它仅在编译时才会提供,以告知编译器不要将此加载或存储超出此点(以任何方向)作为其优化的一部分。这就是所谓的SW屏障。第二个屏障(内置同步)将简单地转化为硬屏障,如果你是一个屏障(mfence / sfence)操作在x86上,或其他架构中的等价物。 CPU也可能在运行时进行各种优化,最重要的是实际上无序执行操作 - 该指令告诉它确保加载或存储无法通过该点,并且必须在正确的一侧观察同步点。
这是另一个很好的解释:
如上所述,编译器和处理器都可以优化存储器和存储器之间的关系,如图2所示。以必须使用内存屏障的方式执行指令。影响编译器和
处理器的内存障碍是硬件内存障碍,而
只影响编译器的内存障碍是软件内存障碍。
除了硬件和软件内存障碍之外,内存屏障
可能仅限于内存读取,内存写入或两者。影响读写的内存
障碍是一个完整的内存障碍。
还有一类特定于
的内存障碍多处理器环境。这些内存屏障的名称是前缀为smp的
。在多处理器系统中,这些障碍是
硬件内存障碍,而在单处理器系统中,它们是
软件内存障碍。 b
barrier()宏是唯一的软件内存屏障,它是一个
的全内存屏障。 Linux内核中的所有其他内存障碍都是
硬件障碍。硬件内存屏障是隐含的软件
障碍。
SW屏障有用的示例:请考虑以下代码 - 对于(i = 0; i
;
}
通过优化编译的这个简单的循环很可能会被展开和向量化。
下面是汇编代码gcc 4.8.0 -O3生成的打包(向量)操作:
400420:66 0f 6f 00 movdqa(%rax),%xmm0
400424:48 83 c0 10 add $ 0x10,%rax
400428:66 0f fe c1 paddd%xmm1,%xmm0
40042c:66 0f 7f 40 f0 movdqa%xmm0,0xfffffffffffffff0(%rax)
400431:48 39 d0 cmp%rdx,%rax
400434:75 ea jne 400420< main + 0x30>但是,在每次迭代中添加内联程序集时,gcc不允许更改该内联程序集的顺序
400418: 83 00 01 addl $ 0x1(%rax)
40041b:48 83 c0 04 add $ 0x4,%rax
40041f:48 39 d0 cmp%rdx,%rax
400422:75 f4 jne 400418< main + 0x28>
但是,当CPU执行这些代码时,它可以重新排列底层操作的顺序,只要它不打破记忆订购模式。这意味着执行操作可以不按顺序完成(如果CPU支持该操作,就像现在这样做的大部分操作一样)。 HW围栏会阻止这种情况发生。
asm volatile("": : :"memory")
is often used as a memory barrier (e.g. as seen in the linux kernel barrier() macro).
This sounds similar to what the gcc builtin __sync_synchronize does.
Are these two similar ? If not, what are the differences, and when would one used over the other ?
解决方案 There's a significant difference - the first option (inline asm) actually does nothing at runtime, there's no command performed there and the CPU doesn't know about it. it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. It's called a SW barrier.
The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you're on x86, or its equivalents in other architectures. The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order - this instruction tells it to make sure that loads or stores can't pass this point and must be observed in the correct side of the sync point.
Here's another good explanation:
Types of Memory Barriers
As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a
memory barrier. A memory barrier that affects both the compiler and
the processor is a hardware memory barrier, and a memory barrier that
only affects the compiler is a software memory barrier.
In addition to hardware and software memory barriers, a memory barrier
can be restricted to memory reads, memory writes, or both. A memory
barrier that affects both reads and writes is a full memory barrier.
There is also a class of memory barrier that is specific to
multi-processor environments. The name of these memory barriers are
prefixed with "smp". On a multi-processor system, these barriers are
hardware memory barriers and on uni-processor systems, they are
software memory barriers.
The barrier() macro is the only software memory barrier, and it is a
full memory barrier. All other memory barriers in the Linux kernel are
hardware barriers. A hardware memory barrier is an implied software
barrier.
An example for when SW barrier is useful: consider the following code -
for (i = 0; i < N; ++i) {
a[i]++;
}
This simple loop, compiled with optimizations, would most likely be unrolled and vectorized.
Here's the assembly code gcc 4.8.0 -O3 generated packed (vector) operations:
400420: 66 0f 6f 00 movdqa (%rax),%xmm0
400424: 48 83 c0 10 add $0x10,%rax
400428: 66 0f fe c1 paddd %xmm1,%xmm0
40042c: 66 0f 7f 40 f0 movdqa %xmm0,0xfffffffffffffff0(%rax)
400431: 48 39 d0 cmp %rdx,%rax
400434: 75 ea jne 400420 <main+0x30>
However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can't group them, and the assembly becomes the scalar version of the loop:
400418: 83 00 01 addl $0x1,(%rax)
40041b: 48 83 c0 04 add $0x4,%rax
40041f: 48 39 d0 cmp %rdx,%rax
400422: 75 f4 jne 400418 <main+0x28>
However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.
这篇关于gcc内存屏障__sync_synchronize vs asm volatile(""::" memory")的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!