gcc内存屏障__sync_synchronize vs asm volatile(""::" memory") [英] gcc memory barrier __sync_synchronize vs asm volatile("": : :"memory")

查看:1285
本文介绍了gcc内存屏障__sync_synchronize vs asm volatile(""::" memory")的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

asm volatile(:::memory)通常用作内存屏障(例如,在Linux内核barrier()宏中看到的)。



这听起来类似于gcc内建的 __ sync_synchronize



这两个相似吗?如果没有,有什么区别,以及什么时候会使用另一个?

解决方案

有一个重大的区别 - 第一个选项(内联asm)在运行时实际上什么也不做,在那里没有执行任何命令,并且CPU不知道它。它仅在编译时才会提供,以告知编译器不要将此加载或存储超出此点(以任何方向)作为其优化的一部分。这就是所谓的SW屏障。第二个屏障(内置同步)将简单地转化为硬屏障,如果你是一个屏障(mfence / sfence)操作在x86上,或其他架构中的等价物。 CPU也可能在运行时进行各种优化,最重要的是实际上无序执行操作 - 该指令告诉它确保加载或存储无法通过该点,并且必须在正确的一侧观察同步点。



这是另一个很好的解释:

如上所述,编译器和处理器都可以优化存储器和存储器之间的关系,如图2所示。以必须使用
内存屏障的方式执行指令。影响编译器和
处理器的内存障碍是硬件内存障碍,而
只影响编译器的内存障碍是软件内存障碍。

除了硬件和软件内存障碍之外,内存屏障
可能仅限于内存读取,内存写入或两者。影响读写的内存
障碍是一个完整的内存障碍。



还有一类特定于
的内存障碍多处理器环境。这些内存屏障的名称是前缀为smp的
。在多处理器系统中,这些障碍是
硬件内存障碍,而在单处理器系统中,它们是
软件内存障碍。 b

barrier()宏是唯一的软件内存屏障,它是一个
的全内存屏障。 Linux内核中的所有其他内存障碍都是
硬件障碍。硬件内存屏障是隐含的软件
障碍。

SW屏障有用的示例:请考虑以下代码 - 对于(i = 0; i a [i] ++)的

  ; 
}

通过优化编译的这个简单的循环很可能会被展开和向量化。
下面是汇编代码gcc 4.8.0 -O3生成的打包(向量)操作:

  400420:66 0f 6f 00 movdqa(%rax),%xmm0 
400424:48 83 c0 10 add $ 0x10,%rax
400428:66 0f fe c1 paddd%xmm1,%xmm0
40042c:66 0f 7f 40 f0 movdqa%xmm0,0xfffffffffffffff0(%rax)
400431:48 39 d0 cmp%rdx,%rax
400434:75 ea jne 400420< main + 0x30>但是,在每次迭代中添加内联程序集时,gcc不允许更改该内联程序集的顺序

  400418: 83 00 01 addl $ 0x1(%rax)
40041b:48 83 c0 04 add $ 0x4,%rax
40041f:48 39 d0 cmp%rdx,%rax
400422:75 f4 jne 400418< main + 0x28>

但是,当CPU执行这些代码时,它可以重新排列底层操作的顺序,只要它不打破记忆订购模式。这意味着执行操作可以不按顺序完成(如果CPU支持该操作,就像现在这样做的大部分操作一样)。 HW围栏会阻止这种情况发生。


asm volatile("": : :"memory") is often used as a memory barrier (e.g. as seen in the linux kernel barrier() macro).

This sounds similar to what the gcc builtin __sync_synchronize does.

Are these two similar ? If not, what are the differences, and when would one used over the other ?

解决方案

There's a significant difference - the first option (inline asm) actually does nothing at runtime, there's no command performed there and the CPU doesn't know about it. it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. It's called a SW barrier.

The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you're on x86, or its equivalents in other architectures. The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order - this instruction tells it to make sure that loads or stores can't pass this point and must be observed in the correct side of the sync point.

Here's another good explanation:

Types of Memory Barriers

As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a memory barrier. A memory barrier that affects both the compiler and the processor is a hardware memory barrier, and a memory barrier that only affects the compiler is a software memory barrier.

In addition to hardware and software memory barriers, a memory barrier can be restricted to memory reads, memory writes, or both. A memory barrier that affects both reads and writes is a full memory barrier.

There is also a class of memory barrier that is specific to multi-processor environments. The name of these memory barriers are prefixed with "smp". On a multi-processor system, these barriers are hardware memory barriers and on uni-processor systems, they are software memory barriers.

The barrier() macro is the only software memory barrier, and it is a full memory barrier. All other memory barriers in the Linux kernel are hardware barriers. A hardware memory barrier is an implied software barrier.

An example for when SW barrier is useful: consider the following code -

for (i = 0; i < N; ++i) {
    a[i]++;
}

This simple loop, compiled with optimizations, would most likely be unrolled and vectorized. Here's the assembly code gcc 4.8.0 -O3 generated packed (vector) operations:

400420:       66 0f 6f 00             movdqa (%rax),%xmm0
400424:       48 83 c0 10             add    $0x10,%rax
400428:       66 0f fe c1             paddd  %xmm1,%xmm0
40042c:       66 0f 7f 40 f0          movdqa %xmm0,0xfffffffffffffff0(%rax)
400431:       48 39 d0                cmp    %rdx,%rax
400434:       75 ea                   jne    400420 <main+0x30>

However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can't group them, and the assembly becomes the scalar version of the loop:

400418:       83 00 01                addl   $0x1,(%rax)
40041b:       48 83 c0 04             add    $0x4,%rax
40041f:       48 39 d0                cmp    %rdx,%rax
400422:       75 f4                   jne    400418 <main+0x28>

However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.

这篇关于gcc内存屏障__sync_synchronize vs asm volatile(&quot;&quot;::&quot; memory&quot;)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆