了解内存障碍 [英] Making sense of Memory Barriers

查看:46
本文介绍了了解内存障碍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图以一种对Java无锁程序员有用的水平来理解内存障碍,我觉得这个水平介于学习volatile和从x86手册学习存储/加载缓冲区之间.

I am attempting to understand memory barriers at a level useful for java lock-free programmers.This level, I feel, is somewhere between learning just about volatiles and learning about working of Store/Load buffers from an x86 manual.

我花了一些时间阅读大量博客/食谱,并总结了以下摘要.知识渊博的人可以看看摘要,看看我是否错过或列出了不正确的内容.

I spent some time reading a bunch of blogs/cookbooks and have come up with the summary below. Could someone more knowledgeable look at the summary to see if I have missed or listed something incorrectly.

LFENCE:

Name             : LFENCE/Load Barrier/Acquire Fence
Barriers         : LoadLoad + LoadStore
Details          : Given sequence {Load1, LFENCE, Load2, Store1}, the
                   barrier ensures that Load1 can't be moved south and
                   Load2 and Store1 can't be moved north of the
                   barrier. 
                   Note that Load2 and Store1 can still be reordered.

Buffer Effect    : Causes the contents of the LoadBuffer 
                   (pending loads) to be processed for that CPU.This
                   makes program state exposed from other CPUs visible
                   to this CPU before Load2 and Store1 are executed.

Cost on x86      : Either very cheap or a no-op.
Java instructions: Reading a volatile variable, Unsafe.loadFence()

SFENCE

Name             : SFENCE/Store Barrier/Release Fence
Barriers         : StoreStore + LoadStore
Details          : Given sequence {Load1, Store1, SFENCE, Store2,Load2}
                   the barrier ensures that Load1 and Store1 can't be
                   moved south and Store2 can't be moved north of the 
                   barrier.
                   Note that Load1 and Store1 can still be reordered AND 
                   Load2 can be moved north of the barrier.
Buffer Effect    : Causes the contents of the StoreBuffer flushed to 
                   cache for the CPU on which it is issued.
                   This will make program state visible to other CPUs
                   before Store2 and Load1 are executed.
Cost on x86      : Either very cheap or a no-op.
Java instructions: lazySet(), Unsafe.storeFence(), Unsafe.putOrdered*()

MFENCE

Name             : MFENCE/Full Barrier/Fence
Barriers         : StoreLoad
Details          : Obtains the effects of the other three barrier.
                   Given sequence {Load1, Store1, MFENCE, Store2,Load2}, 
                   the barrier ensures that Load1 and Store1 can't be
                   moved south and Store2 and Load2 can't be moved north
                   of the barrier.
                   Note that Load1 and Store1 can still be reordered AND
                   Store2 and Load2 can still be reordered.
 Buffer Effect   : Causes the contents of the LoadBuffer (pending loads) 
                   to be processed for that CPU.
                   AND
                   Causes the contents of the StoreBuffer flushed to
                   cache for the CPU on which it is issued.
 Cost on x86     : The most expensive kind.
Java instructions: Writing to a volatile, Unsafe.fullFence(), Locks

最后,如果SFENCE和MFENCE都耗尽了storeBuffer(使高速缓存行无效并等待来自其他CPU的确认),为什么一个无操作而另一个非常昂贵呢?

Finally, if both SFENCE and MFENCE drains the storeBuffer (invalidates cacheline and waits for acks from other cpus), why is one a no-op and the other a very expensive op?

谢谢

(从Google的机械同情"论坛中交叉发布)

(Cross-posted from Google's Mechanical Sympathy forum)

推荐答案

您正在使用Java,因此,真正重要的是Java内存模型.编译时(包括JIT)优化将重新排序您的内存访问在Java内存模型的限制内,而不是JVM恰好针对其进行JIT编译的更强大的x86内存模型.(请参阅我对有关内存重新排序如何帮助处理器和编译器的回答?)

You're using Java, so all that really matters is the Java memory model. Compile-time (including JIT) optimizations will re-order your memory accesses within the limitations of the Java memory model, not the stronger x86 memory model that the JVM happens to be JIT-compiling for. (See my answer to How does memory reordering help processors and compilers?)

尽管如此,对x86的学习可以使您理解具体的基础,但不要陷入认为x86上的Java就像x86上的汇编一样工作的陷阱.(或者说整个世界都是x86.许多其他体系结构的排列都很薄弱,例如Java内存模型.)

Still, learning about x86 can give your understanding a concrete foundation, but don't fall into the trap of thinking that Java on x86 works like assembly on x86. (Or that the whole world is x86. Many other architectures are weakly ordered, like the Java memory model.)

x86 LFENCE SFENCE 是无操作的,除非您使用 movnt 弱排序的高速缓存绕过存储.正常加载隐式地获取加载,而普通存储隐式地是发布存储

x86 LFENCE and SFENCE are no-ops as far as memory ordering, unless you used movnt weakly-ordered cache-bypassing stores. Normal loads are implicitly acquire-loads, and normal stores are implicitly release-stores.

您的表中有一个错误: SFENCE <根据英特尔的指令集参考手册,/code>未按照加载指令进行排序" .它只是 一个StoreStore屏障,而不是LoadStore屏障.

You have an error in your table: SFENCE is "not ordered with respect to load instructions", according to Intel's instruction set reference manual. It is only a StoreStore barrier, not a LoadStore barrier.

(该链接是Intel pdf的html转换.请参见标签Wiki,以获取指向正式版本的链接.)

(That link is an html conversion of Intel's pdfs. See the x86 tag wiki for links to the official version.)

lfence 是LoadLoad和LoadStore障碍,因此您的表是正确的.

lfence is a LoadLoad and LoadStore barrier, so your table is correct.

但是CPU并不能真正提前"缓冲负载.他们执行这些操作,并在结果可用后立即开始将结果用于乱序执行.(通常,即使在L1缓存命中时,使用加载结果的指令也已在加载结果准备好之前进行了解码和发布).这是装载和存储之间的根本区别.

But CPUs don't really "buffer" loads ahead of time. They do them and start using the results for out-of-order execution as soon as the results are available. (Usually instructions using the result of a load have already been decoded and issued before the result of a load is ready, even on an L1 cache hit). This is the fundamental difference between loads and stores.

SFENCE 很便宜,因为它实际上不必耗尽存储缓冲区.这是实现它的一种方法,它以性能为代价,使硬件保持简单.

SFENCE is cheap because it doesn't actually have to drain the store buffer. That's one way to implement it which keeps the hardware simple, at the cost of performance.

MFENCE 昂贵,因为它是唯一阻止StoreLoad重新排序的障碍.请参阅Jeff Preshing的内存重新排序抓住该法案进行解释,并找到一个测试程序,该程序实际上演示了在实际硬件上对StoreLoad的重新排序.

MFENCE is expensive because it's the only barrier that prevents StoreLoad reordering. See Jeff Preshing's Memory Reordering Caught in the Act for an explanation, and a test program that actually demonstrates StoreLoad reordering on real hardware.

Jeff Preshing的博客文章对于理解无锁编程和内存排序语义.我通常将他的博客链接在我对内存排序问题的解答中.如果您有兴趣阅读更多我所写的内容(主要是C ++/asm,而不是Java),则可以在此上使用搜索来找到那些答案.

Jeff Preshing's blog posts are gold for understanding lock-free programming and memory ordering semantics. I usually link his blog in my SO answers to memory-ordering questions. You can probably use search on that to find those answers, if you're interested in reading more of what I've written (mostly C++ / asm, not Java though).

有趣的事实:在x86上进行的任何原子性的读-修改-写操作也是一个完整的内存障碍. lock 前缀(隐藏在 xchg [mem],reg 上)也是一个障碍. lock add [esp],0 是常见的成语,用于出现在 mfence 之前的无障碍内存屏障.(堆栈存储器在L1中几乎总是很热,而不是共享的.)

Fun fact: Any atomic read-modify-write operation on x86 is also a full memory barrier. The lock prefix, which is implicit on xchg [mem], reg, is also a full barrier. lock add [esp], 0 was a common idiom for a memory barrier that's otherwise a no-op, before mfence was available. (stack memory is almost always hot in L1, and not shared).

因此,在x86上,递增原子计数器具有相同的性能,而不管您请求的内存排序语义如何.(例如,c ++ 11 memory_order_relaxed memory_order_seq_cst (顺序一致性)).但是,请使用适当的任何内存顺序语义,因为其他体系结构可以执行原子操作而没有完整的内存障碍.强制编译器/JVM在不需要时使用内存屏障是很浪费的.

So on x86, incrementing an atomic counter has the same performance regardless of the memory-ordering semantics you request. (e.g. c++11 memory_order_relaxed vs. memory_order_seq_cst (sequential consistency)). Use whatever memory-order semantics are appropriate, though, because other architectures can do atomic ops without full memory barriers. Forcing the compiler / JVM to use a memory barrier when you don't need it is a waste.

这篇关于了解内存障碍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆