内存重新排序如何帮助处理器和编译器? [英] How does memory reordering help processors and compilers?

查看:137
本文介绍了内存重新排序如何帮助处理器和编译器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我研究了Java内存模型并看到了重新排序问题。一个简单的例子:

I studied the Java memory model and saw re-ordering problems. A simple example:

boolean first = false;
boolean second = false;

void setValues() {
    first = true;
    second = true;
}

void checkValues() {
    while(!second);
    assert first;
}

重新排序是非常不可预测和奇怪的。此外,它破坏了抽象。我认为处理器架构必须有充分的理由去做一些对程序员来说太不方便的事情。 这些原因是什么?

Reordering is very unpredictable and weird. Also, it ruins abstractions. I suppose that processor architectures must have a good reason to do something that's so inconvenient for programmers. What are those reasons?

有很多关于如何处理重新排序的信息,但我找不到任何关于的信息为什么需要它。在任何地方,人们只会说这是因为某些性能优势。例如,在第一个之前存储 second 的性能优势是什么?

There is a lot of information about how to handle reordering, but I can't find anything about why it is needed. Everywhere people just say something like "it is because of some performance benefit". What are the performance benefits in storing second before first, for example?

你能推荐一些关于此的文章,论文或书籍,或者自己解释一下吗?

Can you recommend some article, paper or book about this, or explain it by yourself?

推荐答案

TL; DR :它为编译器和硬件提供了更多空间来利用 as-if 规则,而不要求它保留所有原始源的行为,只是单个线程本身的结果。

TL;DR: It gives the compiler and hardware more room to take advantage of the as-if rule by not requiring it to preserve all behaviour of the original source, only the result of the single thread itself.

将外部可观察(从其他线程)排序的加载/存储排除在图片之外优化必须保留的东西为编译器提供了很多空间来将事物合并到更少的操作中。对于硬件,延迟商店是最重要的,但对于编译器,各种重新排序都可以提供帮助。

Taking the externally-observable (from other threads) ordering of loads/stores out of the picture as something that optimizations must preserve gives the compiler a lot of room to merge things into fewer operations. For the hardware, delaying stores is the big one, but for compilers all kinds of reordering can help.

(参见部分 - 关于为什么它有助于编译器的部分)

(See part-way down for a section on why it helps the compiler)

之前的硬件重新排序稍后加载的商店( StoreLoad重新排序)里面的商店CPU对于无序执行至关重要。 (见下文)。

Hardware reordering earlier stores with later loads (StoreLoad reordering) inside the CPU is essential for out-of-order execution. (See below).

其他类型的重新排序(例如StoreStore重新排序,这是你的问题的主题)并不重要,可以构建高性能的CPU只有StoreLoad重新排序,而不是其他三种。 (主要的例子是标签:x86,其中每个商店都是发布商店,每个商店加载是一种获取负载。请参阅 x86 标签维基了解更多详情。)

Other kinds of reordering (e.g. StoreStore reordering, which is the subject of your question) aren't essential, and high performance CPUs can be built with only StoreLoad reordering, not the other three kinds. (The prime example being tag:x86, where every store is a release-store, every load is an acquire-load. See the x86 tag wiki for more details.)

有些人,比如Linus Torvalds,认为与其他商店重新订购商店对硬件的帮助不大,因为硬件已经必须跟踪商店订购以支持单个的无序执行线程。 (单个线程总是运行,好像它自己的所有存储/加载按程序顺序发生。)如果你很好奇,请参阅realworldtech上该线程中的其他帖子。和/或如果你发现Linus的侮辱和明智的技术争论很有趣:P

Some people, like Linus Torvalds, argue that reordering stores with other stores doesn't help the hardware much, because hardware already has to track store-ordering to support out-of-order execution of a single thread. (A single thread always runs as if all of its own stores/loads happen in program order.) See other posts in that thread on realworldtech if you're curious. And/or if you find Linus's mix of insults and sensible technical arguments entertaining :P

对于Java,问题在于,架构存在于硬件提供这些订购保证的地方。 弱内存排序是RISC ISA(如ARM)的常见功能, PowerPC和MIPS。 (但不是SPARC-TSO)。设计决策背后的原因与我所链接的真实世界的线程中所讨论的相同:使硬件更简单,并在需要时让软件请求订购。

For Java, the issue is that, architectures exist where the hardware doesn't provide these ordering guarantees. Weak memory ordering is a common feature of RISC ISAs like ARM, PowerPC, and MIPS. (But not SPARC-TSO). The reasons behind that design decision are the same ones being argued over in the realworldtech thread I linked: make the hardware simpler, and let software request ordering when needed.

所以Java的架构师没有太多选择:对于具有比Java标准更弱的内存模型的架构实现JVM,在每个存储之后需要存储屏障指令,并且在每次加载之前需要加载屏障。 (除非JVM的JIT编译器能够证明没有其他线程可以引用该变量。)运行屏障指令一直很慢。

So Java's architect(s) didn't have much of a choice: Implementing a JVM for an architecture with a weaker memory model than the Java standard would require a store-barrier instruction after every single store, and a load-barrier before every load. (Except when the JVM's JIT-compiler can prove that no other thread can have a reference to that variable.) Running barrier instructions all the time is slow.

强大的内存Java模型将无法在ARM(和其他ISA)上实现高效的JVM。证明不需要障碍几乎是不可能的,需要人工智能水平的全球计划理解。 (这比普通优化器的做法更好。)

A strong memory model for Java would make efficient JVMs on ARM (and other ISAs) impossible. Proving that barriers aren't needed is near-impossible, requiring AI levels of global program-understanding. (This goes WAY beyond what normal optimizers do).

(另见Jeff Preshing关于 C ++编译的优秀博客文章时间重新排序。当你将JIT编译包含在本机代码中时,这基本上适用于Java。)

(see also Jeff Preshing's excellent blog post on C++ compile-time reordering. This basically applies to Java when you include JIT-compiling to native code as part of the process.)

保持Java的另一个原因和C / C ++内存模型的弱点是允许更多的优化。由于允许其他线程(通过弱内存模型)以任何顺序观察我们的存储和加载,即使代码涉及存储到内存,也允许积极的转换。

Another reason for keeping the Java and C/C++ memory models weak is to allow more optimizations. Since other threads are allowed (by the weak memory model) to observe our stores and loads in any order, aggressive transformations are allowed even when the code involves stores to memory.

例如在类似Davide的例子中:

e.g. in a case like Davide's example:

c.a = 1;
c.b = 1;
c.a++;
c.b++;

// same observable effects as the much simpler
c.a = 2;
c.b = 2;

不要求其他线程能够观察中间状态。所以编译器可以将其编译为 c.a = 2; cb = 2; ,无论是在Java编译时还是字节码被JIT编译为机器代码。

There's no requirement that other threads be able to observe the intermediate states. So a compiler can just compile that to c.a = 2; c.b = 2;, either at Java-compile time or when the bytecode is JIT-compiled to machine code.

这种方法很常见增加从另一个方法多次调用的东西。没有这个规则,只有在编译器可以证明没有其他线程可以观察到差异时才会将其转换为 ca + = 4

It's common for a method that increments something to be called multiple times from another method. Without this rule, turning it into c.a += 4 could only happen if the compiler could prove that no other thread could observe the difference.

C ++程序员有时会错误地认为,因为他们正在编译x86,所以他们不需要 std :: atomic< int> 来获取一些共享变量的排序保证。 这是错误的,因为优化是基于语言内存模型的as-if规则而不是目标硬件发生的。

C++ programmers sometimes make the mistake of thinking that since they're compiling for x86, they don't need std::atomic<int> to get some ordering guarantees for a shared variable. This is wrong, because optimizations happen based on the as-if rule for the language memory model, not the target hardware.

一次一个商店被提交到缓存中,它在其他核心上运行的线程变得全局可见(通过缓存一致性协议)。此时,将其回滚为时已晚(另一个核心可能已经获得了该值的副本)。因此,只有知道商店不会出错,并且在它之前没有任何指令,它才会发生。并且商店的数据准备就绪。并且在之前的某个时刻没有分支错误预测,等等。即我们需要排除所有错误推测的情况,然后才能退出商店指令。

Once a store is committed into cache, it becomes globally visible to threads running on other cores (via the cache-coherency protocol). At that point, it's too late to roll it back (another core might already have gotten a copy of the value). So it can't happen until it's known for certain that the store won't fault, and neither will any instruction before it. and the store's data is ready. And that there wasn't a branch-mispredict at some point earlier, etc. etc. i.e. we need to rule out all cases of mis-speculation before we can retire a store instruction.

如果没有StoreLoad重新排序,每个加载都必须等待所有先前的存储退出(即完全执行完毕,将数据提交到缓存),然后才能从缓存中读取值以供以后依赖于的指令使用加载的值。 (加载将值从缓存复制到寄存器的时刻是其他线程全局可见的时刻。)

Without StoreLoad reordering, every load would have to wait for all preceding stores to retire (i.e. be totally finished executing, having committed the data to cache) before they could read a value from cache for use by later instructions that depend on the value loaded. (The moment when a load copies a value from cache into a register is when it's globally visible to other threads.)

因为你不知道其他核心上发生了什么,我不认为硬件可以通过推测它不是一个问题来隐藏启动负载的这种延迟,然后在事后检测到误推测。 (并将其视为分支错误预测:抛弃所有依赖于该负载的工作,然后重新发布它。)核心可能能够允许来自独占或修改状态,因为它们不能出现在其他核心中。 (如果在推测加载之前退出最后一个商店之前,如果来自另一个CPU的缓存一致性请求来自另一个CPU,则检测错误推测。)无论如何,这显然是其他任何事情都不需要的大量复杂性。

Since you can't know what's happening on other cores, I don't think hardware could hide this delay in starting loads by speculating that it isn't a problem, and then detecting mis-speculation after the fact. (And treat it like a branch mispredict: throw away all work done that depended on that load, and re-issue it.) A core might be able to allow speculative early loads from cache lines that were in Exclusive or Modified state, since they can't be present in other cores. (Detecting mis-speculation if a cache-coherency request for that cache line came in from another CPU before retiring the last store before the speculative load.) Anyway, this is obviously a large amount of complexity that isn't needed for anything else.

请注意,我甚至没有提到商店的缓存缺失。这会将商店的延迟从几个周期增加到数百个周期。

Note that I haven't even mentioned cache-misses for stores. That increases the latency of a store from a few cycles to hundreds of cycles.

我在答案的早期部分作为计算机体系结构简介的一部分包含了一些链接取消优化英特尔Sandybridge系列CPU管道的程序 。如果你发现这很难理解,这可能会有所帮助,或者更令人困惑。

I included some links as part of a brief intro to computer architecture in the early part of my answer on Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. That might be helpful, or more confusing, if you're finding this hard to follow.

CPU避免商店的WAR和WAW管道危险,通过缓冲存储队列,直到商店指令准备退休。来自同一核心的负载必须检查存储队列(以保留单个线程的按顺序执行的外观,否则在加载最近可能存储的任何内容之前,您需要内存屏障指令!)。存储队列对其他线程不可见;只有在存储指令退出时,存储才会变为全局可见,但只要它们执行,负载就会全局可见。 (并且可以使用预先提取到缓存中的值)。

CPUs avoid WAR and WAW pipeline hazards for stores by buffering them in a store queue until store instructions are ready to retire. Loads from the same core have to check the store queue (to preserve the appearance of in-order execution for a single thread, otherwise you'd need memory-barrier instructions before loading anything that might have been stored recently!). The store queue is invisible to other threads; stores only become globally visible when the store instruction retires, but loads become globally visible as soon as they execute. (And can use values prefetched into the cache well ahead of that).

另请参阅维基百科关于经典RISC管道的文章

因此,商店可能无序执行,但他们'只重新排序在商店队列中。由于指令必须退出才能支持精确的异常,因此硬件强制执行StoreStore订单似乎没什么好处。

So out-of-order execution is possible for stores, but they're only reordered inside the store queue. Since instructions have to retire in order to support precise exceptions, there doesn't appear to be much benefit at all to having the hardware enforce StoreStore ordering.

由于负载变为全局在执行时可见,强制执行LoadLoad排序可能需要在缓存中未命中的加载后延迟加载。当然,实际上CPU会推测性地执行以下负载,并且如果发生则检测存储器顺序错误推测。这对于良好的性能几乎是必不可少的:无序执行的很大一部分好处是继续做有用的工作,隐藏缓存未命中的延迟。

Since loads become globally visible when they execute, enforcing LoadLoad ordering may require delaying loads after a load that misses in cache. Of course, in reality the CPU would speculatively execute the following loads, and detect a memory-order mis-speculation if it occurs. This is nearly essential for good performance: A large part of the benefit of out-of-order execution is to keep doing useful work, hiding the latency of cache misses.

Linus的一个论点是,弱排序的CPU需要多线程代码才能使用大量内存屏障指令,因此对于多线程代码来说它们需要便宜不好意思只有当你有硬件跟踪加载和存储的依赖顺序时才有可能。

One of Linus' arguments is that weakly-ordered CPUs require multi-threaded code to use a lot of memory barrier instructions, so they'll need to be cheap for multi-threaded code to not suck. That's only possible if you have hardware tracking the dependency ordering of loads and stores.

但如果你有依赖关系的硬件跟踪,你可以让硬件强制执行所有顺序时间,因此软件不必运行尽可能多的屏障指令。如果你有硬件支持使障碍便宜,为什么不在每个加载/存储上隐含它们,就像x86那样。

But if you have that hardware tracking of dependencies, you can just have the hardware enforce ordering all the time, so software doesn't have to run as many barrier instructions. If you have hardware support to make barriers cheap, why not just make them implicit on every load / store, like x86 does.

他的另一个主要论点是内存排序是HARD,也是bug的主要来源。在硬件中实现一次就比每个必须正确完成的软件项目更好。 (这个论点只有在没有巨大性能开销的硬件中才有效。)

His other major argument is that memory ordering is HARD, and a major source of bugs. Getting it right once in hardware is better than every software project having to get it right. (This argument only works because it's possible in hardware without huge performance overhead.)

这篇关于内存重新排序如何帮助处理器和编译器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆