如何通过按顺序提交进行加载-> 存储重新排序? [英] How is load->store reordering possible with in-order commit?

查看:22
本文介绍了如何通过按顺序提交进行加载-> 存储重新排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ARM 允许使用后续存储重新排序加载,因此以下伪代码:

//CPU 0 |//CPU 1温度0 = x;|温度 1 = y;y = 1;|x = 1;

可能导致 temp0 == temp1 == 1(并且,这在实践中也是可以观察到的).我无法理解这是如何发生的;似乎按顺序提交会阻止它(据我所知,几乎所有 OOO 处理器中都存在这种情况).我的推理是负载在提交之前必须有它的值,它在存储之前提交,并且存储的值在提交之前不能对其他处理器可见."

我猜我的假设之一一定是错误的,并且必须满足以下条件之一:

  • 指令不需要按所有方式提交.较晚的存储可以安全地提交并在较早的加载之前变得可见,只要在存储提交时内核可以保证先前的加载(和所有中间指令)不会触发异常,并且加载的地址是保证与商店的不同.

  • 负载可以在其值已知之前提交.我不知道这将如何实施.

  • 存储可以在提交之前变得可见.也许某处的内存缓冲区允许将存储转发到加载到不同的线程,即使加载较早排队?

  • 完全不同的东西?

有很多假设的微架构特性可以解释这种行为,但我最好奇的是现代弱排序 CPU 中实际存在的那些特性.

解决方案

您的假设要点在我看来都是正确的,只是您可以构建一个 uarch,其中负载可以在仅检查权限 (TLB) 后从 OoO 核心退出在负载上以确保它肯定会发生.可能有 OoO exec CPU 可以做到这一点(更新:显然有).

我认为 x86 CPU 需要负载才能真正让数据到达,然后才能退休,但它们强大的内存模型无论如何都不允许 LoadStore 重新排序.所以 ARM 肯定会有所不同.

您是对的,在退休之前,任何其他核心都无法看到商店.那就是疯狂.即使在 SMT 核心(一个物理核心上的多个逻辑线程)上,它也会链接对两个逻辑线程在一起,如果其中任何一个检测到错误推测,则要求它们都回滚.这将违背 SMT 让一个逻辑线程利用其他线程中的停顿的目的.

(相关:使已退休但尚未提交(到 L1d)的存储对同一核心上的其他逻辑线程可见是一些真正的 PowerPC 实现如何使线程在全局存储顺序上不一致.将两次原子写入不同其他线程总是以相同的顺序看到不同线程中的位置?)


按顺序执行的 CPU 可以开始加载(检查 TLB 并写入加载缓冲区条目),并且只有在指令准备好之前尝试使用结果时才会停止.然后后面的指令,包括stores,就可以正常运行了.这基本上是有序管道中非糟糕性能所必需的;在每次缓存未命中(甚至只是 L1d 延迟)上拖延是不可接受的.即使在有序的 CPU 上,内存并行也是一个问题.它们可以有多个加载缓冲区来跟踪多个未完成的缓存未命中.像 Cortex-A53 这样的高性能有序 ARM 内核仍然被广泛使用在现代智能手机中,在使用结果寄存器之前调度加载是众所周知的循环数组的重要优化.(展开甚至软件流水线.)

因此,如果加载未命中缓存但存储命中(并在较早的缓存未命中加载获取其数据之前提交到 L1d),您可以重新排序 LoadStore.(Jeff Preshing 对内存重新排序的介绍 使用了LoadStore 的示例,但根本没有涉及 uarch 详细信息.)

在您检查 TLB 和/或任何内存区域内容后,负载不会出错.该部分必须在它退役之前完成,或者在它到达有序管道的末端之前完成.就像位于存储缓冲区中等待提交的退休存储一样,位于加载缓冲区中的退休负载肯定会在某个时刻发生.

所以有序管道上的顺序是:

  • lw r0, [r1] TLB 命中,但在 L1d 缓存中未命中.加载执行单元将地址 (r1) 写入加载缓冲区.任何尝试读取 r0 的后续指令都会停止,但我们确信加载没有出错.

    r0 绑定到等待加载缓冲区准备好,lw 指令本身可以离开管道(退出),后面的指令也可以.>

  • 任何数量的其他不读取 r0 的指令.这会拖延一个有序的管道.

  • sw r2, [r3] 存储执行单元将地址+数据写入存储缓冲区/队列.那么这条指令就可以退出了.

    探测加载缓冲区发现该存储与挂起的加载不重叠,因此它可以提交到 L1d.(如果它重叠,你不能'在 MESI RFO 完成之前提交它,并且快速重启会将传入的数据转发到加载缓冲区.因此,即使不探测每个存储,处理这种情况也不会太复杂,但让我们只看一下单独的缓存-line 案例,我们可以让 LoadStore 重新排序)

    致力于 L1d = 成为全球可见的.这可能发生在较早的加载仍在等待缓存行到达时.


对于 OoO CPU,您需要某种方式将加载完成与 OoO 核心联系起来,以便指令等待加载结果.我想这是可能的,但这意味着寄存器的架构/退休值可能不会存储在核心的任何地方.来自错误推测的管道刷新和其他回滚必须保持传入负载与物理和架构寄存器之间的关联.(不过,在管道回滚时不刷新存储缓冲区已经是 CPU 必须做的事情.位于存储缓冲区中的已退休但尚未提交的存储无法回滚.)

对于具有小 OoO 窗口的 uarches 来说,这可能是一个很好的设计理念,该窗口太小而无法隐藏缓存未命中.(公平地说,每个高性能 OoO exec CPU 都是如此:内存延迟通常太高而无法完全隐藏.)


我们有在 OoO ARM 上进行 LoadStore 重新排序的实验证据:https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf 显示负载缓冲"的非零计数;在 Tegra 2 上,它基于乱序 Cortex-A9 uarch.我没有查找所有其他人,但我确实重写了答案,以表明这也是无序 CPU 的可能机制.不过我不确定是不是这样.

ARM allows the reordering loads with subsequent stores, so that the following pseudocode:

// CPU 0 | // CPU 1 temp0 = x; | temp1 = y; y = 1; | x = 1;

can result in temp0 == temp1 == 1 (and, this is observable in practice as well). I'm having trouble understanding how this occurs; it seems like in-order commit would prevent it (which, it was my understanding, is present in pretty much all OOO processors). My reasoning goes "the load must have its value before it commits, it commits before the store, and the store's value can't become visible to other processors until it commits."

I'm guessing that one of my assumptions must be wrong, and something like one of the following must hold:

  • Instructions don't need to commit all the way in-order. A later store could safely commit and become visible before an earlier load, so long as at the time the store commits the core can guarantee that the previous load (and all intermediate instructions) won't trigger an exception, and that the load's address is guaranteed to be distinct from the store's.

  • The load can commit before its value is known. I don't have a guess as to how this would be implemented.

  • Stores can become visible before they are committed. Maybe a memory buffer somewhere is allowed to forward stores to loads to a different thread, even if the load was enqueued earlier?

  • Something else entirely?

There's a lot of hypothetical microarchitectural features that would explain this behavior, but I'm most curious about the ones that are actually present in modern weakly ordered CPUs.

解决方案

Your bullet points of assumptions all look correct to me, except that you could build a uarch where loads can retire from the OoO core after merely checking permissions (TLB) on a load to make sure it can definitely happen. There could be OoO exec CPUs that do that (update: apparently there are).

I think x86 CPUs require loads to actually have the data arrive before they can retire, but their strong memory model doesn't allow LoadStore reordering anyway. So ARM certainly could be different.

You're right that stores can't be made visible to any other cores before retirement. That way lies madness. Even on an SMT core (multiple logical threads on one physical core), it would link speculation on two logical threads together, requiring them both to roll back if either one detected mis-speculation. That would defeat the purpose of SMT of having one logical thread take advantage of stalls in others.

(Related: Making retired but not yet committed (to L1d) stores visible to other logical threads on the same core is how some real PowerPC implementations make it possible for threads to disagree on the global order of stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)


CPUs with in-order execution can start a load (check the TLB and write a load-buffer entry) and only stall if an instruction tries to use the result before it's ready. Then later instructions, including stores, can run normally. This is basically required for non-terrible performance in an in-order pipeline; stalling on every cache miss (or even just L1d latency) would be unacceptable. Memory parallelism is a thing even on in-order CPUs; they can have multiple load buffers that track multiple outstanding cache misses. High(ish) performance in-order ARM cores like Cortex-A53 are still widely used in modern smartphones, and scheduling loads well ahead of when the result register is used is a well-known important optimization for looping over an array. (Unrolling or even software pipelining.)

So if the load misses in cache but the store hits (and commits to L1d before earlier cache-miss loads get their data), you can get LoadStore reordering. (Jeff Preshing intro to memory reording uses that example for LoadStore, but doesn't get into uarch details at all.)

A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point.

So the sequence on an in-order pipeline is:

  • lw r0, [r1] TLB hit, but misses in L1d cache. Load execution unit writes the address (r1) into a load buffer. Any later instruction that tries to read r0 will stall, but we know for sure that the load didn't fault.

    With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later instructions.

  • any amount of other instructions that don't read r0. That would stall an in-order pipeline.

  • sw r2, [r3] store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.

    Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d. (If it had overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer. So it might not be too complicated to handle that case without even probing on every store, but let's only look at the separate-cache-line case where we can get LoadStore reordering)

    Committing to L1d = becoming globally visible. This can happen while the earlier load is still waiting for the cache line to arrive.


For OoO CPUs, you'd need some way to tie load completion back into the OoO core for instructions waiting on the load result. I guess that's possible, but it means that the architectural/retirement value of a register might not be stored anywhere in the core. Pipeline flushes and other rollbacks from mis-speculation would have to hang on to that association between an incoming load and a physical and architectural register. (Not flushing store buffers on pipeline rollbacks is already a thing that CPUs have to do, though. Retired but not yet committed stores sitting in the store buffer have no way to be rolled back.)

That could be a good design idea for uarches with a small OoO window that's too small to come close to hiding a cache miss. (Which to be fair, is every high-performance OoO exec CPU: memory latency is usually too high to fully hide.)


We have experimental evidence of LoadStore reordering on an OoO ARM: section 7.1 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf shows non-zero counts for "load buffering" on Tegra 2, which is based on the out-of-order Cortex-A9 uarch. I didn't look up all the others, but I did rewrite the answer to suggest that this is the likely mechanism for out-of-order CPUs, too. I don't know for sure if that's the case, though.

这篇关于如何通过按顺序提交进行加载-> 存储重新排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆