使用有序提交如何进行加载和存储的重新排序? [英] How is load->store reordering possible with in-order commit?

查看:96
本文介绍了使用有序提交如何进行加载和存储的重新排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ARM允许对后续存储进行重新排序,以便使用以下伪代码:

ARM allows the reordering loads with subsequent stores, so that the following pseudocode:

// CPU 0 | // CPU 1 temp0 = x; | temp1 = y; y = 1; | x = 1;

// CPU 0 | // CPU 1 temp0 = x; | temp1 = y; y = 1; | x = 1;

可能会导致temp0 == temp1 == 1(在实践中也是可以观察到的).我无法理解这种情况如何发生;似乎按顺序提交会阻止它(据我的理解,几乎所有OOO处理器中都存在).我的推理是:负载在提交之前必须具有其值,在存储之前必须具有其值,并且存储的值直到其提交才对其他处理器可见."

can result in temp0 == temp1 == 1 (and, this is observable in practice as well). I'm having trouble understanding how this occurs; it seems like in-order commit would prevent it (which, it was my understanding, is present in pretty much all OOO processors). My reasoning goes "the load must have its value before it commits, it commits before the store, and the store's value can't become visible to other processors until it commits."

我猜我的一个假设一定是错误的,并且必须满足以下条件之一:

I'm guessing that one of my assumptions must be wrong, and something like one of the following must hold:

  • 指令不需要一路提交 .以后的存储区可以安全地提交并在较早的加载之前可见,只要在存储区提交时核心可以保证先前的加载区(和所有中间指令)不会触发异常,并且加载区的地址为保证与商店不同.

  • Instructions don't need to commit all the way in-order. A later store could safely commit and become visible before an earlier load, so long as at the time the store commits the core can guarantee that the previous load (and all intermediate instructions) won't trigger an exception, and that the load's address is guaranteed to be distinct from the store's.

负载可以在知道其值之前提交.我不知道如何实现.

The load can commit before its value is known. I don't have a guess as to how this would be implemented.

在提交存储之前,它们可以变得可见.也许允许某个位置的内存缓冲区将存储转发到其他线程的加载,即使加载已在较早之前入队?

Stores can become visible before they are committed. Maybe a memory buffer somewhere is allowed to forward stores to loads to a different thread, even if the load was enqueued earlier?

还有其他事情吗?

有许多假设的微体系结构特征可以解释这种行为,但是我对现代弱顺序CPU中实际存在的特征感到最好奇.

There's a lot of hypothetical microarchitectural features that would explain this behavior, but I'm most curious about the ones that are actually present in modern weakly ordered CPUs.

推荐答案

您的假设要点对我来说都是正确的,除了您可以构建一个uarch,在其中可以仅通过检查权限(TLB)即可将负载从OoO核心中撤出.在负载上,以确保它一定会发生.可能有OoO exec CPU可以做到这一点(更新:显然有).

Your bullet points of assumptions all look correct to me, except that you could build a uarch where loads can retire from the OoO core after merely checking permissions (TLB) on a load to make sure it can definitely happen. There could be OoO exec CPUs that do that (update: apparently there are).

我认为x86 CPU需要负载才能使数据真正退役,但是它们的强大内存模型无论如何都不允许LoadStore重新排序.因此ARM肯定会有所不同.

I think x86 CPUs require loads to actually have the data arrive before they can retire, but their strong memory model doesn't allow LoadStore reordering anyway. So ARM certainly could be different.

您说对了,退休之前其他任何核心都无法看到商店.那就是疯狂.即使在 SMT内核(一个物理内核上有多个逻辑线程)上,它也会将两个方面的推测联系起来逻辑线程在一起,如果它们中的任何一个检测到错误推测,都要求它们都回滚.这将破坏SMT的目的,即让一个逻辑线程利用其他线程的停顿.

You're right that stores can't be made visible to any other cores before retirement. That way lies madness. Even on an SMT core (multiple logical threads on one physical core), it would link speculation on two logical threads together, requiring them both to roll back if either one detected mis-speculation. That would defeat the purpose of SMT of having one logical thread take advantage of stalls in others.

(相关:使已退休但尚未提交给L1d的存储对于同一内核上的其他逻辑线程可见,这是某些真正的PowerPC实现如何使线程有可能在全局存储顺序上产生分歧的原因.将两个原子写入到不同位置其他线程是否总是以相同的顺序看到不同线程中的位置?)

(Related: Making retired but not yet committed (to L1d) stores visible to other logical threads on the same core is how some real PowerPC implementations make it possible for threads to disagree on the global order of stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

按顺序执行的CPU可以启动加载(检查TLB并写入加载缓冲区条目),并且只有在指令试图在准备好之前使用结果的情况下,它才会暂停.然后,包括商店在内的后续说明即可正常运行.这对于顺序管道中的非糟糕性能而言基本上是必需的;在每个高速缓存未命中(甚至只是L1d延迟)时停滞将是不可接受的.内存并行性即使在有序的CPU上也是如此.它们可以具有多个加载缓冲区,以跟踪多个未决的缓存未命中.诸如 Cortex-A53 之类的高性能(ish)性能有序的ARM内核仍广泛用于现代智能手机.

CPUs with in-order execution can start a load (check the TLB and write a load-buffer entry) and only stall if an instruction tries to use the result before it's ready. Then later instructions, including stores, can run normally. This is basically required for non-terrible performance in an in-order pipeline; stalling on every cache miss (or even just L1d latency) would be unacceptable. Memory parallelism is a thing even on in-order CPUs; they can have multiple load buffers that track multiple outstanding cache misses. High(ish) performance in-order ARM cores like Cortex-A53 are still widely used in modern smartphones.

因此,如果缓存中的负载未命中,但是存储命中了(并在较早的缓存未命中负载获取其数据之前提交了L1d),则可以使LoadStore重新排序. ( Jeff Preshing内存排序简介使用该示例用于LoadStore,但根本不涉及uarch详细信息.)

So if the load misses in cache but the store hits (and commits to L1d before earlier cache-miss loads get their data), you can get LoadStore reordering. (Jeff Preshing intro to memory reording uses that example for LoadStore, but doesn't get into uarch details at all.)

检查了TLB和/或任何内存区域填充后,加载不会出错.该部分必须在退出之前或到达有序管道的末端之前完成.就像坐在存储缓冲区中等待提交的已退休存储一样,坐在加载缓冲区中的已退休加载肯定在某个时刻发生.

A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point.

所以顺序管道上的顺序是:

So the sequence on an in-order pipeline is:

  • lw r0, [r1] TLB命中,但未命中L1d高速缓存.加载执行单元将地址(r1)写入加载缓冲区.以后任何尝试读取r0的指令都将停止,但是我们可以肯定地知道负载没有故障.

  • lw r0, [r1] TLB hit, but misses in L1d cache. Load execution unit writes the address (r1) into a load buffer. Any later instruction that tries to read r0 will stall, but we know for sure that the load didn't fault.

r0绑定为等待该装载缓冲区准备就绪,lw指令本身可以离开流水线(退出),以后的指令也可以退出.

With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later instructions.

任何其他不读取r0的指令.那样会使顺序管道停滞不前.

any amount of other instructions that don't read r0. That would stall an in-order pipeline.

sw r2, [r3]存储执行单元将地址+数据写入存储缓冲区/队列.然后该指令可以退役.

sw r2, [r3] store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.

探测加载缓冲区会发现此存储与挂起的加载不重叠,因此它可以提交到L1d.(如果已经重叠,则您无法直到将其提交给MESI RFO为止,然后快速重新启动会将传入的数据转发到加载缓冲区,因此,即使不对每个存储进行探测,处理这种情况也不会太复杂,而让我们只看一下单独的缓存在线情况下,我们可以对LoadStore重新排序)

Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d. (If it had overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer. So it might not be too complicated to handle that case without even probing on every store, but let's only look at the separate-cache-line case where we can get LoadStore reordering)

致力于L1d =在全球范围内可见.当较早的负载仍在等待高速缓存行到达时,可能会发生这种情况.

Committing to L1d = becoming globally visible. This can happen while the earlier load is still waiting for the cache line to arrive.

对于OoO CPU,您需要某种方法将加载完成绑定回OoO内核,以获取等待加载结果的指令.我想这是可能的,但这意味着寄存器的架构/退休值可能不会存储在内核中的任何位置.由于错误推测而导致的管道刷新和其他回滚将不得不依赖于传入负载与物理和体系结构寄存器之间的关联. (不过,在管道回滚上不刷新存储缓冲区已经是CPU要做的事情.位于存储缓冲区中的已退休但尚未提交的存储区无法回滚.)

For OoO CPUs, you'd need some way to tie load completion back into the OoO core for instructions waiting on the load result. I guess that's possible, but it means that the architectural/retirement value of a register might not be stored anywhere in the core. Pipeline flushes and other rollbacks from mis-speculation would have to hang on to that association between an incoming load and a physical and architectural register. (Not flushing store buffers on pipeline rollbacks is already a thing that CPUs have to do, though. Retired but not yet committed stores sitting in the store buffer have no way to be rolled back.)

对于具有很小的OoO窗口的Uarches来说,这可能是一个很好的设计思路,而OoO窗口又很小,以至于无法接近隐藏高速缓存未命中.

That could be a good design idea for uarches with a small OoO window that's too small to come close to hiding a cache miss.

我们有实验证据证明可在OoO ARM上进行LoadStore重新排序: Tegra 2 上,它基于乱序的

We have experimental evidence of LoadStore reordering on an OoO ARM: section 7.1 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf shows non-zero counts for "load buffering" on Tegra 2, which is based on the out-of-order Cortex-A9 uarch. I didn't look up all the others, but I did rewrite the answer to suggest that this is the likely mechanism for out-of-order CPUs, too. I don't know for sure if that's the case, though.

这篇关于使用有序提交如何进行加载和存储的重新排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆