关于在x86上缓存命中之前执行的缓存未命中加载中的指令顺序 [英] Regarding instruction ordering in executions of cache-miss loads before cache-hit stores on x86

查看:136
本文介绍了关于在x86上缓存命中之前执行的缓存未命中加载中的指令顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出如下所示的小程序(从顺序一致性/ TSO角度看手工制作的程序),并假设它是由超标量无序x86 cpu运行的:

Given the small program shown below (handcrafted to look the same from a sequential consistency / TSO perspective), and assuming it's being run by a superscalar out-of-order x86 cpu:

Load A <-- A in main memory
Load B <-- B is in L2
Store C, 123 <-- C is L1

我有几个问题:


  1. 假设有足够大的指令窗口,是否将同时提取,解码和执行这三个指令?我认为不是,因为那样会破坏程序顺序的执行。

  2. 从内存中获取A的第二次加载要比从B中获取第二次加载要花费更长的时间。完全执行?仅在完全执行Load A之后才开始获取B吗?还是直到什么时候必须等待?

  3. 为什么商店必须等待装载?如果是,那么指令将只是等待提交到存储缓冲区中,直到加载完成,还是在解码之后将不得不坐下来等待加载?

谢谢

推荐答案

术语:指令窗口通常表示乱序执行窗口CPU可以找到ILP。即ROB或RS大小。参见了解影响在具有两个较长依赖链的循环上的有效位数,以增加长度

Terminology: "instruction-window" normally means out-of-order execution window, over which the CPU can find ILP. i.e. ROB or RS size. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

在单个周期中可以通过管道的指令数的术语是管道宽度。例如Skylake是4级超标量故障。 (它的部分管线,如解码,uop缓存获取和报废,宽于4 oups,但问题/重命名是最窄的点。)

The term for how many instructions can go through the pipeline in a single cycle is pipeline width. e.g. Skylake is 4-wide superscalar out-of-order. (Parts of its pipeline, like decode, uop-cache fetch, and retirement, are wider than 4 uops, but issue/rename is the narrowest point.)

术语: 等待提交到存储缓冲区中。存储数据+地址在执行存储时被写入存储缓冲区。在退休后的任何时候,当它被认为是非投机性的时,它会从存储缓冲区 到L1d。

Terminology: "wait to be committed in the store buffer" store data + address gets written into the store buffer when a store executes. It commits from the store buffer to L1d at any point after retirement, when it's known to be non-speculative.

(按程序顺序,以维持不进行存储重新排序的TSO内存模型。存储缓冲区允许存储在该内核内无序执行执行,但仍会执行L1d(并成为全局 执行存储=将地址+数据写入存储缓冲区。

什么是存储缓冲区?

英特尔硬件上存储缓冲区的大小?究竟什么是存储缓冲区?

(In program order, to maintain the TSO memory model of no store reordering. A store buffer allows stores to execute inside this core out of order but still commit to L1d (and become globally visible) in-order. Executing a store = writing address + data to the store buffer.)
what is a store buffer?
Size of store buffers on Intel hardware? What exactly is a store buffer?

前端无关 。 3个连续的指令很可能会在同一16字节的提取块中提取,并且可能以与一个组相同的周期进行预解码和解码。并且(也可以替代地)作为一组3或4 uops的一部分分发到乱序后端。 IDK为什么您会认为其中的任何一个都会导致任何潜在的问题。

The front-end is irrelevant. 3 consecutive instructions might well be fetched in the same 16-byte fetch block, and might go through pre-decode and decode in the same cycle as a group. And (also or instead) issue into the out-of-order back-end as part of a group of 3 or 4 uops. IDK why you think any of that would cause any potential problem.

前端(从提取到发出/重命名)按程序顺序处理指令。同时处理不会将后面的指令放在之前的 之前,而是将它们放在相同的时间。更重要的是,它保留了有关程序顺序的信息。不会丢失或丢弃,因为对于依赖于前一个 1 的指令来说很重要!

The front end (from fetch to issue/rename) processes instructions in program order. Processing simultaneously doesn't put later instructions before earlier ones, it puts them at the same time. And more importantly, it preserves the information of what program order is; that's not lost or discarded because it matters for instructions that depend on the previous one1!

在大多数流水线阶段之间都有队列,因此(例如,在Intel Sandybridge上)作为一组最多6条指令进行预解码的指令可能不会与同一组多达4条指令(或更多具有宏融合功能)一起击中解码器。参见 https://www.realworldtech.com/sandy-bridge/3/ 进行提取,下一页进行解码。 (还有uop缓存。)

There are queues between most pipeline stages, so (for example on Intel Sandybridge) instructions that pre-decode as part of a group of up-to-6 instructions might not hit the decoders as part of the same group of up-to-4 (or more with macro-fusion). See https://www.realworldtech.com/sandy-bridge/3/ for fetch, and the next page for decode. (And the uop cache.)

执行(将oups从外面分派到执行端口-of-order Scheduler)是订购重要的地方。 乱序调度程序必须避免破坏单线程代码。 2

Executing (dispatching uops to execution ports from the out-of-order scheduler) is where ordering matters. The out-of-order scheduler has to avoid breaking single threaded code.2

通常,问题/重命名为除非您在前端遇到瓶颈,否则执行将远远领先于执行。因此,通常没有理由期望一起发出的uops会一起执行。 (为争辩起见,假设您显示的2个负载确实在同一周期内被调度执行,而不管它们是如何通过前端到达的。)

Usually issue/rename is far ahead of execution, unless you're bottlenecked on the front-end. So there's normally no reason to expect that uops that issued together will execute together. (For the sake of argument, let's assume that the 2 loads you show do get dispatched for execution in the same cycle, regardless of how they got there via the front-end.)

但是无论如何,这里启动并同时加载和存储都没有问题。 uop调度程序不知道在L1d中负载会达到还是未达到。它只是在一个周期内向负载执行单元发送2个负载指令,并向这些端口发送存储地址+存储数据uop。

But anyway, there's no problem here starting both loads and the store the same time. The uop scheduler doesn't know whether a load will hit or miss in L1d. It just sends 2 load uops to the load execution units in a cycle, and a store-address + store-data uop to those ports.


2)[加载顺序]

2) [load ordering]

这是棘手的部分。

正如我在,即使内存模型要求此负载,现代的x86 CPU也会推测性地使用负载B的L2命中结果用于以后的指令。

As I explained in an answer + comments on your last question, modern x86 CPUs will speculatively use the L2 hit result from Load B for later instructions, even though the memory model requires that this load happens after Load A.

但是如果在Load A完成之前没有其他内核向高速缓存行B写入数据,则没有任何区别。内存顺序缓冲区负责检测在较早的加载完成之前已加载的高速缓存行的失效,并执行内存顺序错误推测管道刷新(回滚到退出)

But if no other cores write to cache line B before Load A completes, then nothing can tell the difference. The Memory-Order Buffer takes care of detecting invalidations of cache lines that were loaded from before earlier loads complete, and doing a memory-order mis-speculation pipeline flush (rollback to retirement state) in the rare case that allowing load re-ordering could change the result.


3)为什么商店必须等待

3) Why would the store have to wait for the loads?

不会,除非存储地址取决于负载值。 uop调度程序将在输入准备好后将存储地址和存储数据uops分配给执行单元。

It won't, unless the store-address depends on a load value. The uop scheduler will dispatch the store-address and store-data uops to execution units when their inputs are ready.

这是在按程序顺序加载之后,并且存储缓冲区将在加载后使它甚至更远,直到涉及全局内存顺序为止。存储缓冲区只有在存储退休后才将存储数据提交到L1d(使其全局可见)。由于是在加载之后,它们也将退休。

It's after the loads in program order, and the store buffer will make it even farther after the loads as far as global memory order is concerned. The store buffer won't commit the store data to L1d (making it globally visible) until after the store has retired. Since it's after the loads, they'll have also retired.

退休是有序的,以允许确切的例外情况并确保没有前一个指令发生异常或分支预测有误。按顺序退出可使我们确定退休后该指令是否具有投机性。)

(Retirement is in-order to allow precise exceptions, and to make sure no previous instructions took an exception or were a mispredicted branch. In-order retirement allows us to say for sure that an instruction is non-speculative after it retires.)

因此,是的,该机制确实确保了直到两个加载都从内存中获取了数据之后,存储设备才能提交给L1d(通过L1d高速缓存,该高速缓存为所有内核提供了一致的内存视图)。因此,这可以防止LoadStore重新排序(对于较早的加载以及以后的存储)。

So yes, this mechanism does ensure that the store can't commit to L1d until after both loads have taken data from memory (via L1d cache which provides a coherent view of memory to all cores). So this prevents LoadStore reordering (of earlier loads with later stores).

我不确定是否有任何顺序较弱的OoO CPU进行LoadStore重新排序。在有序CPU上,可能会在缓存命中存储之前出现缓存未命中加载,并且CPU使用记分板来避免停顿,直到尚未真正从寄存器读取加载数据为止。 (LoadStore很奇怪:请参阅Jeff Preshing的内存屏障就像源代码管理操作一样)。也许某些OoO执行CPU也可以在已知确实发生的情况下跟踪退休后的缓存未命中存储,但数据仍未到达。 x86不会这样做,因为它会违反TSO内存模型。

I'm not sure if any weakly-ordered OoO CPUs do LoadStore reordering. It is possible on in-order CPUs when a cache-miss load comes before a cache-hit store, and the CPU uses scoreboarding to avoid stalling until the load data is actually read from a register, if it still isn't ready. (LoadStore is a weird one: see also Jeff Preshing's Memory Barriers Are Like Source Control Operations). Maybe some OoO exec CPUs can also track cache-miss stores post retirement when they're known to be definitely happening, but the data just still hasn't arrived yet. x86 doesn't do this because it would violate the TSO memory model.

脚注1:在某些架构(通常为VLIW)中,同时指令束是该架构的一部分,其方式对于软件是可见的。因此,如果软件无法用可以同时执行的指令填充所有3个插槽,则必须用NOP填充它们。甚至可能允许将两个寄存器交换为包含 mov r0,r1 mov r1,r0 的捆绑软件,

Footnote 1: There are some architectures (typically VLIW) where bundles of simultaneous instructions are part of the architecture in a way that's visible to software. So if software can't fill all 3 slots with instructions that can execute simultaneously, it has to fill them with NOPs. It might even be allowed to swap 2 registers with a bundle that contained mov r0, r1 and mov r1, r0, depending on whether the ISA allows instructions in the same bundle to read and write the same registers.

但是x86并非如此:超标量无序执行必须始终保持不变,这取决于ISA是否允许同一束中的指令读取和写入相同的寄存器。保留以程序顺序一次运行一条指令的错觉。 OoO执行人员的基本原则是:不要破坏单线程代码。

But x86 is not like that: superscalar out-of-order execution must always preserve the illusion of running instructions one at a time in program order. The cardinal rule of OoO exec is: don't break single-threaded code.

任何违反此规定的操作都只能通过< a href = https://en.wikipedia.org/wiki/Hazard_(computer_architecture) rel = nofollow noreferrer>检查危害,或者在发现错误时进行推测性回滚。

Anything that would violate this can only be done with checking for hazards, or speculatively with rollback upon detection of mistakes.

脚注2:(续脚注1)

您可以提取/解码/发布两个背对 inc eax 指令,但是它们不能在同一周期内执行,因为寄存器重命名+ OoO调度程序必须检测到第二个读取了命令的输出。首先。

You can fetch / decode / issue two back-to-back inc eax instructions, but they can't execute in the same cycle because register renaming + the OoO scheduler has to detect that the 2nd one reads the output of the first.

这篇关于关于在x86上缓存命中之前执行的缓存未命中加载中的指令顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆