什么是当代x86处理器中的指令融合? [英] What is instruction fusion in contemporary x86 processors?

查看:151
本文介绍了什么是当代x86处理器中的指令融合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解的是,指令融合有两种类型:

  1. 微操作融合
  2. 宏操作融合

微操作是指可以在1个时钟周期内执行的操作.如果几个微操作融合在一起,我们将获得一条指令".

如果多条指令融合在一起,我们将获得宏操作.

如果多个宏操作融合在一起,我们将获得宏操作融合.

我正确吗?

解决方案

不,融合与一条复杂的指令(如cpuidlock add [mem], eax)如何解码成多个微指令完全分开.

退休阶段发现一条指令的所有微指令都已经退休,因此该指令已经退休的方式与融合无关.


宏融合将cmp/jcc或test/jcc解码为单个比较分支uop.(Intel和AMD CPU).流水线的其余部分纯粹将其视为单个uop 1 (性能计数器仍将其视为2条指令).这样可以节省uop缓存空间,并节省包括解码在内的所有带宽.在某些代码中,比较与分支"占总指令混合的很大一部分,可能约为25%,因此选择查找此融合而不是其他可能的融合(如mov dst,src1/or dst,src2)是有意义的.

Sandybridge系列还可以将其他一些ALU指令与条件分支(例如,带有某些条件的add/subinc/dec + JCC)进行宏融合. ( x86_64-组装-循环条件和故障)


微融合将同一条指令的2个微指令存储在一起,因此它们在管道的融合域部分仅占用1个插槽" .但是它们仍然必须分别调度到单独的执行单元.在Intel Sandybridge系列中,RS(预留站又称为调度程序)位于未融合的域中,因此它们甚至分别存储在调度程序中. (请参阅我在微融合和寻址模式


两者可以同时发生

    cmp   [rdi], eax
    jnz   .target

cmp/jcc可以宏融合到单个cmp分支的ALU uop中,而[rdi]中的负载可以与该uop微融合.

无法对cmp进行微熔丝操作并不能防止宏熔丝.

这里的局限性是:相对于RIP的+立即数永远不能进行微熔断,因此cmp dword [static_data], 1/jnz可以进行宏熔断,而不能进行微熔断.

SnB系列上的

A cmp/jcc(如cmp [rdi+rax], edx/jnz)将在解码器中进行宏熔断和微熔断,但微熔断将在发布阶段之前取消分层. (因此,在融合域和未融合域中总共是2个uops:使用索引寻址模式和ALU cmp/jnz加载).您可以通过在CMP和JCC与之后之间放置mov ecx, 1来使用性能计数器进行验证,并请注意,由于我们击败了宏融合,每个循环迭代uops_issued.any:uuops_executed.thread都增加了1.而且微融合的表现也一样.

在Skylake上,cmp dword [rdi], 0/jnz无法宏熔断. (仅限微型保险丝).我使用包含一些伪mov ecx,1指令的循环进行了测试.重新排序,以便将cmp/jcc拆分为cmp/jcc的指令之一,不会更改融合域或非融合域uof的perf计数器.

但是cmp [rdi],eax/jnz 宏保险丝和微型保险丝.重新排序,使mov ecx,1指令将CMP与JNZ does 更改性能计数器分开(证明了宏融合),并且uops_exected高于uops_issueed每次迭代高出1(证明了微融合).

cmp [rdi+rax], eax/jne仅宏熔丝;不微. (实际上在解码时微融合了,但由于采用了索引寻址模式,因此在发行前就取消了分层,并且它不是像sub eax, [rdi+rax]这样的RMW寄存器目的地,可以使索引寻址模式保持微融合.带有索引的sub地址模式在SKL上(可能是Haswell)实现了 宏保险和微型保险丝.

(cmp dword [rdi],0执行 micro -保险丝,但是:uops_issued.any:uuops_executed.thread低1,并且循环中不包含nop或其他消除的"指令,或任何其他可能会微熔的存储指令.

某些编译器(包括GCC IIRC)更喜欢使用单独的加载指令,然后在寄存器上进行比较+分支.待办事项:检查gcc和clang的选择对于立即数还是寄存器来说是否最优.


微操作是指可以在1个时钟周期内执行的操作.

不完全是.它们在管道中或在ROB和RS中占用1个插槽",以在无序的后端中对其进行跟踪.

是的,在一个时钟周期内将uop分配到执行端口,并且简单的uop(例如整数加法)可以在同一周期内完成执行.自Haswell以来,这最多可能同时发生8 oups,但在Sunny Cove上增加到10 oups.实际执行可能需要超过1个时钟周期(占用执行单元更长的时间,例如FP分频).

除法器是我认为现代主流英特尔公司中唯一没有完全流水线化的执行单元,但是Knight's Landing有一些不完全流水线化的SIMD混洗,它们是单uop的,但是(互为)两个周期.) >


脚注1:

如果cmp [rdi], eax/jne在内存操作数上发生错误,即#PF异常,则将其与指向cmp之前的异常返回地址一起使用.因此,我认为即使是异常处理也仍然可以将其视为一件事.

或者如果分支目标地址是伪造的,则在已经执行了分支的 之后,将使用更新的RIP从代码获取中发生#PF异常.再次重申,我认为cmp无法成功执行且jcc无法出错,因此要求RIP指向JCC时要采取例外处理.

但是即使这种情况有可能需要设计CPU来处理,也可以将其排序直到实际检测到异常之前进行.也许有微码辅助或某些特殊情况的硬件.

在正常情况下,关于cmp/jcc uop如何通过管道,它的工作原理与一条长单uop指令完全一样,两者均设置了标志有条件地进行分支.

令人惊讶的是,loop指令(与dec rcx/jnz类似,但没有设置标志)在英特尔CPU上不是单个uop. 为什么循环指令比较慢?英特尔不能有效地实施它吗?.

What I understand is, there are two types of instruction fusions:

  1. Micro-operation fusion
  2. Macro-operation fusion

Micro-operations are those operations that can be executed in 1 clock cycle. If several micro-operations are fused, we obtain an "instruction".

If several instructions are fused, we obtain a Macro-operation.

If several macro-operations are fused, we obtain Macro-operation fusing.

Am I correct?

解决方案

No, fusion is totally separate from how one complex instruction (like cpuid or lock add [mem], eax) can decode to multiple uops.

The way the retirement stage figures out that all the uops for a single instruction have retired, and thus the instruction has retired, has nothing to do with fusion.


Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop. (Intel and AMD CPUs). The rest of the pipeline sees it purely as a single uop1 (except performance counters still count it as 2 instructions). This saves uop cache space, and bandwidth everywhere including decode. In some code, compare-and-branch makes up a significant fraction of the total instruction mix, like maybe 25%, so choosing to look for this fusion rather than other possible fusions like mov dst,src1 / or dst,src2 makes sense.

Sandybridge-family can also macro-fuse some other ALU instructions with conditional branches, like add/sub or inc/dec + JCC with some conditions. (x86_64 - Assembly - loop conditions and out of order)


Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline. But they still have to dispatch separately to separate execution units. And in Intel Sandybridge-family, the RS (Reservation Station aka scheduler) is in the unfused domain, so they're even stored separately in the scheduler. (See Footnote 2 in my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths.)

P6 family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window there. But SnB-family reportedly simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.

And Sandybridge family will "un-laminate" indexed addressing modes under some conditions, splitting them back into 2 separate uops in their own slots before issue/rename into the ROB in the out-of-order back end, so you lose the front-end issue/rename throughput benefit of micro-fusion. See Micro fusion and addressing modes


Both can happen at the same time

    cmp   [rdi], eax
    jnz   .target

The cmp/jcc can macro-fuse into a single cmp-and-branch ALU uop, and the load from [rdi] can micro-fuse with that uop.

Failure to micro-fuse the cmp does not prevent macro-fusion.

The limitations here are: RIP-relative + immediate can never micro-fuse, so cmp dword [static_data], 1 / jnz can macro-fuse but not micro-fuse.

A cmp/jcc on SnB-family (like cmp [rdi+rax], edx / jnz) will macro and micro-fuse in the decoders, but the micro-fusion will un-laminate before the issue stage. (So it's 2 total uops in both the fused-domain and unfused-domain: load with an indexed addressing mode, and ALU cmp/jnz). You can verify this with perf counters by putting a mov ecx, 1 in between the CMP and JCC vs. after, and note that uops_issued.any:u and uops_executed.thread both go up by 1 per loop iteration because we defeated macro-fusion. And micro-fusion behaved the same.

On Skylake, cmp dword [rdi], 0/jnz can't macro-fuse. (Only micro-fuse). I tested with a loop that contained some dummy mov ecx,1 instructions. Reordering so one of those mov instructions split up the cmp/jcc didn't change perf counters for fused-domain or unfused-domain uops.

But cmp [rdi],eax/jnz does macro- and micro-fuse. Reordering so a mov ecx,1 instruction separates CMP from JNZ does change perf counters (proving macro-fusion), and uops_executed is higher than uops_issued by 1 per iteration (proving micro-fusion).

cmp [rdi+rax], eax/jne only macro-fuses; not micro. (Well actually micro-fuses in decode but un-laminates before issue because of the indexed addressing mode, and it's not an RMW-register destination like sub eax, [rdi+rax] that can keep indexed addressing modes micro-fused. That sub with an indexed addressing mode does macro- and micro-fuse on SKL, and presumably Haswell).

(The cmp dword [rdi],0 does micro-fuse, though: uops_issued.any:u is 1 lower than uops_executed.thread, and the loop contains no nop or other "eliminated" instructions, or any other memory instructions that could micro-fuse).

Some compilers (including GCC IIRC) prefer to use a separate load instruction and then compare+branch on a register. TODO: check whether gcc and clang's choices are optimal with immediate vs. register.


Micro-operations are those operations that can be executed in 1 clock cycle.

Not exactly. They take 1 "slot" in the pipeline, or in the ROB and RS that track them in the out-of-order back-end.

And yes, dispatching a uop to an execution port happens in 1 clock cycle and simple uops (e.g., integer addition) can complete execution in the same cycle. This can happen for up to 8 uops simultaneously since Haswell, but increased to 10 on Sunny Cove. The actual execution might take more than 1 clock cycle (occupying the execution unit for longer, e.g. FP division).

The divider is I think the only execution unit on modern mainstream Intel that's not fully pipelined, but Knight's Landing has some not-fully-pipelined SIMD shuffles that are single uop but (reciprocal) throughput of 2 cycles.).


Footnote 1:

If cmp [rdi], eax / jne faults on the memory operand, i.e. a #PF exception, it's taken with the exception return address pointing to before the cmp. So I think even exception handling can still treat it as a single thing.

Or if the branch target address is bogus, a #PF exception will happen after the branch has already executed, from code fetch with an updated RIP. So again, I don't think there's a way for cmp to execute successfully and the jcc to fault, requiring an exception to be taken with RIP pointing to the JCC.

But even if that case is a possibility the CPU needs to be designed to handle, sorting that out can be deferred until the exception is actually detected. Maybe with a microcode assist, or some special-case hardware.

As far as how the cmp/jcc uop goes through the pipeline in the normal case, it works exactly like one long single-uop instruction that both sets flags and conditionally branches.

Surprisingly, the loop instruction (like dec rcx/jnz but without setting flags) is not a single uop on Intel CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?.

这篇关于什么是当代x86处理器中的指令融合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆