为什么使用寄存器R12时POP慢? [英] Why is POP slow when using register R12?

查看：101 发布时间：2021/5/16 19:19:16 performance x86 intel cpu-architecture micro-optimization

本文介绍了为什么使用寄存器R12时POP慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在最近的Intel CPU上， POP 指令通常每个周期具有2条指令的吞吐量.但是，当使用寄存器 R12 (或 RSP ，除了前缀之外，具有相同的编码)时，如果指令通过旧式解码器，则吞吐量将下降至每个周期1(如果µops来自DSB，则吞吐量保持在每个周期2个左右).

可以使用 nanoBench 复制如下:

  sudo ./nanoBench.sh -asm"pop R12"；

在Haswell机器上进行的进一步实验显示以下内容:当添加1到4个 nops ，

  sudo ./nanoBench.sh -asm"pop R12;不;sudo ./nanoBench.sh -asm"pop R12;不不;sudo ./nanoBench.sh -asm"pop R12;不不不;sudo ./nanoBench.sh -asm"pop R12;不不不不;

执行时间增加到2个周期.添加第五个 nop 时，

  sudo ./nanoBench.sh -asm"pop R12;不不不不不;

执行时间增加到3个周期.这表明没有任何其他指令可以在与 pop R12 指令相同的周期内被解码.(当使用其他寄存器，例如 R11 时，最后一个示例需要1.5个周期.)

在Skylake上，当在1到3个 nops 之间添加时，执行时间停留在1个周期，而在4到7个 nops 之间增加到2.这表明 pop R12 是需要复杂解码器的指令，即使它只有一个µop(另请参见解决方案

解决方法: pop r12 的 pop r/m64 编码不具有此功能解码惩罚.(感谢@Andreas测试我的猜测.)

  db 0x41，0x8f，0xc4;REX.B = 1 8F/0 pop r/m64 = pop r12

pop r12 的标准编码具有与 pop rsp 相同的操作码字节，只是REX有所不同.(短格式编码将寄存器号放在该1字节的低3位中.)/p>

pop rsp 在解码器中也有特殊情况；在Haswell上，它是3 uops ¹，因此很明显只有复杂的解码器才能对其进行解码.如果哪个解码器可以对哪个指令进行解码的主要过滤是通过操作码字节(不考虑前缀)(至少对而言)，也将受到惩罚，这也是有道理的这组操作码.无论这是否真的反映出确切的内部原理，它至少是一个有用的思维模型，有助于理解pop modrm为什么没有这种效果.(尽管通常情况下，您只将 pop r/m64 与内存目标一起使用，这将意味着多-uop，因此仅是复数解码器.)

push rsp 在Haswell上总共为2，而大多数 push reg 指令为1 uop.但可能是额外的uop只是在发布/重命名(由于读取RSP)期间插入了堆栈同步，而在解码期间未插入 .@Andreas报告说 push rsp 和 push r12 在解码器中均未显示任何特殊效果(我假设是uop缓存).仅1个微融合uop，执行时带有/不带有堆栈同步uop.

类似 FF/0 inc r/m32 的操作码可能是在不同指令之间共享相同的前导字节(将modrm /r 字段作为额外的操作码字节进行重载)有趣的是，是否有一些单uu指令与多uu指令共享一个前导字节.就像 C0/4 SHL r/m8，imm8与 C0/2 RCL r/m8，imm8一样. http://ref.x86asm.net/coder64.html .但是带有存储目标的SHL已经可以是多个微指令，因此无论如何，简单的解码器都可能会乐观地尝试它，如果结果是单微指令，它会成功吗?尽管 pop r12 可能会在简单的解码器中尽早解决，而不是检测REX前缀.

对于英特尔来说，花大量的晶体管来确保诸如立即移位之类的常见指令能够有效地进行解码是比通常不常见的 pop r12 之类的较不常见的指令更有意义的选择.功能尾声，因此通常不在内部循环中.只有包含函数调用的较大循环.

脚注1 : pop rsp 很特殊，因为它只是 mov rsp，[rsp] .(或者如手册所述， POP ESP指令会在将堆栈顶部的旧数据写入目标之前递增堆栈指针(ESP). Haswell的3-uop实现似乎是不必要的.与 mov rsp，[rsp] 相同的1个uop(我认为故障情况是相同的)，但是通过向 pop reg的常规方式添加一个uop，这可能已经在解码器中节省了晶体管解码(可能隐式要求总共3个堆栈同步uop)，而不是将其视为单独的完整指令? pop rsp 很少使用，因此其性能不会问题.

也许将16位 pop sp 情况解码为1个纯负载uop时遇到问题?x86机器代码中没有 [sp] 寻址模式，并且可能限制已扩展到16位AGU的内部uops.除此之外，我认为 pop 和 mov 的可能故障原因相同.

根据@Andreas的测试，
pop r12 (简短形式)最终最终会解码为正常的1 uop，而堆栈同步的uops不会比其他寄存器的重复弹出更多.强>.它由于无法在简单的解码器中解码而受到惩罚，但不会受到 pop rsp 专门解码到的任何额外内容的干扰.
也许GAS，NASM和其他汇编程序应该获得补丁，以便可以使用modrm编码对 pop r12 进行编码，尽管可能不是默认设置.解码器的吞吐量通常不是问题，因此默认情况下花费额外的代码大小字节是不希望的.尤其是如果对其他架构(例如AMD或Silvermont家族)没有影响.
和/或GCC应该使用R12作为保存/恢复呼叫保留寄存器的最后选择吗?( R12在寻址模式下用作基数时，总是需要一个SIB字节，因此，这也是避免它的另一个原因，如果编译器不会尝试避免在其中保留指针.)也许还安排了r12的push/pop来进行高效解码，还有其他3个pops(或其他single-uop)isns)之后，然后是多码率 ret .
On recent Intel CPUs, the POP instruction usually has a throughput of 2 instructions per cycle. However, when using register R12 (or RSP, which has the same encoding except for the prefix), the throughput drops to 1 per cycle if the instructions go through the legacy decoders (the throughput stays at around 2 per cycle if the µops come from the DSB).

This can be reproduced using nanoBench as follows:
sudo ./nanoBench.sh -asm "pop R12"
Further experiments on a Haswell machine show the following: When adding between 1 and 4 nops,
sudo ./nanoBench.sh -asm "pop R12; nop;" sudo ./nanoBench.sh -asm "pop R12; nop; nop;" sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop;" sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop; nop;"
the execution time increases to 2 cycles. When adding a 5th nop,
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop; nop; nop;"
the execution time increases to 3 cycles. This suggests that no other instruction can be decoded in the same cycle as a pop R12 instruction. (When using a different register, e.g., R11, the last example needs 1.5 cycles.)

On Skylake, the execution time stays at 1 cycle when adding between 1 and 3 nops, and increases to 2 for between 4 and 7 nops. This suggests that pop R12 is an instruction that requires the complex decoder, even though it has just one µop (see also Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?)

Why is the POP instruction decoded differently when using register R12? Are there any other instructions for which this is also the case?
解决方案
Workaround: the pop r/m64 encoding of pop r12 doesn't have this decode penalty. (Thanks @Andreas for testing my guess.)
db 0x41, 0x8f, 0xc4 ; REX.B=1 8F /0 pop r/m64 = pop r12

The standard encoding of pop r12 has the same opcode byte as pop rsp, differing only by a REX. (The short form encoding puts the register number in the low 3 bits of that 1 byte).

pop rsp is special cased even in the decoders; on Haswell it's 3 uops¹ so clearly only the complex decoder can decode it. pop r12 also getting penalized makes sense if the primary filtering of which decoder can decode which instruction is by the opcode byte (not accounting for prefixes), at least for this group of opcodes. Whether this really reflects the exact internals, it's at least a useful mental model to understand why pop modrm doesn't have this effect. (Although normally you'd only use pop r/m64 with a memory destination, which would mean multi-uop and thus complex-decoder only.)

push rsp is 2 total uops on Haswell, unlike most push reg instructions being 1 uop. But likely that extra uop is just a stack-sync inserted during issue/rename (because of reading RSP), not during decode. @Andreas reports that push rsp and push r12 both show no special effects in the decoder (and I assume uop cache). Just 1 micro-fused uop, with/without a stack-sync uop when it executes.

Opcodes like FF /0 inc r/m32 where the same leading byte is shared between different instructions (overloading the modrm /r field as extra opcode bytes) might be interesting to check on, if there are some single-uop instructions that share a leading byte with multi-uop instructions. Like maybe C0 /4 SHL r/m8,imm8 vs. C0 /2 RCL r/m8, imm8. http://ref.x86asm.net/coder64.html. But SHL with a memory destination can already be multiple uops, so it might get optimistically attempted by the simple decoders anyway, and succeed if it turns out to be single-uop? While perhaps pop r12 bails out early in the simple decoders instead of detecting the REX prefix.

It would make sense for Intel to spend the transistors to make sure common instructions like immediate shifts can decode efficiently, moreso than for less-common instructions like pop r12 which you'll normally only find in function epilogues, and thus usually not in inner loop. Only larger loops that include function calls.

Footnote 1: pop rsp is special because it's just mov rsp, [rsp]. (Or as the manual puts it, The POP ESP instruction increments the stack pointer (ESP) before data at the old top of stack is written into the destination. Haswell's 3-uop implementation seems unnecessary vs. literally the same 1 uop as mov rsp, [rsp] (I think the fault conditions are identical), but this might have saved transistors in the decoders by adding a uop to the normal way pop reg decodes (possibly implicitly requiring a stack-sync uop for a total of 3), instead of treating it as a whole separate instruction? pop rsp is very rarely used so its performance doesn't matter.

Perhaps the 16-bit pop sp case was a problem for decoding that byte as 1 pure-load uop? There is no [sp] addressing mode in x86 machine code, and it's possible that limitation extends to internal uops for 16-bit AGU. Other than that, I think the possible fault reasons are the same for pop and mov.

pop r12 (short form) does eventually decode to the normal 1 uop, with no more stack-sync uops than for repeated pop of other registers, as per @Andreas's testing. It gets penalized by not being decodeable in the simple decoders, but not by any extra uops that pop rsp specifically decodes to.

Perhaps GAS, NASM, and other assemblers should get a patch to make it possible to encode pop r12 with the modrm encoding, although probably not defaulting to that. Decoder throughput is often not a problem so spending an extra byte of code-size by default would be undesirable. Especially if there's no effect on other uarches, like AMD or Silvermont-family.

And/or GCC should use R12 as its last choice of call-preserved reg to save/restore? (R12 always needs a SIB byte when used as the base in an addressing mode, too, so that's another reason to avoid it, if compilers aren't going to try to avoid keeping pointers in it.) And maybe schedule the push/pop of r12 for efficient decoding, with 3 other pops (or other single-uop isns) after it before multi-uop ret.

这篇关于为什么使用寄存器R12时POP慢?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么使用寄存器R12时POP慢? [英] Why is POP slow when using register R12?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么使用寄存器R12时POP慢? [英] Why is POP slow when using register R12?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭