在 x86-64 中使用 32 位寄存器/指令的优点 [英] The advantages of using 32bit registers/instructions in x86-64

查看:25
本文介绍了在 x86-64 中使用 32 位寄存器/指令的优点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时 gcc 使用 32 位寄存器,而我希望它使用 64 位寄存器.例如下面的 C 代码:

Sometimes gcc uses 32bit register, when I would expect it to use a 64bit register. For example the following C code:

unsigned long long 
div(unsigned long long a, unsigned long long b){
    return a/b;
}

使用 -O2 选项编译以(省略一些样板内容):

is compiled with -O2 option to (leaving out some boilerplate stuff):

div:
    movq    %rdi, %rax
    xorl    %edx, %edx
    divq    %rsi
    ret

对于无符号除法,寄存器%rdx需要为0.这可以通过xorq %rdx, %rdx 来实现,但是xorl %edx, %edx 似乎也有同样的效果.

For the unsigned division, the register %rdx needs to be 0. This can be achieved by means of xorq %rdx, %rdx, but xorl %edx, %edx seems to have the same effect.

至少在我的机器上,xorlxorq 没有性能提升(即加速).

At least on my machine there was no performance gain (i.e. speed up) for xorl over xorq.

实际上我不止一个问题:

I have actually more than just one question:

  1. 为什么 gcc 更喜欢 32 位版本?
  2. 为什么 gcc 在 xorl 处停止而不使用 xorw?
  3. 是否存在 xorlxorq 更快的机器?
  4. 如果可能的话,应该总是更喜欢 32 位寄存器/操作而不是 64 位寄存器/操作吗?
  1. Why does gcc prefer the 32bit version?
  2. Why does gcc stop at xorl and doesn't use xorw?
  3. Are there machines for which xorl is faster than xorq?
  4. Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?

推荐答案

为什么 gcc 更喜欢 32 位版本?

Why does gcc prefer the 32bit version?

主要是代码大小:机器码编码中不需要REX前缀.

Mainly code size: no REX prefix needed in the machine-code encoding.

为什么gcc停在xorl而不使用xorw?

Why does gcc stop at xorl and doesn't use xorw?

写入 8 位或 16 位部分寄存器不会将零扩展到寄存器的其余部分.(仅写入 32 位寄存器隐式零扩展为 64)

Writing an 8 or 16-bit partial register doesn't zero-extend to the rest of the register. (Only writing a 32-bit register implicitly zero-extends to 64)

此外,xorw 需要一个operand-size 前缀来编码,所以它和xorq 一样大小,比xorl 大.32 位操作数大小是 x86-64 机器码中的默认值,不需要前缀.(对于大多数指令;一些像 push/popcall/jmp默认为64位,包括内存间接call [rdi] = ff 17> 使用内存中的指针.)8 位操作数大小使用单独的操作码,而不是前缀,但仍可能存在部分寄存器惩罚.

Besides, xorw requires an operand-size prefix to encode, so it's the same size as xorq, larger than xorl. 32-bit operand-size is the default in x86-64 machine code, no prefixes required. (For most instructions; a few like push/pop and call/jmp default to 64-bit, including memory-indirect call [rdi] = ff 17 with a pointer in memory.) 8-bit operand size uses separate opcodes, not prefixes, but still potentially has partial-register penalties.

另见 为什么 GCC 不使用部分寄存器? 32-位寄存器被视为部分寄存器,因为写入它们总是写入整个 64 位寄存器.(主要问题是写入部分 reg,而不是在全角写入后读取它们.)

See also Why doesn't GCC use partial registers? 32-bit registers are not considered partial registers, because writing them always writes the whole 64-bit register. (And it's writing partial regs that's the main problem, not reading them after a full-width write.)

是否存在 xorl 比 xorq 更快的机器?

Are there machines for which xorl is faster than xorq?

是的,Silvermont/KNL 只识别 xor-zeroing 作为 32 位操作数大小的归零惯用语(依赖性破坏和其他好东西).因此,即使代码大小相同,xor %r10d, %r10dxor %r10, %r10 好得多.(无论操作数大小如何,xor 都需要 r10 的 REX 前缀.

Yes, Silvermont / KNL only recognize xor-zeroing as a zeroing idiom (dependency breaking, and other good stuff) with 32-bit operand size. Thus, even though code-size is the same, xor %r10d, %r10d is much better than xor %r10, %r10. (xor needs a REX prefix for r10 regardless of operand-size).

在所有 CPU 上,代码大小总是可能对解码和 I-cache 占用空间很重要(除非后面的 .p2align 指令只会在前面的代码更小1).使用 32 位操作数大小进行异或归零(或一般隐式零扩展而不是显式2,包括使用 AVX vpxor xmm0,xm​​m0,xm​​m0 将 AVX512 归零zmm0.)

On all CPUs, code size always potentially matters for decode and I-cache footprint (except when a later .p2align directive would just make more padding if the preceding code is smaller1). There's no downside to using 32-bit operand size for xor-zeroing (or to implicit zero-extending in general instead of explict2, including using AVX vpxor xmm0,xmm0,xmm0 to zero AVX512 zmm0.)

大多数指令对于所有操作数大小的速度都相同,因为现代 x86 CPU 可以为宽 ALU 提供晶体管预算.例外情况包括 imul r64,r64 在 Ryzen 之前的 AMD CPU 和 Intel Atom 上比 imul r32,r32 慢,并且 64 位 div 在所有 CPU 上都慢得多.AMD pre-Ryzen 具有较慢的 popcnt r64.Atom/Silvermont 的 shld/shrd r64r32 比较慢.主流英特尔(Skylake 等)具有较慢的 bswap r64.

Most instructions are the same speed for all operand-sizes, because modern x86 CPUs can afford the transistor budget for wide ALUs. Exceptions include imul r64,r64 is slower than imul r32,r32 on AMD CPUs before Ryzen, and Intel Atom, and 64bit div is significantly slower on all CPUs. AMD pre-Ryzen has slower popcnt r64. Atom/Silvermont have slow shld/shrd r64 vs. r32. Mainstream Intel (Skylake etc.) have slower bswap r64.

如果可能的话,应该总是更喜欢 32 位寄存器/操作而不是 64 位寄存器/操作吗?

Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?

是的,至少出于代码大小的原因,更喜欢 32 位操作,但请注意,在指令中的任何位置(包括寻址模式)使用 r8..r15 也需要 REX 前缀.因此,如果您有一些数据,您可以使用 32 位操作数大小(或指向 8/16/32 位数据的指针),最好将其保存在低 8 个命名寄存器(e/rax..)而不是高8 个编号的寄存器.

Yes, prefer 32-bit ops for code-size reasons at least, but note that using r8..r15 anywhere in an instruction (including an addressing mode) will also require a REX prefix. So if you have some data you can use 32-bit operand-size with (or pointers to 8/16/32-bit data), prefer to keep it in the low 8 named registers (e/rax..) rather than high 8 numbered registers.

但是不要花费额外的指令来实现这一点;节省几个字节的代码大小通常是最不重要的考虑因素.只需使用 r8d 而不是保存/恢复 rbx 所以如果你需要一个不需要调用的额外寄存器,你可以使用 ebx -保存.使用 32 位 r8d 而不是 64 位 r8 对代码大小没有帮助,但是对于某些 CPU 上的某些操作,它可以更快(见上文).

But don't spend extra instructions to make this happen; saving a few bytes of code-size is usually the least important consideration. e.g. just use r8d instead of saving/restoring rbx so you can use ebx if you need an extra register that doesn't have to be call-preserved. Using 32-bit r8d instead of 64-bit r8 won't help with code-size, but it can be faster for some operations on some CPUs (see above).

这也适用于您只关心寄存器的低 16 位的情况,但使用 32 位加法而不是 16 位加法仍然会更有效.

This also applies to cases where you only care about the low 16 bits of a register, but it can still be more efficient to use a 32-bit add instead of 16-bit.

另见 http://agner.org/optimize/ 标签维基.

See also http://agner.org/optimize/ and the x86 tag wiki.

脚注 1:在极少数情况下,使指令长于所需时间(可以使用哪些方法在现代 x86 上有效地扩展指令长度?)

Footnote 1: There are rare use-cases for making instructions longer than necessary (What methods can be used to efficiently extend instruction length on modern x86?)

  • 在不需要 NOP 的情况下对齐后面的分支目标.

  • To align a later branch target without needing a NOP.

调整特定微架构的前端(即通过控制指令边界的位置来优化解码).插入 NOP 会花费额外的前端带宽,并完全破坏整个目的.

Tuning for the front-end of a specific microarchitecture (i.e. optimizing decode by controlling where instructions boundaries are). Inserting NOPs would cost extra front-end bandwidth and completely defeat the whole purpose.

汇编器不会为你做这件事,手工做是很费时间的,每次你改变任何东西时都要重新做(你可能不得不使用 .byte 指令来手动编码说明).

Assemblers won't do this for you, and doing it by hand is time consuming to re-do every time you change anything (and you may have to use .byte directives to manually encode the instruction).

脚注 2:我发现隐式零扩展至少与更广泛的操作一样便宜的规则有一个例外:Haswell/Skylake AVX 128 位负载被 256 读取-bit 指令与被 128 位指令消耗相比,具有额外的 1c 存储转发延迟.(详情在 Agner Fog 的博客论坛上的一个帖子.)

Footnote 2: I've found one exception to the rule that implicit zero-extension is at least as cheap as a wider operation: Haswell/Skylake AVX 128-bit loads being read by a 256-bit instruction have an extra 1c of store-forwarding latency vs. being consumed by a 128-bit instruction. (Details in a thread on Agner Fog's blog forum.)

这篇关于在 x86-64 中使用 32 位寄存器/指令的优点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆