在x86-64中使用32位寄存器/指令的优点 [英] The advantages of using 32bit registers/instructions in x86-64
问题描述
unsigned long long
div(无符号long long a,unsigned long long b) {
返回a / b;
}
使用-O2选项编译(留下一些样板文件):
div:
movq%rdi,%rax
xorl%edx,%edx
divq %rsi
ret
对于无符号除法,寄存器% rdx
需要 0
。这可以通过 xorq%rdx,%rdx
来实现,但是 xorl%edx,%edx
似乎有同样的效果。
至少在我的机器上,
我实际上不止一个问题:
- 为什么gcc更喜欢32位版本?
- 为什么gcc停在
xorl
并且不使用xorw
? - 是否有机器的
xorl
快于xorq
? -
- 如果可能的话,是否应该选择32位寄存器/操作而不是64位寄存器/操作? / ol>
解决方案
为什么gcc更喜欢32位版本?
代码大小:不需要REX前缀。
为什么gcc停在
xorl
并且不使用xorw
?
写入一个16位的部分寄存器不会零延伸到寄存器的其余部分。此外,
xorw
需要操作数大小的前缀进行编码,因此它大于xorl
。 (另请参阅为什么x64指令将32位寄存器的上半部分清零历史背景)
是否有机器的xorl速度更快比xorq?
是的,Silvermont / KNL只承认
XOR
-zeroing作为一个调零成语(依赖中断,和其他好东西)与32位操作数的大小。因此,即使代码大小相同,xor%r10d,%r10d
要比xor%r10好,%r10
。 (xorl
需要REX前缀r10
)。
<在其他CPU上,代码大小始终可能很重要(除非稍后的.p2align
伪指令只会在前面的代码更小时进行更多填充),并且没有任何缺点使用32位操作数大小进行xor-zeroing。
使用更大的指令而不是填充使用NOP通常效率更高,但汇编程序不会为您做这件事,而且手工完成这项工作非常耗时(并且您可能需要使用
.byte
指令手动对指令进行编码)。
如果可能,我们应该选择32位寄存器/操作而不是64位寄存器/操作吗?
是的,至少是代码大小的原因。这也适用于您只关心寄存器的低16位的情况,,但使用32位的add可以更有效率。
另见 http://agner.org/optimize/ 和 x86 标记wiki。
Sometimes gcc uses 32bit register, when I would expect it to use a 64bit register. For example the following C code:
unsigned long long div(unsigned long long a, unsigned long long b){ return a/b; }
is compiled with -O2 option to (leaving out some boilerplate stuff):
div: movq %rdi, %rax xorl %edx, %edx divq %rsi ret
For the unsigned division, the register
%rdx
needs to be0
. This can be achieved by means ofxorq %rdx, %rdx
, butxorl %edx, %edx
seems to have the same effect.At least on my machine there was no performance gain (i.e. speed up) for
xorl
overxorq
.I have actually more than just one question:
- Why does gcc prefer the 32bit version?
- Why does gcc stop at
xorl
and doesn't usexorw
? - Are there machines for which
xorl
is faster thanxorq
? - Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?
解决方案Why does gcc prefer the 32bit version?
Code size: no REX prefix needed.
Why does gcc stop at
xorl
and doesn't usexorw
?Writing a 16bit partial register doesn't zero-extend to the rest of the register. Besides,
xorw
requires an operand-size prefix to encode, so it's larger thanxorl
. (See also Why do x64 instructions zero the upper part of a 32 bit register for historical background)Are there machines for which xorl is faster than xorq?
Yes, Silvermont / KNL only recognize
xor
-zeroing as a zeroing idiom (dependency breaking, and other good stuff) with 32-bit operand size. Thus, even though code-size is the same,xor %r10d, %r10d
is much better thanxor %r10, %r10
. (Thexorl
needs a REX prefix forr10
).On other CPUs, code size always potentially matters (except when a later
.p2align
directive would just make more padding if the preceding code is smaller), and there's no downside to using 32-bit operand size for xor-zeroing.imul r64,r64
is slower thanimul r32,r32
on AMD CPUs, and Intel Atom, and 64bitdiv
is significantly slower on all CPUs. Other than that, most instructions are the same speed for all operand-sizes, because modern x86 CPUs can afford the transistor budget for wide ALUs.Using larger instructions instead of padding with a NOP is typically more efficient, but assemblers won't do this for you, and doing it by hand is time consuming (and you may have to use
.byte
directives to manually encode the instruction).Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?
Yes, for code-size reasons at least. This also applies to cases where you only care about the low 16 bits of a register, but it can still be more efficient to use a 32-bit add.
See also http://agner.org/optimize/ and the x86 tag wiki.
这篇关于在x86-64中使用32位寄存器/指令的优点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!