在x86-64中使用32位寄存器/指令的优点 [英] The advantages of using 32bit registers/instructions in x86-64

查看:169
本文介绍了在x86-64中使用32位寄存器/指令的优点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时gcc使用32位寄存器,当我期望它使用64位寄存器时。例如下面的C代码:

  unsigned long long 
div(无符号long long a,unsigned long long b) {
返回a / b;
}

使用-O2选项编译(留下一些样板文件):

  div:
movq%rdi,%rax
xorl%edx,%edx
divq %rsi
ret

对于无符号除法,寄存器% rdx 需要 0 。这可以通过 xorq%rdx,%rdx 来实现,但是 xorl%edx,%edx 似乎有同样的效果。

至少在我的机器上, xorl 超过<$ c $没有性能提升(即加速) c> xorq 。



我实际上不止一个问题:


  1. 为什么gcc更喜欢32位版本?

  2. 为什么gcc停在 xorl 并且不使用 xorw

  3. 是否有机器的 xorl 快于 xorq


  4. 如果可能的话,是否应该选择32位寄存器/操作而不是64位寄存器/操作? / ol>

    解决方案


    为什么gcc更喜欢32位版本?




    代码大小:不需要REX前缀。


    为什么gcc停在 xorl 并且不使用 xorw


    写入一个16位的部分寄存器不会零延伸到寄存器的其余部分。此外, xorw 需要操作数大小的前缀进行编码,因此它大于 xorl 。 (另请参阅为什么x64指令将32位寄存器的上半部分清零历史背景)


    是否有机器的xorl速度更快比xorq?


    是的,Silvermont / KNL只承认 XOR -zeroing作为一个调零成语(依赖中断,和其他好东西)与32位操作数的大小。因此,即使代码大小相同, xor%r10d,%r10d 要比 xor%r10好,%r10 。 ( xorl 需要REX前缀 r10 )。



    <在其他CPU上,代码大小始终可能很重要(除非稍后的 .p2align 伪指令只会在前面的代码更小时进行更多填充),并且没有任何缺点使用32位操作数大小进行xor-zeroing。


    imul r64,r64 比 imul r32,r32 <在AMD CPU和Intel Atom上的code>和64位 div 在所有CPU上显着较慢。除此之外,大多数指令的速度与所有操作数大小相同,因为现代x86 CPU可以为宽ALU提供晶体管预算。



    使用更大的指令而不是填充使用NOP通常效率更高,但汇编程序不会为您做这件事,而且手工完成这项工作非常耗时(并且您可能需要使用 .byte 指令手动对指令进行编码)。


    如果可能,我们应该选择32位寄存器/操作而不是64位寄存器/操作吗?


    是的,至少是代码大小的原因。这也适用于您只关心寄存器的低16位的情况,,但使用32位的add可以更有效率



    另见 http://agner.org/optimize/ x86 标记wiki。


    Sometimes gcc uses 32bit register, when I would expect it to use a 64bit register. For example the following C code:

    unsigned long long 
    div(unsigned long long a, unsigned long long b){
        return a/b;
    }
    

    is compiled with -O2 option to (leaving out some boilerplate stuff):

    div:
        movq    %rdi, %rax
        xorl    %edx, %edx
        divq    %rsi
        ret
    

    For the unsigned division, the register %rdx needs to be 0. This can be achieved by means of xorq %rdx, %rdx, but xorl %edx, %edx seems to have the same effect.

    At least on my machine there was no performance gain (i.e. speed up) for xorl over xorq.

    I have actually more than just one question:

    1. Why does gcc prefer the 32bit version?
    2. Why does gcc stop at xorl and doesn't use xorw?
    3. Are there machines for which xorl is faster than xorq?
    4. Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?

    解决方案

    Why does gcc prefer the 32bit version?

    Code size: no REX prefix needed.

    Why does gcc stop at xorl and doesn't use xorw?

    Writing a 16bit partial register doesn't zero-extend to the rest of the register. Besides, xorw requires an operand-size prefix to encode, so it's larger than xorl. (See also Why do x64 instructions zero the upper part of a 32 bit register for historical background)

    Are there machines for which xorl is faster than xorq?

    Yes, Silvermont / KNL only recognize xor-zeroing as a zeroing idiom (dependency breaking, and other good stuff) with 32-bit operand size. Thus, even though code-size is the same, xor %r10d, %r10d is much better than xor %r10, %r10. (The xorl needs a REX prefix for r10).

    On other CPUs, code size always potentially matters (except when a later .p2align directive would just make more padding if the preceding code is smaller), and there's no downside to using 32-bit operand size for xor-zeroing.

    imul r64,r64 is slower than imul r32,r32 on AMD CPUs, and Intel Atom, and 64bit div is significantly slower on all CPUs. Other than that, most instructions are the same speed for all operand-sizes, because modern x86 CPUs can afford the transistor budget for wide ALUs.

    Using larger instructions instead of padding with a NOP is typically more efficient, but assemblers won't do this for you, and doing it by hand is time consuming (and you may have to use .byte directives to manually encode the instruction).

    Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?

    Yes, for code-size reasons at least. This also applies to cases where you only care about the low 16 bits of a register, but it can still be more efficient to use a 32-bit add.

    See also http://agner.org/optimize/ and the tag wiki.

    这篇关于在x86-64中使用32位寄存器/指令的优点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆