为什么在x86上除以3需要右移(以及其他奇数)? [英] Why does division by 3 require a rightshift (and other oddities) on x86?

查看:132
本文介绍了为什么在x86上除以3需要右移(以及其他奇数)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我具有以下C/C ++函数:

I have the following C/C++ function:

unsigned div3(unsigned x) {
    return x / 3;
}

-O3使用clang 10 进行编译时,结果为:

When compiled using clang 10 at -O3, this results in:

div3(unsigned int):
        mov     ecx, edi         # tmp = x
        mov     eax, 2863311531  # result = 3^-1
        imul    rax, rcx         # result *= tmp
        shr     rax, 33          # result >>= 33
        ret

我真正理解的是:被3除等于与乘性逆3 -1 mod 2 32 相乘,即2863311531.

What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.

有些事情我还是不明白:

There are some things that I don't understand though:

  1. 为什么我们需要完全使用ecx/rcx?我们不能直接将raxedi相乘吗?
  2. 为什么要在64位模式下进行乘法运算? eaxecx乘起来不是更快吗?
  3. 为什么我们使用imul而不是mul?我以为模块化算术都是无符号的.
  4. 最后的33位右移是怎么回事?我以为我们可以删除最高的32位.
  1. Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
  2. Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
  3. Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
  4. What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.

编辑1

对于那些不理解3 -1 mod 2 32 是什么意思的人,我在这里谈论的是乘法逆. 例如:

Edit 1

For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here. For example:

// multiplying with inverse of 3:
15 * 2863311531      = 42949672965
42949672965 mod 2^32 = 5

// using fixed-point multiplication
15 * 2863311531      = 42949672965
42949672965 >> 33    = 5

// simply dividing by 3
15 / 3               = 5

因此,将42949672965乘以实际上等于除以3.我假设clang的优化实际上是基于模数算法的,而实际上它是基于定点算法的.

So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.

我现在意识到,乘法逆仅可用于除法运算而无余数.例如,将3 -1 乘以1等于3 -1 ,而不是零.只有定点算法才具有正确的舍入.

I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.

不幸的是,即使在可能的情况下,clang也不使用模块化算术,在这种情况下,模块化算术只是一条imul指令.以下函数具有与上面相同的编译输出.

Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.

unsigned div3(unsigned x) {
    __builtin_assume(x % 3 == 0);
    return x / 3;
}


(关于精确除法的定点乘法逆的规范问答,适用于每种可能的输入:


(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)

推荐答案

  1. 我们不能直接将rax与edi相乘吗?

我们不能imul rax, rdi,因为调用约定允许调用者将垃圾留在RDI的高位中.仅EDI部分包含该值.内联时这不是问题;将32位寄存器 隐式扩展为完整的64位寄存器,因此编译器通常不需要额外的指令即可对32位值进行零扩展.

We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.

(零扩展到另一个寄存器会更好,这是因为移动消除的限制(如果无法避免的话) ).

(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).

从字面上更进一步地回答您的问题,不,x86没有任何乘法指令对它们的输入之一进行零扩展以使您可以将32位和64位寄存器相乘.两个输入的宽度必须相同.

Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.

  1. 我们为什么要在64位模式下相乘?

(术语:所有这些代码都以64位模式运行.您在问为什么要使用64位 operand-size .)

(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)

可以 mul edi EAX 与EDI相乘,以得到EDX:EAX上的64位结果拆分,但是mul edi在Intel CPU上为3 uops与大多数具有快速64位imul的现代x86-64 CPU相比. (尽管imul r64, r64在AMD Bulldozer系列和某些低功率CPU上速度较慢.) https://uops.info/ https://agner.org/optimize/(说明表和Microarch PDF) (有趣的事实:mul rdi实际上是Intel CPU上的更便宜的,只有2 oups.也许不必对整数乘法单元的输出进行额外的拆分,例如mul edi,必须将64位低半乘法器输出拆分为EDX和EAX一半,但这对于64x64 => 128位mul很自然发生.)

You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF) (Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)

您想要的零件也位于EDX中,因此您需要另一个mov eax, edx来处理它. (同样,因为我们正在查找的是该函数的独立定义的代码,而不是在内联到调用程序中之后.)

Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)

GCC 8.3和更早版本的 did 使用32位mul而不是64位imul( https://godbolt.org/z/5qj7d5 ).当Bulldozer系列和旧的Silvermont CPU更加相关时,对于-mtune=generic来说这并不疯狂,但是对于最近的GCC而言,那些CPU在过去更遥远,其通用调整选择反映了这一点.不幸的是,GCC还浪费了mov指令,将EDI复制到EAX,使这种方式看起来更糟:/

GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/

# gcc8.3 -O3  (default -mtune=generic)
div3(unsigned int):
        mov     eax, edi                 # 1 uop, stupid wasted instruction
        mov     edx, -1431655765         # 1 uop  (same 32-bit constant, just printed differently)
        mul     edx                      # 3 uops on Sandybridge-family
        mov     eax, edx                 # 1 uop
        shr     eax                      # 1 uop
        ret
                                  # total of 7 uops on SnB-family

使用mov eax, 0xAAAAAAAB/mul edi只能是6 oups,但仍然比:

Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:

# gcc9.3 -O3  (default -mtune=generic)
div3(unsigned int):
        mov     eax, edi                # 1 uop
        mov     edi, 2863311531         # 1 uop
        imul    rax, rdi                # 1 uop
        shr     rax, 33                 # 1 uop
        ret
                      # total 4 uops, not counting ret

不幸的是,64位的0x00000000AAAAAAAB不能表示为32位符号扩展的立即数,因此imul rax, rcx, 0xAAAAAAAB是不可编码的.意思是0xFFFFFFFFAAAAAAAB.

Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.

  1. 为什么我们使用imul而不是mul?我以为模块化算术都是无符号的.

它是未签名的.输入的正负号仅影响结果的上半部分,但imul reg, reg不会产生上半部分.只有mulimul的单操作数形式是NxN =>的全乘法. 2N,因此只有它们需要单独的带符号和无符号版本.

It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.

只有imul具有更快,更灵活的仅对下半部分的形式.关于imul reg, reg的唯一签名内容是,它基于下半部分的有符号溢出来设置OF.仅拥有mul r,r与FLAGS输出唯一的区别就是mul r,r,这是不值得花费更多的操作码和更多的晶体管的.

Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.

英特尔手册( https://www.felixcloutier.com/x86/imul )甚至指出了可以将其用于未签名的事实.

Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.

  1. 最后的33位右移是怎么回事?我以为我们可以删除最高的32位.

否,如果以这种方式实现,没有乘数常量可以为每个可能的输入x给出正确的正确答案.优化规则不允许近似,仅允许对程序使用的每个输入产生完全相同的可观察行为的实现.除了不知道unsigned的完整范围外,不知道x的值范围,编译器没有该选项. (-ffast-math仅适用于浮点;如果需要更快的整数数学近似值,请按如下所示手动进行编码):

No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):

请参见为什么GCC在实现整数除法时使用乘以奇数的方法?,了解有关编译器用于通过编译时间常数进行精确除法的定点乘法逆方法.

See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.

有关在一般情况下无法正常运行的示例,请参见我对

For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed

// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
    int64_t invDivisor = 0x1999999A;
    return (int32_t) ((invDivisor * dividend) >> 32);
}

1073741829/10实际上是107374182时,它的第一个错误答案(如果从0向上循环)是div10(1073741829) = 107374183.(它应四舍五入而不是像C整数除法那样朝0取整.)

Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)

从您的编辑中,我看到您实际上是在讨论使用乘积结果的 low 一半,显然,该结果对于直到UINT_MAX的精确倍数都非常适用.

From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.

正如您所说,如果除法有余数,例如16 * 0xaaaaaaab = 0xaaaaaab0截断为32位而不是5时.

As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.

unsigned div3_exact_only(unsigned x) {
    __builtin_assume(x % 3 == 0);  // or an equivalent with if() __builtin_unreachable()
    return x / 3;
}

是的,如果该数学计算可行,则对于编译器而言,使用32位imul实现该方法是合法且最佳的.他们不寻求这种优化,因为这鲜为人知.如果值得在编译时间方面增加编译器代码甚至寻找优化,则IDK值得一提,更不用说在开发人员时间中的编译器维护成本了.这不是运行时成本上的巨大差异,而且几乎不可能实现.很好,但是.

Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.

div3_exact_only:
    imul  eax, edi, 0xAAAAAAAB        # 1 uop, 3c latency
    ret

但是,至少在已知类型宽度(例如uint32_t:

However, it is something you can do yourself in source code, at least for known type widths like uint32_t:

uint32_t div3_exact_only(uint32_t x) {
    return x * 0xaaaaaaabU;
}

这篇关于为什么在x86上除以3需要右移(以及其他奇数)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆