我可以使用AVX FMA单元来执行比特精确的52位整数乘法吗? [英] Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

查看:374

是的,这是可能的。但是就AVX2而言,它不可能比使用MULX / ADCX / ADOX的标量方法更好。



这种方法实际上有不同数量的变化,适用于不同的输入/输出域。我只会介绍其中的3个,但是一旦你知道它们是如何工作的,它们很容易被推广。



免责声明:


  • 这里所有的解决方案都假设舍入模式是循环的。

  • 不建议使用快速数学优化标志,因为这些解决方案依赖于在严格的IEEE。






51 ,2 51 ]

 <$ c输入:A和B在[-2 ^ 51,2 ^ 51] 
//输出: L和H的范围是[-2 ^ 51,2 ^ 51]
void mul52_signed(__ m256d& L,__m256d& H,__m256d A,__m256d B){
const __m256d ROUND = _mm256_set1_pd(30423614405477505635920876929024 ); // 3 * 2 ^ 103
const __m256d SCALE = _mm256_set1_pd(1/4503599627370496); // 1/2 ^ 52

//乘以并加上标准化常数。这迫使乘法
//被四舍五入成正确的位数。
H = _mm256_fmadd_pd(A,B,ROUND);

//撤销规范化。
H = _mm256_sub_pd(H,ROUND);

//恢复产品的下半部分。
L = _mm256_fmsub_pd(A,B,H);

//修正H.的缩放比例b $ b H = _mm256_mul_pd(H,SCALE);





$ b这是最简单的一个,也是唯一与标量方法。最后的缩放比例是可选的,取决于你想对输出做什么。所以这可以被认为只有3条指令。但是它也是最没有用的,因为输入和输出都是浮点值。

两个FMA保持融合是非常重要的。这就是快速数学优化可以破坏事物的地方。如果第一个FMA被分解,那么 L 不再保证在 [ - 2 ^ 51,2 ^ 51] code>。如果第二个FMA被打破了,那么 L 将是完全错误的。




整数范围:[-2 51 ,2 51 ]



A和B的范围是[-2]。

//输出:L和H的范围是[-2 ^ 51,2 ^ 51]
void mul52_signed(__ m256i& L,__m256i& H,__m256i A, __m256i B){
const __m256d CONVERT_U = _mm256_set1_pd(6755399441055744); // 3 * 2 ^ 51
const __m256d CONVERT_D = _mm256_set1_pd(1.5);

__m256d l,h,a,b;

//转换为double
A = _mm256_add_epi64(A,_mm256_castpd_si256(CONVERT_U));
B = _mm256_add_epi64(B,_mm256_castpd_si256(CONVERT_D));
a = _mm256_sub_pd(_mm256_castsi256_pd(A),CONVERT_U);
b = _mm256_sub_pd(_mm256_castsi256_pd(B),CONVERT_D);

//获取上半部分。将H转换为int64。
h = _mm256_fmadd_pd(a,b,CONVERT_U);
H = _mm256_sub_epi64(_mm256_castpd_si256(h),_mm256_castpd_si256(CONVERT_U));

//撤销规范化。
h = _mm256_sub_pd(h,CONVERT_U);

//恢复下半部分。
l = _mm256_fmsub_pd(a,b,h);

//将L转换为int64
l = _mm256_add_pd(l,CONVERT_D);
L = _mm256_sub_epi64(_mm256_castpd_si256(l),_mm256_castpd_si256(CONVERT_D));



$ b $ p
$ b

在第一个例子的基础上,我们将它与广义的快速双重< - > int64 转换技巧



这是一个更有用的工具,因为您正在使用整数。但即使有快速的转换技巧,大部分时间都会花在转换上。幸运的是,如果您多次乘以相同的操作数,则可以消除一些输入转换。




无符号整数范围:[0,2 52

  //输入:A和B的范围是[0,2 ^ 52)
//输出:L和H (0,2 ^ 52)
void mul52_unsigned(__ m256i& L,__m256i& H,__m256i A,__m256i B){
const __m256d CONVERT_U = _mm256_set1_pd(4503599627370496); // 2 ^ 52
const __m256d CONVERT_D = _mm256_set1_pd(1);
const __m256d CONVERT_S = _mm256_set1_pd(1.5);

__m256d l,h,a,b;

//转换为double
A = _mm256_or_si256(A,_mm256_castpd_si256(CONVERT_U));
B = _mm256_or_si256(B,_mm256_castpd_si256(CONVERT_D));
a = _mm256_sub_pd(_mm256_castsi256_pd(A),CONVERT_U);
b = _mm256_sub_pd(_mm256_castsi256_pd(B),CONVERT_D);

//获取上半部分。将H转换为int64。
h = _mm256_fmadd_pd(a,b,CONVERT_U);
H = _mm256_xor_si256(_mm256_castpd_si256(h),_mm256_castpd_si256(CONVERT_U));

//撤销规范化。
h = _mm256_sub_pd(h,CONVERT_U);

//恢复下半部分。
l = _mm256_fmsub_pd(a,b,h);

//将L转换为int64
l = _mm256_add_pd(l,CONVERT_S);
L = _mm256_sub_epi64(_mm256_castpd_si256(l),_mm256_castpd_si256(CONVERT_S));

//校正
H = _mm256_sub_epi64(H,_mm256_srli_epi64(L,63));
L = _mm256_and_si256(L,_mm256_set1_epi64x(0x000fffffffffffff));
}

最后我们得到原始问题的答案。这是通过调整转换和增加一个校正步骤来构建有符号的整数解决方案。

但是在这一点上,我们在13条指令 - 其中一半是高不包括众多的 FP< - > int 绕过延迟。所以这不太可能赢得任何基准。相比之下, 64 x 64 - > 128位 SIMD乘法可以在16条指令中完成(如果预处理输入,则可以为14条)。

校正步骤可省略如果舍入模式是舍入或舍入为零。唯一的指令是 h = _mm256_fmadd_pd(a,b,CONVERT_U); 。所以在AVX512上,你可以重写该指令的四舍五入,并保留舍入模式。








值得注意的是,通过调整魔术常数,可以减少2 52 的操作范围。这对于第一个解决方案(浮点型)是有用的,因为它给了你额外的尾数来用于浮点累加。这样可以避免在最后两个解决方案中不断地在int64和double之间来回转换。



虽然这里的3个例子不太可能比标量方法更好,AVX512几乎肯定会取得平衡。对于ADCX和ADOX来说,Knights Landing的吞吐量特别差。当然,当AVX512-IFMA推出时,所有这些都是没有实际意义的。这减少了完整的 52×52 - > 104位产品到2条指令,并给予免费的积累。


AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources.

Let's say I need an unsigned multiply with inputs larger than 32-bit, but less or equal to 52-bits - can I simply use the floating point DP multiply or FMA instructions, and will the output be bit-exact when the integer inputs and results can be represented in 52 or fewer bits (i.e., in the range [0, 2^52-1])?

How about the more general case where I want all 104 bits of the product? Or the case where the integer product takes more than 52 bits (i.e., the product has non-zero values in bit indexes > 52) - but I want only the low 52 bits? In this latter case, the MUL is going to give me higher bits and round away some of the lower bits (perhaps that's what IFMA helps with?).

Edit: in fact, perhaps it could do anything up to 2^53, based on this answer - I had forgotten that the implied leading 1 before the mantissa effectively gives you another bit.


1 Interestingly, the 64-bit product PMULDQ operation has half the latency and twice the throughput of 32-bit PMULLD version, as Mysticial explains in the comments.

解决方案

Yes it's possible. But as of AVX2, it's unlikely to be better than the scalar approaches with MULX/ADCX/ADOX.

There's virtually an unlimited number of variations of this approach for different input/output domains. I'll only cover 3 of them, but they are easy to generalize once you know how they work.

Disclaimers:

  • All solutions here assume the rounding mode is round-to-even.
  • Use of fast-math optimization flags is not recommended as these solutions rely on strict IEEE.

Signed doubles in the range: [-251, 251]

//  A*B = L + H*2^52
//  Input:  A and B are in the range [-2^51, 2^51]
//  Output: L and H are in the range [-2^51, 2^51]
void mul52_signed(__m256d& L, __m256d& H, __m256d A, __m256d B){
    const __m256d ROUND = _mm256_set1_pd(30423614405477505635920876929024.);    //  3 * 2^103
    const __m256d SCALE = _mm256_set1_pd(1. / 4503599627370496);                //  1 / 2^52

    //  Multiply and add normalization constant. This forces the multiply
    //  to be rounded to the correct number of bits.
    H = _mm256_fmadd_pd(A, B, ROUND);

    //  Undo the normalization.
    H = _mm256_sub_pd(H, ROUND);

    //  Recover the bottom half of the product.
    L = _mm256_fmsub_pd(A, B, H);

    //  Correct the scaling of H.
    H = _mm256_mul_pd(H, SCALE);
}

This is the simplest one and the only one which is competitive with the scalar approaches. The final scaling is optional depending on what you want to do with the outputs. So this can be considered only 3 instructions. But it's also the least useful since both the inputs and outputs are floating-point values.

It is absolutely critical that both the FMAs stay fused. And this is where fast-math optimizations can break things. If the first FMA is broken up, then L is no longer guaranteed to be in the range [-2^51, 2^51]. If the second FMA is broken up, L will be completely wrong.


Signed integers in the range: [-251, 251]

//  A*B = L + H*2^52
//  Input:  A and B are in the range [-2^51, 2^51]
//  Output: L and H are in the range [-2^51, 2^51]
void mul52_signed(__m256i& L, __m256i& H, __m256i A, __m256i B){
    const __m256d CONVERT_U = _mm256_set1_pd(6755399441055744);     //  3*2^51
    const __m256d CONVERT_D = _mm256_set1_pd(1.5);

    __m256d l, h, a, b;

    //  Convert to double
    A = _mm256_add_epi64(A, _mm256_castpd_si256(CONVERT_U));
    B = _mm256_add_epi64(B, _mm256_castpd_si256(CONVERT_D));
    a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U);
    b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D);

    //  Get top half. Convert H to int64.
    h = _mm256_fmadd_pd(a, b, CONVERT_U);
    H = _mm256_sub_epi64(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U));

    //  Undo the normalization.
    h = _mm256_sub_pd(h, CONVERT_U);

    //  Recover bottom half.
    l = _mm256_fmsub_pd(a, b, h);

    //  Convert L to int64
    l = _mm256_add_pd(l, CONVERT_D);
    L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_D));
}

Building off of the first example, we combine it with a generalized version of the fast double <-> int64 conversion trick.

This one is more useful since you're working with integers. But even with the fast conversion trick, most of the time will be spent doing conversions. Fortunately, you can eliminate some of the input conversions if you are multiplying by the same operand multiple times.


Unsigned integers in the range: [0, 252)

//  A*B = L + H*2^52
//  Input:  A and B are in the range [0, 2^52)
//  Output: L and H are in the range [0, 2^52)
void mul52_unsigned(__m256i& L, __m256i& H, __m256i A, __m256i B){
    const __m256d CONVERT_U = _mm256_set1_pd(4503599627370496);     //  2^52
    const __m256d CONVERT_D = _mm256_set1_pd(1);
    const __m256d CONVERT_S = _mm256_set1_pd(1.5);

    __m256d l, h, a, b;

    //  Convert to double
    A = _mm256_or_si256(A, _mm256_castpd_si256(CONVERT_U));
    B = _mm256_or_si256(B, _mm256_castpd_si256(CONVERT_D));
    a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U);
    b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D);

    //  Get top half. Convert H to int64.
    h = _mm256_fmadd_pd(a, b, CONVERT_U);
    H = _mm256_xor_si256(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U));

    //  Undo the normalization.
    h = _mm256_sub_pd(h, CONVERT_U);

    //  Recover bottom half.
    l = _mm256_fmsub_pd(a, b, h);

    //  Convert L to int64
    l = _mm256_add_pd(l, CONVERT_S);
    L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_S));

    //  Make Correction
    H = _mm256_sub_epi64(H, _mm256_srli_epi64(L, 63));
    L = _mm256_and_si256(L, _mm256_set1_epi64x(0x000fffffffffffff));
}

Finally we get the answer to the original question. This builds off of the signed integer solution by adjusting the conversions and adding a correction step.

But at this point, we're at 13 instructions - half of which are high-latency instructions, not counting the numerous FP <-> int bypass delays. So it's unlikely this will be winning any benchmarks. By comparison, a 64 x 64 -> 128-bit SIMD multiply can be done in 16 instructions (14 if you pre-process the inputs.)

The correction step can be omitted if the rounding mode is round-down or round-to-zero. The only instruction where this matters is h = _mm256_fmadd_pd(a, b, CONVERT_U);. So on AVX512, you can override the rounding for that instruction and leave the rounding mode alone.


Final Thoughts:

It's worth noting that the 252 range of operation can be reduced by adjusting the magic constants. This may be useful for the first solution (the floating-point one) since it gives you extra mantissa to use for accumulation in floating-point. This lets you bypass the need to constantly to convert back-and-forth between int64 and double like in the last 2 solutions.

While the 3 examples here are unlikely to be better than scalar methods, AVX512 will almost certainly tip the balance. Knights Landing in particular has poor throughput for ADCX and ADOX.

And of course all of this is moot when AVX512-IFMA comes out. That reduces a full 52 x 52 -> 104-bit product to 2 instructions and gives the accumulation for free.

这篇关于我可以使用AVX FMA单元来执行比特精确的52位整数乘法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆