我可以使用 AVX FMA 单元进行位精确 52 位整数乘法吗? [英] Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

查看:25
本文介绍了我可以使用 AVX FMA 单元进行位精确 52 位整数乘法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AXV2 没有任何大于 32 位源的整数乘法.它确实提供 32 x 32 -> 32 乘法,以及 32 x 32 -> 64 乘以1,但没有 64 位来源.

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources.

假设我需要一个输入大于 32 位但小于或等于 52 位的无符号乘法 - 我可以简单地使用浮点数 DP 乘法 或 FMA 指令,当整数输入和结果可以用 52 位或更少位表示时(即在范围内),输出是否为位精确[0, 2^52-1])?

Let's say I need an unsigned multiply with inputs larger than 32-bit, but less or equal to 52-bits - can I simply use the floating point DP multiply or FMA instructions, and will the output be bit-exact when the integer inputs and results can be represented in 52 or fewer bits (i.e., in the range [0, 2^52-1])?

我想要产品的所有 104 位的更一般情况如何?或者整数乘积超过 52 位的情况(即,该乘积在位索引 > 52 中具有非零值) - 但我只想要低 52 位?在后一种情况下,MUL 将给我较高的位并舍去一些较低的位(也许这就是 IFMA 的帮助?).

How about the more general case where I want all 104 bits of the product? Or the case where the integer product takes more than 52 bits (i.e., the product has non-zero values in bit indexes > 52) - but I want only the low 52 bits? In this latter case, the MUL is going to give me higher bits and round away some of the lower bits (perhaps that's what IFMA helps with?).

事实上,根据 这个答案,它可能可以做任何高达 2^53 的事情 - 我忘记了尾数之前隐含的前导 1 有效地给了你另一位.

in fact, perhaps it could do anything up to 2^53, based on this answer - I had forgotten that the implied leading 1 before the mantissa effectively gives you another bit.

1 有趣的是,64 位产品 PMULDQ 操作的延迟是 32 位 PMULLD 版本的一半,吞吐量是其两倍,如神秘的 在评论中解释.

1 Interestingly, the 64-bit product PMULDQ operation has half the latency and twice the throughput of 32-bit PMULLD version, as Mysticial explains in the comments.

推荐答案

是的,这是可能的.但从 AVX2 开始,它不太可能比 MULX/ADCX/ADOX 的标量方法更好.

Yes it's possible. But as of AVX2, it's unlikely to be better than the scalar approaches with MULX/ADCX/ADOX.

对于不同的输入/输出域,这种方法实际上有无数种变体.我只会介绍其中的 3 个,但是一旦您知道它们的工作原理,它们就很容易概括.

There's virtually an unlimited number of variations of this approach for different input/output domains. I'll only cover 3 of them, but they are easy to generalize once you know how they work.

免责声明:

  • 此处的所有解决方案都假定舍入模式是舍入到偶数.
  • 不建议使用快速数学优化标志,因为这些解决方案依赖于严格的 IEEE.

范围内的签名双打:[-251, 251]

Signed doubles in the range: [-251, 251]

//  A*B = L + H*2^52
//  Input:  A and B are in the range [-2^51, 2^51]
//  Output: L and H are in the range [-2^51, 2^51]
void mul52_signed(__m256d& L, __m256d& H, __m256d A, __m256d B){
    const __m256d ROUND = _mm256_set1_pd(30423614405477505635920876929024.);    //  3 * 2^103
    const __m256d SCALE = _mm256_set1_pd(1. / 4503599627370496);                //  1 / 2^52

    //  Multiply and add normalization constant. This forces the multiply
    //  to be rounded to the correct number of bits.
    H = _mm256_fmadd_pd(A, B, ROUND);

    //  Undo the normalization.
    H = _mm256_sub_pd(H, ROUND);

    //  Recover the bottom half of the product.
    L = _mm256_fmsub_pd(A, B, H);

    //  Correct the scaling of H.
    H = _mm256_mul_pd(H, SCALE);
}

这是最简单的一种,也是唯一一种与标量方法竞争的方法.最终缩放是可选的,具体取决于您想对输出做什么.所以这可以认为只有3条指令.但它也是最没用的,因为输入和输出都是浮点值.

This is the simplest one and the only one which is competitive with the scalar approaches. The final scaling is optional depending on what you want to do with the outputs. So this can be considered only 3 instructions. But it's also the least useful since both the inputs and outputs are floating-point values.

两个 FMA 保持融合是绝对重要的.这就是快速数学优化可以解决问题的地方.如果第一个 FMA 被分解,则 L 不再保证在 [-2^51, 2^51] 范围内.如果第二个 FMA 被打破,L 就完全错了.

It is absolutely critical that both the FMAs stay fused. And this is where fast-math optimizations can break things. If the first FMA is broken up, then L is no longer guaranteed to be in the range [-2^51, 2^51]. If the second FMA is broken up, L will be completely wrong.

范围内的有符号整数:[-251, 251]

Signed integers in the range: [-251, 251]

//  A*B = L + H*2^52
//  Input:  A and B are in the range [-2^51, 2^51]
//  Output: L and H are in the range [-2^51, 2^51]
void mul52_signed(__m256i& L, __m256i& H, __m256i A, __m256i B){
    const __m256d CONVERT_U = _mm256_set1_pd(6755399441055744);     //  3*2^51
    const __m256d CONVERT_D = _mm256_set1_pd(1.5);

    __m256d l, h, a, b;

    //  Convert to double
    A = _mm256_add_epi64(A, _mm256_castpd_si256(CONVERT_U));
    B = _mm256_add_epi64(B, _mm256_castpd_si256(CONVERT_D));
    a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U);
    b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D);

    //  Get top half. Convert H to int64.
    h = _mm256_fmadd_pd(a, b, CONVERT_U);
    H = _mm256_sub_epi64(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U));

    //  Undo the normalization.
    h = _mm256_sub_pd(h, CONVERT_U);

    //  Recover bottom half.
    l = _mm256_fmsub_pd(a, b, h);

    //  Convert L to int64
    l = _mm256_add_pd(l, CONVERT_D);
    L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_D));
}

在第一个示例的基础上,我们将其与 fast double <-> 的通用版本结合起来.int64 转换技巧.

Building off of the first example, we combine it with a generalized version of the fast double <-> int64 conversion trick.

这个更有用,因为您正在处理整数.但即使使用快速转换技巧,大部分时间也会花在转换上.幸运的是,如果您多次乘以同一个操作数,您可以消除一些输入转换.

This one is more useful since you're working with integers. But even with the fast conversion trick, most of the time will be spent doing conversions. Fortunately, you can eliminate some of the input conversions if you are multiplying by the same operand multiple times.

范围内的无符号整数:[0, 252)

//  A*B = L + H*2^52
//  Input:  A and B are in the range [0, 2^52)
//  Output: L and H are in the range [0, 2^52)
void mul52_unsigned(__m256i& L, __m256i& H, __m256i A, __m256i B){
    const __m256d CONVERT_U = _mm256_set1_pd(4503599627370496);     //  2^52
    const __m256d CONVERT_D = _mm256_set1_pd(1);
    const __m256d CONVERT_S = _mm256_set1_pd(1.5);

    __m256d l, h, a, b;

    //  Convert to double
    A = _mm256_or_si256(A, _mm256_castpd_si256(CONVERT_U));
    B = _mm256_or_si256(B, _mm256_castpd_si256(CONVERT_D));
    a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U);
    b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D);

    //  Get top half. Convert H to int64.
    h = _mm256_fmadd_pd(a, b, CONVERT_U);
    H = _mm256_xor_si256(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U));

    //  Undo the normalization.
    h = _mm256_sub_pd(h, CONVERT_U);

    //  Recover bottom half.
    l = _mm256_fmsub_pd(a, b, h);

    //  Convert L to int64
    l = _mm256_add_pd(l, CONVERT_S);
    L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_S));

    //  Make Correction
    H = _mm256_sub_epi64(H, _mm256_srli_epi64(L, 63));
    L = _mm256_and_si256(L, _mm256_set1_epi64x(0x000fffffffffffff));
}

最终我们得到了原始问题的答案.这是通过调整转换和添加校正步骤建立在有符号整数解的基础之上的.

Finally we get the answer to the original question. This builds off of the signed integer solution by adjusting the conversions and adding a correction step.

但在这一点上,我们有 13 条指令——其中一半是高延迟指令,这还不包括大量的 FP <->int 绕过延迟.因此,这不太可能赢得任何基准测试.相比之下,64 x 64 ->128 位 SIMD 乘法可以在 16 条指令中完成(如果对输入进行预处理,则为 14 条.)

But at this point, we're at 13 instructions - half of which are high-latency instructions, not counting the numerous FP <-> int bypass delays. So it's unlikely this will be winning any benchmarks. By comparison, a 64 x 64 -> 128-bit SIMD multiply can be done in 16 instructions (14 if you pre-process the inputs.)

如果舍入模式为向下舍入或舍入到零,则可以省略校正步骤.唯一重要的指令是 h = _mm256_fmadd_pd(a, b, CONVERT_U);.因此,在 AVX512 上,您可以覆盖该指令的舍入并保留舍入模式.

The correction step can be omitted if the rounding mode is round-down or round-to-zero. The only instruction where this matters is h = _mm256_fmadd_pd(a, b, CONVERT_U);. So on AVX512, you can override the rounding for that instruction and leave the rounding mode alone.

最后的想法:

值得注意的是,252的操作范围可以通过调整magic常量来减少.这对于第一个解决方案(浮点解决方案)可能很有用,因为它为您提供了额外的尾数以用于浮点累加.这让您无需像前两个解决方案一样在 int64 和 double 之间不断来回转换.

It's worth noting that the 252 range of operation can be reduced by adjusting the magic constants. This may be useful for the first solution (the floating-point one) since it gives you extra mantissa to use for accumulation in floating-point. This lets you bypass the need to constantly to convert back-and-forth between int64 and double like in the last 2 solutions.

虽然这里的 3 个示例不太可能比标量方法更好,但 AVX512 几乎肯定会打破平衡.尤其是 Knights Landing,ADCX 和 ADOX 的吞吐量很差.

While the 3 examples here are unlikely to be better than scalar methods, AVX512 will almost certainly tip the balance. Knights Landing in particular has poor throughput for ADCX and ADOX.

当然,当 AVX512-IFMA 出现时,所有这些都没有实际意义.这减少了一个完整的 52 x 52 ->104位乘积到2条指令,免费累加.

And of course all of this is moot when AVX512-IFMA comes out. That reduces a full 52 x 52 -> 104-bit product to 2 instructions and gives the accumulation for free.

这篇关于我可以使用 AVX FMA 单元进行位精确 52 位整数乘法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆