我可以使用AVX FMA单元来执行比特精确的52位整数乘法吗? [英] Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?
问题描述
假设我需要一个输入大于32位但小于或等于52位的无符号乘法 - 我可以简单地使用浮点数或FMA指令,并且当整数输入和结果的输出是精确的可以用52位或更少的位来表示(即在[0,2 ^ 52-1]的范围内)? 如果我希望所有的更普遍的情况如何产品的104位?或者整数乘积超过52位的情况(即乘积大于52的乘积不为零),但我只想要低52位?在后一种情况下, 编辑:实际上,也许它可以做任何事情2 ^ 53,基于这个答案 - 我已经忘记了尾数之前隐含的前导 1 有趣的是,64位产品
MUL
会给我更高的位,并舍去一些低位(也许这就是IFMA的帮助?)。 b
$ b 1
有效地给了你一点。
PMULDQ
延迟是32位 PMULLD
版本的两倍,如Mysticial
<-vx-fma-units-to-do-bit-exact-52-bit-integer-multiplications?noredirect = 1#comment70105620_41403718 div class =h2_lin>解决方案
是的,这是可能的。但是就AVX2而言,它不可能比使用MULX / ADCX / ADOX的标量方法更好。
这种方法实际上有不同数量的变化,适用于不同的输入/输出域。我只会介绍其中的3个,但是一旦你知道它们是如何工作的,它们很容易被推广。
免责声明:
- 这里所有的解决方案都假设舍入模式是循环的。
- 不建议使用快速数学优化标志,因为这些解决方案依赖于在严格的IEEE。
51 ,2 51 ]
<$ c输入:A和B在[-2 ^ 51,2 ^ 51]
//输出: L和H的范围是[-2 ^ 51,2 ^ 51]
void mul52_signed(__ m256d& L,__m256d& H,__m256d A,__m256d B){
const __m256d ROUND = _mm256_set1_pd(30423614405477505635920876929024 ); // 3 * 2 ^ 103
const __m256d SCALE = _mm256_set1_pd(1/4503599627370496); // 1/2 ^ 52
//乘以并加上标准化常数。这迫使乘法
//被四舍五入成正确的位数。
H = _mm256_fmadd_pd(A,B,ROUND);
//撤销规范化。
H = _mm256_sub_pd(H,ROUND);
//恢复产品的下半部分。
L = _mm256_fmsub_pd(A,B,H);
//修正H.的缩放比例b $ b H = _mm256_mul_pd(H,SCALE);
$ b这是最简单的一个,也是唯一与标量方法。最后的缩放比例是可选的,取决于你想对输出做什么。所以这可以被认为只有3条指令。但是它也是最没有用的,因为输入和输出都是浮点值。
两个FMA保持融合是非常重要的。这就是快速数学优化可以破坏事物的地方。如果第一个FMA被分解,那么
L
不再保证在[ - 2 ^ 51,2 ^ 51] code>。如果第二个FMA被打破了,那么
L
将是完全错误的。
整数范围:[-2 51 ,2 51 ]
A和B的范围是[-2]。//输出:L和H的范围是[-2 ^ 51,2 ^ 51]
void mul52_signed(__ m256i& L,__m256i& H,__m256i A, __m256i B){
const __m256d CONVERT_U = _mm256_set1_pd(6755399441055744); // 3 * 2 ^ 51
const __m256d CONVERT_D = _mm256_set1_pd(1.5);
__m256d l,h,a,b;
//转换为double
A = _mm256_add_epi64(A,_mm256_castpd_si256(CONVERT_U));
B = _mm256_add_epi64(B,_mm256_castpd_si256(CONVERT_D));
a = _mm256_sub_pd(_mm256_castsi256_pd(A),CONVERT_U);
b = _mm256_sub_pd(_mm256_castsi256_pd(B),CONVERT_D);
//获取上半部分。将H转换为int64。
h = _mm256_fmadd_pd(a,b,CONVERT_U);
H = _mm256_sub_epi64(_mm256_castpd_si256(h),_mm256_castpd_si256(CONVERT_U));
//撤销规范化。
h = _mm256_sub_pd(h,CONVERT_U);
//恢复下半部分。
l = _mm256_fmsub_pd(a,b,h);
//将L转换为int64
l = _mm256_add_pd(l,CONVERT_D);
L = _mm256_sub_epi64(_mm256_castpd_si256(l),_mm256_castpd_si256(CONVERT_D));
$ b $ p
$ b在第一个例子的基础上,我们将它与广义的快速
双重< - > int64
转换技巧。
这是一个更有用的工具,因为您正在使用整数。但即使有快速的转换技巧,大部分时间都会花在转换上。幸运的是,如果您多次乘以相同的操作数,则可以消除一些输入转换。
无符号整数范围:[0,2 52 )
//输入:A和B的范围是[0,2 ^ 52)
//输出:L和H (0,2 ^ 52)
void mul52_unsigned(__ m256i& L,__m256i& H,__m256i A,__m256i B){
const __m256d CONVERT_U = _mm256_set1_pd(4503599627370496); // 2 ^ 52
const __m256d CONVERT_D = _mm256_set1_pd(1);
const __m256d CONVERT_S = _mm256_set1_pd(1.5);
__m256d l,h,a,b;
//转换为double
A = _mm256_or_si256(A,_mm256_castpd_si256(CONVERT_U));
B = _mm256_or_si256(B,_mm256_castpd_si256(CONVERT_D));
a = _mm256_sub_pd(_mm256_castsi256_pd(A),CONVERT_U);
b = _mm256_sub_pd(_mm256_castsi256_pd(B),CONVERT_D);
//获取上半部分。将H转换为int64。
h = _mm256_fmadd_pd(a,b,CONVERT_U);
H = _mm256_xor_si256(_mm256_castpd_si256(h),_mm256_castpd_si256(CONVERT_U));
//撤销规范化。
h = _mm256_sub_pd(h,CONVERT_U);
//恢复下半部分。
l = _mm256_fmsub_pd(a,b,h);
//将L转换为int64
l = _mm256_add_pd(l,CONVERT_S);
L = _mm256_sub_epi64(_mm256_castpd_si256(l),_mm256_castpd_si256(CONVERT_S));
//校正
H = _mm256_sub_epi64(H,_mm256_srli_epi64(L,63));
L = _mm256_and_si256(L,_mm256_set1_epi64x(0x000fffffffffffff));
}
最后我们得到原始问题的答案。这是通过调整转换和增加一个校正步骤来构建有符号的整数解决方案。
但是在这一点上,我们在13条指令 - 其中一半是高不包括众多的
FP< - > int
绕过延迟。所以这不太可能赢得任何基准。相比之下,64 x 64 - > 128位
SIMD乘法可以在16条指令中完成(如果预处理输入,则可以为14条)。
校正步骤可省略如果舍入模式是舍入或舍入为零。唯一的指令是
h = _mm256_fmadd_pd(a,b,CONVERT_U);
。所以在AVX512上,你可以重写该指令的四舍五入,并保留舍入模式。
:
值得注意的是,通过调整魔术常数,可以减少2 52 的操作范围。这对于第一个解决方案(浮点型)是有用的,因为它给了你额外的尾数来用于浮点累加。这样可以避免在最后两个解决方案中不断地在int64和double之间来回转换。
虽然这里的3个例子不太可能比标量方法更好,AVX512几乎肯定会取得平衡。对于ADCX和ADOX来说,Knights Landing的吞吐量特别差。当然,当AVX512-IFMA推出时,所有这些都是没有实际意义的。这减少了完整的
52×52 - > 104位
产品到2条指令,并给予免费的积累。AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources.
Let's say I need an unsigned multiply with inputs larger than 32-bit, but less or equal to 52-bits - can I simply use the floating point DP multiply or FMA instructions, and will the output be bit-exact when the integer inputs and results can be represented in 52 or fewer bits (i.e., in the range [0, 2^52-1])?
How about the more general case where I want all 104 bits of the product? Or the case where the integer product takes more than 52 bits (i.e., the product has non-zero values in bit indexes > 52) - but I want only the low 52 bits? In this latter case, the
MUL
is going to give me higher bits and round away some of the lower bits (perhaps that's what IFMA helps with?).Edit: in fact, perhaps it could do anything up to 2^53, based on this answer - I had forgotten that the implied leading
1
before the mantissa effectively gives you another bit.
1 Interestingly, the 64-bit product
PMULDQ
operation has half the latency and twice the throughput of 32-bitPMULLD
version, as Mysticial explains in the comments.解决方案Yes it's possible. But as of AVX2, it's unlikely to be better than the scalar approaches with MULX/ADCX/ADOX.
There's virtually an unlimited number of variations of this approach for different input/output domains. I'll only cover 3 of them, but they are easy to generalize once you know how they work.
Disclaimers:
- All solutions here assume the rounding mode is round-to-even.
- Use of fast-math optimization flags is not recommended as these solutions rely on strict IEEE.
Signed doubles in the range: [-251, 251]
// A*B = L + H*2^52 // Input: A and B are in the range [-2^51, 2^51] // Output: L and H are in the range [-2^51, 2^51] void mul52_signed(__m256d& L, __m256d& H, __m256d A, __m256d B){ const __m256d ROUND = _mm256_set1_pd(30423614405477505635920876929024.); // 3 * 2^103 const __m256d SCALE = _mm256_set1_pd(1. / 4503599627370496); // 1 / 2^52 // Multiply and add normalization constant. This forces the multiply // to be rounded to the correct number of bits. H = _mm256_fmadd_pd(A, B, ROUND); // Undo the normalization. H = _mm256_sub_pd(H, ROUND); // Recover the bottom half of the product. L = _mm256_fmsub_pd(A, B, H); // Correct the scaling of H. H = _mm256_mul_pd(H, SCALE); }
This is the simplest one and the only one which is competitive with the scalar approaches. The final scaling is optional depending on what you want to do with the outputs. So this can be considered only 3 instructions. But it's also the least useful since both the inputs and outputs are floating-point values.
It is absolutely critical that both the FMAs stay fused. And this is where fast-math optimizations can break things. If the first FMA is broken up, then
L
is no longer guaranteed to be in the range[-2^51, 2^51]
. If the second FMA is broken up,L
will be completely wrong.
Signed integers in the range: [-251, 251]
// A*B = L + H*2^52 // Input: A and B are in the range [-2^51, 2^51] // Output: L and H are in the range [-2^51, 2^51] void mul52_signed(__m256i& L, __m256i& H, __m256i A, __m256i B){ const __m256d CONVERT_U = _mm256_set1_pd(6755399441055744); // 3*2^51 const __m256d CONVERT_D = _mm256_set1_pd(1.5); __m256d l, h, a, b; // Convert to double A = _mm256_add_epi64(A, _mm256_castpd_si256(CONVERT_U)); B = _mm256_add_epi64(B, _mm256_castpd_si256(CONVERT_D)); a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U); b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D); // Get top half. Convert H to int64. h = _mm256_fmadd_pd(a, b, CONVERT_U); H = _mm256_sub_epi64(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U)); // Undo the normalization. h = _mm256_sub_pd(h, CONVERT_U); // Recover bottom half. l = _mm256_fmsub_pd(a, b, h); // Convert L to int64 l = _mm256_add_pd(l, CONVERT_D); L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_D)); }
Building off of the first example, we combine it with a generalized version of the fast
double <-> int64
conversion trick.This one is more useful since you're working with integers. But even with the fast conversion trick, most of the time will be spent doing conversions. Fortunately, you can eliminate some of the input conversions if you are multiplying by the same operand multiple times.
Unsigned integers in the range: [0, 252)
// A*B = L + H*2^52 // Input: A and B are in the range [0, 2^52) // Output: L and H are in the range [0, 2^52) void mul52_unsigned(__m256i& L, __m256i& H, __m256i A, __m256i B){ const __m256d CONVERT_U = _mm256_set1_pd(4503599627370496); // 2^52 const __m256d CONVERT_D = _mm256_set1_pd(1); const __m256d CONVERT_S = _mm256_set1_pd(1.5); __m256d l, h, a, b; // Convert to double A = _mm256_or_si256(A, _mm256_castpd_si256(CONVERT_U)); B = _mm256_or_si256(B, _mm256_castpd_si256(CONVERT_D)); a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U); b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D); // Get top half. Convert H to int64. h = _mm256_fmadd_pd(a, b, CONVERT_U); H = _mm256_xor_si256(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U)); // Undo the normalization. h = _mm256_sub_pd(h, CONVERT_U); // Recover bottom half. l = _mm256_fmsub_pd(a, b, h); // Convert L to int64 l = _mm256_add_pd(l, CONVERT_S); L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_S)); // Make Correction H = _mm256_sub_epi64(H, _mm256_srli_epi64(L, 63)); L = _mm256_and_si256(L, _mm256_set1_epi64x(0x000fffffffffffff)); }
Finally we get the answer to the original question. This builds off of the signed integer solution by adjusting the conversions and adding a correction step.
But at this point, we're at 13 instructions - half of which are high-latency instructions, not counting the numerous
FP <-> int
bypass delays. So it's unlikely this will be winning any benchmarks. By comparison, a64 x 64 -> 128-bit
SIMD multiply can be done in 16 instructions (14 if you pre-process the inputs.)The correction step can be omitted if the rounding mode is round-down or round-to-zero. The only instruction where this matters is
h = _mm256_fmadd_pd(a, b, CONVERT_U);
. So on AVX512, you can override the rounding for that instruction and leave the rounding mode alone.
Final Thoughts:
It's worth noting that the 252 range of operation can be reduced by adjusting the magic constants. This may be useful for the first solution (the floating-point one) since it gives you extra mantissa to use for accumulation in floating-point. This lets you bypass the need to constantly to convert back-and-forth between int64 and double like in the last 2 solutions.
While the 3 examples here are unlikely to be better than scalar methods, AVX512 will almost certainly tip the balance. Knights Landing in particular has poor throughput for ADCX and ADOX.
And of course all of this is moot when AVX512-IFMA comes out. That reduces a full
52 x 52 -> 104-bit
product to 2 instructions and gives the accumulation for free.这篇关于我可以使用AVX FMA单元来执行比特精确的52位整数乘法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!