获取完整整数乘法的高半部分和低半部分 [英] Getting the high half and low half of a full integer multiply

查看:53
本文介绍了获取完整整数乘法的高半部分和低半部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从三个值 A、B、C(无符号 32 位整数)开始.而且我必须获得两个值 D、E(也是无符号 32 位整数).哪里

I start with three values A,B,C (unsigned 32bit integer). And i have to obtain two values D,E (unsigned 32 bit integer also). Where

D = high(A*C);
E = low(A*C) + high(B*C);

我希望两个 32 位 uint 相乘会产生 64 位结果.高"和低"只是我对 64 位乘法结果中前 32 位和后 32 位标记的约定.

I expect that multiply of two 32bit uint produce 64bit result. "high" and "low" is just my covnention for mark the first 32 bits and the last 32 bits in 64bit result of multiply.

我尝试获得一些已经可用的优化代码.我在巨大的循环中有一小部分代码,只有几个命令行,但是它消耗了几乎所有的计算时间(几个小时计算的物理模拟).这就是我尝试优化这一小部分的原因,其余代码可以保持更用户良好安排".

有一些 SSE 指令适合计算提到的例程.gcc 编译器可能会做优化工作.但是,如果有必要,我不拒绝直接在 SSE 说明中编写一些代码的选项.

There is some SSE instructions that are fit for compute mentioned routine. The gcc compiler probably do optimized work. However i do not reject an option to write some piece of code in SSE intructions directly, if it will be necessary.

请耐心等待我对 SSE 的低经验.我将尝试仅象征性地为 SSE 编写一个算法.订购掩码或理解结构可能会出现一些错误.

Be patient with my low experience with SSE please. I will try to write an algorithm for SSE just symbolically. There will be probably some mistakes with ordering masks or understanding the structure.

  1. 将四个 32 位整数按顺序存储到一个 128 位寄存器中:A、B、C、C.
  2. 将指令(可能是 pmuludq)应用到提到的 128 位寄存器中,该寄存器将 32 位整数对相乘并作为结果返回 64 位整数对.所以它应该同时计算A*C的乘法和B*C的乘法,并返回两个64位值.
  3. 我希望我有新的 128 位寄存器值 P、Q、R、S(四个 32 位块),其中 P、Q 是 A*C 和 R,S 的 64 位结果是 B*C 的 64 位结果.然后我继续将寄存器中的值重新排列为 P,Q,0,R
  4. 取前 64 位 P、Q 并添加第二个 64 位 0、R.结果是一个新的 64 位值.
  5. 将结果的前 32 位读取为 D,将结果的后 32 位读取为 E.
  1. Store four 32-bit integers into one 128-bit register in order: A,B,C,C.
  2. Apply instruction (probably pmuludq) into mentioned 128-bit register which multiply pairs of 32-bit integeres and return pairs of 64-bit integers as result. So it shoudld calculate multiply of A*C and multiply of B*C simultaneously and return two 64-bit values.
  3. I expect that i have new 128bit register values P,Q,R,S (four 32-bit blocs) where P,Q is 64-bit result of A*C and R,S is 64-bit result of B*C. Then i continue with rearrange values at register into order P,Q,0,R
  4. Take first 64 bits P,Q and add second 64 bits 0,R. The result is a new 64 bits value.
  5. Read first 32 bits of the result as D and last 32 bits of the result as E.

该算法应该返回正确的 E 和 D 值.

This algorithm should return correct values for E and D.

我的问题:

c++ 中是否有静态代码生成与 1-5 SSE 算法类似的 SSE 例程?我更喜欢具有更高性能的解决方案.如果算法对于标准 c++ 命令有问题,有没有办法在 SSE 中编写算法?

Is there a static code in c++ which generate similar SSE routine as mentioned 1-5 SSE algorithm? I preffer solutions with higher performance. If the algorithm is problematic for standart c++ commands, is there a way how to write an algorithm in SSE?

我使用 TDM-GCC 4.9.2 64 位编译器.

I use TDM-GCC 4.9.2 64-bit compiler.

(注意:问题在建议后被修改)

(note: Question was modified after advice)

(注2:我在这个http://sci.tuomastonteri.fi/programming/sse 中有一个灵感 使用 SSE 以获得更好的性能)

(note2: I have an inspiration in this http://sci.tuomastonteri.fi/programming/sse for using SSE for obtain better performance)

推荐答案

除非您有多个输入要并行处理,否则您不需要为此使用向量.clang 和 gcc 已经很好地优化了编写代码的正常"方式:转换为两倍大小,相乘,然后移动以获得高一半.编译器识别这种模式.

You don't need vectors for this unless you have multiple inputs to process in parallel. clang and gcc already do a good job of optimizing the "normal" way to write your code: cast to twice the size, multiply, then shift to get the high half. Compilers recognize this pattern.

他们注意到操作数以 32 位开始,因此在转换为 64b 后,上半部分都为零.因此,他们可以使用 x86 的 mul insn 进行 32b*32b->64b 乘法,而不是进行完整的扩展精度 64b 乘法.在 64 位模式下,它们对您的代码的 __uint128_t 版本执行相同的操作.

They notice that the operands started out as 32bit, so the upper halves are all zero after casting to 64b. Thus, they can use x86's mul insn to do a 32b*32b->64b multiply, instead of doing a full extended-precision 64b multiply. In 64bit mode, they do the same thing with a __uint128_t version of your code.

这两个函数都编译成相当好的代码(一个 mulimul每乘)..gcc -m32 不支持 128b 类型,但我不会深入讨论,因为 1. 您只询问了 32 位值的完整乘法,以及 2. 您应该始终在需要时使用 64 位代码跑得快的东西.如果您正在执行结果不适合寄存器的全乘法,clang 将避免大量额外的 mov 指令,因为 gcc 对此很傻.这个小测试函数为 gcc 错误报告做了一个很好的测试用例.

Both of these functions compile to fairly good code (one mul or imul per multiply).. gcc -m32 doesn't support 128b types, but I won't get into that because 1. you only asked about full multiplies of 32bit values, and 2. you should always use 64bit code when you want something to run fast. If you are doing full-multiplies where the result doesn't fit in a register, clang will avoid a lot of extra mov instructions, because gcc is silly about this. This little test function made a good test-case for that gcc bug report.

godbolt 链接包含一个函数,该函数在循环中调用 this,将结果存储在数组中.它通过一堆改组自动矢量化,但如果您有多个输入要并行处理,它仍然看起来像加速.不同的输出格式在乘法后可能需要较少的改组,例如可能为 DE 存储单独的数组.

That godbolt link includes a function that calls this in a loop, storing the result in an array. It auto-vectorizes with a bunch of shuffling, but still looks like a speedup if you have multiple inputs to process in parallel. A different output format might take less shuffling after the multiply, like maybe storing separate arrays for D and E.

我包含 128b 版本以表明编译器可以处理这个问题,即使它不是微不足道的(例如,只需执行 64 位 imul 指令在 32 位上执行 64*64->64b 乘法输入,在将函数入口时可能位于输入寄存器中的任何高位归零之后.)

I'm including the 128b version to show that compilers can handle this even when it's not trivial (e.g. just do a 64bit imul instruction to do a 64*64->64b multiply on the 32bit inputs, after zeroing any upper bits that might be sitting in the input registers on function entry.)

针对 Haswell CPU 和更新版本时,gcc 和 clang 可以使用 mulx BMI2 指令.(我在 Godbolt 链接中使用了 -mno-bmi2 -mno-avx2 以保持 asm 更简单.如果你一个 Haswell CPU,只需使用 -O3-march=haswell.) mulx dest1, dest2, src1dest1:dest2 = rdx * src1mul src1 做 <代码>rdx:rax = rax * src1.所以 mulx 有两个只读输入(一个隐式:edx/rdx)和两个只写输出.这让编译器可以使用更少的 mov 指令进行全乘法,以将数据传入和传出 mul 的隐式寄存器.这只是一个小的加速,尤其是.由于 64 位 mulx 在 Haswell 上有 4 个周期的延迟而不是 3 个.(奇怪的是,64bit mulmulx 比 32bit mulmulx.)

When targeting Haswell CPUs and newer, gcc and clang can use the mulx BMI2 instruction. (I used -mno-bmi2 -mno-avx2 in the godbolt link to keep the asm simpler. If you do have a Haswell CPU, just use -O3 -march=haswell.) mulx dest1, dest2, src1 does dest1:dest2 = rdx * src1 while mul src1 does rdx:rax = rax * src1. So mulx has two read-only inputs (one implicit: edx/rdx), and two write-only outputs. This lets compilers do full-multiplies with fewer mov instructions to get data into and out of the implicit registers for mul. This is only a small speedup, esp. since 64bit mulx has 4 cycle latency instead of 3, on Haswell. (Strangely, 64bit mul and mulx are slightly cheaper than 32bit mul and mulx.)

// compiles to good code: you can and should do this sort of thing:
#include <stdint.h>

struct DE { uint32_t D,E; };

struct DE f_structret(uint32_t A, uint32_t B, uint32_t C) {
  uint64_t AC = A * (uint64_t)C;
  uint64_t BC = B * (uint64_t)C;
  uint32_t D = AC >> 32;         // high half
  uint32_t E = AC + (BC >> 32);  // We could cast to uint32_t before adding, but don't need to
  struct DE retval = { D, E };
  return retval;
}



#ifdef __SIZEOF_INT128__  // IDK the "correct" way to detect __int128_t support
struct DE64 { uint64_t D,E; };

struct DE64 f64_structret(uint64_t A, uint64_t B, uint64_t C) {
  __uint128_t AC = A * (__uint128_t)C;
  __uint128_t BC = B * (__uint128_t)C;
  uint64_t D = AC >> 64;         // high half
  uint64_t E = AC + (BC >> 64);
  struct DE64 retval = { D, E };
  return retval;
}
#endif

这篇关于获取完整整数乘法的高半部分和低半部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆