使用进位标志多字加法 [英] multi-word addition using the carry flag

查看:486
本文介绍了使用进位标志多字加法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

GCC有128位整数。使用这些我可以让编译器使用 MUL (或 IMUL 只有一个操作数)指令。例如:

GCC has 128-bit integers. Using these I can get the compiler to use the mul (or imul with only one operand) instructions. For example

uint64_t x,y;
unsigned __in128 z = (unsigned __int128)x*y;

产生 MUL 。 (如果你有兴趣看到这一问题的结束,更新前,为code表示)我已经使用这个来创建一个128×128至256的功能。

produces mul. I have used this to create a 128x128 to 256 function (see the end of this question, before the update, for code for that if you're interested).

现在我想要做的256位加法和我还没有找到一个办法让编译器使用 ADC 除非使用组装。我可以使用汇编,但我要为效率的内联函数。编译器已经产生一个有效的128×128至256的功能(我在这个问题开始解释的原因),所以我不明白为什么我要改写这个在装配以及(或任何其他职能,编译器已经实现了有效)

Now I want to do 256-bit addition and I have not found a way to get the compiler to use ADC except by using assembly. I could use an assembler but I want inline functions for efficiency. The compiler already produces an efficient 128x128 to 256 function (for the reason I explained at the start of this question) so I don't see why I should rewrite this in assembly as well (or any other functions which the compiler already implements efficiently).

下面是内联汇编的功能,我想出了:

Here is the inline assembly function I have come up with:

#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
 __asm__ __volatile__ ( \
 "addq %[v1], %[u1] \n" \
 "adcq %[v2], %[u2] \n" \
 "adcq %[v3], %[u3] \n" \
 "adcq %[v4], %[u4] \n" \
 : [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
 : [v1]  "r" (Y1), [v2]  "r" (Y2), [v3]  "r" (Y3), [v4]  "r" (Y4)) 

(可能不是每个输出需要一个<一个href=\"https://stackoverflow.com/questions/26567746/unexpected-gcc-inline-asm-behaviour-clobbered-variable-overwritten\">early揍修改,但我得到错误的结果,而不至少在过去两年)

和下面是一个用C同样的事情功能

And here is a function which does the same thing in C

void add256(int256 *x, int256 *y) {
    uint64_t t1, t2;
    t1 = x->x1; x->x1 += y->x1;
    t2 = x->x2; x->x2 += y->x2 + ((x->x1) < t1);
    t1 = x->x3; x->x3 += y->x3 + ((x->x2) < t2);
                x->x4 += y->x4 + ((x->x3) < t1);
}

为什么装配必要吗?为什么不能编译器编译 add256 函数中使用随身携带的标志吗?有没有办法强迫编译器要做到这一点(例如,我可以改变 add256 ,使其做到这一点)?什么是有人想为编译器不支持内联汇编(写在装配所有的功能是什么?)为什么没有内在此?

Why is assembly necessary for this? Why can't the compiler compile the add256 function to use the carry flags? Is there a way to coerce the compiler to do this (e.g. can I change add256 so that it does this)? What is someone suppose to do for a compiler which does not support inline assembly (write all the functions in assembly?) Why are there no intrinsic for this?

下面是128×128到256个功能

Here is the 128x128 to 256 function

void muldwu128(int256 *w, uint128 u, uint128 v) {
   uint128 t;
   uint64_t u0, u1, v0, v1, k, w1, w2, w3;

   u0 = u >> 64L;
   u1 = u;
   v0 = v >> 64L;
   v1 = v;

   t = (uint128)u1*v1;
   w3 = t;
   k = t >> 64L;

   t = (uint128)u0*v1 + k;
   w2 = t;
   w1 = t >> 64L;
   t = (uint128)u1*v0 + w2;
   k = t >> 64L;

   w->hi = (uint128)u0*v0 + w1 + k;
   w->lo = (t << 64L) + w3;

}

有些类型定义:

typedef          __int128  int128;
typedef unsigned __int128 uint128;

typedef union {
    struct {
        uint64_t x1;
        uint64_t x2;
         int64_t x3;
         int64_t x4;
    };
    struct {
        uint128 lo;
         int128 hi;
    };
} int256;

更新:

我的问题主要是这些问题重复:

My question is largely a duplicate of these questions:


  1. get-gcc-to-use-carry-logic-for-arbitrary-$p$pcision-arithmetic-without-inline-assembly

  2. efficient-128-bit-addition-using-carry-flag

  3. multiword-addition-in-c.

  1. get-gcc-to-use-carry-logic-for-arbitrary-precision-arithmetic-without-inline-assembly
  2. efficient-128-bit-addition-using-carry-flag
  3. multiword-addition-in-c.

英特尔有一个很好的文章(<一个href=\"http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html\">New指令支持大整数算术),其中讨论了大整数运算和三个新指令MULX,使用ADCx,ADOX。他们写道:

Intel has a good article (New Instructions Support Large Integer Arithmetic) which discusses large integer arithmetic and the three new instructions MULX, ADCX, ADOX. They write:

mulx的固有的定义,
  使用ADCx和ADOX还将集成到编译器。这是第一个
  例如一个添加带进位类型的指令与正在实施
  内部函数。内在的支持将使用户能够实现大
  使用更高级别的编程语言如整数运算
  C / C ++。

intrinsic definitions of mulx, adcx and adox will also be integrated into compilers. This is the first example of an "add with carry" type instruction being implemented with intrinsics. The intrinsic support will enable users to implement large integer arithmetic using higher level programming languages such as C/C++.

的内在函数

unsigned __int64 umul128(unsigned __int64 a, unsigned __int64 b, unsigned __int64 * hi);
unsigned char _addcarry_u64(unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 *out);
unsigned char _addcarryx_u64(unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 *out);

顺便说一下,MSVC中已经有<一个href=\"https://msdn.microsoft.com/en-us/library/vstudio/3dayytw9%28v=vs.100%29.aspx\"><$c$c>_umul128内在。因此,即使MSVC没有 __ int128 _umul128 内在可被用来生成 MUL ,因此128位乘法。

Incidentally, MSVC already has a _umul128 intrinsic. So even though MSVC does not have __int128 the _umul128 intrinsic can be used to generate mul and therefore 128 bit multiplication.

MULX instruciton是因为BMI2在Haswell的使用。在使用ADCx ADOX 指令可用于Broadwell微架构的处理器。这太糟糕了,没有内在的 ADC 已自8086在1979年提供了能够解决内联汇编问题。

The MULX instruciton is available since BMI2 in Haswell. The ADCX and ADOX instructions are available for Broadwell processors. It's too bad there is no intrinsic for ADC which has been available since the 8086 in 1979. That would solve the inline assembly problem.

编辑:其实 __ int128 将使用 mulx 如果BMI2定义(例如,使用 - mbmi2 或 - =进军的Haswell

actually __int128 will use mulx if BMI2 is defined (e.g. using -mbmi2 or -march=haswell).

编辑:

我试过锵与携带建宏通过添加LUU永福的建议

I tried the Clang's add with carry builtins as suggested by Lưu Vĩnh Phúc

void add256(int256 *x, int256 *y) {
    unsigned long long carryin=0, carryout;
    x->x1 = __builtin_addcll(x->x1, y->x1, carryin, &carryout); carryin = carryout;
    x->x2 = __builtin_addcll(x->x2, y->x2, carryin, &carryout); carryin = carryout;
    x->x3 = __builtin_addcll(x->x3, y->x3, carryin, &carryout); carryin = carryout;
    x->x4 = __builtin_addcll(x->x4, y->x4, carryin, &carryout);  
}

但是,这并不产生 ADC ,它是更复杂,比我期望的那样。

but this does not generated ADC and it's more complicated than I expect.

推荐答案

我发现在ICC 13.0.01的解决方案使用 _addcarry_u64 内在

I found a solution with ICC 13.0.01 using the _addcarry_u64 intrinsic

void add256(uint256 *x, uint256 *y) {
    unsigned char c = 0;
    c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
    c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
    c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
        _addcarry_u64(c, x->x4, y->x4, &x->x4);
}

产生

L__routine_start_add256_0:
add256:
        xorl      %r9d, %r9d                                    #25.9
        movq      (%rsi), %rax                                  #22.9
        addq      %rax, (%rdi)                                  #22.9
        movq      8(%rsi), %rdx                                 #23.9
        adcq      %rdx, 8(%rdi)                                 #23.9
        movq      16(%rsi), %rcx                                #24.9
        adcq      %rcx, 16(%rdi)                                #24.9
        movq      24(%rsi), %r8                                 #25.9
        adcq      %r8, 24(%rdi)                                 #25.9
        setb      %r9b                                          #25.9
        ret                                                     #26.1

我用 -O3 编译。我不知道如何启用 ADX 与ICC。也许我需要ICC 14?

I compiled with -O3. I don't know how to enable adx with ICC. Maybe I need ICC 14?

这正是1 addq 和三个 adcq 像我期望的那样。

That's exactly 1 addq and three adcq like I expect.

通过使用锵结果 -O3 -madx 是一个烂摊子

With Clang the result using -O3 -madx is a mess

add256(uint256*, uint256*):                  # @add256(uint256*, uint256*)
movq    (%rsi), %rax
xorl    %ecx, %ecx
xorl    %edx, %edx
addb    $-1, %dl
adcq    %rax, (%rdi)
addb    $-1, %cl
movq    (%rdi), %rcx
adcxq   %rax, %rcx
setb    %al
movq    8(%rsi), %rcx
movb    %al, %dl
addb    $-1, %dl
adcq    %rcx, 8(%rdi)
addb    $-1, %al
movq    8(%rdi), %rax
adcxq   %rcx, %rax
setb    %al
movq    16(%rsi), %rcx
movb    %al, %dl
addb    $-1, %dl
adcq    %rcx, 16(%rdi)
addb    $-1, %al
movq    16(%rdi), %rax
adcxq   %rcx, %rax
setb    %al
movq    24(%rsi), %rcx
addb    $-1, %al
adcq    %rcx, 24(%rdi)
retq

如果不启用 -madx 在Clang的结果也好不了多少。

Without enabling -madx in Clang the result is not much better.

编辑:
Apperently MSVC已经有 _addcarry_u64 。我试了一下,这是作为ICC一样好(1X 添加和3 ADC )。

这篇关于使用进位标志多字加法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆