使用进位标志有效的128位加法 [英] Efficient 128-bit addition using carry flag
问题描述
我用我的C ++ code的非常内环128位的整数计数器。 (无关背景:实际应用在规则网格上,这涉及重复递增的大整数评估有限差分方程,因为小四舍五入积累足以影响甚至答案64位是不够的precision)
I'm using a 128 bit integer counter in the very inner loops of my C++ code. (Irrelevant background: The actual application is evaluating finite difference equations on a regular grid, which involves repetitively incrementing large integers, and even 64 bits isn't enough precision because small rounding accumulates enough to affect the answers.)
我已经重新presented整数作为两个64位无符号多头。我现在需要一个128位的常数来增加这些值。这并不难,但你必须手动赶上从低位字进位到高位字。
I've represented the integer as two 64 bit unsigned longs. I now need to increment those values by a 128 bit constant. This isn't hard, but you have to manually catch the carry from the low word to the high word.
我有工作code是这样的:
I have working code something like this:
inline void increment128(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
if (loWord < loAdd) ++hiWord; // test_and_add_carry
hiWord += hiAdd;
}
这是紧张和简单的code。它的工作原理。
This is tight and simple code. It works.
不幸的是,这是我的运行时间的20%左右。凶手线是LOWORD测试。如果我删除了,我明明得到错误的答案,但在运行时开销从20%下降到4%!所以携带的测试是特别贵!
Unfortunately this is about 20% of my runtime. The killer line is that loWord test. If I remove it, I obviously get the wrong answers but the runtime overhead drops from 20% to 4%! So that carry test is especially expensive!
我的问题:请问C ++暴露随身携带的硬件标志,甚至延伸到GCC?
这似乎是补充可以不高于若实际编译指令用于通过最后的套利指令的HIWORD另外一个附加的测试和添加进线来完成。
有没有办法改写的测试和附加随身行,让编译器使用内部运算code?
My question: Does C++ expose the hardware carry flags, even as an extension to GCC? It seems like the additions could be done without the test-and-add-carry line above if the actual compiled instructions used an add using last carry instruction for the hiWord addition. Is there a way to rewrite the test-and-add-carry line to get the compiler to use the intrinsic opcode?
推荐答案
其实如果你仔细写你的code gcc会自动使用进...
Actually gcc will use the carry automatically if you write your code carefully...
我编这个code与的gcc -O2 -Wall -Werror -S
:
void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
if (loWord < loAdd) ++hiWord; // test_and_add_carry
hiWord += hiAdd;
}
void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
hiWord += hiAdd;
hiWord += (loWord < loAdd); // test_and_add_carry
}
这是increment128_1集:
This is the assembly for increment128_1:
.cfi_startproc
movabsq $-8801131483544218438, %rax
addq (%rsi), %rax
movabsq $-8801131483544218439, %rdx
cmpq %rdx, %rax
movq %rax, (%rsi)
ja .L5
movq (%rdi), %rax
addq $1, %rax
.L3:
movabsq $6794178679361, %rdx
addq %rdx, %rax
movq %rax, (%rdi)
ret
...这是increment128_2集:
...and this is the assembly for increment128_2:
movabsq $-8801131483544218438, %rax
addq %rax, (%rsi)
movabsq $6794178679361, %rax
addq (%rdi), %rax
movabsq $-8801131483544218439, %rdx
movq %rax, (%rdi)
cmpq %rdx, (%rsi)
setbe %dl
movzbl %dl, %edx
leaq (%rdx,%rax), %rax
movq %rax, (%rdi)
ret
请注意在第二个版本缺乏条件分支。
Note the lack of conditional branches in the second version.
此外,引用往往不好的表现,因为GCC有担心走样......这往往是最好只按值传递的东西。试想一下:
Also, references are often bad for performance, because GCC has to worry about aliasing... It is often better to just pass things by value. Consider:
struct my_uint128_t {
unsigned long hi;
unsigned long lo;
};
my_uint128_t increment128_3(my_uint128_t x)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
x.lo += loAdd;
x.hi += hiAdd + (x.lo < loAdd);
return x;
}
大会:
.cfi_startproc
movabsq $-8801131483544218438, %rdx
movabsq $-8801131483544218439, %rax
movabsq $6794178679362, %rcx
addq %rsi, %rdx
cmpq %rdx, %rax
sbbq %rax, %rax
addq %rcx, %rax
addq %rdi, %rax
ret
这实际上是三个最严格的code。
This is actually the tightest code of the three.
... OK所以没有人实际使用的随身携带自动:-)。但他们避免条件转移,我敢打赌是缓慢的部分(因为分支prediction逻辑将弄错了一半的时间)。
...OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).
和一多,这我碰到做一个小的搜索偶然发现。你知道吗GCC还内置了支持128位的整数?
And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?
typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));
my_uint128_t increment128_4(my_uint128_t x)
{
const my_uint128_t hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
return x + (hiAdd << 64) + loAdd;
}
对于这一个大会即将尽善尽美:
The assembly for this one is about as good as it gets:
.cfi_startproc
movabsq $-8801131483544218438, %rax
movabsq $6794178679361, %rdx
pushq %rbx
.cfi_def_cfa_offset 16
addq %rdi, %rax
adcq %rsi, %rdx
popq %rbx
.cfi_offset 3, -16
.cfi_def_cfa_offset 8
ret
(不知道在哪里 EBX
的推/流行是从哪里来的,但是这还算不错。)
(Not sure where the push/pop of ebx
came from, but this is still not bad.)
所有这些都是与GCC 4.5.2,顺便说一句。
All of these are with GCC 4.5.2, by the way.
这篇关于使用进位标志有效的128位加法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!