使用inline内联汇编器对带有进位的bigint add进行编码 [英] Using intel inline assembler to code bigint add with carry

查看:126
本文介绍了使用inline内联汇编器对带有进位的bigint add进行编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做一个快速代码,以大整数形式添加64位数字:

I would like to do a fast code for adding 64 bit numbers in big ints:

uint64_t ans[n];
uint64_t a[n], b[n]; // assume initialized values....
for (int i = 0; i < n; i++)
  ans[i] = a[i] + b[i];

但以上内容不适用于进位.

but the above does not work with carry.

我在这里看到了另一个问题,建议使用if语句检查哪一个优雅:

I saw another question here that suggested using an if statement to check which is elegant:

ans[0] = a[0] + b[0];
int c = ans[0] < a[0];
for (int i = 0; i < n; i++) {
  ans[i] = a[i] + b[i] + c;
  c = ans[i] < a[i];
}

但是,我想学习如何嵌入内联(intel)程序集并更快地完成它. 我确定有64位操作码,等同于:

However, I would like to learn how to embed inline (intel) assembly and do it faster. I am sure there are 64 bit opcodes, the equivalent of:

add eax, ebx
adc ...

但是我不知道如何从其余的c ++代码中将参数传递给汇编器.

but I don't know how to pass parameters to the assembler from the rest of the c++ code.

推荐答案

但以上内容不适用于进位.

but the above does not work with carry.

如果您的意思是GCC不会生成使用ADC指令的代码,那是因为其优化程序已经确定了实现加法的最佳方法.

If you mean that GCC does not generate code that uses the ADC instruction, that's because its optimizer has determined that there is a more optimal way to implement the addition.

这是我的代码测试版本.我已经将数组提取出来作为传递给函数的参数,这样就不会遗漏代码,并且我们可以将研究范围限制在相关部分.

Here is my test version of your code. I have extracted the arrays out as parameters passed to the function so that the code cannot be elided and we can limit our study to the relevant portions.

void Test(uint64_t* a, uint64_t* b, uint64_t* ans, int n)
{
    for (int i = 0; i < n; ++i)
    {
        ans[i] = a[i] + b[i];
    }
}

现在,的确,当您使用现代版本的GCC和编译时,对此进行编译 ,您会看到一堆看上去很疯狂的代码.

Now, indeed, when you compile this with a modern version of GCC and look at the disassembly, you'll see a bunch of crazy-looking code.

Godbolt编译器浏览器非常有用,它可以对C源代码行及其对应的汇编代码进行颜色编码(或者至少是尽其所能;这在优化代码中并不完美,但是它在这里效果很好).紫色代码是在循环内部实现64位加法的原因. GCC发出SSE2指令进行加法运算.具体来说,您可以选择 MOVDQU (这会导致未对齐的double从内存到XMM寄存器中的四字), PADDQ (在打包的整数quadwords)和 MOVQ (这会将四字从XMM寄存器中移出)进入内存).粗略地说,对于非汇编专家来说,MOVDQU是如何加载64位整数值的,PADDQ进行加法,然后MOVQ存储结果.

The Godbolt compiler explorer is helpful enough that it color-codes lines of C source and their corresponding assembly code (or at least, it does so to the best of its ability; this isn't perfect in optimized code, but it works well enough here). The purple code is what implements the 64-bit addition in the inner body of the loop. GCC is emitting SSE2 instructions to do the addition. Specifically, you can pick out MOVDQU (which does an unaligned move of a double quadword from memory into an XMM register), PADDQ (which does an addition on packed integer quadwords), and MOVQ (which moves a quadword from an XMM register into memory). Roughly speaking, for a non-assembly expert, MOVDQU is how it loads the 64-bit integer values, PADDQ does the addition, and then MOVQ stores the result.

使此输出特别嘈杂和令人困惑的部分原因是,GCC正在展开 for循环.如果禁用循环展开(-fno-tree-vectorize),则会得到更易于阅读的输出,尽管它是仍然使用相同的指令执行相同的操作. (好吧,主要是.现在,它在各处都使用MOVQ进行加载和存储,而不是使用MOVDQU进行加载.)

Part of what makes this output especially noisy and confusing is that GCC is unrolling the for loop. If you disable loop unrolling (-fno-tree-vectorize), you get output that is easier to read, although it's still doing the same thing using the same instructions. (Well, mostly. Now it's using MOVQ everywhere, for both loads and stores, instead of loading with MOVDQU.)

另一方面,如果您明确禁止编译器使用SSE2指令(-mno-sse2),则您会看到输出明显不同.现在,由于它不能使用SSE2指令,因此它发出基本的x86指令来执行64位加法操作,而唯一的方法是ADD + ADC.

On the other hand, if you specifically forbid the compiler from using SSE2 instructions (-mno-sse2), you see output that is significantly different. Now, because it can't use SSE2 instructions, it's emitting basic x86 instructions to do the 64-bit addition—and the only way to do it is ADD + ADC.

我怀疑这是您期望看到的代码.显然,GCC相信对向量进行矢量化可以加快代码的速度,因此,当您使用-O2-O3标志进行编译时,它就是这样做的.在-O1,它始终使用ADD + ADC.这是其中更少的指令并不意味着更快的代码的情况之一. (或者至少,GCC并不这么认为.实际代码的基准可能会讲一个不同的故事.在某些人为设计的场景中,开销可能会很大,而在现实世界中则无关紧要.)

I suspect that this is the code you were expecting to see. Clearly, GCC believes that vectorizing the operation results in faster code, so this is what it does when you compile with the -O2 or -O3 flags. At -O1, it always uses ADD + ADC. This is one of those cases where fewer instructions does not imply faster code. (Or at least, GCC doesn't think so. Benchmarks on actual code might tell a different story. The overhead may be significant in certain contrived scenarios but irrelevant in the real world.)

就其价值而言,Clang的行为与GCC的行为非常相似.

For what it's worth, Clang behaves in very similar ways as GCC does here.

如果您的意思是这段代码没有将上一个加法的结果转移到下一个加法的结果,那么您是对的.您显示的第二段代码实现了该算法,并且 GCC会使用ADC指令对此进行编译.

If you meant that this code doesn't carry the result of the previous addition over to the next addition, then you're right. The second snippet of code that you've shown implements that algorithm, and GCC does compile this using the ADC instruction.

至少在针对x86-32时有效.在面向x86-64的情况下,如果您具有可用的本机64位整数寄存器,则甚至不需要携带". 简单的ADD指令就足够,而所需的代码则少得多.实际上,这只是32位体系结构上的"bigint"算法,这就是为什么我在所有上述分析和编译器输出中都假设使用x86-32.

At least, it does when targeting x86-32. When targeting x86-64, where you have native 64-bit integer registers available, no "carrying" is even necessary; simple ADD instructions are sufficient, requiring significantly less code. In fact, this is only "bigint" arithmetic on 32-bit architectures, which is why I have assumed x86-32 in all of the foregoing analysis and compiler output.

在评论中,Ped7g想知道为什么编译器似乎没有ADD + ADC链习惯用法的想法.我不确定他指的是什么,因为他没有分享他尝试过的任何输入代码示例,但是正如我所展示的,编译器 do 在这里使用ADC指令.但是,编译器不会在循环迭代中进行链式携带.实际上,这太难实现了,因为有太多指令清除了这些标志.手写汇编代码的人也许可以做到,但编译器无法做到.

In a comment, Ped7g wonders why compilers don't seem to have the idea of the ADD+ADC chain idiom. I'm not entirely sure what he's referring to here, since he didn't share any examples of the input code that he tried, but as I've shown, compilers do use ADC instructions here. However, compilers don't chain carries across loop iterations. This is too difficult to implement in practice, because so many instructions clear the flags. Someone hand-writing the assembly code might be able to do it, but compilers won't.

(请注意,c可能应为 unsigned 整数,以鼓励某些优化.在这种情况下,它仅确保GCC在准备执行64位操作时使用XOR指令代替CDQ.虽然速度稍快,但不是巨大的改进,但是里程可能因实际代码而异.)

(Note that c should probably be an unsigned integer to encourage certain optimizations. In this case, it just ensures that GCC uses an XOR instruction when preparing to do a 64-bit addition instead of a CDQ. Although slightly faster, not a huge improvement, but mileage may vary with real code.)

(另外,令人失望的是,GCC无法发出用于在循环内设置c的无分支代码.如果输入值足够随机,则分支预测将失败,并且最终会导致效率相对较低的代码.几乎可以肯定,编写C源代码以说服GCC发出无分支代码的方法,但这是完全不同的答案.)

(Also, it's disappointing that GCC is unable to emit branchless code for setting c inside of the loop. With sufficiently random input values, branch prediction will fail, and you'll end up with relatively inefficient code. There are almost certainly ways of writing the C source to persuade GCC to emit branchless code, but that's an entirely different answer.)

我想学习如何嵌入内联(intel)程序集并更快地完成它.

I would like to learn how to embed inline (intel) assembly and do it faster.

好吧,我们已经看到,如果您天真的使一堆ADC指令发出,它可能不一定会更快.除非您确信自己对性能的假设是正确的,否则请不要进行手动优化!

Well, we've already seen that it might not necessarily be faster if you naïvely caused a bunch of ADC instructions to be emitted. Don't hand-optimize unless you are confident that your assumptions about performance are correct!

此外,内联汇编不仅难以编写,调试和维护,而且甚至可能使您的代码变慢,因为它禁止了某些本来可以由编译器完成的优化.您需要能够证明您的手动编码程序集在性能上远胜于编译器生成的内容,而这些考虑因素的相关性越来越小.您还应该通过更改标志或巧妙地编写C源代码,确认没有办法让编译器生成接近理想输出的代码.

Also, not only is inline assembly difficult to write, debug, and maintain, but it may even make your code slower because it inhibits certain optimizations that could otherwise have been done by the compiler. You need to be able to prove that your hand-coded assembly is a significant enough performance win over what the compiler would generate that these considerations become less relevant. You also should confirm that there is no way to get the compiler to generate code that is close to your ideal output, either by altering flags or cleverly writing the C source.

但是如果您真的想,则可以阅读各种在线教程教您如何使用GCC的内联汇编器. 这是一个相当不错的;还有很多其他的.当然,还有手册 .所有人都会解释扩展asm" 指定输入操作数和输出操作数,这将回答您的问题如何将其余c ++代码中的参数传递给汇编器".

But if you really wanted to, you could read any of a variety of online tutorials that teach you how to use GCC's inline assembler. This is a pretty good one; there are plenty of others. And of course, there is the manual. All will explain how "extended asm" allows you to specify input operands and output operands, which will answer your question of "how to pass parameters to the assembler from the rest of the c++ code".

正如paddy和Christopher Oicles所建议的那样,您应该更喜欢内部函数而不是内联汇编.不幸的是,没有导致ADC指令发出的内在函数.内联汇编是您唯一的选择–或者,我已经建议编写C源代码,以便编译器自己执行Right Thing™.

As paddy and Christopher Oicles suggested, you should prefer intrinsics to inline assembly. Unfortunately, there are no intrinsics that cause ADC instructions to be emitted. Inline assembly is your only recourse there—that, or what I already suggested of writing the C source so that the compiler will do the Right Thing™ on its own.

尽管有 _addcarry_u32_addcarry_u64内在函数.这些导致发出ADCXADOX指令. 这些是扩展"版本ADC可以产生更有效的代码.它们是随Broadwell微体系结构引入的Intel ADX指令集的一部分.在我看来,Broadwell没有足够高的市场渗透率,您只能发出ADCXADOX指令并将其命名为"day". 很多用户仍然拥有较旧的计算机,因此,尽可能为他们提供支持符合您的利益.如果您正在准备针对特定体系结构进行了调整的构建,那么它们很棒,但是我不建议将其用于一般用途.

There are _addcarry_u32 and _addcarry_u64 intrinsics, though. These cause ADCX or ADOX instructions to be emitted. These are "extended" versions of ADC that can produce more efficient code. They are part of the Intel ADX instruction set, introduced with the Broadwell microarchitecture. In my opinion, Broadwell does not have sufficiently high market penetration that you can simply emit ADCX or ADOX instructions and call it a day. Lots of users still have older machines, and it's in your interest to support them to the extent possible. They're great if you're preparing builds tuned for specific architectures, but I would not recommend it for general use.

我确定有64位操作码,等效于:add + adc

以64位体系结构为目标时,有ADDADC(以及ADCXADOX)的64位版本.然后,您可以使用相同的模式来实现128位的"bigint"算法.

There are 64-bit versions of ADD and ADC (and ADCX and ADOX) when you're targeting a 64-bit architecture. This would then allow you to implement 128-bit "bigint" arithmetic using the same pattern.

在x86-32上,基本指令集中没有这些指令的64位版本.您必须转向SSE2,就像我们看到的GCC和Clang一样.

On x86-32, there are no 64-bit versions of these instructions in the base instruction set. You must turn to SSE2, like we saw GCC and Clang do.

这篇关于使用inline内联汇编器对带有进位的bigint add进行编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆