关于ADC -1(0xFFFFFFFF)有什么特别之处吗? [英] Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

查看:98
本文介绍了关于ADC -1(0xFFFFFFFF)有什么特别之处吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的一个研究项目中,我正在编写C ++代码.但是,生成的程序集是项目的关键点之一.C ++不提供对标志操作指令的直接访问,尤其是对 ADC 的访问,但是只要编译器足够聪明地使用它,这就不成问题.考虑:

  constexpr无符号X = 0;无符号的f1(无符号的a,无符号的b){b + = a;无符号c = b<一种;返回c + b + X;} 

可变的 c 是一种解决方法,可让我掌握进位标记并将其添加到 b X 中.看起来我很幸运,并且( g ++ -O3 ,版本9.1)生成的代码是这样的:

  f1(无符号整数,无符号整数):添加%edi,%esimov%esi,%eaxadc $ 0x0,%eaxretq 

对于我测试过的所有 X 值,上述代码均与上面相同(当然,对于随即改变的立即值 $ 0x0 除外).我发现了一个例外:当 X == -1 (或 0xFFFFFFFFu 〜0u )时,...的确无关紧要您将其拼写),生成的代码为:

  f1(无符号整数,无符号整数):xor%eax,%eax添加%edi,%esisetb%allea -0x1(%rsi,%rax,1),%eaxretq 

这似乎不如间接测量所建议的初始代码有效(虽然不是很科学).我是对的吗?如果是这样,这是否是缺少优化机会"的错误值得举报吗?

出于价值考虑, clang -O3 版本8.8.0始终使用 ADC (如我所愿)和 icc -O3 ,版本19.0.1永远不会.

我尝试使用内在的 _addcarry_u32 ,但没有帮助.

  unsigned f2(unsigned a,unsigned b){b + = a;无符号字符c = b<一种;_addcarry_u32(c,b,X,& b);返回b;} 

我认为我可能没有正确使用 _addcarry_u32 (我找不到很多信息).既然要由我来提供进位标志,使用它有什么意义?(再次,介绍 c 并祈求编译器了解情况.)

实际上,我可能会正确使用它.对于 X == 0 ,我很高兴:

  f2(无符号整数,无符号整数):添加%esi,%edimov%edi,%eaxadc $ 0x0,%eaxretq 

对于 X == -1 我很不高兴:-(

  f2(无符号整数,无符号整数):添加%esi,%edimov $ 0xffffffff,%eaxsetb%dl加$ 0xff,%dlADC%edi,%eaxretq 

我确实得到了 ADC ,但这显然不是最有效的代码.( dl 在那里做什么?两条指令来读取进位标志并将其恢复?真的吗?我希望我做错了!)

解决方案

mov + adc $ -1,%eax 更有效xor -零+ setc + 3组件 lea 在大多数CPU上的延迟和uop计数,在任何仍然相关的CPU上都没有恶化.> 1


这似乎是gcc错过的优化:它可能看到了一种特殊情况并锁定了这种情况,将自己开枪射击并阻止了 adc 模式识别的发生.

我不知道它到底在寻找什么/正在寻找什么,所以是的,您应该将此报告为未优化优化错误.或者,如果您想更深入地研究自己,可以在优化通过后查看GIMPLE或RTL输出,然后看看会发生什么.如果您对GCC的内部代表一无所知.Godbolt有一个GIMPLE树转储窗口,您可以从与克隆编译器"相同的下拉列表中添加.


clang用 adc 对其进行编译的事实证明这是合法的,即您想要的asm确实与C ++源代码匹配,并且您不会错过某些阻止编译器执行操作的特殊情况优化.(假设clang是没有错误的,这里就是这种情况.)

如果您不小心,例如,可能会发生该问题.在C语言中,尝试编写一个一般情况下的 adc 函数来带进位并从3输入加法中带出进位是困难的,因为这两个加法中的任何一个都可以带进,所以您不能仅仅使用 sum<将进位加到输入之一后的a + b 习惯用法.我不确定是否有可能让gcc或clang发出 add/adc/adc ,其中中间的 adc 必须带进并产生带出./p>

例如 0xff ... ff + 1 绕回至0,因此 sum = a + b + carry_in / carry_out = sum<无法优化为 adc ,因为在 a = -1 carry_in = 1 .

所以另一个猜测是,也许gcc考虑过更早地进行 + X ,并且由于这种特殊情况而将自己开枪了.不过,这没有什么意义.


使用它有什么意义,因为要由我提供进位标志吗?

您正在正确使用 _addcarry_u32 .

其存在的要点是让您用进位 in 和进位 out 表示加法,这在纯C语言中很难实现.不能很好地优化它,通常不只是将进位结果保存在CF中.

如果只需要结转,可以提供一个 0 作为结转,它将优化为 add 而不是 adc ,但仍然可以将C变量带给您.

例如要在32位块中添加两个128位整数,您可以执行此操作

 //在x86-64上不正确,因为它无法优化与2x _addcary_u64相同的效果//即使__restrict保证不重叠.无效adc_128bit(unsigned * __ restrict dst,const unsigned * __ restrict src){未签名的字符进位;进位= _addcarry_u32(0,dst [0],src [0],& dst [0]);进位= _addcarry_u32(carry,dst [1],src [1],& dst [1]);进位= _addcarry_u32(carry,dst [2],src [2],& dst [2]);进位= _addcarry_u32(carry,dst [3],src [3],& dst [3]);} 

(的

与无符号__int128 相比,效率非常低,在无符号__int128 中,编译器仅使用64位add/adc,但确实使clang和ICC发出了 add / adc / adc / adc .GCC弄得一团糟,使用 setcc 将CF存储到整数以执行某些步骤,然后 add dl,-1 将其放回CF中以生成adc .

不幸的是,GCC很讨厌用纯C语言编写的扩展精度/biginteger.Clang有时会稍好一些,但是大多数编译器都不好.这就是为什么对于大多数体系结构,最低级别的gmplib函数都是在asm中手写的原因.


脚注1 :或用于uop计数:在Intel Haswell和更早版本中, adc 为2 oups,但零值立即数(在Sandybridge系列解码器的特殊情况下,如1 uop.

但是具有 base + index + disp 的3分量LEA使其成为Intel CPU上的3周期延迟指令,因此肯定更糟.

在Intel Broadwell及更高版本上, adc 甚至是立即数非零的1uop指令,它利用了Haswell为FMA引入的3输入微指令的支持.

总的uop数量相等,但是延迟更短,这意味着 adc 仍然是更好的选择.

https://agner.org/optimize/

In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:

constexpr unsigned X = 0;

unsigned f1(unsigned a, unsigned b) {
    b += a;
    unsigned c = b < a;
    return c + b + X;
}

Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:

f1(unsigned int, unsigned int):
 add %edi,%esi
 mov %esi,%eax
 adc $0x0,%eax
 retq 

For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:

f1(unsigned int, unsigned int):
 xor %eax,%eax
 add %edi,%esi
 setb %al
 lea -0x1(%rsi,%rax,1),%eax
 retq 

This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?

For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.

I've tried using the intrinsic _addcarry_u32 but it didn't help.

unsigned f2(unsigned a, unsigned b) {
    b += a;
    unsigned char c = b < a;
    _addcarry_u32(c, b, X, &b);
    return b;
}

I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)

I might, actually, be using it correctly. For X == 0 I'm happy:

f2(unsigned int, unsigned int):
 add %esi,%edi
 mov %edi,%eax
 adc $0x0,%eax
 retq 

For X == -1 I'm unhappy :-(

f2(unsigned int, unsigned int):
 add %esi,%edi
 mov $0xffffffff,%eax
 setb %dl
 add $0xff,%dl
 adc %edi,%eax
 retq 

I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)

解决方案

mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1


This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.

I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".


The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)

That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.

e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.

So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.


What's the point of using it since it's up to me to provide the carry flag?

You're using _addcarry_u32 correctly.

The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.

If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.

e.g. to add two 128-bit integers in 32-bit chunks, you can do this

// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
    unsigned char carry;
    carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
    carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
    carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
    carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}

(On Godbolt with GCC/clang/ICC)

That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.

GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.


Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.

But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.

On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.

So equal total uop count but worse latency means that adc would still be a better choice.

https://agner.org/optimize/

这篇关于关于ADC -1(0xFFFFFFFF)有什么特别之处吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆