除非Haswell指定,否则GCC编译的前导零计数很差 [英] GCC compiles leading zero count poorly unless Haswell specified
问题描述
GCC支持 < builtin,它计算参数中前导零的数量(连续的最高有效零)数目。
其中, 0 对于高效地实现 lg(unsigned int x)
函数,它取以 x
的二进制对数,向下舍入1 \\ / sup>:
/ **返回x的base-2日志,其中x> 0 * /
unsigned lg(unsigned x){
return 31U - (unsigned)__ builtin_clz(x);
}
这可以直接使用 - 特别考虑 x == 1
和 clz(x)== 31
- 然后 x == 2 ^ 0
so
lg(x)== 0
和 31 - 31 == 0
正确的结果。更高的值 x
的工作方式相似。
假设内置函数有效地实现, 纯C解决方案。
<现在碰巧,计数前导零操作本质上是双重的,它是
bsr
指令。这将返回参数中最重要的1位 2 的索引。所以如果有10个前导零,则第一个1位位于参数的第21位。一般来说,我们有 31 - clz(x)== bsr(x)
,所以 bsr
其实直接实现我们期望的 lg()
函数,而不需要多余的 31U - ...
部分。 事实上,你可以在这行之间阅读,看到 __ builtin_clz
函数是用 bsr
记住:如果参数为零,它被定义为未定义的行为,当然前导零操作完全定义为32无论 int
的位大小是否为零)。所以 __ builtin_clz
当然是在x86上有效地映射到 bsr
指令的思想下实现的。
然而,看看GCC实际做了什么,在 -O3
中有其他默认选项:it 增加了一大堆额外的垃圾:
lg(unsigned int):
bsr edi,edi
mov eax,31
xor edi,31
sub eax,edi
ret
xor edi,31
行实际上是一个 not edi
对于实际上很重要的底部4位来说,这是从 neg edi $ c $ off-by-one 3 c>将
bsr
的结果转换为 clz
。然后执行 31 - clz(x)
。
然而在 -mtune = haswell
,代码简化到预期的单一 bsr
指令:
lg(unsigned int):
bsr eax,edi
ret
为什么这种情况对我来说很不明确。在Haswell之前, bsr
指令已经存在了几十年,并且AFAIK的行为不变。它不仅仅是一个特定拱门的调优问题,因为 bsr
+一堆额外的指令不会比平原 bsr
并进一步使用 -mtune = haswell
仍然会在慢代码中产生结果。
64位输入和输出的情况是甚至更糟:关键路径中有一个额外的 movsx
,它似乎什么都不做,因为 clz
永远不会是负面的。同样, -march = haswell
变体对于单个 bsr
指令是最优的。
最后,让我们检查非Windows编译器空间中的大玩家, icc
和 clang
。 icc
只是做了一件坏事,并添加了诸如 neg eax之类的冗余内容;加上eax,31; neg eax;添加eax,31;
- wtf?无论 -march
。
0 例如扫描位图作为第一个位元。
1 0的对数是未定义的,所以用0参数调用我们的函数是未定义的行为。
这里,LSB是第0位,MSB是第31位。
3 回想一下,二进制补码中的 -x ==〜x + 1
。
这看起来像一个使用gcc的已知问题: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168
GCC supports the __builtin_clz(int x)
builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument.
Among other things0, this is great for efficiently implementing the lg(unsigned int x)
function, which takes the base-2 logarithm of x
, rounding down1:
/** return the base-2 log of x, where x > 0 */
unsigned lg(unsigned x) {
return 31U - (unsigned)__builtin_clz(x);
}
This works in the straightforward way - in particular consider the case x == 1
and clz(x) == 31
- then x == 2^0
so lg(x) == 0
and 31 - 31 == 0
and we get the correct result. Higher values of x
work similarly.
Assuming the builtin is efficiently implemented, this ends much better than the alternate pure C solutions.
Now as it happens, the count leading zeros operation is essentially the dual of the bsr
instruction in x86. That returns the index of the most-significant 1-bit2 in the argument. So if there are 10 leading zeros, the first 1-bit is in bit 21 of the argument. In general we have 31 - clz(x) == bsr(x)
and in so bsr
in fact directly implements our desired lg()
function, without the superfluous 31U - ...
part.
In fact, you can read between the line and see that the __builtin_clz
function was implemented with bsr
in mind: it is defined as undefined behavior if the argument is zero, when of course the "leading zeros" operation is perfectly well-defined as 32 (or whatever the bit-size of int
is) with a zero argument. So __builtin_clz
was certainly implemented with the idea of being efficiently mapped to a bsr
instruction on x86.
However, looking at what GCC actually does, at -O3
with otherwise default options: it adds a ton of extra junk:
lg(unsigned int):
bsr edi, edi
mov eax, 31
xor edi, 31
sub eax, edi
ret
The xor edi,31
line is effectively a not edi
for the bottom 4 bits that actually matter, that's off-by-one3 from neg edi
which turns the result of bsr
into clz
. Then the 31 - clz(x)
is carried out.
However with -mtune=haswell
, the code simplifies into exactly the expected single bsr
instruction:
lg(unsigned int):
bsr eax, edi
ret
Why that is the case is very unclear to me. The bsr
instruction has been around for a couple decades before Haswell, and the behavior is, AFAIK, unchanged. It's not just an issue of tuning for a particular arch, since bsr
+ a bunch of extra instructions isn't going to be faster than a plain bsr
and furthermore using -mtune=haswell
still results in the slower code.
The situation for 64-bit inputs and outputs is even slightly worse: there is an extra movsx
in the critical path which seems to do nothing since the result from clz
will never be negative. Again, the -march=haswell
variant is optimal with a single bsr
instruction.
Finally, let's check the big players in the non-Windows compiler space, icc
and clang
. icc
just does a bad job and adds redundant stuff like neg eax; add eax, 31; neg eax; add eax, 31;
- wtf? clang
does a good job regardless of -march
.
0 Such as scanning a bitmap for the first set bit.
1 The logarithm of 0 is undefinited, and so calling our function with a 0 argument is undefined behavior.
2 Here, the LSB is the 0th bit and the MSB is the 31st.
3 Recall that -x == ~x + 1
in twos-complement.
This looks like a known issue with gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168
这篇关于除非Haswell指定,否则GCC编译的前导零计数很差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!