除非Haswell指定,否则GCC编译的前导零计数很差 [英] GCC compiles leading zero count poorly unless Haswell specified

查看:140
本文介绍了除非Haswell指定,否则GCC编译的前导零计数很差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

GCC支持 < builtin,它计算参数中前导零的数量(连续的最高有效零)数目。



其中, 0 对于高效地实现 lg(unsigned int x)函数,它取以 x 的二进制对数,向下舍入1 \\ / sup>:

  / **返回x的base-2日志,其中x> 0 * / 
unsigned lg(unsigned x){
return 31U - (unsigned)__ builtin_clz(x);
}

这可以直接使用 - 特别考虑 x == 1 clz(x)== 31 - 然后 x == 2 ^ 0 so lg(x)== 0 31 - 31 == 0 正确的结果。更高的值 x 的工作方式相似。



假设内置函数有效地实现, 纯C解决方案



<现在碰巧,计数前导零操作本质上是双重的,它是
在x86中> bsr 指令。这将返回参数中最重要的1位 2 的索引。所以如果有10个前导零,则第一个1位位于参数的第21位。一般来说,我们有 31 - clz(x)== bsr(x),所以 bsr 其实直接实现我们期望的 lg()函数,而不需要多余的 31U - ... 部分。

事实上,你可以在这行之间阅读,看到 __ builtin_clz 函数是用 bsr 记住:如果参数为零,它被定义为未定义的行为,当然前导零操作完全定义为32无论 int 的位大小是否为零)。所以 __ builtin_clz 当然是在x86上有效地映射到 bsr 指令的思想下实现的。



然而,看看GCC实际做了什么,在 -O3 中有其他默认选项:it 增加了一大堆额外的垃圾

  lg(unsigned int):
bsr edi,edi
mov eax,31
xor edi,31
sub eax,edi
ret

xor edi,31 行实际上是一个 not edi 对于实际上很重要的底部4位来说,这是从 neg edi 3 c>将 bsr 的结果转换为 clz 。然后执行 31 - clz(x)



然而在 -mtune = haswell 代码简化到预期的单一 bsr 指令:

  lg(unsigned int):
bsr eax,edi
ret

为什么这种情况对我来说很不明确。在Haswell之前, bsr 指令已经存在了几十年,并且AFAIK的行为不变。它不仅仅是一个特定拱门的调优问题,因为 bsr +一堆额外的指令不会比平原 bsr 并进一步使用 -mtune = haswell 仍然会在慢代码中产生结果

64位输入和输出的情况是甚至更糟:关键路径中有一个额外的 movsx ,它似乎什么都不做,因为 clz 永远不会是负面的。同样, -march = haswell 变体对于单个 bsr 指令是最优的。



最后,让我们检查非Windows编译器空间中的大玩家, icc clang icc 只是做了一件坏事,并添加了诸如 neg eax之类的冗余内容;加上eax,31; neg eax;添加eax,31; - wtf?无论 -march




clang
>

0 例如扫描位图作为第一个位元。



1 0的对数是未定义的,所以用0参数调用我们的函数是未定义的行为

这里,LSB是第0位,MSB是第31位。



3 回想一下,二进制补码中的 -x ==〜x + 1

解决方案

这看起来像一个使用gcc的已知问题: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168


GCC supports the __builtin_clz(int x) builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument.

Among other things0, this is great for efficiently implementing the lg(unsigned int x) function, which takes the base-2 logarithm of x, rounding down1:

/** return the base-2 log of x, where x > 0 */
unsigned lg(unsigned x) {
  return 31U - (unsigned)__builtin_clz(x);
}

This works in the straightforward way - in particular consider the case x == 1 and clz(x) == 31 - then x == 2^0 so lg(x) == 0 and 31 - 31 == 0 and we get the correct result. Higher values of x work similarly.

Assuming the builtin is efficiently implemented, this ends much better than the alternate pure C solutions.

Now as it happens, the count leading zeros operation is essentially the dual of the bsr instruction in x86. That returns the index of the most-significant 1-bit2 in the argument. So if there are 10 leading zeros, the first 1-bit is in bit 21 of the argument. In general we have 31 - clz(x) == bsr(x) and in so bsr in fact directly implements our desired lg() function, without the superfluous 31U - ... part.

In fact, you can read between the line and see that the __builtin_clz function was implemented with bsr in mind: it is defined as undefined behavior if the argument is zero, when of course the "leading zeros" operation is perfectly well-defined as 32 (or whatever the bit-size of int is) with a zero argument. So __builtin_clz was certainly implemented with the idea of being efficiently mapped to a bsr instruction on x86.

However, looking at what GCC actually does, at -O3 with otherwise default options: it adds a ton of extra junk:

lg(unsigned int):
        bsr     edi, edi
        mov     eax, 31
        xor     edi, 31
        sub     eax, edi
        ret 

The xor edi,31 line is effectively a not edi for the bottom 4 bits that actually matter, that's off-by-one3 from neg edi which turns the result of bsr into clz. Then the 31 - clz(x) is carried out.

However with -mtune=haswell, the code simplifies into exactly the expected single bsr instruction:

lg(unsigned int):
        bsr     eax, edi
        ret

Why that is the case is very unclear to me. The bsr instruction has been around for a couple decades before Haswell, and the behavior is, AFAIK, unchanged. It's not just an issue of tuning for a particular arch, since bsr + a bunch of extra instructions isn't going to be faster than a plain bsr and furthermore using -mtune=haswell still results in the slower code.

The situation for 64-bit inputs and outputs is even slightly worse: there is an extra movsx in the critical path which seems to do nothing since the result from clz will never be negative. Again, the -march=haswell variant is optimal with a single bsr instruction.

Finally, let's check the big players in the non-Windows compiler space, icc and clang. icc just does a bad job and adds redundant stuff like neg eax; add eax, 31; neg eax; add eax, 31; - wtf? clang does a good job regardless of -march.


0 Such as scanning a bitmap for the first set bit.

1 The logarithm of 0 is undefinited, and so calling our function with a 0 argument is undefined behavior.

2 Here, the LSB is the 0th bit and the MSB is the 31st.

3 Recall that -x == ~x + 1 in twos-complement.

解决方案

This looks like a known issue with gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

这篇关于除非Haswell指定,否则GCC编译的前导零计数很差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆