除非Haswell指定，否则GCC编译的前导零计数很差 [英] GCC compiles leading zero count poorly unless Haswell specified

查看：140 发布时间：2018/4/20 16:12:19 c performance gcc x86 intel

本文介绍了除非Haswell指定，否则GCC编译的前导零计数很差的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

GCC支持 < builtin，它计算参数中前导零的数量（连续的最高有效零）数目。

其中，⁰对于高效地实现 lg（unsigned int x）函数，它取以 x 的二进制对数，向下舍入1 \\ / sup>：

/ **返回x的base-2日志，其中x> 0 * / unsigned lg（unsigned x）{ return 31U - （unsigned）__ builtin_clz（x）; }
这可以直接使用 - 特别考虑 x == 1 和 clz（x）== 31 - 然后 x == 2 ^ 0 so lg（x）== 0 和 31 - 31 == 0 正确的结果。更高的值 x 的工作方式相似。
假设内置函数有效地实现，纯C解决方案。 <现在碰巧，计数前导零操作本质上是双重的，它是

 在x86中>  bsr  指令。这将返回参数中最重要的1位²的索引。所以如果有10个前导零，则第一个1位位于参数的第21位。一般来说，我们有 31  -  clz（x）== bsr（x），所以 bsr 其实直接实现我们期望的 lg（）函数，而不需要多余的 31U  -  ... 部分。  
 
 事实上，你可以在这行之间阅读，看到 __ builtin_clz 函数是用 bsr 记住：如果参数为零，它被定义为未定义的行为，当然前导零操作完全定义为32无论 int 的位大小是否为零）。所以 __ builtin_clz 当然是在x86上有效地映射到 bsr 指令的思想下实现的。 
 
 
 然而，看看GCC实际做了什么，在 -O3 中有其他默认选项：it 增加了一大堆额外的垃圾： 
 
 
  lg（unsigned int）：
 bsr edi，edi 
 mov eax，31 
 xor edi，31 
 sub eax，edi 
 ret 
  
  xor edi，31 行实际上是一个 not edi 对于实际上很重要的底部4位来说，这是从 neg edi  3  c>将 bsr 的结果转换为 clz 。然后执行 31  -  clz（x）。
 
 
 然而在 -mtune = haswell ，代码简化到预期的单一 bsr 指令： 
 
 
  lg（unsigned int）：
 bsr eax，edi 
 ret 
  
为什么这种情况对我来说很不明确。在Haswell之前， bsr 指令已经存在了几十年，并且AFAIK的行为不变。它不仅仅是一个特定拱门的调优问题，因为 bsr  +一堆额外的指令不会比平原 bsr 并进一步使用 -mtune = haswell  仍然会在慢代码中产生结果。  
 
 
 64位输入和输出的情况是甚至更糟：关键路径中有一个额外的 movsx ，它似乎什么都不做，因为 clz 永远不会是负面的。同样， -march = haswell 变体对于单个 bsr 指令是最优的。
 
 
 最后，让我们检查非Windows编译器空间中的大玩家，  icc 和 clang  。  icc 只是做了一件坏事，并添加了诸如 neg eax之类的冗余内容;加上eax，31; neg eax;添加eax，31;   -  wtf？无论 -march 。
 
 
 
  clang  > 
 
  ⁰例如扫描位图作为第一个位元。
 
 
  ¹ 0的对数是未定义的，所以用0参数调用我们的函数是未定义的行为。
 
这里，LSB是第0位，MSB是第31位。 
 
 
  ³回想一下，二进制补码中的 -x ==〜x + 1  。
解决方案
这看起来像一个使用gcc的已知问题： https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168  
 
GCC supports the __builtin_clz(int x) builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument.

Among other things⁰, this is great for efficiently implementing the lg(unsigned int x) function, which takes the base-2 logarithm of x, rounding down¹:
/** return the base-2 log of x, where x > 0 */
unsigned lg(unsigned x) {
  return 31U - (unsigned)__builtin_clz(x);
}
This works in the straightforward way - in particular consider the case x == 1 and clz(x) == 31 - then x == 2^0 so lg(x) == 0 and 31 - 31 == 0 and we get the correct result. Higher values of x work similarly.

Assuming the builtin is efficiently implemented, this ends much better than the alternate pure C solutions.

Now as it happens, the count leading zeros operation is essentially the dual of the bsr instruction in x86. That returns the index of the most-significant 1-bit² in the argument. So if there are 10 leading zeros, the first 1-bit is in bit 21 of the argument. In general we have 31 - clz(x) == bsr(x) and in so bsr in fact directly implements our desired lg() function, without the superfluous 31U - ... part. 

In fact, you can read between the line and see that the __builtin_clz function was implemented with bsr in mind: it is defined as undefined behavior if the argument is zero, when of course the "leading zeros" operation is perfectly well-defined as 32 (or whatever the bit-size of int is) with a zero argument. So __builtin_clz was certainly implemented with the idea of being efficiently mapped to a bsr instruction on x86. 

However, looking at what GCC actually does, at -O3 with otherwise default options: it adds a ton of extra junk:
lg(unsigned int):
        bsr     edi, edi
        mov     eax, 31
        xor     edi, 31
        sub     eax, edi
        ret 
The xor edi,31 line is effectively a not edi for the bottom 4 bits that actually matter, that's off-by-one³ from neg edi which turns the result of bsr into clz. Then the 31 - clz(x) is carried out.

However with -mtune=haswell, the code simplifies into exactly the expected single bsr instruction:
lg(unsigned int):
        bsr     eax, edi
        ret
Why that is the case is very unclear to me. The bsr instruction has been around for a couple decades before Haswell, and the behavior is, AFAIK, unchanged. It's not just an issue of tuning for a particular arch, since bsr + a bunch of extra instructions isn't going to be faster than a plain bsr and furthermore using -mtune=haswell still results in the slower code. 

The situation for 64-bit inputs and outputs is even slightly worse: there is an extra movsx in the critical path which seems to do nothing since the result from clz will never be negative. Again, the -march=haswell variant is optimal with a single bsr instruction.

Finally, let's check the big players in the non-Windows compiler space, icc and clang. icc just does a bad job and adds redundant stuff like neg eax; add eax, 31; neg  eax; add eax, 31; - wtf? clang does a good job regardless of -march.



⁰ Such as scanning a bitmap for the first set bit.

¹ The logarithm of 0 is undefinited, and so calling our function with a 0 argument is undefined behavior.

² Here, the LSB is the 0th bit and the MSB is the 31st. 

³ Recall that -x == ~x + 1 in twos-complement.
 解决方案 
This looks like a known issue with gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

                        这篇关于除非Haswell指定，否则GCC编译的前导零计数很差的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

除非Haswell指定，否则GCC编译的前导零计数很差 [英] GCC compiles leading zero count poorly unless Haswell specified

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

除非Haswell指定，否则GCC编译的前导零计数很差 [英] GCC compiles leading zero count poorly unless Haswell specified

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭