使用函数__builtin_clz如何将文字0和0作为变量产生不同的行为? [英] How can a literal 0 and 0 as a variable yield different behavior with the function __builtin_clz?

查看:323
本文介绍了使用函数__builtin_clz如何将文字0和0作为变量产生不同的行为?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在仅1种情况下,__builtin_clz给出错误的答案.我很好奇是什么导致了这种行为.

There's only 1 circumstance where __builtin_clz gives the wrong answer. I'm curious what's causing that behavior.

当我使用文字值0时,我总是得到32的期望值.但是0作为变量将产生31.为什么存储值0的方法很重要?

When I use the literal value 0 I always get 32 as expected. But 0 as a variable yields 31. Why does the method of storing the value 0 matter?

我参加了一个建筑课程,但是不了解差异化的程序集.看起来,当给定文字值0时,即使不进行优化,该汇编总会以某种方式始终具有32个硬编码的正确答案.当使用-march = native时,用于计算前导零的方法也有所不同.

I've taken an architecture class but don't understand the diffed assembly. It looks like when given the literal value 0, the assembly somehow always has the correct answer of 32 hard coded even without optimizations. And the method for counting leading zeros is different when using -march=native.

这篇文章关于用_BitScanReversebsrl %eax, %eax行模拟__builtin_clz的现象似乎暗示位扫描反向不适用于0.

This post about emulating __builtin_clz with _BitScanReverse and the line bsrl %eax, %eax seem to imply that bit scan reverse doesn't work for 0.

+-------------------+-------------+--------------+
|      Compile      | literal.cpp | variable.cpp |
+-------------------+-------------+--------------+
| g++               |          32 |           31 |
| g++ -O            |          32 |           32 |
| g++ -march=native |          32 |           32 |
+-------------------+-------------+--------------+

literal.cpp

#include <iostream>

int main(){
    int i = 0;
    std::cout << __builtin_clz(0) << std::endl;
}

variable.cpp

#include <iostream>

int main(){
    int i = 0;
    std::cout << __builtin_clz(i) << std::endl;
}

g ++的差异-S [名称] -o [名称]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"
23c23,26
<       movl    $32, %esi
---
>       movl    -4(%rbp), %eax
>       bsrl    %eax, %eax
>       xorl    $31, %eax
>       movl    %eax, %esi

g ++的差异-march = native -S [输入名称] -o [输出名称]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"
23c23,25
<       movl    $32, %esi
---
>       movl    -4(%rbp), %eax
>       lzcntl  %eax, %eax
>       movl    %eax, %esi

g ++的差异-O -S [名称] -o [名称]

1c1
<       .file   "literal.cpp"
---
>       .file   "variable.cpp"

推荐答案

在禁用优化的情况下进行编译时,编译器不会在语句之间进行常量传播.这部分是为什么将-1除以整数(负数)会导致FPE吗?-在那儿读我的答案,然后/or 为什么产生lang -O0(对于这个简单的浮点数总和)效率低下的asm?

When you compile with optimization disabled, the compiler doesn't do constant-propagation across statements. That part is a duplicate of Why does integer division by -1 (negative one) result in FPE? - read my answer there, and/or Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

这就是为什么字面零可以不同于值= 0的变量的原因. 只有禁用优化的变量才会导致运行时bsr+xor $31, %reg.

This is why a literal zero can be different from a variable with value = 0. Only the variable with optimization disabled results in runtime bsr+xor $31, %reg.

GCC手册中 中记录的内容,__builtin_clz

As documented in the GCC manual for __builtin_clz

从最高有效位位置开始,返回x中前导0位的数目. 如果x为0,则结果不确定.

Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

这允许clz/ctz在x86上分别编译为31- bsrbsf指令.得益于2的补码,31-bsrbsr + xor $31,%reg实现. (BSR产生的是最高设置位的索引,而不是前导零计数).

This allows clz / ctz to compile to 31-bsr or bsf instructions respectively, on x86. 31-bsr is implemented with bsr+xor $31,%reg thanks to the magic of 2's complement. (BSR produces the index of the highest set bit, not the leading-zero count).

请注意,它只显示结果,而不是行为.它不是C ++ UB (整个程序可以做任何事情),它仅限于这种结果,就像在x86 asm中一样.但是无论如何,似乎当输入为编译时常量0时,GCC会像x86 lzcnt以及其他ISA上的clz指令那样生成类型宽度. (这可能发生在与目标无关的GIMPLE树优化中,该过程通过包含内建函数的常量传播来完成.)

Note that it only says result, not behaviour. It's not C++ UB (whole program could do absolutely anything), it's limited to that result, just like in x86 asm. But anyway, it seems when the input is a compile-time constant 0, GCC produces the type-width like x86 lzcnt, and like clz instructions on other ISAs. (This probably happens in target-independent GIMPLE tree optimization where constant-propagation through operations including builtins is done.)

Intel文档 bsf /

Intel documents bsf/bsr as If the content source operand is 0, the content of the destination operand is undefined. In real life, Intel hardware implements the same behaviour AMD documents: leave the destination unmodified in that case.

但是由于Intel拒绝对其进行文档化,因此编译器不允许您编写利用此文档的代码. GCC不了解或不在乎这种行为,因此无法利用它.即使MSVC的内在函数需要一个输出指针arg,MSVC也不会这样做,因此很容易以这种方式工作.请参见 VS:具有_BitScanReverse64内在函数的意外优化行为

But since Intel refuses to document it, compilers won't let you make code which takes advantage of it. GCC does not know or care about that behaviour, and provides no way to take advantage of it. Neither does MSVC even though its intrinsic that takes an output pointer arg so could easily have worked that way. See VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

通过-march=native,GCC可以直接使用BMI1 lzcnt ,对于每个可能的输入位模式(包括0 ),它都已明确定义.它直接产生前导零计数,而不是第一个置位的 index .

With -march=native, GCC can use BMI1 lzcnt directly, which is well-defined for every possible input bit pattern including 0. It directly produces a leading-zero count instead of the index of the first set bit.

(这就是为什么BSR/BSF对input = 0毫无意义;没有索引可供他们查找.有趣的事实:bsr %eax, %eaxeax=0起作用.在asm中,指令也根据输入是否为零设置ZF,以便您可以检测到输出何时为未定义",而不是在bsr之前进行单独的test +分支.或者在AMD和现实生活中的所有其他操作上,它保持目标不变.)

(This is why BSR/BSF don't make sense for input=0; there is no index for them to find. Fun fact: bsr %eax, %eax does "work" for eax=0. In asm, the instructions also set ZF according to whether the input was zero so you can detect when the output is "undefined" instead of a separate test+branch before bsr. Or on AMD and everything else in real life, that it left the destination unmodified.)

在直到Skylake的Intel上,lzcnt/tzcnt对输出寄存器都有错误的依赖关系,即使结果始终依赖输出寄存器. IIRC,Coffee Lake还修复了popcnt的错误dep. (所有这些都与BSR/BSF在同一执行单元上运行.)

On Intel until Skylake, lzcnt / tzcnt have a false dependency on the output register even though the result does not depend on it ever. IIRC, Coffee Lake also fixed the false dep for popcnt. (All of which runs on the same execution unit as BSR/BSF.)

  • Why does breaking the "output dependency" of LZCNT matter?
  • VS: unexpected optimization behavior with _BitScanReverse64 intrinsic
  • How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

这篇关于使用函数__builtin_clz如何将文字0和0作为变量产生不同的行为?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆