VS:_BitScanReverse64固有的意外优化行为 [英] VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

查看:159
本文介绍了VS:_BitScanReverse64固有的意外优化行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于定义了_BitScanReverse64,以下代码在调试模式下可以正常工作 如果未设置位,则返回0. 引用MSDN : (返回值为)如果设置了索引,则为非零;如果未找到设置位,则为0."

The following code works fine in debug mode, since _BitScanReverse64 is defined to return 0 if no Bit is set. Citing MSDN: (The return value is) "Nonzero if Index was set, or 0 if no set bits were found."

如果我以发布模式编译此代码,它仍然可以工作,但是如果启用了编译器 优化,例如\ O1或\ O2,索引不为零,并且assert()失败.

If I compile this code in release mode it still works, but if I enable compiler optimizations, such as \O1 or \O2 the index is not zero and the assert() fails.

#include <iostream>
#include <cassert>

using namespace std;

int main()
{
  unsigned long index = 0;
  _BitScanReverse64(&index, 0x0ull);

  cout << index << endl;

  assert(index == 0);

  return 0;
}

这是预期的行为吗?我正在使用Visual Studio Community 2015,版本14.0.25431.01更新3.(我留在cout中,以便在优化过程中不删除变量索引).还有一种有效的解决方法,还是我不应该直接使用此内在编译器?

Is this the intended behaviour ? I am using Visual Studio Community 2015, Version 14.0.25431.01 Update 3. (I left cout in, so that the variable index is not deleted during optimization). Also is there an efficient workaround or should I just not use this compiler intrinsic directly?

推荐答案

AFAICT,当输入为零时,内在函数将index留在垃圾中,比asm指令的行为弱.这就是为什么它具有单独的布尔返回值和整数输出操作数的原因.

AFAICT, the intrinsic leaves garbage in index when the input is zero, weaker than the behaviour of the asm instruction. This is why it has a separate boolean return value and integer output operand.

尽管index arg被引用占用,但编译器将其视为仅输出.

Despite the index arg being taken by reference, the compiler treats it as output-only.

unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
同一内在特性的英特尔内在特性指南文档似乎比您链接的 Microsoft文档更为清晰,并阐明了哪些MS文档正在尝试说.但是仔细阅读后,他们似乎都说了同样的话,并且在bsr指令周围描述了一个薄薄的包装纸.

unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel's intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the bsr instruction.

英特尔将BSR指令记录为产生了未定义值"(当输入为0时,但在这种情况下设置ZF.),但AMD记录为目标不变:

Intel documents the BSR instruction as producing an "undefined value" when the input is 0, but setting the ZF in that case. But AMD documents it as leaving the destination unchanged:

中的

AMD的BSF条目 AMD64架构 程序员手册 第3卷: 通用和 系统说明

AMD's BSF entry in AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions

...如果第二个操作数包含0,则指令设置ZF 为1且不更改目标寄存器的内容. ...

... If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. ...

在当前的Intel硬件上,实际行为与AMD的文档相符:当src操作数为0时,目标寄存器保持不变.这也许就是为什么MS将其描述为仅在输入为非零时设置Index的原因(并且内部函数的返回值不为零).

On current Intel hardware, the actual behaviour matches AMD's documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting Index when the input is non-zero (and the intrinsic's return value is non-zero).

在Intel(但可能不是AMD 上)因为甚至没有将64位寄存器截断为32位.例如mov rax,-1; bsf eax, ecx(ECX为零)保留RAX = -1(64位),而不是您从xor eax, 0获得的0x00000000ffffffff.但是对于非零ECX,bsf eax, ecx具有零扩展到RAX的通常效果,例如保留RAX = 3.

On Intel (but maybe not AMD), this goes as far as not even truncating a 64-bit register to 32-bit. e.g. mov rax,-1 ; bsf eax, ecx (with zeroed ECX) leaves RAX=-1 (64-bit), not the 0x00000000ffffffff you'd get from xor eax, 0. But with non-zero ECX, bsf eax, ecx has the usual effect of zero-extending into RAX, leaving for example RAX=3.

IDK为什么英特尔仍然没有对此文件进行记录.也许是真的很老的x86 CPU(例如原始386?)以不同的方式实现它?英特尔和AMD经常超越x86手册中记录的内容,以免破坏现有的广泛使用的代码(例如Windows),这可能是这样开始的.

IDK why Intel still hasn't documented it. Perhaps a really old x86 CPU (like original 386?) implements it differently? Intel and AMD frequently go above and beyond what's documented in the x86 manuals in order to not break existing widely-used code (e.g. Windows), which might be how this started.

在这一点上,英特尔似乎不太可能会放弃对输出的依赖性,而对于输入= 0则不留下实际的垃圾或-1或32,但是缺少文档使该选项处于打开状态.

At this point it seems unlikely that Intel will ever drop that output dependency and leave actual garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open.

Skylake删除了lzcnttzcnt的错误依赖关系(后来的uarch删除了popcnt的错误dep),同时仍然保留了bsr/bsf的依赖关系. (为什么打破LZCNT的输出依赖关系"很重要? )

Skylake dropped the false dependency for lzcnt and tzcnt (and a later uarch dropped the false dep for popcnt) while still preserving the dependency for bsr/bsf. (Why does breaking the "output dependency" of LZCNT matter?)

当然,由于 MSVC优化了您的index = 0初始化,所以大概它只使用它想要的任何目标寄存器,而不必使用保存C变量先前值的寄存器.您想要的是,即使AMD保证了这种特性,我也不认为您可以利用未经修改的dst行为.

Of course, since MSVC optimized away your index = 0 initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. So even if you wanted to, I don't think you could take advantage of the dst-unmodified behaviour even though it's guaranteed on AMD.

因此,用C ++术语来说,内在函数对index 没有输入依赖性.但是在asm中,指令确实对dst寄存器具有输入依赖性,就像add dst, src指令一样.如果编译器不小心,这可能会导致意外的性能问题.

So in C++ terms, the intrinsic has no input dependency on index. But in asm, the instruction does have an input dependency on the dst register, like an add dst, src instruction. This can cause unexpected performance issues if compilers aren't careful.

不幸的是,在Intel硬件上,

Unfortunately on Intel hardware, the popcnt / lzcnt / tzcnt asm instructions also have a false dependency on their destination, even though the result never depends on it. Compilers work around this now that it's known, though, so you don't have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered).

除非您知道输入不为零,否则需要检查它以确保index有效.例如

You need to check it to make sure index is valid, unless you know the input was non-zero. e.g.

if(_BitScanReverse64(&idx, input)) {
    // idx is valid.
    // (MS docs say "Index was set")
} else {
    // input was zero, idx holds garbage.
    // (MS docs don't say Index was even set)
    idx = -1;     // might make sense, one lower than the result for bsr(1)
}


如果要避免执行此额外的检查分支,可以使用 lzcnt指令,如果您要定位足够新的硬件(例如Intel Haswell或AMD Bulldozer IIRC),则通过不同的内在函数.即使输入全为零,它也起作用",并且实际上计数前导零而不是返回最高设置位的索引.


If you want to avoid this extra check branch, you can use the lzcnt instruction via different intrinsics if you're targeting new enough hardware (e.g. Intel Haswell or AMD Bulldozer IIRC). It "works" even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit.

这篇关于VS:_BitScanReverse64固有的意外优化行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆