在 32 位数字中查找第一个(最低)设置位的位置 [英] Find position of first (lowest) set bit in 32-bit number
问题描述
我需要在一个 32 位数字中得到一个 1 位数字,其中只有一个 1 位(总是).C++ 或 asm 中最快的方法.
I need to get a 1-bit number in a 32-bit number, in which there is only one 1-bit (always). The fastest way in C ++ or asm.
例如
input: 0x00000001, 0x10000000
output: 0, 28
推荐答案
#ifdef __GNUC__
,使用 __builtin_ctz(unsigned)
来计算尾随零(GCC 手册).GCC、clang 和 ICC 在所有目标 ISA 上都支持它.(在没有本地指令的 ISA 上,它将调用 GCC 辅助函数.)
#ifdef __GNUC__
, use __builtin_ctz(unsigned)
to Count Trailing Zeros (GCC manual). GCC, clang, and ICC all support it on all target ISAs. (On ISAs where there's no native instruction, it will call a GCC helper function.)
Leading vs. Trailing 是按打印顺序写入时,MSB 优先,就像 8 位二进制 00000010
有 6 个前导零和一个尾随零.(当转换为 32 位二进制时,将有 24+6 = 30 个前导零.)
Leading vs. Trailing is when written in printing order, MSB-first, like 8-bit binary 00000010
has 6 leading zeros and one trailing zero. (And when cast to 32-bit binary, will have 24+6 = 30 leading zeros.)
对于 64 位整数,使用 __builtin_ctzll(unsigned long long)
.不幸的是,GNU C 位扫描内置函数不采用固定宽度类型(尤其是前导 零版本),但 unsigned
在 x86 的 GNU C 上始终是 32 位的(虽然不适用于 AVR 或 MSP430).unsigned long long
在我知道的所有 GNU C 目标上总是 uint64_t
.
For 64-bit integers, use __builtin_ctzll(unsigned long long)
. It's unfortunate that GNU C bitscan builtins don't take fixed-width types (especially the leading zeros versions), but unsigned
is always 32-bit on GNU C for x86 (although not for AVR or MSP430). unsigned long long
is always uint64_t
on all GNU C targets I'm aware of.
在 x86 上,它将编译为 bsf
或 tzcnt
取决于调整 + 目标选项. tzcnt
在现代 Intel 上是一个具有 3 个周期延迟的单个 uop,而在 AMD 上只有 2 个具有 2 个周期延迟的 uop(也许有点反转以提供 lzcnt uop?)https://agner.org/optimize//https://uops.info/.无论哪种方式,它都由快速硬件直接支持,并且比您在纯 C++ 中所做的任何事情都要快.与 x * 1234567
的成本大致相同(在 Intel CPU 上,bsf
/tzcnt
与 imul r, r,imm
,在前端 uops、后端端口和延迟中.)
On x86, it will compile to bsf
or tzcnt
depending on tuning + target options. tzcnt
is a single uop with 3 cycle latency on modern Intel, and only 2 uops with 2 cycle latency on AMD (perhaps a bit-reverse to feed an lzcnt uop?) https://agner.org/optimize/ / https://uops.info/. Either way it's directly supported by fast hardware, and is much faster than anything you can do in pure C++. About the same cost as x * 1234567
(on Intel CPUs, bsf
/tzcnt
has the same cost as imul r, r, imm
, in front-end uops, back-end port, and latency.)
对于没有设置位的输入,内置函数有未定义的行为,如果它可以作为 bsf
运行,则允许它避免任何额外的检查.
The builtin has undefined behaviour for inputs with no bits set, allowing it to avoid any extra checks if it might run as bsf
.
在其他编译器(特别是 MSVC)中,您可能需要 TZCNT 的内在函数,例如 immintrin.h<中的
_mm_tzcnt_32
/代码>.(英特尔内在函数指南).或者,您可能需要为非 SIMD 内部函数包含 intrin.h
(MSVC) 或 x86intrin.h
.
In other compilers (specifically MSVC), you might want an intrinsic for TZCNT, like _mm_tzcnt_32
from immintrin.h
. (Intel intrinsics guide). Or you might need to include intrin.h
(MSVC) or x86intrin.h
for non-SIMD intrinsics.
与 GCC/clang 不同,MSVC 不会阻止您将内部函数用于 ISA 扩展,而您尚未启用编译器自行使用.
Unlike GCC/clang, MSVC doesn't stop you from using intrinsics for ISA extensions you haven't enabled for the compiler to use on its own.
MSVC 也有用于实际 BSF/BSR 的 _BitScanForward
/_BitScanReverse
,但 AMD 保证(和 Intel 也实现)的离开目的地未修改行为仍然没有暴露通过这些内在函数,尽管它们有指针输出 API.
MSVC also has _BitScanForward
/ _BitScanReverse
for actual BSF/BSR, but the leave-destination-unmodified behaviour that AMD guarantees (and Intel also implements) is still not exposed by these intrinsics, despite their pointer-output API.
- VS:_BitScanReverse64 内在的意外优化行为 - 假定始终写入指针输出:/
- _BitScanForward _BitScanForward64 缺失 (VS2017) Snappy - 正确的标头
- 如何使用 MSVC 内在函数来获得与此 GCC 代码等效的代码?
- VS: unexpected optimization behavior with _BitScanReverse64 intrinsic - pointer-output is assumed to always be written :/
- _BitScanForward _BitScanForward64 missing (VS2017) Snappy - correct headers
- How to use MSVC intrinsics to get the equivalent of this GCC code?
TZCNT 在没有 BMI1 的 CPU 上将 解码为 BSF,因为它的机器代码编码是 rep bsf
.它们对非零输入给出相同的结果,因此编译器可以并且总是只使用 tzcnt
因为这在 AMD 上要快得多.(它们在 Intel 上的速度相同,因此没有缺点.在 Skylake 和更高版本上,tzcnt 没有错误的输出依赖性.BSF 这样做是因为它在 input=0 时保持其输出未修改.
TZCNT decode as BSF on CPUs without BMI1 because its machine-code encoding is rep bsf
. They give identical results for non-zero inputs, so compilers can and do always just use tzcnt
because that's much faster on AMD. (They're the same speed on Intel so no downside. And on Skylake and later, tzcnt has no false output dependency. BSF does because it leaves its output unmodified for input=0).
(bsr
与 lzcnt
的情况不太方便:bsr 返回位索引,lzcnt 返回前导零计数.因此为了在 AMD 上获得最佳性能,你需要知道你的代码只会在支持 BMI1/TBM 的 CPU 上运行,所以编译器可以使用 lzcnt
)
(The situation is less convenient for bsr
vs. lzcnt
: bsr returns the bit-index, lzcnt returns the leading-zero count. So for best performance on AMD, you need to know that your code will only run on CPUs supporting BMI1 / TBM so the compiler can use lzcnt
)
请注意,如果设置了 1 位,则从任一方向扫描都会找到相同的位.所以 31 - lzcnt = bsr
在这种情况下与 bsf = tzcnt
相同.如果移植到另一个只有前导零计数且没有位反转指令的 ISA,可能很有用.
Note that with exactly 1 bit set, scanning from either direction will find the same bit. So 31 - lzcnt = bsr
is the same in this case as bsf = tzcnt
. Possibly useful if porting to another ISA that only has leading-zero count and no bit-reverse instruction.
相关:
- 为什么打破输出依赖" 现代编译器通常知道打破 lzcnt/tzcnt/popcnt 的错误依赖.bsf/bsr 也有,我认为 GCC 在这方面也很聪明,但具有讽刺意味的是可能不是.
- x86 bsr/bsf 如何具有固定延迟,而不依赖于数据?它不是像伪代码显示的那样循环遍历位吗? - 伪代码不是硬件实现.
- https://en.wikipedia.org/wiki/Find_first_set 有更多关于比特扫描的信息跨 ISA 的功能.包括 POSIX
ffs()
它返回一个基于 1 的索引,并且必须做额外的工作来解释输入为 0 的可能性.
- Why does breaking the "output dependency" of LZCNT matter? modern compilers generally know to break the false dependency for lzcnt/tzcnt/popcnt. bsf/bsr have one, too, and I think GCC is also smart about that, but ironically might not be.
- How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows? - the pseudocode is not the hardware implementation.
- https://en.wikipedia.org/wiki/Find_first_set has more about bitscan functions across ISAs. Including POSIX
ffs()
which returns a 1-based index and has to do extra work to account for the possibility of the input being 0.
编译器确实识别 ffs()
并像内置函数一样内联它(就像他们对 memcpy 或 sqrt 所做的那样),但并不总是设法优化掉他们的固定序列为实现而做的所有工作当你真正想要一个基于 0 的索引时.告诉编译器只设置了 1 位尤其困难.
Compilers do recognize ffs()
and inline it like a builtin (like they do for memcpy or sqrt), but don't always manage to optimize away all the work their canned sequence does to implement it when you actually want a 0-based index. It's especially hard to tell the compiler there's only 1 bit set.
这篇关于在 32 位数字中查找第一个(最低)设置位的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!