将低于最高有效位的所有位归零的最有效方法是什么? [英] What is the most efficient way to zero all bits below the most significant set bit?
问题描述
因此对于以下序列: 0001000111000
So for the following sequence: 0001000111000
所需的结果将是: 0001000000000
The desired result would be: 0001000000000
我完全意识到,这可以通过以下方法来实现:找到带有程序集BSRL的MSB索引(或类似的比特扭曲hack),然后>>将数字按(索引-1)进行位移位,然后<<向后移动(索引-1), 但是我想知道,具体来说,是否存在汇编指令或具有更好性能的指令序列,而不是可以做到这一点的令人费解的黑客程序.
I am fully aware that this is achievable by finding the MSB's index with assembly BSRL (or similar bit-twiddling hack) then >> bit shifting the number by (index - 1), then << shifting back by (index - 1), but I want to know whether there is, specifically, an assembly instruction or a sequence of instructions with better performance, not a bit-twiddling hack that can do this.
推荐答案
没有单个指令可以做到这一点. BMI1 blsi dst,src
可以隔离最低设置位,而不是最高的.即x & -x
.如果x86具有blsi
的位反转版本,我们可以使用它,但不能.
There's no single instruction that can do this. BMI1 blsi dst,src
can isolate the lowest set bit, not the highest. i.e. x & -x
. If x86 had a bit-reversed version of blsi
, we could use that, but it doesn't.
但是您可以做得比您建议的要好.对于位扫描和移位,全零输入始终是一种特殊情况.否则,我们的输出将设置为1位. 1 << bsr(input)
.
But you can do much better than what you were suggesting. An all-zero input is always going to be a special case for bit-scan and shift. Otherwise our output has exactly 1 bit set. It's 1 << bsr(input)
.
;; input: x in RDI
;; output: result in RAX
isolate_msb:
xor eax, eax ; tmp = 0
bsr rdi, rdi ; edi = bit index of MSB in input
jz .input_was_zero
bts rax, rdi ; rax |= 1<<edi
.input_was_zero: ; return 0 for input=0
ret
很明显,对于32位输入,仅使用32位寄存器.如果不可能为零,请省略JZ.使用BSR代替LZCNT给我们一个位索引,而不是31-bitidx,因此我们可以直接使用它.但是LZCNT在AMD上明显更快.
Obviously for 32-bit inputs, use only 32-bit registers. And if zero is not possible, omit the JZ. Using BSR instead of LZCNT gives us a bit-index, not 31-bitidx, so we can use it directly. But LZCNT is significantly faster on AMD.
异或归零不在关键路径上,以便为BTS准备输入. xor-zero + BTS是在Intel CPU上实现1<<n
的最有效方法.在AMD上这是2微秒,延迟为2c,所以mov rax,1
/shl rax,cl
在那里会更好.但是对Intel来说更糟,因为除非使用BMI2 shlx
,否则可变计数移位为3微秒.
The xor-zeroing is off the critical path, to prepare an input for BTS. xor-zero + BTS is the most efficient way to implement 1<<n
on Intel CPUs. It's 2 uops with 2c latency on AMD, so mov rax,1
/ shl rax,cl
would be better there. But worse on Intel because variable-count shifts are 3 uops, unless you use BMI2 shlx
.
无论如何,这里的实际工作是BSR + BTS,因此Intel SnB系列产品的延迟为3个周期+ 1个周期. ( https://agner.org/optimize/)
Anyway, the real work here is BSR + BTS, so that's 3 cycle + 1 cycle latency on Intel SnB-family. (https://agner.org/optimize/)
unsigned isolate_msb32(unsigned x) {
unsigned bitidx = BSR32(x);
//return 1ULL << bitidx; // if x is definitely non-zero
return x ? 1U << bitidx : x;
}
unsigned isolate_msb64(uint64_t x) {
unsigned bitidx = BSR64(x);
return x ? 1ULL << bitidx : x;
}
其中BSR32
是根据编译器支持的内在函数定义的.这是棘手的地方,特别是如果您想要64位版本.没有任何可移植的内在函数. GNU C提供了count-leading-zeros内在函数,但是GCC和ICC在将63-__builtin_clzll(x)
优化回BSR方面很烂.相反,他们两次否定.有专门用于BSR的 内置程序,但是与MSVC和支持GNU扩展(gcc/clang/ICC)的编译器相比,它们甚至更具有编译器特定性.
Where BSR32
is defined in terms of an intrinsic your compiler supports. This is where things get tricky, especially if you want a 64-bit version. There's no single portable intrinsic. GNU C provides count-leading-zeros intrinsics, but GCC and ICC suck at optimizing 63-__builtin_clzll(x)
back into just BSR. Instead they negate twice. There are builtins for BSR specifically, but those are even more compiler-specific than just MSVC vs. compilers that support GNU extensions (gcc/clang/ICC).
#include <stdint.h>
// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
#ifdef __INTEL_COMPILER
typedef unsigned int bsr_idx_t;
#else
#include <intrin.h> // MSVC
typedef unsigned long bsr_idx_t;
#endif
static inline
unsigned BSR32(unsigned long x){
bsr_idx_t idx;
_BitScanReverse(&idx, x); // ignore bool retval
return idx;
}
static inline
unsigned BSR64(uint64_t x) {
bsr_idx_t idx;
_BitScanReverse64(&idx, x); // ignore bool retval
return idx;
}
#elif defined(__GNUC__)
#ifdef __clang__
static inline unsigned BSR64(uint64_t x) {
return 63-__builtin_clzll(x);
// gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
}
#else
#define BSR64 __builtin_ia32_bsrdi
#endif
#include <x86intrin.h>
#define BSR32(x) _bit_scan_reverse(x)
#endif
在Godbolt编译探险,铛和ICC编译此branchlessly ,即使他们不知道x
不为零.
On the Godbolt compiler explorer, clang and ICC compile this branchlessly, even when they don't know that x
is non-zero.
所有4个编译器均未使用bts
来实现1<<bit
. :(在Intel上非常便宜.
All 4 compilers fail to use bts
to implement 1<<bit
. :( It's very cheap on Intel.
# clang7.0 -O3 -march=ivybridge (for x86-64 System V)
# with -march=haswell and later it uses lzcnt and has to negate. /sigh.
isolate_msb32(unsigned int):
bsr ecx, edi
mov eax, 1
shl rax, cl
test edi, edi
cmove eax, edi # return 1<<bsr(x) or x (0) if x was zero
ret
GCC和MSVC生成分支代码.例如
GCC and MSVC make branchy code. e.g.
# gcc8.2 -O3 -march=haswell
mov eax, edi
test edi, edi
je .L6
bsr eax, edi
mov edi, 1
shlx rax, rdi, rax # BMI2: 1 uop instead of 3 for shl rax,cl
.L6:
ret
这篇关于将低于最高有效位的所有位归零的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!