将低于最高有效位的所有位归零的最有效方法是什么? [英] What is the most efficient way to zero all bits below the most significant set bit?

查看:234
本文介绍了将低于最高有效位的所有位归零的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此对于以下序列: 0001000111000

So for the following sequence: 0001000111000

所需的结果将是: 0001000000000

The desired result would be: 0001000000000

我完全意识到,这可以通过以下方法来实现:找到带有程序集BSRL的MSB索引(或类似的比特扭曲hack),然后>>将数字按(索引-1)进行位移位,然后<<向后移动(索引-1), 但是我想知道,具体来说,是否存在汇编指令或具有更好性能的指令序列,而不是可以做到这一点的令人费解的黑客程序.

I am fully aware that this is achievable by finding the MSB's index with assembly BSRL (or similar bit-twiddling hack) then >> bit shifting the number by (index - 1), then << shifting back by (index - 1), but I want to know whether there is, specifically, an assembly instruction or a sequence of instructions with better performance, not a bit-twiddling hack that can do this.

推荐答案

没有单个指令可以做到这一点. BMI1 blsi dst,src 可以隔离最低设置位,而不是最高的.即x & -x.如果x86具有blsi的位反转版本,我们可以使用它,但不能.

There's no single instruction that can do this. BMI1 blsi dst,src can isolate the lowest set bit, not the highest. i.e. x & -x. If x86 had a bit-reversed version of blsi, we could use that, but it doesn't.

但是您可以做得比您建议的要好.对于位扫描和移位,全零输入始终是一种特殊情况.否则,我们的输出将设置为1位. 1 << bsr(input).

But you can do much better than what you were suggesting. An all-zero input is always going to be a special case for bit-scan and shift. Otherwise our output has exactly 1 bit set. It's 1 << bsr(input).

;; input: x in RDI
;; output: result in RAX
isolate_msb:
    xor   eax, eax           ; tmp = 0
    bsr   rdi, rdi           ; edi = bit index of MSB in input
    jz    .input_was_zero
    bts   rax, rdi           ; rax |= 1<<edi

.input_was_zero:             ; return 0 for input=0
    ret

很明显,对于32位输入,仅使用32位寄存器.如果不可能为零,请省略JZ.使用BSR代替LZCNT给我们一个位索引,而不是31-bitidx,因此我们可以直接使用它.但是LZCNT在AMD上明显更快.

Obviously for 32-bit inputs, use only 32-bit registers. And if zero is not possible, omit the JZ. Using BSR instead of LZCNT gives us a bit-index, not 31-bitidx, so we can use it directly. But LZCNT is significantly faster on AMD.

异或归零不在关键路径上,以便为BTS准备输入. xor-zero + BTS是在Intel CPU上实现1<<n的最有效方法.在AMD上这是2微秒,延迟为2c,所以mov rax,1/shl rax,cl在那里会更好.但是对Intel来说更糟,因为除非使用BMI2 shlx,否则可变计数移位为3微秒.

The xor-zeroing is off the critical path, to prepare an input for BTS. xor-zero + BTS is the most efficient way to implement 1<<n on Intel CPUs. It's 2 uops with 2c latency on AMD, so mov rax,1 / shl rax,cl would be better there. But worse on Intel because variable-count shifts are 3 uops, unless you use BMI2 shlx.

无论如何,这里的实际工作是BSR + BTS,因此Intel SnB系列产品的延迟为3个周期+ 1个周期. ( https://agner.org/optimize/)

Anyway, the real work here is BSR + BTS, so that's 3 cycle + 1 cycle latency on Intel SnB-family. (https://agner.org/optimize/)

unsigned isolate_msb32(unsigned x) {
    unsigned bitidx = BSR32(x);
    //return 1ULL << bitidx;           // if x is definitely non-zero
    return x ? 1U << bitidx : x;
}

unsigned isolate_msb64(uint64_t x) {
    unsigned bitidx = BSR64(x);
    return x ? 1ULL << bitidx : x;
}

其中BSR32是根据编译器支持的内在函数定义的.这是棘手的地方,特别是如果您想要64位版本.没有任何可移植的内在函数. GNU C提供了count-leading-zeros内在函数,但是GCC和ICC在将63-__builtin_clzll(x)优化回BSR方面很烂.相反,他们两次否定.有专门用于BSR的 内置程序,但是与MSVC和支持GNU扩展(gcc/clang/ICC)的编译器相比,它们甚至更具有编译器特定性.

Where BSR32 is defined in terms of an intrinsic your compiler supports. This is where things get tricky, especially if you want a 64-bit version. There's no single portable intrinsic. GNU C provides count-leading-zeros intrinsics, but GCC and ICC suck at optimizing 63-__builtin_clzll(x) back into just BSR. Instead they negate twice. There are builtins for BSR specifically, but those are even more compiler-specific than just MSVC vs. compilers that support GNU extensions (gcc/clang/ICC).

#include <stdint.h>

// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
    #ifdef __INTEL_COMPILER
        typedef unsigned int bsr_idx_t;
    #else
        #include <intrin.h>   // MSVC
        typedef unsigned long bsr_idx_t;
    #endif

    static inline
    unsigned BSR32(unsigned long x){
        bsr_idx_t idx;
        _BitScanReverse(&idx, x); // ignore bool retval
        return idx;
    }
    static inline
    unsigned BSR64(uint64_t x) {
        bsr_idx_t idx;
        _BitScanReverse64(&idx, x); // ignore bool retval
        return idx;
    }
#elif defined(__GNUC__)

  #ifdef __clang__
    static inline unsigned BSR64(uint64_t x) {
        return 63-__builtin_clzll(x);
      // gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
    }
  #else
    #define BSR64 __builtin_ia32_bsrdi
  #endif

    #include <x86intrin.h>
    #define BSR32(x) _bit_scan_reverse(x)

#endif

在Godbolt编译探险,铛和ICC编译此branchlessly ,即使他们不知道x不为零.

On the Godbolt compiler explorer, clang and ICC compile this branchlessly, even when they don't know that x is non-zero.

所有4个编译器均未使用bts来实现1<<bit. :(在Intel上非常便宜.

All 4 compilers fail to use bts to implement 1<<bit. :( It's very cheap on Intel.

# clang7.0 -O3 -march=ivybridge   (for x86-64 System V)
# with -march=haswell and later it uses lzcnt and has to negate.  /sigh.
isolate_msb32(unsigned int):
        bsr     ecx, edi
        mov     eax, 1
        shl     rax, cl
        test    edi, edi
        cmove   eax, edi       # return 1<<bsr(x)  or  x (0) if x was zero
        ret

GCC和MSVC生成分支代码.例如

GCC and MSVC make branchy code. e.g.

# gcc8.2 -O3 -march=haswell
    mov     eax, edi
    test    edi, edi
    je      .L6
    bsr     eax, edi
    mov     edi, 1
    shlx    rax, rdi, rax    # BMI2:  1 uop instead of 3 for shl rax,cl
.L6:
    ret

这篇关于将低于最高有效位的所有位归零的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆