如何取消设置N个最右边的设置位 [英] How to unset N right-most set bits

查看：99 发布时间：2021/5/16 19:51:31 bit-manipulation intrinsics integer-arithmetic

本文介绍了如何取消设置N个最右边的设置位的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有一个相对知名的技巧可以取消设置一个最右边的位:

There is a relatively well-known trick for unsetting a single right-most bit:

y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)

我发现自己有一个死循环来清除n个最右边的位，但是有没有更简单的代数技巧?

I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?

假定相对较大的n(对于64位整数，n必须小于64，但通常约为20-30).

Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).

// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000

我几次尝试过TAOCP Vol4，但找不到任何灵感.

I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.

也许有一些硬件支持?

推荐答案

对于具有BMI2的Intel x86 CPU， pext 和 pdep 很快. AMD的PEXT/PDEP微代码非常慢( https://uops.info/)所以要小心在AMD上，其他选项可能更快，甚至可能在循环中甚至 blsi ，或者在popcount上进行更好的二进制搜索(请参见下文).
只有Intel具有用于pext/pdep的受掩码控制的打包/解压缩的专用硬件执行单元，从而使其保持恒定时间:1 uop，3个周期的延迟，只能在端口1上运行.

For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.

我不知道其他ISA具有类似的位打包硬件操作.

I'm not aware of other ISAs having a similar bit-packing hardware operation.

pdep 基础: pdep(-1ULL，a)== a .从第一个操作数中取低popcnt(a)位，并将其存放在 a 设置了位的位置，将再次给您 a .

pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.

但是，如果您的位源清除了低N位，而不是全1，则 a 中的前N个置位位将获取0而不是1.这正是您所需要的想要.

But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.

uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
    return _pdep_u64(-1ULL << n, a);
}

-1ULL<<n 适用于C中的n = 0..63.x86 asm标量移位指令掩盖了它们的计数(有效地& 63 )，所以可能对于较大的 n 的C undefined-行为会发生.如果需要的话，请在源代码中使用 n& 63 ，以便在C语言中明确定义该行为，并且仍然可以将其编译为直接使用计数的移位指令.

-1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.

在Godbolt上，它具有简单的循环引用实现，表明它们产生的结果与示例相同输入 a 和 n .

On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.

GCC和clang都按照明显的方式编译它，如所写:

GCC and clang both compile it the obvious way, as written:

# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
        mov     rax, -1
        shlx    rax, rax, rsi
        pdep    rax, rax, rdi
        ret

(SHLX是单uop，有1个周期的延迟，与更新FLAGS的传统变量计数移位不同...除非CL = 0，否则

(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0)

因此，这从 a ->输出(只是pdep)具有3个周期的延迟.
和来自 n ->输出(shlx，pdep)的4个周期延迟.

So this has 3 cycle latency from a->output (just pdep)
and 4 cycle latency from n->output (shlx, pdep).

而且前端只有3块.

一个半相关的BMI2技巧:

A semi-related BMI2 trick:

pext(a，a)将在底部打包这些位，例如(1ULL<< popcnt(a))-1 ，但如果所有位都已设置，则不会溢出.

pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.

使用AND掩码清除其低N位，然后使用 pdep 进行扩展即可.但这是一种过于复杂的昂贵方法，用于创建具有大于N个零的足够位数的位源，而这对于pdep而言实际上是最重要的.感谢@harold在此答案的第一个版本中发现了这一点.

Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to @harold for spotting this in the first version of this answer.

@Nate建议对要清除的低位数进行二进制搜索，这可能是pdep的一个很好的选择.

@Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.

在 popcount(x>> c)== popcount(x)-N 时停止，以找出要清除的低位，最好使用 c .(例如 c = foo?a:b 通常会编译为cmov).


Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c.  (e.g. c = foo ? a : b often compiles to cmov).
完成搜索后， x&(-1ULL<< c)使用该计数，或者只是 tmp<<c 移回您已经拥有的 x>> c 结果.直接使用右移比生成新的遮罩并在每次迭代中都使用它要便宜.
Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have.  Using right-shift directly is cheaper than generating a new mask and using it every iteration.
高性能popcount在现代CPU上相对广泛地可用.(尽管x86-64的 not 基线；您仍然需要使用 -mpopcnt 或 -march = native 进行编译.)
High-performance popcount is relatively widely available on modern CPUs.  (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).
调整此参数可能涉及选择一个可能的起点，并且可能使用最大的初始步长而不是纯二进制搜索.通过尝试一些初步的猜测而获得一些指令级的并行性，可能有助于缩短延迟瓶颈.
Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search.  Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.

                        这篇关于如何取消设置N个最右边的设置位的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何取消设置N个最右边的设置位 [英] How to unset N right-most set bits

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何取消设置N个最右边的设置位 [英] How to unset N right-most set bits

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭