如何取消设置N个最右边的设置位 [英] How to unset N right-most set bits

查看:99
本文介绍了如何取消设置N个最右边的设置位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个相对知名的技巧可以取消设置一个最右边的位:

There is a relatively well-known trick for unsetting a single right-most bit:

y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)

我发现自己有一个死循环来清除n个最右边的位,但是有没有更简单的代数技巧?

I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?

假定相对较大的n(对于64位整数,n必须小于64,但通常约为20-30).

Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).

// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000

我几次尝试过TAOCP Vol4,但找不到任何灵感.

I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.

也许有一些硬件支持?

推荐答案

对于具有BMI2的Intel x86 CPU, pext pdep 很快. AMD的PEXT/PDEP微代码非常慢( https://uops.info/)所以要小心在AMD上,其他选项可能更快,甚至可能在循环中甚至 blsi ,或者在popcount上进行更好的二进制搜索(请参见下文).
只有Intel具有用于pext/pdep的受掩码控制的打包/解压缩的专用硬件执行单元,从而使其保持恒定时间:1 uop,3个周期的延迟,只能在端口1上运行.

For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.

我不知道其他ISA具有类似的位打包硬件操作.

I'm not aware of other ISAs having a similar bit-packing hardware operation.

pdep 基础: pdep(-1ULL,a)== a .从第一个操作数中取低popcnt(a)位,并将其存放在 a 设置了位的位置,将再次给您 a .

pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.

但是,如果您的位源清除了低N位,而不是全1,则 a 中的前N个置位位将获取0而不是1.这正是您所需要的想要.

But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.

uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
    return _pdep_u64(-1ULL << n, a);
}

-1ULL<<n 适用于C中的n = 0..63.x86 asm标量移位指令掩盖了它们的计数(有效地& 63 ),所以可能对于较大的 n 的C undefined-行为会发生.如果需要的话,请在源代码中使用 n& 63 ,以便在C语言中明确定义该行为,并且仍然可以将其编译为直接使用计数的移位指令.

-1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.

在Godbolt上,它具有简单的循环引用实现,表明它们产生的结果与示例相同输入 a n .

On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.

GCC和clang都按照明显的方式编译它,如所写:

GCC and clang both compile it the obvious way, as written:

# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
        mov     rax, -1
        shlx    rax, rax, rsi
        pdep    rax, rax, rdi
        ret

(SHLX是单uop,有1个周期的延迟,与更新FLAGS的传统变量计数移位不同...除非CL = 0,否则

(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0)

因此,这从 a ->输出(只是pdep)具有3个周期的延迟.
和来自 n ->输出(shlx,pdep)的4个周期延迟.

So this has 3 cycle latency from a->output (just pdep)
and 4 cycle latency from n->output (shlx, pdep).

而且前端只有3块.

一个半相关的BMI2技巧:

A semi-related BMI2 trick:

pext(a,a)将在底部打包这些位,例如(1ULL<< popcnt(a))-1 ,但如果所有位都已设置,则不会溢出.

pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.

使用AND掩码清除其低N位,然后使用 pdep 进行扩展即可.但这是一种过于复杂的昂贵方法,用于创建具有大于N个零的足够位数的位源,而这对于pdep而言实际上是最重要的.感谢@harold在此答案的第一个版本中发现了这一点.

Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to @harold for spotting this in the first version of this answer.

@Nate建议对要清除的低位数进行二进制搜索,这可能是pdep的一个很好的选择.

@Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.

popcount(x>> c)== popcount(x)-N 时停止,以找出要清除的低位,最好使用 c .(例如 c = foo?a:b 通常会编译为cmov).

Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c. (e.g. c = foo ? a : b often compiles to cmov).

完成搜索后, x&(-1ULL<< c)使用该计数,或者只是 tmp<<c 移回您已经拥有的 x>> c 结果.直接使用右移比生成新的遮罩并在每次迭代中都使用它要便宜.

Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.

高性能popcount在现代CPU上相对广泛地可用.(尽管x86-64的 not 基线;您仍然需要使用 -mpopcnt -march = native 进行编译.)

High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).

调整此参数可能涉及选择一个可能的起点,并且可能使用最大的初始步长而不是纯二进制搜索.通过尝试一些初步的猜测而获得一些指令级的并行性,可能有助于缩短延迟瓶颈.

Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.

这篇关于如何取消设置N个最右边的设置位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆