我想基于任意掩码打包位 [英] I want to pack the bits based on arbitrary mask

查看:69
本文介绍了我想基于任意掩码打包位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设数据是1011 1001,掩码是0111 0110,那么您有:

input data:       1011 1001
input mask:       0111 0110
apply mask:       0011 0000 (based on `input mask`)
bits selected:    -011 -00- (based on `input mask`)
right packed:     ---0 1100
expected result:  0000 1100 (set left `8 - popcount('input mask')` bits to zero) 

因此最终输出为0000 1100(请注意,未指定的左侧3个位置为零填充).

您可以看到,input mask中的位为1的任何地方,中选择了中的相应值(在上面的bits selected中),然后所有中都选择了位从结果的最低有效位开始连续打包(如上面的right packed所示).最后,将打包后剩下的所有最左边的位都设置为0(会有8 - popcount(mask)这样的位).

显而易见的选择是旋转并选择,但是由于mask具有5位,因此将消耗5次操作.我可以一步一步做到吗?

注意:

  1. 掩码可以是具有任意nON的任何内容(在上面 例如n=5).您所知道的是ON中的位数 面具和面具本身.掩码会随着nON不断变化.

  2. 在上面的示例中,我使用了8位数据和掩码,但实际上 用法可以是8位,16位,32位,64位和128位.

解决方案

如果您要定位x86,则大多数编译器都会对pdep具有内在的含义(在这里-您想要的特定操作是expand_right-这另一部分可能会引起极大的兴趣,因为它专门涵盖了您要处理字大小的元素的简单情况.

实际上,如果您实际上是在处理8位数据和掩码值,则可以只使用一个预先计算的查找表-一个8位大x 8位= 65k的表,它涵盖了所有{data, mask}组合,并且您可以直接找到答案,也可以是一个256个条目的答案,其中包含所有mask值,并为您提供一些系数,以便进行简单的移位计算或基于乘法的代码.

FWIW,我不确定5条旋转指令如何轻松实现,因为天真的解决方案需要为每个位设置1条旋转指令(无论是否设置)(因此,对于8位字长, 7或8旋转 2 指令).


1 当然,性能原则上取决于硬件,但是取决于实现它的所有主流Intel CPU,它的吞吐量为1周期,延迟为3周期(不确定AMD)./p>

2 只有7个旋转,因为显然可以忽略最低位的"0旋转"操作.

Let's say that data is 1011 1001 and the mask is 0111 0110, then you have:

input data:       1011 1001
input mask:       0111 0110
apply mask:       0011 0000 (based on `input mask`)
bits selected:    -011 -00- (based on `input mask`)
right packed:     ---0 1100
expected result:  0000 1100 (set left `8 - popcount('input mask')` bits to zero) 

So the final output is 0000 1100 (note that the 3 positions on the left which are unspecified are zero-filled).

You can see that wherever the bits in input mask is 1 the corresponding value in input data is selected (in bits selected above) and then all selected bits are packed contiguously started in the least significant bits of the result (as shown in right packed above). Finally, any leftmost bits which are left over after the packing are set to 0 (there will be 8 - popcount(mask) such bits).

Obvious choice is rotate and select but that will consume 5 operations as mask has 5 bits. Can I do this in one step?

Note:

  1. The mask can be anything with arbitrary n bits ON (In above example n=5). All you know is the number of bits that are ON in the mask and the mask itself. Mask will keep on changing with n bits ON.

  2. In above example I have used data and mask of 8-bits but in real usage it can be 8, 16, 32, 64 and 128 bits.

解决方案

If you're targeting x86 most compilers will have an instrinsic for the pdep (parallel bit deposit) instruction which directly performs the operation you want, in hardware, at a rate of 1 per cycle (3 cycles latency)1, on Intel hardware that supports it. For example, gcc offers it as the _pdep_u32 and _pdep_u64 intrinsic functions.

Unfortunately, on AMD Ryzen (the only AMD hardware that supports BMI2) this operation is very slow: one per 18 cycles. You might want to have a separate code-path to support non-Intel platforms if they are important to you.

If you aren't on x86, you can find general purpose implementations of these options here - the specific operation you want is expand_right - and this other section will probably be of great interest in that it specifically covers the simple case where you are dealing with word-sized elements.

In practice, if you are really dealing with 8-bit data and mask values, you might just use a precomputed lookup table - either a big 8 bit x 8 bit = 65k one which covers all {data, mask} combinations and which gives you the answer directly, or a 256-entry one which covers all mask values and gives you some coefficients for a simple bit-shifting calculation or a multiplication-based code.

FWIW, I'm not sure how you can do it easily with 5 rotate instructions, because it seems that the naive solution needs 1 rotate instruction for each bit, whether set or not (so for a word size of 8 bits, 7 or 8 rotate2 instructions).


1 Of course, the performance in principle depends on the hardware, but on all the mainstream Intel CPUs that implement it, it's 1 cycle throughput, 3 cycles latency (not sure about AMD).

2 Only 7 rotates because the "rotate of 0" operation for the lowest order bit can evidently be omitted.

这篇关于我想基于任意掩码打包位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆