不使用 BMI2 的 PDEP 的便携式高效替代品? [英] Portable efficient alternative to PDEP without using BMI2?
问题描述
平行存款指令的文档(PDEP
) 在英特尔的位操作指令集 2 (BMI2) 中描述了指令的以下串行实现(类 C 伪代码):
The documentation for the parallel deposit instruction (PDEP
) in Intel's Bit Manipulation Instruction Set 2 (BMI2) describes the following serial implementation for the instruction (C-like pseudocode):
U64 _pdep_u64(U64 val, U64 mask) {
U64 res = 0;
for (U64 bb = 1; mask; bb += bb) {
if (val & bb)
res |= mask & -mask;
mask &= mask - 1;
}
return res;
}
See also Intel's pdep
insn ref manual entry.
这个算法是 O(n),其中 n 是 mask
中设置的位数,这显然有 O(k) 的最坏情况,其中 k 是 mask
中的总位数代码>掩码代码>.
This algorithm is O(n), where n is the number of set bits in mask
, which obviously has a worst case of O(k) where k is the total number of bits in mask
.
是否可能有更有效的最坏情况算法?
Is a more efficient worst case algorithm possible?
是否可以制作一个更快的版本,假设 val
最多设置一位,即对于某个值,要么等于 0,要么等于 1<
r
从 0 到 63?
Is it possible to make a faster version that assumes that val
has at most one bit set, ie either equals 0 or equals 1<<r
for some value of r
from 0 to 63?
推荐答案
问题的第二部分,关于 1-bit 存款的特殊情况,需要两个步骤.第一步,我们需要确定val
中单个1位的位索引r
,如果val
有一个合适的响应为零.这可以通过 POSIX 函数 ffs
轻松完成,或者如果 r
通过其他方式已知,如提问者在评论中所暗示的那样.在第二步中,我们需要识别 mask
中 r
-th 1 位的位索引 i
,如果它存在.然后我们可以将 val
的 r
-th 位存放在 i
位.
The second part of the question, about the special case of a 1-bit deposit, requires two steps. In the first step, we need to determine the bit index r
of the single 1-bit in val
, with a suitable response in case val
is zero. This can easily be accomplished via the POSIX function ffs
, or if r
is known by other means, as alluded to by the asker in comments. In the second step we need to identify bit index i
of the r
-th 1-bit in mask
, if it exists. We can then deposit the r
-th bit of val
at bit i
.
在 mask
中找到 r
-th 1 位的索引的一种方法是使用经典的 population count 基于二进制划分的算法,并记录所有中间分组比特计数.然后我们对记录的比特计数数据进行二分查找,以确定所需比特的位置.
One way of finding the index of the r
-th 1-bit in mask
is to tally the 1-bits using a classical population count algorithm based on binary partitioning, and record all of the intermediate group-wise bit counts. We then perform a binary search on the recorded bit-count data to identify the position of the desired bit.
以下 C
代码使用 64 位数据演示了这一点.这实际上是否比迭代方法更快将在很大程度上取决于mask
和val
的典型值.
The following C
-code demonstrates this using 64-bit data. Whether this is actually faster than the iterative method will very much depend on typical values of mask
and val
.
#include <stdint.h>
/* Find the index of the n-th 1-bit in mask, n >= 0
The index of the least significant bit is 0
Return -1 if there is no such bit
*/
int find_nth_set_bit (uint64_t mask, int n)
{
int t, i = n, r = 0;
const uint64_t m1 = 0x5555555555555555ULL; // even bits
const uint64_t m2 = 0x3333333333333333ULL; // even 2-bit groups
const uint64_t m4 = 0x0f0f0f0f0f0f0f0fULL; // even nibbles
const uint64_t m8 = 0x00ff00ff00ff00ffULL; // even bytes
uint64_t c1 = mask;
uint64_t c2 = c1 - ((c1 >> 1) & m1);
uint64_t c4 = ((c2 >> 2) & m2) + (c2 & m2);
uint64_t c8 = ((c4 >> 4) + c4) & m4;
uint64_t c16 = ((c8 >> 8) + c8) & m8;
uint64_t c32 = (c16 >> 16) + c16;
int c64 = (int)(((c32 >> 32) + c32) & 0x7f);
t = (c32 ) & 0x3f; if (i >= t) { r += 32; i -= t; }
t = (c16>> r) & 0x1f; if (i >= t) { r += 16; i -= t; }
t = (c8 >> r) & 0x0f; if (i >= t) { r += 8; i -= t; }
t = (c4 >> r) & 0x07; if (i >= t) { r += 4; i -= t; }
t = (c2 >> r) & 0x03; if (i >= t) { r += 2; i -= t; }
t = (c1 >> r) & 0x01; if (i >= t) { r += 1; }
if (n >= c64) r = -1;
return r;
}
/* val is either zero or has a single 1-bit.
Return -1 if val is zero, otherwise the index of the 1-bit
The index of the least significant bit is 0
*/
int find_bit_index (uint64_t val)
{
return ffsll (val) - 1;
}
uint64_t deposit_single_bit (uint64_t val, uint64_t mask)
{
uint64_t res = (uint64_t)0;
int r = find_bit_index (val);
if (r >= 0) {
int i = find_nth_set_bit (mask, r);
if (i >= 0) res = (uint64_t)1 << i;
}
return res;
}
这篇关于不使用 BMI2 的 PDEP 的便携式高效替代品?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!