位旋转帮助:扩展位以遵循给定的位掩码 [英] Bit twiddle help: Expanding bits to follow a given bitmask

查看:41
本文介绍了位旋转帮助:扩展位以遵循给定的位掩码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对扩展位"的快速方法感兴趣,该方法可以定义如下:

I'm interested in a fast method for "expanding bits," which can be defined as the following:

  1. B 为具有 n 位的二进制数,即 B \ in {0,1} ^ n
  2. P B 中所有1/true位的位置,即 1<<p [i]&B == 1 和| P | = k
  3. 对于另一个给定数字 A \ in {0,1} ^ k ,令 Ap A 给定 B ,这样 Ap [j] == A [j]<<p [j] .
  4. 位扩展"的结果是 Ap .
  1. Let B be a binary number with n bits, i.e. B \in {0,1}^n
  2. Let P be the position of all 1/true bits in B, i.e. 1 << p[i] & B == 1, and |P|=k
  3. For another given number, A \in {0,1}^k, let Ap be the bit-expanded form of A given B, such that Ap[j] == A[j] << p[j].
  4. The result of the "bit expansion" is Ap.

几个例子:

  • 给出 B :00 1 0 111 0, A :0110,然后 Ap 应为00 0 0 110 0
  • 给出 B : 1 00 1 1 00 1 A :1101,则 Ap 应该为 1 00 1 0 00 1
  • Given B: 0010 1110, A: 0110, then Ap should be 0000 1100
  • Given B: 1001 1001, A: 1101, then Ap should be 1001 0001

以下是简单明了的算法,但我不禁感到有一种更快/更轻松的方法可以实现的感觉.

Following is a straightforward algorithm, but I can't help shake the feeling that there's a faster/easier way to do this.

unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
  int k = popcount(B); // cuda function, but there are good methods for this
  unsigned int Ap = 0;
  int j = k-1;
  // Starting at the most significant bit,
  for (int i = n - 1; i >= 0; --i) {
    Ap <<= 1;
    // if B is 1, add the value at A[j] to Ap, decrement j. 
    if (B & (1 << i)) {
      Ap += (A >> j--) & 1;
    }
  }
  return Ap;
}

推荐答案

问题似乎是要对BMI2指令 PDEP 进行CUDA仿真,该指令采用源操作数 a ,并根据掩码 b 的1位的位置存放其位.当前提供的GPU上没有相同或相似操作的硬件支持;也就是说,直到并包括Maxwell架构.

The question appears to be asking for a CUDA emulation of the BMI2 instruction PDEP, which takes a source operand a, and deposits its bits based on the positions of the 1-bits of a mask b. There is no hardware support for an identical, or a similar, operation on currently shipping GPUs; that is, up to and including the Maxwell architecture.

根据给出的两个示例,我假设掩码 b 通常是稀疏的,并且我们可以通过仅迭代 b .这可能会在GPU上引起分支分歧,但是在不了解特定用例的情况下,性能的确切折衷方案尚不清楚.就目前而言,我假设与掩码发散的负面影响相比,掩码 b 中稀疏性的利用对性能具有更强的积极影响.

I am assuming, based on the two examples given, that the mask b in general is sparse, and that we can minimize work by only iterating over the 1-bits of b. This could cause divergent branches on the GPU, but the exact trade-off in performance is unknown without knowledge of a specific use case. For now, I am assuming that the exploitation of sparsity in the mask b has a stronger positive influence on performance compared to the negative impact of divergence.

在下面的仿真代码中,我减少了对潜在的昂贵"移位操作的使用,而主要依靠简单的ALU指令.在各种GPU上,执行移位指令的吞吐量都比简单的整数算法低.我在代码中保留了一个关键步骤,即偏离关键路径,以避免执行受到算术单元的限制.如果需要,表达式 1U<<i 可以用加法代替:引入变量 m ,该变量在循环之前被初始化为 1 ,并且每次循环都加倍.

In the emulation code below, I have reduced the use of potentially "expensive" shift operations, instead relying mostly on simple ALU instructions. On various GPUs, shift instructions are executed with lower throughput than simple integer arithmetic. I have retained a single shift, off the critical path through the code, to avoid becoming execution limited by the arithmetic units. If desired, the expression 1U << i can be replaced by addition: introduce a variable m that is initialized to 1 before the loop and doubled each time through the loop.

基本思想是依次隔离掩码 b 的每个1位(从最低有效端开始),并将其与 i 的值相加- a 的第一位,并将结果合并到扩展目标中.使用了 b 中的1位之后,我们将其从掩码中删除,并进行迭代,直到掩码变为零为止.

The basic idea is to isolate each 1-bit of mask b in turn (starting at the least significant end), AND it with the value of the i-th bit of a, and incorporate the result into the expanded destination. After a 1-bit from b has been used, we remove it from the mask, and iterate until the mask becomes zero.

为了避免将 a 的第 i 位移到适当位置,我们简单地将其隔离,然后通过简单的取反将其值复制到所有更高有效的位,利用整数的二进制补码表示.

In order to avoid shifting the i-th bit of a into place, we simply isolate it and then replicate its value to all more significant bits by simple negation, taking advantage of the two's complement representation of integers.

/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant 
   bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
    unsigned int l, s, r = 0;
    int i;
    for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
        l = b & (0 - b); // extract mask's least significant 1-bit
        b = b ^ l; // clear mask's least significant 1-bit
        s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
        r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
    }
    return r;
}

上面没有提到任何移位操作的变体如下所示:

The variant without any shift operations alluded to above looks as follows:

/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant 
   bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
    unsigned int l, s, r = 0, m = 1;
    while (b) { // iterate over 1-bits in mask, until mask becomes 0
        l = b & (0 - b); // extract mask's least significant 1-bit
        b = b ^ l; // clear mask's least significant 1-bit
        s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
        r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
        m = m + m; // mask for next bit of 'a'
    }
    return r;
}

在下面的评论中,@ Evgeny Kluev在

In comments below, @Evgeny Kluev pointed to a shift-free PDEP emulation at the chessprogramming website that looks potentially faster than either of my two implementations above; it seems worth a try.

这篇关于位旋转帮助:扩展位以遵循给定的位掩码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆