位旋转帮助:扩展位以遵循给定的位掩码 [英] Bit twiddle help: Expanding bits to follow a given bitmask

查看：41 发布时间：2021/4/15 19:56:20 algorithm cuda bit-manipulation

本文介绍了位旋转帮助:扩展位以遵循给定的位掩码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对扩展位"的快速方法感兴趣，该方法可以定义如下:

I'm interested in a fast method for "expanding bits," which can be defined as the following:

让 B 为具有 n 位的二进制数，即 B \ in {0,1} ^ n
让 P 为 B 中所有1/true位的位置，即 1<<p [i]&B == 1 和| P | = k
对于另一个给定数字 A \ in {0,1} ^ k ，令 Ap 为 A 给定 B ，这样 Ap [j] == A [j]<<p [j] .
位扩展"的结果是 Ap .

Let B be a binary number with n bits, i.e. B \in {0,1}^n
Let P be the position of all 1/true bits in B, i.e. 1 << p[i] & B == 1, and |P|=k
For another given number, A \in {0,1}^k, let Ap be the bit-expanded form of A given B, such that Ap[j] == A[j] << p[j].
The result of the "bit expansion" is Ap.

几个例子:

给出 B :00 1 0 111 0， A :0110，然后 Ap 应为00 0 0 110 0
给出 B : 1 00 1 1 00 1 ， A :1101，则 Ap 应该为 1 00 1 0 00 1

Given B: 0010 1110, A: 0110, then Ap should be 0000 1100
Given B: 1001 1001, A: 1101, then Ap should be 1001 0001

以下是简单明了的算法，但我不禁感到有一种更快/更轻松的方法可以实现的感觉.

Following is a straightforward algorithm, but I can't help shake the feeling that there's a faster/easier way to do this.

unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
  int k = popcount(B); // cuda function, but there are good methods for this
  unsigned int Ap = 0;
  int j = k-1;
  // Starting at the most significant bit,
  for (int i = n - 1; i >= 0; --i) {
    Ap <<= 1;
    // if B is 1, add the value at A[j] to Ap, decrement j. 
    if (B & (1 << i)) {
      Ap += (A >> j--) & 1;
    }
  }
  return Ap;
}

推荐答案

问题似乎是要对BMI2指令 PDEP 进行CUDA仿真，该指令采用源操作数 a ，并根据掩码 b 的1位的位置存放其位.当前提供的GPU上没有相同或相似操作的硬件支持；也就是说，直到并包括Maxwell架构.

The question appears to be asking for a CUDA emulation of the BMI2 instruction PDEP, which takes a source operand a, and deposits its bits based on the positions of the 1-bits of a mask b. There is no hardware support for an identical, or a similar, operation on currently shipping GPUs; that is, up to and including the Maxwell architecture.

根据给出的两个示例，我假设掩码 b 通常是稀疏的，并且我们可以通过仅迭代 b .这可能会在GPU上引起分支分歧，但是在不了解特定用例的情况下，性能的确切折衷方案尚不清楚.就目前而言，我假设与掩码发散的负面影响相比，掩码 b 中稀疏性的利用对性能具有更强的积极影响.

I am assuming, based on the two examples given, that the mask b in general is sparse, and that we can minimize work by only iterating over the 1-bits of b. This could cause divergent branches on the GPU, but the exact trade-off in performance is unknown without knowledge of a specific use case. For now, I am assuming that the exploitation of sparsity in the mask b has a stronger positive influence on performance compared to the negative impact of divergence.

在下面的仿真代码中，我减少了对潜在的昂贵"移位操作的使用，而主要依靠简单的ALU指令.在各种GPU上，执行移位指令的吞吐量都比简单的整数算法低.我在代码中保留了一个关键步骤，即偏离关键路径，以避免执行受到算术单元的限制.如果需要，表达式 1U<<i 可以用加法代替:引入变量 m ，该变量在循环之前被初始化为 1 ，并且每次循环都加倍.

In the emulation code below, I have reduced the use of potentially "expensive" shift operations, instead relying mostly on simple ALU instructions. On various GPUs, shift instructions are executed with lower throughput than simple integer arithmetic. I have retained a single shift, off the critical path through the code, to avoid becoming execution limited by the arithmetic units. If desired, the expression 1U << i can be replaced by addition: introduce a variable m that is initialized to 1 before the loop and doubled each time through the loop.

基本思想是依次隔离掩码 b 的每个1位(从最低有效端开始)，并将其与 i 的值相加- a 的第一位，并将结果合并到扩展目标中.使用了 b 中的1位之后，我们将其从掩码中删除，并进行迭代，直到掩码变为零为止.

The basic idea is to isolate each 1-bit of mask b in turn (starting at the least significant end), AND it with the value of the i-th bit of a, and incorporate the result into the expanded destination. After a 1-bit from b has been used, we remove it from the mask, and iterate until the mask becomes zero.

为了避免将 a 的第 i 位移到适当位置，我们简单地将其隔离，然后通过简单的取反将其值复制到所有更高有效的位，利用整数的二进制补码表示.

In order to avoid shifting the i-th bit of a into place, we simply isolate it and then replicate its value to all more significant bits by simple negation, taking advantage of the two's complement representation of integers.

/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant 
   bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
    unsigned int l, s, r = 0;
    int i;
    for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
        l = b & (0 - b); // extract mask's least significant 1-bit
        b = b ^ l; // clear mask's least significant 1-bit
        s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
        r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
    }
    return r;
}

上面没有提到任何移位操作的变体如下所示:

The variant without any shift operations alluded to above looks as follows:

/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant 
   bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
    unsigned int l, s, r = 0, m = 1;
    while (b) { // iterate over 1-bits in mask, until mask becomes 0
        l = b & (0 - b); // extract mask's least significant 1-bit
        b = b ^ l; // clear mask's least significant 1-bit
        s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
        r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
        m = m + m; // mask for next bit of 'a'
    }
    return r;
}

在下面的评论中，@ Evgeny Kluev在

In comments below, @Evgeny Kluev pointed to a shift-free PDEP emulation at the chessprogramming website that looks potentially faster than either of my two implementations above; it seems worth a try.

这篇关于位旋转帮助:扩展位以遵循给定的位掩码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

位旋转帮助:扩展位以遵循给定的位掩码 [英] Bit twiddle help: Expanding bits to follow a given bitmask

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

位旋转帮助:扩展位以遵循给定的位掩码 [英] Bit twiddle help: Expanding bits to follow a given bitmask

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭