如何在CUDA中打包位（有效地）？ [英] How to pack bits (efficiently) in CUDA?

查看：353 发布时间：2017/3/5 15:18:12 c++ parallel-processing cuda bit-packing

本文介绍了如何在CUDA中打包位（有效地）？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个字节数组，其中每个字节是0或1.现在我想把这些值装入位，使8个原始字节占用1个目标字节，原始字节0进入位0，字节1进入位1等。
到目前为止，我在内核中有以下内容：

I have an array of bytes where each byte is either 0 or 1. Now I want to pack these values into bits, so that 8 original bytes occupy 1 target byte, with original byte 0 going into bit 0, byte 1 into bit 1, etc. So far I have the following in the kernel:

const uint16_t tid = threadIdx.x;
__shared__ uint8_t packing[cBlockSize];

// ... Computation of the original bytes in packing[tid]
__syncthreads();

if ((tid & 4) == 0)
{
    packing[tid] |= packing[tid | 4] << 4;
}
if ((tid & 6) == 0)
{
    packing[tid] |= packing[tid | 2] << 2;
}
if ((tid & 7) == 0)
{
    pOutput[(tid + blockDim.x*blockIdx.x)>>3] = packing[tid] | (packing[tid | 1] << 1);
}

这是正确和有效的吗？

推荐答案

__ ballot（） warp-voting函数非常方便。
假设您可以重新定义 pOutput 为 uint32_t 类型，并且您的块大小为多个（32）：

The __ballot() warp-voting function comes quite handy for this. Assuming that you can redefine pOutput to be of uint32_t type, and that your block size is a multiple of the warp size (32):

unsigned int target = __ballot(packing[tid]);
if (tid % warpSize == 0) {
    pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
}

严格来说，if条件是没有必要的，因为所有线程翘曲将相同的数据写入相同的地址。因此，高度优化的版本只是

Strictly speaking, the if conditional isn't even necessary, as all threads of the warp will write the same data to the same address. So a highly optimized version would just be

pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = __ballot(packing[tid]);

这篇关于如何在CUDA中打包位（有效地）？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在CUDA中打包位（有效地）？ [英] How to pack bits (efficiently) in CUDA?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何在CUDA中打包位（有效地）？ [英] How to pack bits (efficiently) in CUDA?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭