减少CUDA中的区块 [英] Block reduction in CUDA

查看：112 发布时间：2020/6/3 19:55:15 algorithm cuda reduction cub

本文介绍了减少CUDA中的区块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试减少CUDA，我确实是一个新手。我目前正在研究NVIDIA的示例代码。

I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA.

我想我真的不确定如何设置块大小和网格大小，尤其是当我的输入数组较大时（ 512 X 512 ）比单个块大小大。

I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger (512 X 512) than a single block size.

这里是代码。

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;

    while (i < n) 
    { 
        sdata[tid] += g_idata[i] + g_idata[i+blockSize]; 
        i += gridSize; 
    }

    __syncthreads();

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

    if (tid < 32) 
    {
        if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
        if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
        if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
        if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
        if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
        if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
    }

    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

但是，在我看来， g_odata [blockIdx .x] 保存所有块的部分和，如果要获取最终结果，则需要对 g_odata [blockIdx.x]中的所有项求和数组。

However, it seems to me the g_odata[blockIdx.x] saves the partial sums from all blocks, and, if I want to get the final result, I need to sum all the terms within the g_odata[blockIdx.x] array.

我想知道：是否有一个内核可以完成全部求和？还是我误会了这里的事情？如果有人可以教育我，我将不胜感激。非常感谢。

I am wondering: is there a kernel to do the whole summation? or am I misunderstanding things here? I would really appreciate if anyone can educate me with this. Thanks very much.

减少CUDA中的区块 [英] Block reduction in CUDA

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

减少CUDA中的区块 [英] Block reduction in CUDA

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭