减少CUDA中的区块 [英] Block reduction in CUDA
问题描述
我正在尝试减少CUDA,我确实是一个新手。我目前正在研究NVIDIA的示例代码。
I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA.
我想我真的不确定如何设置块大小和网格大小,尤其是当我的输入数组较大时( 512 X 512
)比单个块大小大。
I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger (512 X 512
) than a single block size.
这里是代码。
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n)
{
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32)
{
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
但是,在我看来, g_odata [blockIdx .x]
保存所有块的部分和,如果要获取最终结果,则需要对 g_odata [blockIdx.x]中的所有项求和
数组。
However, it seems to me the g_odata[blockIdx.x]
saves the partial sums from all blocks, and, if I want to get the final result, I need to sum all the terms within the g_odata[blockIdx.x]
array.
我想知道:是否有一个内核可以完成全部求和?还是我误会了这里的事情?如果有人可以教育我,我将不胜感激。非常感谢。
I am wondering: is there a kernel to do the whole summation? or am I misunderstanding things here? I would really appreciate if anyone can educate me with this. Thanks very much.
推荐答案
为了对这个主题有更好的了解,您可以查看此pdf 可以解释,以图形方式显示您在代码中使用的所有策略。
In order to have a better idea of this topic, you can have a look on this pdf of NVIDIA that explains, graphically, all the strategies that you have used in your code.
这篇关于减少CUDA中的区块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!