使用 CUDA 减少总和:N 是多少? [英] Sum reduction with CUDA: What is N?

查看:13
本文介绍了使用 CUDA 减少总和:N 是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 NVIDIA,这个 是最快的求和核:

According to NVIDIA, this is the fastest sum reduction kernel:

template <unsigned int blockSize>
__device__ void warpReduce(volatile int *sdata, unsigned int tid) {
if (blockSize >=  64) sdata[tid] += sdata[tid + 32];
if (blockSize >=  32) sdata[tid] += sdata[tid + 16];
if (blockSize >=  16) sdata[tid] += sdata[tid +  8];
if (blockSize >=    8) sdata[tid] += sdata[tid +  4];
if (blockSize >=    4) sdata[tid] += sdata[tid +  2];
if (blockSize >=    2) sdata[tid] += sdata[tid +  1];
}
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize];  i += gridSize;  }
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid <   64) { sdata[tid] += sdata[tid +   64]; } __syncthreads(); }
if (tid < 32) warpReduce(sdata, tid);
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

但是,我不明白n"参数.有什么线索吗?我不认为要减少数组的大小,因为在 while 循环中会有缓冲区溢出.

However, I don't understand the "n" parameter. Any clues? I don't think it's the size of the array to reduce, since in the while loop there would be a buffer overflow.

推荐答案

我相信你在幻灯片中发现了一个错字(它可能应该类似于 while(i + blockDim.x < n)).

I believe you've discovered a typo in the slides (it should probably be something like while(i + blockDim.x < n)).

如果您查看 CUDA SDK 示例中的源代码 "reduction",最近的 reduce6 的主体看起来是这样的:

If you take a look at the source code in the CUDA SDK sample "reduction", the body of the most recent reduce6 looks like this:

template <class T, unsigned int blockSize, bool nIsPow2>
__global__ void
reduce6(T *g_idata, T *g_odata, unsigned int n)
{
    T *sdata = SharedMemory<T>();

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    ...

    T mySum = 0;

    // we reduce multiple elements per thread.  The number is determined by the 
    // number of active thread blocks (via gridDim).  More blocks will result
    // in a larger gridSize and therefore fewer elements per thread
    while (i < n)
    {         
        mySum += g_idata[i];
        // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
        if (nIsPow2 || i + blockSize < n) 
            mySum += g_idata[i+blockSize];  
        i += gridSize;
    } 

注意 while 中的显式检查,以防止越界访问 g_idata.你最初的怀疑是正确的;n 只是 g_idata 数组的大小.

Note the explicit check within the while which prevents out of bounds access to g_idata. Your initial suspicion is correct; n is simply the size of the g_idata array.

这篇关于使用 CUDA 减少总和:N 是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆