CUDA优化无法正常工作 [英] CUDA Optimization Not Working

查看:67
本文介绍了CUDA优化无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找到了此PDF(

I found this PDF (http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf) that walks you through several ways to optimize a reduce operation in CUDA and I'm trying to follow along. For reduction #5 it suggests unrolling the last 6 iterations of the loop with the following code:

if (tid < 32)
{
  sdata[tid] += sdata[tid + 32];
  sdata[tid] += sdata[tid + 16];
  sdata[tid] += sdata[tid + 8];
  sdata[tid] += sdata[tid + 4];
  sdata[tid] += sdata[tid + 2];
  sdata[tid] += sdata[tid + 1];
}

上一张幻灯片甚至说:

  • 随着减少的进行,#个活动"线程减少
    • 当s< = 32时,我们只剩下一个翘曲
  • 我们不需要__syncthreads()
  • 我们不需要"if(tid< s)",因为它不保存任何工作

但是,当我尝试这种方法时,与以前的方法相比,我得到的减少额却少得多.如果在每次写入共享内存后添加 __ syncthreads(),那么我会得到正确的结果.

However when I tried this approach I got a MUCH smaller sum from the reduction than from the previous approach. If I add __syncthreads() after each write to shared memory then I get the correct result.

关于指令在扭曲中是SIMD同步" 我们不需要__syncthreads()" 的评论不是真的吗?还是这是旧文件并且技术已经改变?

Are the comments about "Instructions are SIMD synchronous within a warp" and "We don't need to __syncthreads()" not true? Or is this an old document and the technology has changed?

推荐答案

您需要使用volatile关键字,如下面的njuffa所述.

You need to use the volatile keyword, as njuffa commented below.

这里有同一文档的最新版本. https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

There's a much more recent version of the same document here. https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

下面是等效的示例#6供参考.

Here's the equivalent example #6 for reference.

template <unsigned int blockSize>
__device__ void warpReduce(volatile int *sdata, unsigned int tid) {
    if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
    if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
    if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
    if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
    if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
    if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;

    while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }

    __syncthreads();

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

    if (tid < 32) warpReduce(sdata, tid);
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

这篇关于CUDA优化无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆