CUDA数组减少 [英] CUDA Array Reduction

查看:277
本文介绍了CUDA数组减少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有多个问题类似于这个已经回答了,但我已经无法拼凑任何非常有用的他们,除非我可能不正确的索引的东西。

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.

我试图对输入向量A执行顺序寻址简化到输出向量B.

I'm trying to preform a sequential addressing reduction on input vector A into output vector B.

完整代码可在这里 http://pastebin.com/7UGadgjX ,但这是内核:

The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:

__global__ void vectorSum(int *A, int *B, int numElements) {
  extern __shared__ int S[];
  // Each thread loads one element from global to shared memory
  int tid = threadIdx.x;
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < numElements) {
    S[tid] = A[i];
    __syncthreads();
    // Reduce in shared memory
    for (int t = blockDim.x/2; t > 0; t>>=1) {
      if (tid < t) {
        S[tid] += S[tid + t];
      }
      __syncthreads();
    }
    if (tid == 0) B[blockIdx.x] = S[0];
  }
}

这些是内核启动语句:

// Launch the Vector Summation CUDA Kernel
  int threadsPerBlock = 256;
  int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
  vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);

我收到一个未指定的启动错误,我读过类似于segfault。我一直密切关注nvidia减少文档,并试图保持我的内核在numElements的边界内,但我似乎缺少一些关键考虑代码是多么简单。

I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.

推荐答案

您的问题是,还原内核需要动态分配的共享内存才能正常运行,但是您的内核启动没有指定任何内容。

Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.

在CUDA运行时API语法中,内核启动语句具有四个参数。前两个是发射的网格和块尺寸。后两个是可选的,默认值为零,但指定动态分配的共享内存大小和流。

In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.

要修复此问题,请按如下所示更改启动代码:

To fix this, change the launch code as follows:

// Launch the Vector Summation CUDA Kernel
  int threadsPerBlock = 256;
  int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
  size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
  vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);

[免责声明:用浏览器编写的代码,未经编译或测试,使用风险自负]

[disclaimer: code written in browser, not compiled or tested, use at own risk]

这至少应该解决你的代码最明显的问题。

This should at least fix the most obvious problem with your code.

这篇关于CUDA数组减少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆