CUDA大型输入数组 [英] CUDA large input arrays

查看:355
本文介绍了CUDA大型输入数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA的新手,我一直在研究减少算法。

I am new to CUDA, I have been working on a "Reduce algorithm".

该算法适用于任何小于1≤24的数组。

The algorithm works for any array size less than 1<<24.

当我使用大小为1<< 25的数组时,程序在总和中返回0,这是错误的。总和应该是2 ^ 25

When I use arrays of size 1<<25 the program returns 0 in "total sum" which is wrong. The sum should me 2^25

编辑 cuda-memcheck编译代码

EDIT cuda-memcheck compiled_code

========= CUDA-MEMCHECK
@@STARTING@@ 
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so.1 [0x2f2d83]
=========     Host Frame:test [0x3b37e]
=========     Host Frame:test [0x2b71]
=========     Host Frame:test [0x2a18]
=========     Host Frame:test [0x2a4c]
=========     Host Frame:test [0x2600]
=========     Host Frame:test [0x2904]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ed5d]
=========     Host Frame:test [0x23e9]
=========

我的设置是:


  • 列表项

  • Nvidia Tesla K40

  • CUDA 6.5

  • 科学版6.4(碳)

  • List item
  • Nvidia Tesla K40
  • CUDA 6.5
  • Scientific Linux release 6.4 (Carbon)

该程序由一个内核,一个内核包装程序和一个用于执行内核包装程序的主体组成。

The program consists of a kernel, a kernel wrapper and a main to execute the kernel wrapper.

/* -------- KERNEL -------- */
__global__ void reduce_kernel(int * d_out, int * d_in, int size)
{
  // position and threadId
  int pos = blockIdx.x * blockDim.x + threadIdx.x;
  int tid = threadIdx.x;

  // do reduction in global memory
  for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
  {
    if (tid < s)
    {
      if (pos+s < size) // Handling out of bounds
      {
        d_in[pos] = d_in[pos] + d_in[pos+s];
      }
    }
    __syncthreads();
  }

  // only thread 0 writes result, as thread
  if ((tid==0) && (pos < size))
  {
    d_out[blockIdx.x] = d_in[pos];
  }
}

这里是内核包装器

/* -------- KERNEL WRAPPER -------- */
void reduce(int * d_out, int * d_in, int size, int num_threads)
{
  // setting up blocks and intermediate result holder

  int num_blocks;
  if(((size) % num_threads))
    {
      num_blocks = ((size) / num_threads) + 1;    
    }
    else
    {
      num_blocks = (size) / num_threads;
    }
  int * d_intermediate;
  cudaMalloc(&d_intermediate, sizeof(int)*num_blocks);
  cudaMemset(d_intermediate, 0, sizeof(int)*num_blocks);
  int prev_num_blocks;
  int i = 1;
  int size_rest = 0;
  // recursively solving, will run approximately log base num_threads times.
  do
  {
    printf("Round:%.d\n", i);
    printf("NumBlocks:%.d\n", num_blocks);
    printf("NumThreads:%.d\n", num_threads);
    printf("size of array:%.d\n", size);
    i++;
    reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);
    size_rest = size % num_threads;
    size = size / num_threads + size_rest;

    // updating input to intermediate
    cudaMemcpy(d_in, d_intermediate, sizeof(int)*num_blocks, cudaMemcpyDeviceToDevice);

    // Updating num_blocks to reflect how many blocks we now want to compute on
    prev_num_blocks = num_blocks;
    if(size % num_threads)
    {
      num_blocks = size / num_threads + 1;      
    }
    else
    {
      num_blocks = size / num_threads;
    }
    // updating intermediate
    cudaFree(d_intermediate);
    cudaMalloc(&d_intermediate, sizeof(int)*num_blocks);
  }
  while(size > num_threads); // if it is too small, compute rest.

  // computing rest
  reduce_kernel<<<1, size>>>(d_out, d_in, prev_num_blocks);
}

以下是主要内容:

/* -------- MAIN -------- */
int main(int argc, char **argv)
{
  printf("@@STARTING@@ \n");
  // Setting num_threads
  int num_threads = 512;
  // Making non-bogus data and setting it on the GPU
  const int size = 1<<24;
  const int size_out = 1;
  int * d_in;
  int * d_out;
  cudaMalloc(&d_in, sizeof(int)*size);
  cudaMalloc(&d_out, sizeof(int)*size_out);

  int * h_in = (int *)malloc(size*sizeof(int));
  for (int i = 0; i <  size; i++) h_in[i] = 1;
  cudaMemcpy(d_in, h_in, sizeof(int)*size, cudaMemcpyHostToDevice);

  // Running kernel wrapper
  reduce(d_out, d_in, size, num_threads);
  int result;
  cudaMemcpy(&result, d_out, sizeof(int), cudaMemcpyDeviceToHost);
  printf("\nFINAL SUM IS: %d\n", result);
}


推荐答案

这种编译代码的方法:

nvcc -o my_reduce my_reduce.cu

在CUDA 6.5上构建cc2.0的计算架构

该架构为限制为65535块(在x-尺寸,这是您正在使用的唯一尺寸)。

That architecture is limited to 65535 blocks (in the x-dimension, which is the only dimension you are using) in the grid.

大小 1<< 24 ,其中 num_threads = 512 ,启动的块数为:

at a size of 1<<24, with a num_threads=512, the number of blocks launched is:

  num_blocks = (size) / num_threads;

1≤24/512或31250块

which is 1<<24/512 or 31250 blocks

略高于1 <<< 25的某个数字,您将超出cc2.0设备的块限制。

at some number slightly above 1<<25 you will exceed the block limit of a cc2.0 device.

要解决此问题,请使用

nvcc -o -arch=sm_35 my_reduce my_reduce.cu

这是正确的编译体系结构(即计算能力) ),并将K40的块限制提高到2 ^ 31-1

which is the correct compile architecture (i.e. compute capability) for your K40, and will raise the block limit to 2^31-1

,请使用正确进行cuda错误检查的方法是什么 CUDA代码,之前在此处寻求帮助。即使您不理解错误结果,也可能会帮助那些试图帮助您的人。

And please use proper cuda error checking any time you are having trouble with a CUDA code, before asking for help here. Even if you don't understand the error results, it will likely help those who are trying to help you.

这篇关于CUDA大型输入数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆