CUDA/C-在内核函数中使用malloc会得出奇怪的结果 [英] CUDA/C - Using malloc in kernel functions gives strange results

查看:101
本文介绍了CUDA/C-在内核函数中使用malloc会得出奇怪的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA/C的新手,也是堆栈溢出的新手.这是我的第一个问题.

I'm new to CUDA/C and new to stack overflow. This is my first question.

我正在尝试在内核函数中动态分配内存,但结果出乎意料.我在内核中使用malloc()读取会降低很多性能,但是无论如何我都需要它,因此我首先尝试使用一个简单的 int ** 数组只是为了测试可能性,然后我实际上需要分配更复杂的结构.

I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected. I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.

在我的主要语言中,我使用了 cudaMalloc() int * 的数组分配了空间,然后我使用了 malloc()为内核函数中的每个线程分配一个数组给外部数组的每个索引.然后,我使用另一个线程来检查结果,但是它并不总是有效.

In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.

这里的主要代码:

#define N_CELLE 1024*2
#define L_CELLE 512

extern "C" {

int main(int argc, char **argv) {
  int *result = (int *)malloc(sizeof(int));
  int *d_result;
  int size_numbers = N_CELLE * sizeof(int *);
  int **d_numbers;

  cudaMalloc((void **)&d_numbers, size_numbers);
  cudaMalloc((void **)&d_result, sizeof(int *));

  kernel_one<<<2, 1024>>>(d_numbers);
  cudaDeviceSynchronize();
  kernel_two<<<1, 1>>>(d_numbers, d_result);

  cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);

  printf("%d\n", *result);

  cudaFree(d_numbers);
  cudaFree(d_result);
  free(result);
}

}

我使用了 extern"C" ,因为导入标头时无法编译,该示例代码中未使用该标头.我粘贴了它,因为我不知道这是否有意义.

I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.

这是kernel_one代码:

This is kernel_one code:

__global__ void kernel_one(int **d_numbers) {
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
  for(int j=0; j<L_CELLE;j++)
    d_numbers[i][j] = 1;
}

这是kernel_two代码:

And this is kernel_two code:

__global__ void kernel_two(int **d_numbers, int *d_result) {
  int temp = 0;
  for(int i=0; i<N_CELLE; i++) {
    for(int j=0; j<L_CELLE;j++)
      temp += d_numbers[i][j];     
  }
  *d_result = temp;
}

一切正常(aka计数正确),直到我在设备内存中使用的总块数少于1024 * 2 * 512.例如,如果我 #define N_CELLE 1024 * 4 ,程序将开始给出随机"结果,例如负数.对可能的问题有任何想法吗?谢谢任何人!

Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers. Any idea of what the problem could be? Thanks anyone!

推荐答案

内核内存分配从静态分配的运行时堆中提取内存.在更大的大小上,您超出了该堆的大小,然后您的两个内核试图从未初始化的内存中进行读取和写入.这会在设备上产生运行时错误,并使结果无效.如果您在主机端添加了正确的API错误检查,或者使用 cuda-memcheck 实用程序运行了代码,您将已经知道这一点.

In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.

解决方案是在尝试运行内核之前,确保将堆大小设置为适当的值.添加类似这样的内容:

The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:

 size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
 cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);

在其他任何API调用之前

转到您的主机代码,应该可以解决该问题.

to your host code before any other API calls, should solve the problem.

这篇关于CUDA/C-在内核函数中使用malloc会得出奇怪的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆