“未知错误"在CUDA的__device__函数内部使用动态分配时 [英] "unknown error" while using dynamic allocation inside __device__ function in CUDA

查看:63
本文介绍了“未知错误"在CUDA的__device__函数内部使用动态分配时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在CUDA应用程序中实现链接列表,以对不断发展的网络进行建模.为此,我在 __ device __ 函数中使用了 malloc ,旨在在全局内存中分配内存.代码是:

I'm trying to implement a linked list in a CUDA application to model a growing network. In oder to do so I'm using malloc inside the __device__ function, aiming to allocate memory in the global memory. The code is:

void __device__ insereviz(Vizinhos **lista, Nodo *novizinho, int *Gteste)
{
   Vizinhos *vizinho;

   vizinho=(Vizinhos *)malloc(sizeof(Vizinhos));

   vizinho->viz=novizinho;

   vizinho->proxviz=*lista;

   *lista=vizinho;

   novizinho->k=novizinho->k+1;
}

在分配了一定数量的元素(大约90000个)之后,我的程序返回未知错误".起初我虽然是一个内存限制,但是我检查了 nvidia-smi 并得到了

After a certain number of allocated elements (around 90000) my program returns "unknown error". At first I though it was a memory constraint, but I checked nvidia-smi and I've got

+------------------------------------------------------+                       
| NVIDIA-SMI 331.38     Driver Version: 331.38         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 770     Off  | 0000:01:00.0     N/A |                  N/A |
| 41%   38C  N/A     N/A /  N/A |    159MiB /  2047MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

因此,这似乎不是内存问题,除非 malloc 在共享内存中进行分配.为了测试这一点,我尝试在分开的块中运行两个网络,但在我可以分配的结构数量上仍然有限制.但是,当我尝试使用较少数量的结构运行同一程序的两个实例时,它们都完成而没有错误.

So it doesn't seem a memory problem, unless malloc is allocating inside the shared memory. To test this I've tried to run two networks in separated blocks, and still have a limitation in the number of structures I'm able to allocate. But when I try to run two instances of the same program with a smaller number of structures they both finish without error.

我也尝试过 cuda-memcheck 并得到

========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
=========     at 0x000001b0 in     /work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:164:insereviz(neighbor**, node*, int*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x00000000 is out of bounds
=========     Device Frame:/work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:142:insereno(int, int, node**, node**, int*) (insereno(int, int, node**, node**, int*) : 0x648)
=========     Device Frame:/work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:111:fazrede(node**, int, int, int, int*) (fazrede(node**, int, int, int, int*) : 0x4b8)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138281]
=========     Host Frame:gpu_testamalloc5 [0x1bd48]
=========     Host Frame:gpu_testamalloc5 [0x3b213]
=========     Host Frame:gpu_testamalloc5 [0x2fe3]
=========     Host Frame:gpu_testamalloc5 [0x2e39]
=========     Host Frame:gpu_testamalloc5 [0x2e7f]
=========     Host Frame:gpu_testamalloc5 [0x2c2f]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xfd) [0x1eead]
=========     Host Frame:gpu_testamalloc5 [0x2829]

内核启动是否存在任何限制或缺少什么?我该如何检查?

Is there any restriction in the kernel launch or something I'm missing? How can I check it?

谢谢

里卡多

推荐答案

最可能的原因是设备堆"上的空间不足.最初默认为8MB,但是您可以更改它.

The most likely reason is that you are running out of space on the "device heap". This is initially defaulting to 8MB, but you can change it.

请参阅

Referring to the documentation, we see that device malloc allocates out of the device heap.

如果发生错误,则 malloc 将返回NULL指针.在设备代码(和主机代码中)测试此NULL指针是一个好习惯,在这方面与主机 malloc 没什么不同.如果获得NULL指针,则表明设备堆空间已用完.

If an error occurs, a NULL pointer will be returned by malloc. It's good practice to test for this NULL pointer in device code (and in host code -- it's no different from host malloc in this respect). If you get a NULL pointer, you have run out of device heap space.

如文档中所述,可以在内核调用之前使用以下命令来调整设备堆的大小:

As indicated in the documentation, the size of the device heap can be adjusted before your kernel call by using the:

cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)

运行时API函数.

如果您忽略所有这些并尝试使用无论如何返回的NULL指针,您将在设备代码中获得无效的访问,如下所示:

If you ignore all this and attempt to use the NULL pointer returned anyway, you'll get invalid accesses in device code, like this:

=========     Address 0x00000000 is out of bounds

这篇关于“未知错误"在CUDA的__device__函数内部使用动态分配时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆