CUDA未指定的启动失败错误 [英] CUDA unspecified launch failure error

查看:460
本文介绍了CUDA未指定的启动失败错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码 http://pastebin.com/vLeD1GJm ,它可以正常工作,但如果我增加:

  #define GPU_MAX_PW 100000000 

到:

  #define GPU_MAX_PW 1000000000 



然后我收到:

  frederico @ zeus: 〜/ Dropbox / coisas / projetos / delta_cuda $ optirun ./a 
块大小= 97657网格48828网格13951

未指定的启动失败a.cu在第447行..错误编号4

我在GTX 675M上运行它,它有2GB的内存。而GPU_MAX_PW的第二个定义将有大约1000000000×2÷1024÷1024 = 1907 MB,所以我不会失去内存。什么可能是问题,因为我只分配更多的内存?也许网格和块配置不可能?



请注意,错误指向这一行:

  HANDLE_ERROR(cudaMemcpy(gwords,gpuHash,sizeof(unsigned short)* GPU_MAX_PW,cudaMemcpyDeviceToHost)); 


解决方案

首先,该程序工作在10,000,000而不是100,000,000(而你说它工作在100,000,000,而不是1,000,000,000)。所以内存大小不是问题,你的计算是基于错误的数字。



calculate_grid_parameters被弄乱了。这个函数的目的是根据GPU_MAX_PW指定需要的线程总数和每个块1024个线程(硬编码)来确定需要多少块,从而计算出网格大小。打印块大小= grid ... grid ...的行实际上有问题的线索。对于100,000,000的GPU_MAX_PW,此函数正确计算需要100,000,000 / 1024 = 97657个块。但是,网格尺寸计算不正确。网格尺寸grid.x * grid.y应等于所需的总块数(大约)。但是这个函数已经决定了grid.g8和grid.y的13951.如果我把这两个乘法,我得到681,199,428,这比期望的总块数97657大得多。现在如果我然后启动一个内核请求的网格维度为48828(x)和13951(y),并且每个块请求1024个线程,我已经请求在该内核启动中的总线程数为697,548,214,272。首先这不是你的意图,其次,虽然目前我不能说出确切的原因,这显然是太多的线程。请注意,如果你从GPU_MAX_PW的100,000,000下降到10,000,000,网格计算就会变得明智 ,我得到:

  block size = 9766 grid 9766 grid 1 
pre>

且无启动失败。


I have the following code http://pastebin.com/vLeD1GJm wich works just fine, but if I increase:

#define GPU_MAX_PW 100000000

to:

#define GPU_MAX_PW 1000000000

Then I receive:

frederico@zeus:~/Dropbox/coisas/projetos/delta_cuda$ optirun ./a
block size = 97657 grid 48828 grid 13951

unspecified launch failure in a.cu at line 447.. err number 4

I'm running this on a GTX 675M which has 2GB of memory. And the second definition of GPU_MAX_PW will have around 1000000000×2÷1024÷1024 = 1907 MB, so I'm not out of memory. What can be the problem since I'm only allocating more memory? Maybe the grid and block configuration become impossible?

Note that the error is pointing to this line:

HANDLE_ERROR(cudaMemcpy(gwords, gpuHashes, sizeof(unsigned short) * GPU_MAX_PW, cudaMemcpyDeviceToHost));

解决方案

First of all you have your sizes listed incorrectly. The program works for 10,000,000 and not 100,000,000 (whereas you said it works for 100,000,000 and not 1,000,000,000). So memory size is not the issue, and your calculations there are based on the wrong numbers.

calculate_grid_parameters is messed up. The objective of this function is to figure out how many blocks are needed and therefore grid size, based on the GPU_MAX_PW specifying the total number of threads needed and 1024 threads per block (hard coded). The line that prints out block size = grid ... grid ... actually has the clue to the problem. For GPU_MAX_PW of 100,000,000, this function correctly computes that 100,000,000/1024 = 97657 blocks are needed. However, the grid dimensions are computed incorrectly. The grid dimensions grid.x * grid.y should equal the total number of blocks desired (approximately). But this function has decided that it wants grid.x of 48828 and grid.y of 13951. If I multiply those two, I get 681,199,428, which is much larger than the desired total block count of 97657. Now if I then launch a kernel with requested grid dimensions of 48828 (x) and 13951 (y), and also request 1024 threads per block, I have requested 697,548,214,272 total threads in that kernel launch. First of all this is not your intent, and secondly, while at the moment I can't say exactly why, this is apparently too many threads. Suffice it to say that this overall grid request exceeds some resource limitation of the machine.

Note that if you drop from 100,000,000 to 10,000,000 for GPU_MAX_PW, the grid calculation becomes "sensible", I get:

block size = 9766 grid 9766 grid 1

and no launch failure.

这篇关于CUDA未指定的启动失败错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆