CUDA 5.x on Kepler,动态内核执行和最大递归“深度” [英] CUDA 5.x on Kepler, dynamic kernel execution and maximum recursion "depth"

查看:350
本文介绍了CUDA 5.x on Kepler,动态内核执行和最大递归“深度”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在CUDA 5编程指南中,有以下说法:


启动可以继续深入24代,
通常会受限于GPU上的可用资源


我的问题如下:




  • GPU上的CUDA运行时保证总是可以实现24的深度,并且在某些情况下甚至可以超过24 )?或者它们的意思是24是绝对最大限制,这个数字在运行时可能不会达到(情况B)?


  • 如果情况B,内核在GPU上启动,没有足够的资源?发射失败? (很奇怪,如果是这样的情况!)




我打算写一个CUDA程序,从开普勒架构。我的算法绝对需要在15-19级别的函数递归(递归级别绑定到我的数据结构)。



参考: TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf

  int main(){
pid_t pid;
while(pid = fork());
while(true){
dummy<<<< 1 1024>> ();
}
}

__global void dummy(){}


$ b b

在某些时候,某些内容将会失败 - 无论是CPU还是GPU内存。以类似的方式,当在GPU上它可能会发生,它会失败(返回错误 - CUDA或fork将返回-1)。



另一种方式来看它是每次启动都可以(2 ^ 31-1)^ 2 *(2 ^ 10-1)≃2 ^ 72 在最坏情况下的线程。也就是说在单启动中你可以有 2 ^ 82 个线程。现在每个递归是指数的,因此即使你在最糟糕的情况下终止线程,它需要保证调度 2 ^ 1968 线程。如果每个线程的状态是1/32位,如果warp完成或不是,则将需要 2 ^ 1945 GiB的内存(这稍微多于信息容量观察到的宇宙 - 即 2 ^ 1595 倍)。



它是情况A(翘曲状态必须至少包括指令指针)的可能性。根据分支因子,如果同步,算法15-19的递归深度可能是可以实现的。



EDIT :如果您的意思是普通递归,而不是递归启动,根据确切的代码,它可能在费米+上几乎是无限的(特斯拉一代不支持递归IIRC)。类似地,没有保证最小深度 - 在堆栈/本地内存上分配大的数组,你将耗尽空间(优化器很好地去除它)。


In the CUDA 5 programming guide, the following is said:

Launches may continue to a depth of 24 generations, but this depth will typically be limited by available resources on the GPU

My questions are the following:

  • does the CUDA runtime on the GPU guarantee that a depth of 24 can always be achieved and that, in some circumstances, might even go beyond 24 (case A)? Or do they mean 24 is the absolute maximum limit and this number might not indeed be reached at runtime (case B)?

  • if case B, what happens when a kernel is launched on the GPU and there is not enough resources? Does the launch fail? (weird if this is the case!)

I plan on writing a CUDA program and I would like to take benefit from the Kepler architecture. My algorithm absolutely needs function recursion at a level of 15-19 typically (the recursion level is bound to my data structures).

Ref: TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf

解决方案

CUDA does not guarantee that recursion depth 1 will be achieved - similarly as traditional OS does not guarantee that launching new process/thread will succeeds. For example if you have following program:

int main() {
    pid_t pid;
    while (pid = fork ());
    while (true) {
        dummy<<<1, 1024>>> ();
    }
}

__global void dummy() {}

At some point something will fail - either you will run out of CPU or GPU memory. In similar way while on GPU it might happen that it will fail (return error - either CUDA or fork will return -1).

Another way of looking on it is that each launch can have (2^31-1)^2*(2^10-1) ≃ 2^72 blocks each with 2^10 threads in worst case. I.e. in single launch you can have 2^82 threads. Now each recursion is exponential hence even if you terminate thread after launching in worst case it would need to guarantee scheduling 2^1968 threads. If state of each thread was 1/32 bit, if warp finished or not, it would require 2^1945 GiB of memory (which is "slightly" more then informational capacity of observed universe - namely 2^1595 times more).

Hence it is definitely case B and there is no sane possibility of it being case A (state of warp must include at least the instruction pointer). Depending on the branching factor, and if you synchronize, of your algorithm 15-19 recursion depth might be achievable.

EDIT: If you mean plain recursion instead of recursive launches than in practice it is limited by stack. Depending on the exact code it might be practically infinite on Fermi+ (Tesla generation does not support recursion IIRC). Similarly there is no guaranteed minimal depth - allocate large array on stack/local memory and you will run out of space (optimizer does good job of getting rid of it).

这篇关于CUDA 5.x on Kepler,动态内核执行和最大递归“深度”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆