动态并行性 - 启动许多小内核是非常缓慢的 [英] Dynamic parallelism - launching many small kernels is very slow

查看:256
本文介绍了动态并行性 - 启动许多小内核是非常缓慢的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用动态并行性来改进我在CUDA中的算法。在我原来的CUDA解决方案中,每个线程计算一个数字,是每个块的共同。我想做的是首先启动一个粗(或低分辨率)内核,其中线程只计算公共值一次(就好像每个线程代表一个块)。然后每个线程创建一个1块(16x16线程)的小网格,并为它传递公共值启动一个子内核。在理论上它应该更快,因为一个是节省许多冗余操作。但在实践中,解决方案的工作速度非常慢,我不知道为什么。

I am trying to use dynamic parallelism to improve an algorithm i have in CUDA. In my original CUDA solution, every thread computes a number that is common for each block. What i want to do is to first launch a coarse (or low resolution) kernel, where threads compute the common value just once (like if every thread represents one block). Then each thread creates a small grid of 1 block (16x16 threads), and launches a child kernel for it passing the common value. In theory it should be faster because one is saving many redundant operations. But in practice, the solution works very slow, i dont know why.

这是代码,非常简单,只是想法。

This is the code, very simplified, just the idea.

__global__ coarse_kernel( parameters ){
    int common_val = compute_common_val();
    dim3 dimblock(16, 16, 1);
    dim3 dimgrid(1, 1, 1);
    child_kernel <<< dimgrid, dimblock >>> (common_val, parameters);

}

__global__ child_kernel( int common_val, parameters ){
    // use common value
    do_computations(common_val, parameters);
}

child_kernels的数量很多,每个线程一个, 400x400线程。从我的理解,GPU应该并行处理所有这些内核?
或者子内核以某种顺序处理?
我的结果显示性能比我原来的解决方案慢10倍。

The amount of child_kernels is a lot, one per thread and there must be around 400x400 threads. From what i understand, the GPU should process all these kernel in parallel right? Or child kernels are processed somehow sequentially? My results show that performance is more than 10 times slower than in the original solution i had.

推荐答案

发布核心的成本,无论是父母还是孩子。如果你的孩子内核没有提取很多并行性,并且对它们的非平行对象没有太多的好处,那么你的微弱的好处可能被子内核发射开销取消。

There is a cost in launching kernels, either parent or child. If your child kernels do not extract much parallelism and there is not much benefit against their non-parallel counterparts, then your faint benefit may be cancelled out by the child kernel launch overheads.

在公式中,让是执行子内核的开销, te 其执行时间和 ts 在没有动态并行性的帮助下执行相同代码的时间。使用动态并行化产生的加速 ts /(to + te)。也许(但这不能从你的代码) te te,ts ts /(to + te)是关于(ts / to)<1 的加速。

In formulas, let to be the overhead to execute a child kernel, te its execution time and ts the time to execute the same code without the help of dynamic parallelism. The speedup arising from the use of dynamic parallelism is ts/(to+te). Perhaps (but this cannot be envinced from your code) te<ts but te,ts<<to, so that ts/(to+te) is about (ts/to)<1 and you observe a slowdown instead of a speedup.

这篇关于动态并行性 - 启动许多小内核是非常缓慢的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆