CUDA并发内核执行每个流多个内核 [英] CUDA concurrent kernel execution with multiple kernels per stream

查看:1704
本文介绍了CUDA并发内核执行每个流多个内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对CUDA内核使用不同的流使并发内核执行成为可能。因此, n 内核在 n 流理论上可以并发运行,如果他们适合硬件, >

现在我面临以下问题:没有 n 不同的内核,但 n * m 其中 m 内核需要按顺序执行。例如 n = 2 m = 3 会导致下面的流执行方案:

 流1:<<<内核0.1> <<内核1.1>> <<内核2.1>> 
流2:<<<<内核0.2>> <<内核1.2>> <<内核2.2>>

我天真的假设是内核x.0和y.1应该同时执行观点)或至少不连续地(从实际的角度)。但我的测量显示我不是这样的情况,似乎执行连续执行(即K0.0,K1.0,K2.0,K0.1,K1.1,K2.1)。内核本身非常小,所以并发执行不应该是一个问题。



现在我的方法是完成一种分派,以确保内核是en以交错方式排列到GPU上的调度器中。但是,当处理大量的流/内核时,这可能会造成更多的危害。



好吧,直接说明:什么是适当的最小不同)方法来解决这种情况?



编辑:使用CUDA事件进行测量。我测量了完全求解计算所需的时间,i。 e。 GPU必须计算所有 n * m 内核。假设是:在完全并发内核执行时,执行时间大致(理想地)是按顺序执行所有内核所需的时间的 1 / n 可以同时执行两个或更多个内核。我现在只使用两个不同的流来确保这一点。



我可以测量一个明确的差异关于使用流之间的执行时间描述和调度内核interleaved,i 。 e::

 循环:i = 0到m 
EnqueueKernel(内核i.1,流1)
EnqueueKernel(Kernel i.2,Stream 2)

 循环:i = 1到n 
循环:j = 0到m
EnqueueKernel(Kernel ji,Stream i)

后者导致运行时间更长。



Edit#2:已将信息流号码更改为1(而不是0,请参阅下面的注释)。



编辑#3: / strong>硬件是NVIDIA Tesla M2090(即Fermi,计算能力2.0)

解决方案

硬件,最好将内核启动交织到多个流,而不是将所有内核启动到一个流,然后是下一个流等。这是因为如果有足够的资源,硬件可以立即启动内核到不同的流,而如果后续启动是对同一个流有经常引入的延迟,降低了并发性。这是您的第一种方法效果更好的原因,这种方法是您应该选择的方法。



启用分析还可以禁用Fermi上的并发,因此请小心。此外,在启动循环期间要小心使用CUDA事件,因为这些会干扰 - 例如,最好使用事件的整个循环时间。


Using different streams for CUDA kernels makes concurrent kernel execution possible. Therefore n kernels on n streams could theoretically run concurrently if the they are fitting into the hardware, right?

Now I'm facing the following problem: There are not n distinct kernels but n*m where m kernels need to be executed in order. For instance n=2 and m=3 would lead to the following execution scheme with streams:

Stream 1: <<<Kernel 0.1>>> <<<Kernel 1.1>>> <<<Kernel 2.1>>>
Stream 2: <<<Kernel 0.2>>> <<<Kernel 1.2>>> <<<Kernel 2.2>>>

My naive assumption is that the kernels x.0 and y.1 should execute concurrently (from a theoretic point of view) or at least not consecutively (from a practical point of view). But my measurements are showing me that this is not the case and it seems that consecutive execution is performed (i. e. K0.0, K1.0, K2.0, K0.1, K1.1, K2.1). The kernels itself are very small, so concurrent execution should not be a problem.

Now my approach would be to accomplish a kind of dispatching for making sure that the kernels are en-queued in an interleaved style into the scheduler on the GPU. But when dealing with a large number of streams / kernels this could do more harm than good.

Alright, coming straight to the point: What would be an appropriate (or at least different) approach to solve this situation?

Edit: Measurements are done by using CUDA events. I've measured the time that is needed to fully solve the computation, i. e. the GPU has to compute all n * m kernels. The assumption is: On fully concurrent kernel execution the execution time is roughly (ideally) 1/n times of the time that is needed to execute all kernels in order, whereby it must be possible that two or more kernels can be executed concurrently. I'm ensuring this by only using two distinct streams right now.

I can measure a clear difference regarding execution times between using the streams as described and dispatching kernels interleaved, i. e.:

Loop: i = 0 to m
    EnqueueKernel(Kernel i.1, Stream 1)
    EnqueueKernel(Kernel i.2, Stream 2)

versus

Loop: i = 1 to n
    Loop: j = 0 to m
        EnqueueKernel(Kernel j.i, Stream i)

The latter leads to a longer runtime.

Edit #2: Changed the Stream numbers to begin by 1 (instead of 0, see comments below).

Edit #3: Hardware is a NVIDIA Tesla M2090 (i. e. Fermi, compute capability 2.0)

解决方案

On Fermi (aka Compute Capability 2.0) hardware it is best to interleave kernel launches to multiple streams rather than to launch all kernels to one stream, then the next stream, etc. This is because the hardware can immediately launch kernels to different streams if there are sufficient resources, whereas if subsequent launches are to the same stream there is often delay introduced, reducing concurrency. This is the reason that your first approach performs better, and this approach is the one you should choose.

Enabling profiling can also disable concurrency on Fermi, so be careful with that. Also, be careful about using CUDA events during your launch loop, as these can interfere -- best to time the whole loop using events as you are doing, for example.

这篇关于CUDA并发内核执行每个流多个内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆