同步多个Cuda流 [英] Synchronising multiple Cuda streams

查看:236
本文介绍了同步多个Cuda流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我当前正在开发的应用程序,我希望有一个较长的内核(即,相对于其他内核而言,需要很长时间才能完成),并与一系列多个较短的内核同时执行。然而,使这变得更复杂的是,这四个较短的内核每个都需要在完成之后进行同步,以便执行另一个收集并处理其他较短内核输出的数据的较短内核。



以下是我所想的示意图,带有绿色编号的绿色条表示不同的内核:





为了实现为此,我编写了如下代码:

  //内核1-6 
$的定义b $ b类Calc
{
Calc()
{
// ...
cudaStream_t stream [5];
for(int i = 0; i< 5; i ++)cudaStreamCreate(& stream [i]);
// ...
}

〜Calc()
{
// ...
for(int i = 0 ; i< 5; i ++)cudaStreamDestroy(stream [i]);
// ...
}

void compute()
{
kernel1<<<<<<< 32,32,0,stream [0 ]>>(...);
for(int i = 0; i< 20; i ++)// //这个20在整个程序中是一个常数
{
kernel2<< 1、32、0,stream [ 1]>>(...);
kernel3< 1、32、0,stream [2]>>(...);
kernel4< 1、32、0,stream [3]>>(...);
kernel5< 1、32、0,stream [4]>>(...);
// ??同步
kernel6< 1、32、0,stream [1]>>(...);
}
}
}

int main()
{
//准备

Calc C;

//根据需要运行大量计算功能
for(int i = 0; i< 100; i ++)
{
C.compute( );
}

// ...

返回0;
}

注意:块,线程和共享内存的数量只是任意数字。



现在,我应该如何在每次迭代中正确同步2-5个内核?首先,我不知道哪个内核将花费最长的时间,因为这可能取决于用户的输入。此外,我尝试使用 cudaDeviceSynchronize() cudaStreamSynchronize(),但它们的执行时间是总执行时间的三倍以上



CUDA赛事是否可行?如果是这样,我应该如何应用它们?如果没有,那么正确的方法是什么?



非常感谢。

解决方案

首先需要发表两条评论。


  1. 启动小内核(一个块) )通常不是从GPU获得良好性能的方法。同样,每个块(32个)具有少量线程的内核通常会施加占用率限制,这将阻止GPU发挥全部性能。启动多个并发内核并不能减轻第二个考虑。因为您已经说过这些数字是任意的,所以在这里我不会再花时间了(但是请参阅下面的下一条评论)。


  2. 见证实际的内核并发是硬。我们需要执行时间相对较长但对GPU资源的需求相对较低的内核。 <<< 32,32>>> 的内核可能会填充您正在运行的GPU,从而阻止任何功能


您的问题似乎归结为如何防止 kernel6 从开始直到 kernel2-5 完成。



可能会为此使用事件。基本上,您会记录事件到每个流中,在kernel2-5启动之后,您将 cudaStreamWaitEvent 调用,每四个事件之一,之前启动 kernel6



像这样:

  kernel2< 1、32、0,stream [1]>>(...); 
cudaEventRecord(event1,stream [1]);
kernel3< 1、32、0,stream [2]>>(...);
cudaEventRecord(event2,stream [2]);
kernel4< 1、32、0,stream [3]>>(...);
cudaEventRecord(event3,stream [3]);
kernel5< 1、32、0,stream [4]>>(...);
cudaEventRecord(event4,stream [4]);
// ??同步
cudaStreamWaitEvent(stream [1],event1);
cudaStreamWaitEvent(stream [1],event2);
cudaStreamWaitEvent(stream [1],event3);
cudaStreamWaitEvent(stream [1],event4);
kernel6< 1、32、0,stream [1]>>(...);

请注意,上述所有调用都是异步。与您使用 cudaDeviceSynchronize()或<$ c $的用法不同,它们都不会花费几微秒的时间来处理,而且它们都不会阻止CPU线程继续运行。 c> cudaStreamSynchronize(),通常 会阻塞CPU线程。



因此,您可能想要在上述序列中执行某种形式的同步(例如 cudaStreamSynchronize(stream [1]); )后,将在一个循环中执行,否则所有这些操作的异步特性都将得到要弄清楚(再加上,根据您的示意图,似乎您可能不希望迭代i + 1的kernel2-5开始在迭代i的kernel6完成之前开始?)请注意,我省略了事件创建和也许与此有关的其他样板,我假设您可以弄清楚,或者参考使用事件的任何示例代码,或者参考文档。



如果您实施所有这些基础架构,则您有能力见证(或不见证)实际能力并发将取决于您的内核本身,而不是我在此答案中建议的任何内容。因此,如果您回来说我做到了,但是我的内核没有并发运行,那么实际上这是与您提出的问题不同的问题,在此,我将向您推荐上述第2条评论。 p>

For the application that I'm currently developing, I want to have a long kernel (that is, a kernel that takes long to finish relative to the others) to execute concurrently with a sequence of multiple shorter kernels that also run concurrently. What makes this more complicated however, is the fact that the four shorter kernels each need to be synchronised after they're done, in order to execute another short kernel that collects and processes the data output by the other short kernels.

The following is a schematic of what I have in mind, with the numbered green bars representing different kernels:

In order to achieve this, I have written code that looks somewhat like the following:

// definitions of kernels 1-6

class Calc
{
    Calc()
    {
        // ...
        cudaStream_t stream[5];
        for(int i=0; i<5; i++) cudaStreamCreate(&stream[i]);
        // ...
    }

    ~Calc()
    {
        // ...
        for(int i=0; i<5; i++) cudaStreamDestroy(stream[i]);
        // ...
    }

    void compute()
    {
        kernel1<<<32, 32, 0, stream[0]>>>(...);
        for(int i=0; i<20; i++) // this 20 is a constant throughout the program
        {
            kernel2<<<1, 32, 0, stream[1]>>>(...);
            kernel3<<<1, 32, 0, stream[2]>>>(...);
            kernel4<<<1, 32, 0, stream[3]>>>(...);
            kernel5<<<1, 32, 0, stream[4]>>>(...);
            // ?? synchronisation ??
            kernel6<<<1, 32, 0, stream[1]>>>(...);
        }
    }
}

int main()
{
    // preparation

    Calc C;

    // run compute-heavy function as many times as needed
    for(int i=0; i<100; i++)
    {
        C.compute();
    }

    // ...

    return 0;
}

Note: the amount of blocks, threads and shared memory are just arbitrary numbers.

Now, how would I go about properly synchronising kernels 2–5 every iteration? For one, I don't know which of the kernels will take the longest to complete, as this may depend on user input. Furthermore, I've tried using cudaDeviceSynchronize() and cudaStreamSynchronize(), but those more than trebled the total execution time.

Are Cuda events perhaps the way to go? If so, how should I apply them? If not, what would be the proper way to do this?

Thank you very much.

解决方案

There are two comments that need to be made first.

  1. Launching small kernels (one block) is generally not the way to get good performance out of the GPU. Likewise kernels with a small number of threads per block (32) will generally impose an occupancy limit which will prevent full performance from the GPU. Launching multiple concurrent kernels doesn't mitigate this second consideration. I'll not spend any further time here since you've said the numbers are arbitrary (but see the next comment below).

  2. Witnessing actual kernel concurrency is hard. We need kernels with a relatively long execution time but a relatively low demand on GPU resources. A kernel of <<<32,32>>> could possibly fill the GPU you are running on, preventing any ability for blocks from a concurrent kernel to run.

Your question seems to boil down to "how do I prevent kernel6 from starting until kernel2-5 are finished.

It's possible to use events for this. Basically, you would record an event into each stream, after the kernel2-5 launches, and you would put a cudaStreamWaitEvent call, one for each of the 4 events, prior to the launch of kernel6.

Like so:

        kernel2<<<1, 32, 0, stream[1]>>>(...);
        cudaEventRecord(event1, stream[1]);
        kernel3<<<1, 32, 0, stream[2]>>>(...);
        cudaEventRecord(event2, stream[2]);
        kernel4<<<1, 32, 0, stream[3]>>>(...);
        cudaEventRecord(event3, stream[3]);
        kernel5<<<1, 32, 0, stream[4]>>>(...);
        cudaEventRecord(event4, stream[4]);
        // ?? synchronisation ??
        cudaStreamWaitEvent(stream[1], event1);
        cudaStreamWaitEvent(stream[1], event2);
        cudaStreamWaitEvent(stream[1], event3);
        cudaStreamWaitEvent(stream[1], event4);
        kernel6<<<1, 32, 0, stream[1]>>>(...);

Note that all of the above calls are asynchronous. None of them should take more than a few microseconds to process, and none of them will block the CPU thread from continuing, unlike your usage of cudaDeviceSynchronize() or cudaStreamSynchronize(), which generally will block the CPU thread.

As a result, you may want some kind of synchronization after the above sequence (e.g.cudaStreamSynchronize(stream[1]); ) is performed in a loop, or else the asynchronous nature of all this is going to get hairy to figure out (plus, based on your schematic diagram, it seems you probably don't want kernel2-5 of iteration i+1 to begin until kernel6 of iteration i is finished?) Note that I've left out event creation and perhaps other boilerplate for this, I'm assuming you can figure that out or refer to any of the sample codes that use events, or refer to the documentation.

And even if you implement all this infrastructure, your ability to witness (or not) actual kernel concurrency will be dictated by your kernels themselves, not anything I've suggested in this answer. So if you come back and say "I did that, but my kernels are not running concurrently" that is actually a different question than what you have posed, here, and I would refer you for starters to my comment #2 above.

这篇关于同步多个Cuda流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆