每个主机线程创建CUDA流(多线程CPU) [英] Creating a cuda stream on each host thread (multi-threaded CPU)

查看:2547
本文介绍了每个主机线程创建CUDA流(多线程CPU)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个多线程的CPU,我想CPU的每个线程能够推出一个单独的CUDA流。在单独的CPU线程会做在不同的时间不同的东西,所以有机会的话,他们不会重叠,但如果他们在同一时间推出了CUDA核心,我想它继续并行运行。

I have a multi-threaded CPU and I would like each thread of the CPU to be able to launch a seperate CUDA stream. The seperate CPU threads will be doing different things at different times so there is a chance that they won't overlap but if they do launch a CUDA kernel at the same time I would like it to continue to run concurrently.

我是pretty相信这是可能的,因为在CUDA工具包文档部分3.2.5.5。它说:甲流命令(可能由不同的主机线程发布)的序列。

I'm pretty sure this is possible because in the CUDA Toolkit documentation section 3.2.5.5. It says "A stream is a sequence of commands (possibly issued by different host threads)..."

所以,如果我想实现这个我会做这样的事情。

So if I want to implement this I would do something like

void main(int CPU_ThreadID) {
    cudaStream_t *stream;
    cudaStreamCreate(&stream);

    int *d_a;
    int *a;
    cudaMalloc((void**)&d_a, 100*sizeof(int));
    cudaMallocHost((void**)&a, 100*8*sizeof(int));
    cudaMemcpyAsync(d_a, a[100*CPU_ThreadID], 100*size(int), cudaMemcpyHostToDevice, stream);
    sum<<<100,32,0,stream>>>(d_a);

    cudaStreamDestroy(stream);
}

这只是一个简单的例子。如果我知道有只有8个CPU线程后来我才知道至多8个流将被创建。这是做正确的方法?这是否会同时运行如果两个或多个不同的主机线程达到大约在同一时间这code?感谢您的帮助!

That is just a simple example. If I know there are only 8 CPU Threads then I know at most 8 streams will be created. Is this the proper way to do this? Will this run concurrently if two or more different host threads reach this code around the same time? Thanks for any help!

编辑:

我纠正一些在code座的语法问题,并把在cudaMemcpyAsync作为sgar91建议。

I corrected some of the syntax issues in the code block and put in the cudaMemcpyAsync as sgar91 suggested.

推荐答案

它看起来真的像你对我提议一个多的处理的应用程序,而不是多线程的。你不提你有线程架构于心,甚至也不是一个操作系统,但线程结构,我知道的并不断定所谓的主线程例行程序,并且您还没有显示出任何preamble到螺纹code。

It really looks to me like you are proposing a multi-process application, not multithreaded. You don't mention which threading architecture you have in mind, nor even an OS, but the threading architectures I know of don't posit a thread routine called "main", and you haven't shown any preamble to the thread code.

一个多进程环境中通常会创建每个进程一个设备上下文,这将抑制细粒度的并发性。

A multi-process environment will generally create one device context per process, which will inhibit fine-grained concurrency.

即使这只是一个疏忽,我想指出的是,一个多线程的应用程序应该所需的设备上建立一个GPU上下文的线程产卵前。

Even if that's just an oversight, I would point out that a multi-threaded application should establish a GPU context on the desired device before threads are spawned.

每个线程就可以发出 cudaSetDevice(0); 或类似的呼吁,这应该引起每个线程拿起指定的设备上建立的上下文

Each thread can then issue a cudaSetDevice(0); or similar call, which should cause each thread to pick up the established context on the indicated device.

一旦到位,你应该能够发出命令来从哪个线程你喜欢的期望流。

Once that is in place, you should be able to issue commands to the desired streams from whichever threads you like.

您不妨参考 cudaOpenMP 样品$ C $℃。虽然省略了流的概念,它显示了与多线程发出命令到相同的设备(和可以延伸到相同的流)的潜力的多线程应用

You may wish to refer to the cudaOpenMP sample code. Although it omits the streams concepts, it demonstrates a multi-threaded app with the potential for multiple threads to issue commands to the same device (and could be extended to the same stream)

不管是不是内核发生上述问题已得到解决后同时或不运行是另外一个问题。同时内核执行有一些要求< /一>和内核本身必须具有兼容的资源要求(块,共享存储器,寄存器,等等),这通常意味着小的内核。

Whether or not kernels happen to run concurrently or not after the above issues have been addressed is a separate issue. Concurrent kernel execution has a number of requirements, and the kernels themselves must have compatible resource requirements (blocks, shared memory, registers, etc.), which generally implies "small" kernels.

这篇关于每个主机线程创建CUDA流(多线程CPU)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆