CUDA流和上下文 [英] CUDA streams and context

查看:1422
本文介绍了CUDA流和上下文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在使用一个应用程序生成一串pthread(linux),每个都创建自己的CUDA上下文。 (现在使用cuda 3.2)。

I am using an application presently that spawns a bunch of pthreads (linux), and each of those creates it's own CUDA context. (using cuda 3.2 right now).

我遇到的问题是,似乎每个线程都有自己的上下文在GPU上占用大量内存。每个线程有200MB,所以这真的限制了我。

The problem I am having is that it seems like each thread having its own context costs a lot of memory on the GPU. Something like 200MB per thread, so this is really limiting me.

我可以在主机线程中创建流,传递流引用到工作线程,然后能够传递给我的CUDA库他们的流号码,并且所有工作出于同样的上下文?

Can I simply create streams in the host thread, pass the stream reference to the worker threads, which would then be able to pass to my CUDA library their stream number, and all work out of the same context?

工作线程自动知道与其父线程相同的CUDA上下文?

Does a worker thread automatically know the same CUDA context as it's parent thread?

感谢

推荐答案

的设备存储器,并且它们的资源被严格地彼此分割。例如,在上下文A中分配的设备内存不能由上下文B访问。流只在创建它们的上下文中有效。

Each CUDA context does cost quite a bit of device memory, and their resources are strictly partitioned from one another. For example, device memory allocated in context A cannot be accessed by context B. Streams also are valid only in the context in which they were created.

最佳实践是为每个设备创建一个CUDA上下文。默认情况下,该CUDA上下文只能从创建它的CPU线程访问。如果要从其他线程访问CUDA上下文,请调用cuCtxPopCurrent()从创建它的线程中弹出它。上下文可以被推送到任何其他CPU线程的当前上下文栈,并且随后的CUDA调用将引用该上下文。

The best practice would be to create one CUDA context per device. By default, that CUDA context can be accessed only from the CPU thread that created it. If you want to access the CUDA context from other threads, call cuCtxPopCurrent() to pop it from the thread that created it. The context then can be pushed onto any other CPU thread's current context stack, and subsequent CUDA calls would reference that context.

上下文push / pop是轻量级操作, 3.2,它们可以在CUDA运行时应用程序中完成。所以我的建议是初始化CUDA上下文,然后调用cuCtxPopCurrent()使上下文浮动,除非一些线程想要操作它。考虑浮动状态是自然的 - 每当线程想要操作上下文时,它的用法包括cuCtxPushCurrent()/ cuCtxPopCurrent()。

Context push/pop are lightweight operations and as of CUDA 3.2, they can be done in CUDA runtime apps. So my suggestion would be to initialize the CUDA context, then call cuCtxPopCurrent() to make the context "floating" unless some threads wants to operate it. Consider the "floating" state to be the natural one - whenever a thread wants to manipulate the context, bracket its usage with cuCtxPushCurrent()/cuCtxPopCurrent().

这篇关于CUDA流和上下文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆