慢的第一个cudaMalloc(K40 vs K20),甚至在cudaSetDevice之后 [英] slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

查看:375
本文介绍了慢的第一个cudaMalloc(K40 vs K20),甚至在cudaSetDevice之后的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道CUDA将在第一次API调用期间进行初始化,但是花费的时间太多了。即使在单独的cudaSetDevice

I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice

测试程序之后:

:CUDA 7.0(compute_35)+ Visual Studio 2012 + NSight 4.5,然后在两个独立的机器上运行(不重建)

The same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5, then got run in 2 seperate machines (no rebuilding)

在第一个cudaMalloc之前, cudaSetDevice

Before the 1st cudaMalloc, I’ve called "cudaSetDevice"

我的电脑:Win7 + Tesla K20,第一个cudaMalloc需要150ms

on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms

Win2012 + Tesla K40,需要1100ms!

on my server: Win2012+ Tesla K40, it takes 1100ms!!

对于这两台机器,随后的cudaMalloc都快得多。

For both machines, subsequent cudaMalloc are much faster.

strong>我的问题是:

My questions are:

1,为什么K40比第一个cudaMalloc需要更长的时间(1100ms vs 150ms)由于K40应该比K20更好

1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20

2,我以为cudaSetDevice可以捕获Init时间?例如来自talonmies的回答

2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies

3,如果初始化是不可避免的,可以处理A在GPU中保持其状态(或上下文),而进程B在同一GPU中运行?我理解我更好地在独占模式下运行GPU,但可以处理A暂停,以便它不需要再次初始化GPU。

3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?

提前感谢

推荐答案


1,为什么K40需要更长的时间(1100ms vs 150ms) cudaMalloc?因为K40应该优于K20

1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20

初始化过程的细节没有指定,但是通过观察系统内存量影响初始化时间。 CUDA初始化通常包括建立 UVM ,涉及设备和主机内存的协调地图。如果您的服务器具有比PC更多的系统内存,这是初始化时间差异的一个可能的解释。操作系统也可能有影响,最终GPU的内存大小可能有影响。

The details of the initialization process are not specified, however by observation the amount of system memory affects initialization time. CUDA initialization usually includes establishment of UVM, which involves harmonizing of device and host memory maps. If your server has more system memory than your PC, it is one possible explanation for the disparity in initialization time. The OS may have an effect as well, finally the memory size of the GPU may have an effect.


2,我想cudaSetDevice可以捕获Init时间?例如来自talonmies的回答

2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies

CUDA初始化过程是一个延迟初始化。这意味着将完成足够的初始化过程以便支持所请求的操作。如果所请求的操作是 cudaSetDevice ,则这可能需要比如果所请求的操作是<$ c时更少的初始化完成(这意味着所需的表观时间可能更短) $ c> cudaMalloc 。这意味着一些初始化开销可以被吸收到 cudaSetDevice 操作中,而一些附加的初始化开销可以被吸收到随后的 cudaMalloc 操作。

The CUDA initialization process is a "lazy" initialization. That means that just enough of the initialization process will be completed in order to support the requested operation. If the requested operation is cudaSetDevice, this may require less of the initialization to be complete (which means the apparent time required may be shorter) than if the requested operation is cudaMalloc. That means that some of the initialization overhead may be absorbed into the cudaSetDevice operation, while some additional initialization overhead may be absorbed into a subsequent cudaMalloc operation.


3,如果初始化是不可避免的,可以处理A在GPU中保持其状态B是在同一个GPU上运行吗?我理解我更好地在独占模式下运行GPU,但可以处理A暂停,以便它不需要再次初始化GPU。

3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?

独立的主机进程通常会产生独立的 CUDA上下文 。 CUDA上下文具有与其相关联的初始化需求,因此如果新的CUDA上下文需要被初始化(可能从单独的主机进程),则另一个单独的cuda上下文可能已经在设备上初始化的事实将不会提供很多好处。通常,保持进程活动涉及保持在该进程中运行的应用程序。应用程序具有各种机制来睡眠或暂停行为。只要应用程序没有终止,该应用程序建立的任何上下文都不需要重新初始化(除非,如果 cudaDeviceReset 被调用)。

Independent host processes will generally spawn independent CUDA contexts. A CUDA context has the initialization requirement associated with it, so the fact that another, separate cuda context may be already initialized on the device will not provide much benefit if a new CUDA context needs to be initialized (perhaps from a separate host process). Normally, keeping a process active involves keeping an application running in that process. Applications have various mechanisms to "sleep" or suspend behavior. As long as the application has not terminated, any context established by that application should not require re-initialization (excepting, perhaps, if cudaDeviceReset is called).

通常,通过设置GPU持久化模式(使用 nvidia-smi )。但是,这与GeForce GPU不相关,也不会与Windows系统相关。

In general, some benefit may be obtained on systems that allow the GPUs to go into a deep idle mode by setting GPU persistence mode (using nvidia-smi). However this will not be relevant for GeForce GPUs nor will it be generally relevant on a windows system.

此外,在多GPU系统上,如果应用程序不需要多个GPU,通常可以使用 CUDA_VISIBLE_DEVICES 环境变量,以限制CUDA运行时只使用必要的设备。

Additionally, on multi-GPU systems, if the application does not need multiple GPUs, some initialization time can usually be avoided by using the CUDA_VISIBLE_DEVICES environment variable, to restrict the CUDA runtime to only use the necessary devices.

这篇关于慢的第一个cudaMalloc(K40 vs K20),甚至在cudaSetDevice之后的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆