cudaGetCacheConfig 需要 0.5 秒 - 如何/为什么? [英] cudaGetCacheConfig takes 0.5 seconds - how/why?

查看:28
本文介绍了cudaGetCacheConfig 需要 0.5 秒 - 如何/为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在带有 GTX Titan X (GM 200) 的基于 Xeon 的系统上使用 CUDA 8.0.它工作得很好,但是 - 与我在家中较弱的 GTX 600 系列卡相比,我的开销很长.具体来说,当我发现对 cudaGetCacheConfig() 的调用始终使 CUDA 运行时 API 花费了令人难以置信的时间:530-560 毫秒,或超过 0.5 秒.这一点,而其他电话并不需要那么多.例如,cuDeviceGetTotalMem 需要 0.7 毫秒(也是相当多的时间,但要少一个数量级),而 cuDeviceGetAttribute(可能仅限于主机端代码) 需要 0.031 毫秒.

I'm using CUDA 8.0 on a Xeon-based system with a GTX Titan X (GM 200). It works fine, but - I get long overheads compared to my weak GTX 600 series card at home. Specifically, when I timeline I find that a call to cudaGetCacheConfig() is consistently taking the CUDA runtime API an incredible amount of time: 530-560 msec, or over 0.5 seconds. This, while other calls don't take as much. For example, cuDeviceGetTotalMem takes 0.7 msec (also quite a bit of time, but an order of magnitude less), and cuDeviceGetAttribute (which is probably limited to host-side code only) takes 0.031 msec.

为什么会这样?或者更确切地说——这怎么可能?我能做些什么来改善这种情况吗?

Why is this happening? Or rather - how could that be possible? And can I do anything to ameliorate this situation?

注意事项:

  • cudaGetCacheConfig()cudaGetDeviceCount() 之后被调用,但可能(不是 100% 确定)不会在任何其他运行时 API 调用之前调用.
  • 如果我在 cudaGetCacheConfig() 调用之前添加 cudaGetDeviceProperties() 调用,前者需要约 0.6 毫秒,而后者仍然需要 0.5 秒(581 毫秒)我最后一次测量).
  • The cudaGetCacheConfig() gets called after cudaGetDeviceCount(), but probably (not 100% certain) not before any other runtime API calls.
  • If I prepend a cudaGetDeviceProperties() call before the cudaGetCacheConfig() call, the former takes ~0.6 msec and the latter still takes over 0.5 sec (581 msec in my last measurement).

推荐答案

TL;DR:CUDA 延迟初始化(正如@RobertCrovella 建议的那样).

@RobertCrovella 在 欺骗错误:

@RobertCrovella explains in the dupe bug:

CUDA 初始化通常包括 UVM 的建立,这涉及到设备和主机内存映射的协调.如果您的服务器的系统内存比您的 PC 多,这是初始化时间差异的一种可能解释.操作系统也可能有影响,最后可能是 GPU 的内存大小有影响.

CUDA initialization usually includes establishment of UVM, which involves harmonizing of device and host memory maps. If your server has more system memory than your PC, it is one possible explanation for the disparity in initialization time. The OS may have an effect as well, finally the memory size of the GPU may have an effect.

我出现这种行为的机器有 256 GB 内存,是我家用机器的 32 倍;GPU 本身有 12 GB,是我家用机器上 GPU 的 4 倍.这意味着我可以 - 不幸的是 - 期望 CUDA 驱动程序和/或运行时 API 的初始化时间比在我的家用机器上要长得多.部分或全部初始化以惰性方式执行,在我的例子中恰好是在调用 cudaGetCacheConfig() 时;我想其他调用只需要一些初始化(虽然不清楚为什么).

the machine on which I get this behavior has 256 GB of memory, 32 times more than my home machine; and the GPU itself has 12 GB, 4 times more than the GPU on my home machine. This means I can - unfortunately - expect much longer initialization of the CUDA driver and/or runtime API than on my home machine. Some or all of this initialization is performed in a lazy fashion, which in my case happens to be when cudaGetCacheConfig() is called; I suppose the other calls only require some of the initialization (not clear why, though).

这篇关于cudaGetCacheConfig 需要 0.5 秒 - 如何/为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆