cudaGetCacheConfig需要0.5秒 - 如何/为什么? [英] cudaGetCacheConfig takes 0.5 seconds - how/why?

查看:196
本文介绍了cudaGetCacheConfig需要0.5秒 - 如何/为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用GTX Titan X(GM 200)的基于Xeon的系统上使用CUDA 8.0。它工作正常,但 - 我得到长的开销相比,我的弱GTX 600系列卡在家里。具体来说,当我发现一个调用 cudaGetCacheConfig()始终使用CUDA运行时API的时间不可思议的时间:530-560毫秒,或超过0.5秒。这,虽然其他调用不采取同样多。例如, cuDeviceGetTotalMem 花费0.7毫秒(也是相当多的时间,但是少了一个数量级)和 cuDeviceGetAttribute (可能仅限于主机端代码)需要0.031毫秒。



为什么会发生这种情况?或者说 - 怎么可能是可能的?



    b $ b
  • cudaGetCacheConfig() cudaGetDeviceCount()后调用,但可能 cudaGetCacheConfig()之前调用 cudaGetDeviceProperties() ()调用,前者需要〜0.6毫秒,后者仍然需要0.5秒(在上次测量中为581毫秒)。


解决方案

TL; DR:CUDA延迟初始化(如@RobertCrovella建议)



@RobertCrovella在 dupe错误< a>:


CUDA初始化通常包括建立UVM,其涉及设备和主机存储器映射的协调。如果您的服务器具有比PC更多的系统内存,这是初始化时间差异的一个可能的解释。操作系统也可能有影响,最终GPU的内存大小可能有影响。


这种行为具有256 GB的内存,比我的家用机器多32倍;并且GPU本身具有12GB,是我的家用机器上的GPU的4倍。这意味着我可以 - 不幸的是 - 期望更长的初始化CUDA驱动程序和/或运行时API比在我的家庭机器。这种初始化的一些或全部以延迟方式执行,在我的情况下恰好是当 cudaGetCacheConfig()被调用时;我想其他调用只需要一些初始化(不清楚为什么,虽然)。


I'm using CUDA 8.0 on a Xeon-based system with a GTX Titan X (GM 200). It works fine, but - I get long overheads compared to my weak GTX 600 series card at home. Specifically, when I timeline I find that a call to cudaGetCacheConfig() is consistently taking the CUDA runtime API an incredible amount of time: 530-560 msec, or over 0.5 seconds. This, while other calls don't take as much. For example, cuDeviceGetTotalMem takes 0.7 msec (also quite a bit of time, but an order of magnitude less), and cuDeviceGetAttribute (which is probably limited to host-side code only) takes 0.031 msec.

Why is this happening? Or rather - how could that be possible? And can I do anything to ameliorate this situation?

Notes:

  • The cudaGetCacheConfig() gets called after cudaGetDeviceCount(), but probably (not 100% certain) not before any other runtime API calls.
  • If I prepend a cudaGetDeviceProperties() call before the cudaGetCacheConfig() call, the former takes ~0.6 msec and the latter still takes over 0.5 sec (581 msec in my last measurement).

解决方案

TL;DR: CUDA lazy initialization (as @RobertCrovella suggests).

@RobertCrovella explains in the dupe bug:

CUDA initialization usually includes establishment of UVM, which involves harmonizing of device and host memory maps. If your server has more system memory than your PC, it is one possible explanation for the disparity in initialization time. The OS may have an effect as well, finally the memory size of the GPU may have an effect.

the machine on which I get this behavior has 256 GB of memory, 32 times more than my home machine; and the GPU itself has 12 GB, 4 times more than the GPU on my home machine. This means I can - unfortunately - expect much longer initialization of the CUDA driver and/or runtime API than on my home machine. Some or all of this initialization is performed in a lazy fashion, which in my case happens to be when cudaGetCacheConfig() is called; I suppose the other calls only require some of the initialization (not clear why, though).

这篇关于cudaGetCacheConfig需要0.5秒 - 如何/为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆