任何特定的函数来初始化GPU而不是第一个cudaMalloc调用? [英] Any particular function to initialize GPU other than the first cudaMalloc call?
问题描述
第一个cudaMalloc调用是慢的(如0.2秒),因为在GPU上的一些初始化工作。有没有任何功能,只做初始化,这样我可以分开的时间? cudaSetDevice似乎将时间减少到0.15秒,但仍然不能消除所有init开销。
The first cudaMalloc call is slow (like 0.2 sec) because of some initialization work on GPU. Is there any function that solely do initialization, so that I can separate the time? cudaSetDevice seems to reduce the time to 0.15 secs, but still does not eliminate all init overheads.
推荐答案
呼叫
cudaFree(0);
是在CUDA运行时强制延迟上下文建立的规范方法。您不能减少开销,这是驱动程序,运行时和操作系统延迟的函数。
is the canonical way to force lazy context establishment in the CUDA runtime. You can't reduce the overhead, that is a function of driver, runtime and operating system latencies. But the call above will let you control how/when those overheads occur during program execution.
在2015年编辑添加了上下文的启发式运行时API中的初始化随着时间的推移发生了微妙的变化,因此 cudaSetDevice
现在建立了上下文,因此 cudaFree()
没有明确要求初始化上下文,可以使用 cudaSetDevice
。还要注意,在第一次内核启动时仍然会产生一些设置时间,而在这之前不是这样。对于内核时序,最好在启动内核之前包括热身调用,以便及时删除此设置延迟。看来,各种性能分析工具有足够的粒度,以避免这种情况,而无需任何额外的API调用或内核调用。
EDIT in 2015 to add that the heuristics of context initialisation in the runtime API have subtly changed over time so that cudaSetDevice
now establishes a context, so the cudaFree()
call isn't explicitly required to intialise a context, you can use cudaSetDevice
instead. Also note that some set-up time will still be incurred at the first kernel launch, whereas before this wasn't the case. For for kernel timing, it is best to include a warm-up call first before launching the kernel you will time to remove this set-up latency. It appears that the various profiling tools have enough granularity built in to avoid this without any extra API calls or kernel calls.
这篇关于任何特定的函数来初始化GPU而不是第一个cudaMalloc调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!