与第三方CUDA库链接会减慢cudaMalloc [英] Linking with 3rd party CUDA libraries slows down cudaMalloc
问题描述
这不是一个秘密,在CUDA 4.x的第一次调用 cudaMalloc
可以是可笑的慢(这是报告了几次),貌似一个
It is not a secret that on CUDA 4.x the first call to cudaMalloc
can be ridiculously slow (which was reported several times), seemingly a bug in CUDA drivers.
最近,我注意到了奇怪的行为: cudaMalloc
的运行时间取决于我链接到我的程序的第三方CUDA库
(注意,我不使用这些库,只是链接我的程序)
Recently, I noticed weird behaviour: the running time of cudaMalloc
directly depends on how many 3rd-party CUDA libraries I linked to my program
(note that I do NOT use these libraries, just link my program with them)
我使用以下程序运行一些测试:
I ran some tests using the following program:
int main() {
cudaSetDevice(0);
unsigned int *ptr = 0;
cudaMalloc((void **)&ptr, 2000000 * sizeof(unsigned int));
cudaFree(ptr);
return 1;
}
结果如下:
-
链接到:-lcudart -lnpp -lcufft -lcublas -lcusparse -lcurand
运行时间:5.852449
Linked with: -lcudart -lnpp -lcufft -lcublas -lcusparse -lcurand running time: 5.852449
与链接:-lcudart -lnpp -lcufft -lcublas运行时间:1.425120
Linked with: -lcudart -lnpp -lcufft -lcublas running time: 1.425120
链接:-lcudart -lnpp -lcufft运行时:0.905424
Linked with: -lcudart -lnpp -lcufft running time: 0.905424
与链接:-lcudart运行时间:0.394558
Linked with: -lcudart running time: 0.394558
根据'gdb',时间确实进入我的cudaMalloc,所以它不是由一些
库初始化程序引起的。
According to 'gdb', the time indeed goes into my cudaMalloc, so it's not caused by some library initialization routine..
推荐答案
在您的示例中, cudaMalloc
调用启动GPU上的延迟上下文建立。当包括运行时API库时,必须检查它们的二进制有效负载,并将它们包含的GPU精灵符号和对象合并到上下文中。图书馆越多,期望的过程就越长。此外,如果在任何cubins中存在架构不匹配,并且您具有向后兼容的GPU,它还可以触发驱动程序重新编译目标GPU的设备代码。在非常极端的情况下,我看到一个旧的应用程序与旧版本的CUBLAS链接需要10秒的时间来加载和初始化时运行在Fermi GPU上。
In your example, the cudaMalloc
call initiates lazy context establishment on the GPU. When runtime API libraries are included, their binary payloads have to be inspected and the GPU elf symbols and objects they contain merged into the context. The more libraries there are, the longer you can expect the process to take. Further, if there is an architecture mismatch in any of the cubins and you have a backwards compatible GPU, it can also trigger driver recompilation of device code for the target GPU. In a very extreme case, I have seen an old application linked with an old version of CUBLAS take 10s of seconds to load and initialise when run on a Fermi GPU.
可以通过发出 cudaFree
调用来明确强制延迟上下文建立:
You can explicitly force lazy context establishment by issuing a cudaFree
call like this:
int main() {
cudaSetDevice(0);
cudaFree(0); // context establishment happens here
unsigned int *ptr = 0;
cudaMalloc((void **)&ptr, 2000000 * sizeof(unsigned int));
cudaFree(ptr);
return 1;
}
如果您使用计时器配置或测试此版本,您应该会发现第一个 cudaFree
调用消耗大部分运行时,并且 cudaMalloc
调用几乎可以免费。
If you profile or instrument this version with timers you should find that the first cudaFree
call consumes most of the runtime and the cudaMalloc
call becomes almost free.
这篇关于与第三方CUDA库链接会减慢cudaMalloc的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!