与第三方CUDA库链接会减慢cudaMalloc [英] Linking with 3rd party CUDA libraries slows down cudaMalloc

查看:119
本文介绍了与第三方CUDA库链接会减慢cudaMalloc的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这不是一个秘密,在CUDA 4.x的第一次调用 cudaMalloc
可以是可笑的慢(这是报告了几次),貌似一个

It is not a secret that on CUDA 4.x the first call to cudaMalloc can be ridiculously slow (which was reported several times), seemingly a bug in CUDA drivers.

最近,我注意到了奇怪的行为: cudaMalloc
的运行时间取决于我链接到我的程序的第三方CUDA库
(注意,我不使用这些库,只是链接我的程序)

Recently, I noticed weird behaviour: the running time of cudaMalloc directly depends on how many 3rd-party CUDA libraries I linked to my program (note that I do NOT use these libraries, just link my program with them)

我使用以下程序运行一些测试:

I ran some tests using the following program:

int main() {
  cudaSetDevice(0);
  unsigned int *ptr = 0;
  cudaMalloc((void **)&ptr, 2000000 * sizeof(unsigned int));   
  cudaFree(ptr);
return 1;
}

结果如下:


  • 链接到:-lcudart -lnpp -lcufft -lcublas -lcusparse -lcurand
    运行时间:5.852449

  • Linked with: -lcudart -lnpp -lcufft -lcublas -lcusparse -lcurand running time: 5.852449

与链接:-lcudart -lnpp -lcufft -lcublas运行时间:1.425120

Linked with: -lcudart -lnpp -lcufft -lcublas running time: 1.425120

链接:-lcudart -lnpp -lcufft运行时:0.905424

Linked with: -lcudart -lnpp -lcufft running time: 0.905424

与链接:-lcudart运行时间:0.394558

Linked with: -lcudart running time: 0.394558

根据'gdb',时间确实进入我的cudaMalloc,所以它不是由一些
库初始化程序引起的。

According to 'gdb', the time indeed goes into my cudaMalloc, so it's not caused by some library initialization routine..

推荐答案

在您的示例中, cudaMalloc 调用启动GPU上的延迟上下文建立。当包括运行时API库时,必须检查它们的二进制有效负载,并将它们包含的GPU精灵符号和对象合并到上下文中。图书馆越多,期望的过程就越长。此外,如果在任何cubins中存在架构不匹配,并且您具有向后兼容的GPU,它还可以触发驱动程序重新编译目标GPU的设备代码。在非常极端的情况下,我看到一个旧的应用程序与旧版本的CUBLAS链接需要10秒的时间来加载和初始化时运行在Fermi GPU上。

In your example, the cudaMalloc call initiates lazy context establishment on the GPU. When runtime API libraries are included, their binary payloads have to be inspected and the GPU elf symbols and objects they contain merged into the context. The more libraries there are, the longer you can expect the process to take. Further, if there is an architecture mismatch in any of the cubins and you have a backwards compatible GPU, it can also trigger driver recompilation of device code for the target GPU. In a very extreme case, I have seen an old application linked with an old version of CUBLAS take 10s of seconds to load and initialise when run on a Fermi GPU.

可以通过发出 cudaFree 调用来明确强制延迟上下文建立:

You can explicitly force lazy context establishment by issuing a cudaFree call like this:

int main() {
    cudaSetDevice(0);
    cudaFree(0); // context establishment happens here
    unsigned int *ptr = 0;
    cudaMalloc((void **)&ptr, 2000000 * sizeof(unsigned int));   
    cudaFree(ptr);
  return 1;
}

如果您使用计时器配置或测试此版本,您应该会发现第一个 cudaFree 调用消耗大部分运行时,并且 cudaMalloc 调用几乎可以免费。

If you profile or instrument this version with timers you should find that the first cudaFree call consumes most of the runtime and the cudaMalloc call becomes almost free.

这篇关于与第三方CUDA库链接会减慢cudaMalloc的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆