与第 3 方 CUDA 库链接会减慢 cudaMalloc [英] Linking with 3rd party CUDA libraries slows down cudaMalloc

查看:14
本文介绍了与第 3 方 CUDA 库链接会减慢 cudaMalloc的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 CUDA 4.x 上第一次调用 cudaMalloc 已经不是什么秘密了速度可能慢得离谱(被报告了好几次),这似乎是 CUDA 驱动程序中的一个错误.

It is not a secret that on CUDA 4.x the first call to cudaMalloc can be ridiculously slow (which was reported several times), seemingly a bug in CUDA drivers.

最近,我注意到奇怪的行为:cudaMalloc 的运行时间直接取决于我链接到我的程序的第 3 方 CUDA 库的数量(请注意,我不使用这些库,只需将我的程序与它们链接)

Recently, I noticed weird behaviour: the running time of cudaMalloc directly depends on how many 3rd-party CUDA libraries I linked to my program (note that I do NOT use these libraries, just link my program with them)

我使用以下程序运行了一些测试:

I ran some tests using the following program:

int main() {
  cudaSetDevice(0);
  unsigned int *ptr = 0;
  cudaMalloc((void **)&ptr, 2000000 * sizeof(unsigned int));   
  cudaFree(ptr);
return 1;
}

结果如下:

  • 链接:-lcudart -lnpp -lcufft -lcublas -lcusparse -lcurand运行时间:5.852449

  • Linked with: -lcudart -lnpp -lcufft -lcublas -lcusparse -lcurand running time: 5.852449

链接:-lcudart -lnpp -lcufft -lcublas 运行时间:1.425120

Linked with: -lcudart -lnpp -lcufft -lcublas running time: 1.425120

关联:-lcudart -lnpp -lcufft 运行时间:0.905424

Linked with: -lcudart -lnpp -lcufft running time: 0.905424

关联:-lcudart 运行时间:0.394558

Linked with: -lcudart running time: 0.394558

根据'gdb',时间确实进入了我的cudaMalloc,所以它不是由一些引起的库初始化例程..

According to 'gdb', the time indeed goes into my cudaMalloc, so it's not caused by some library initialization routine..

我想知道是否有人对此有合理的解释?

I wonder if somebody has plausible explanation for this ?

推荐答案

在您的示例中,cudaMalloc 调用会在 GPU 上启动惰性上下文建立.当包含运行时 API 库时,必须检查它们的二进制有效负载,并将它们包含的 GPU 精灵符号和对象合并到上下文中.库越多,您预计该过程所需的时间就越长.此外,如果任何 cubin 中存在架构不匹配并且您拥有向后兼容的 GPU,它还可以触发驱动程序重新编译目标 GPU 的设备代码.在一个非常极端的情况下,我看到一个与旧版本 CUBLAS 链接的旧应用程序在 Fermi GPU 上运行时需要 10 秒来加载和初始化.

In your example, the cudaMalloc call initiates lazy context establishment on the GPU. When runtime API libraries are included, their binary payloads have to be inspected and the GPU elf symbols and objects they contain merged into the context. The more libraries there are, the longer you can expect the process to take. Further, if there is an architecture mismatch in any of the cubins and you have a backwards compatible GPU, it can also trigger driver recompilation of device code for the target GPU. In a very extreme case, I have seen an old application linked with an old version of CUBLAS take 10s of seconds to load and initialise when run on a Fermi GPU.

您可以通过像这样发出 cudaFree 调用来显式强制建立惰性上下文:

You can explicitly force lazy context establishment by issuing a cudaFree call like this:

int main() {
    cudaSetDevice(0);
    cudaFree(0); // context establishment happens here
    unsigned int *ptr = 0;
    cudaMalloc((void **)&ptr, 2000000 * sizeof(unsigned int));   
    cudaFree(ptr);
  return 1;
}

如果您使用计时器分析或检测此版本,您应该会发现第一个 cudaFree 调用消耗了大部分运行时,而 cudaMalloc 调用几乎是免费的.

If you profile or instrument this version with timers you should find that the first cudaFree call consumes most of the runtime and the cudaMalloc call becomes almost free.

这篇关于与第 3 方 CUDA 库链接会减慢 cudaMalloc的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆