调用具有2个GPU的cudaMalloc时性能较差 [英] Poor performance when calling cudaMalloc with 2 GPUs simultaneously

查看:265
本文介绍了调用具有2个GPU的cudaMalloc时性能较差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序,我在用户的系统上分割GPU之间的处理负载。基本上,当由主应用程序线程周期性触发时,每个GPU具有启动 GPU处理间隔的CPU线程。



考虑以下图片(使用NVIDIA的CUDA分析器工具生成)以获取 GPU处理间隔的示例 - 这里应用程序正在使用单GPU。



p>

正如你可以看到,GPU处理时间的一大部分是由两个排序操作消耗,我使用Thrust库(thrust :: sort_by_key)。此外,它看起来像thrust :: sort_by_key调用几个cudaMallocs在引擎盖下,在它开始实际排序。



现在考虑应用程序在两个GPU上传播处理负载的相同处理间隔:





在一个完美的世界,你会期望2 GPU处理间隔恰好是单GPU的一半(因为每个GPU都做了一半的工作)。你可以看到,这不是这种情况部分是因为cudaMallocs似乎需要更长的时间,同时调用(有时是2-3倍),由于某种争用问题。我不明白为什么这是需要的情况,因为2 GPU的内存分配空间是完全独立的,所以在cudaMalloc不应该有一个系统范围的锁 - 每GPU锁将更加合理。



为了证明我的假设,这个问题是同时发生的cudaMalloc调用,我创建了一个可笑的简单程序有两个CPU线程(每个GPU),每次调用cudaMalloc几次。我首先运行这个程序,以便单独的线程不会同时调用cudaMalloc:





您看到每次分配需要〜175微秒。接下来,我使用调用cudaMalloc的线程同时运行程序:





在这里,每次通话花费了大约538微秒或3倍的时间比前一种情况!不用说,这是非常缓慢的我的应用程序,它的理由,问题只会变得更糟糕的超过2个GPU。



我注意到这种行为在Linux和Windows。在Linux上,我使用Nvidia驱动程序版本319.60和在Windows上我使用327.23版本。我使用CUDA工具包5.5。



可能的原因:
我在这些测试中使用的是GTX 690。这张卡基本上是2 680个类似的GPU在同一单元。这是我运行的唯一的多GPU设置,所以也许cudaMalloc问题与一些硬件依赖690的2 GPU之间有关。

解决方案

总结问题并给出可能的解决方案:



cudaMalloc争用可能来自驱动程序级别争用(可能是由于需要以切换设备上下文作为talonmies建议),可以避免在性能关键部分由cudaMallocing和临时缓冲区之前的额外延迟。



看起来我可能需要重构我的代码,所以我不调用任何排序程序调用cudaMalloc在引擎盖下(在我的例子thrust :: sort_by_key)。 CUB库
在这方面看起来很有前途。作为一个奖励,CUB还向用户公开了CUDA流参数,这也可以提高性能。



请参阅 CUB(CUDA UnBound)等效于thrust :: gather ,了解从推力到CUB的一些细节。



UPDATE:



我拒绝了对thrust :: sort_by_key的调用,赞成cub :: DeviceRadixSort :: SortPairs。

关闭我的每个间隔的处理时间这个刮掉几毫秒。此外,多GPU争用问题已经解决了 - 卸载到2 GPU几乎将处理时间减少了50%,如预期。


I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread.

Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU.

As you can see a big portion of the GPU processing time is consumed by the two sorting operations and I am using the Thrust library for this (thrust::sort_by_key). Also, it looks like thrust::sort_by_key calls a few cudaMallocs under the hood before it starts the actual sort.

Now consider the same processing interval where the application has spread the processing load over two GPUs:

In a perfect world you would expect the 2 GPU processing interval to be exactly half that of the single GPU (because each GPU is doing half the work). As you can see, this it not the case partially because the cudaMallocs seem to take longer when they are called simultaneously (sometimes 2-3 times longer) due to some sort of contention issue. I don't see why this needs to be the case because the memory allocation space for the 2 GPUs are completely independent so there shouldn't be a system-wide lock on cudaMalloc -- a per-GPU lock would be more reasonable.

To prove my hypothesis that the issue is with simultaneous cudaMalloc calls, I created a ridiculously simple program with two CPU threads (for each GPU) each calling cudaMalloc several times. I first ran this program so that the separate threads do not call cudaMalloc at the same time:

You see it takes ~175 microseconds per allocation. Next, I ran the program with the threads calling cudaMalloc simultaneously:

Here, each call took ~538 microseconds or 3 times longer than the previous case! Needless to say, this is slowing down my application tremendously and it stands to reason the issue would only get worse with more than 2 GPUs.

I have noticed this behavior on Linux and Windows. On Linux, I am using Nvidia driver version 319.60 and on Windows I am using the 327.23 version. I am using CUDA toolkit 5.5.

Possible Reason: I am using a GTX 690 in these tests. This card is basically 2 680-like GPUs housed in the same unit. This is the only "multi-GPU" setup I've run, so perhaps the cudaMalloc issue has something to do with some hardware dependence between the 690's 2 GPUs?

解决方案

To summarize the problem and a give a possible solution:

The cudaMalloc contention probably stems from driver level contention (possibly due to the need to switch device contexts as talonmies suggestsed) and one could avoid this extra latency in performance critical sections by cudaMalloc-ing and temporary buffers beforehand.

It looks like I probably need to refactor my code so that I am not calling any sorting routine that calls cudaMalloc under the hood (in my case thrust::sort_by_key). The CUB library looks promising in this regard. As a bonus, CUB also exposes a CUDA stream parameter to the user, which could also serve to boost performance.

See CUB (CUDA UnBound) equivalent of thrust::gather for some details on moving from thrust to CUB.

UPDATE:

I backed out the calls to thrust::sort_by_key in favor of cub::DeviceRadixSort::SortPairs.
Doing this shaved milliseconds off my per-interval processing time. Also the multi-GPU contention issue has resolved itself -- offloading to 2 GPUs almost drops the processing time by 50%, as expected.

这篇关于调用具有2个GPU的cudaMalloc时性能较差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆