在Windows中运行时,CUDA性能损失 [英] CUDA performance penalty when running in Windows

查看:790
本文介绍了在Windows中运行时,CUDA性能损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Windows 7(相对于Linux)运行CUDA应用程序时,我注意到了性能上的巨大影响。我想我可能知道减速发生在哪里:无论什么原因,Windows Nvidia驱动程序(版本331.65)不通过运行时API调用时立即调度CUDA内核。
为了说明这个问题,我概述了mergeSort应用程序(从CUDA 5.5附带的示例)。

I've noticed a big performance hit when I run my CUDA application in Windows 7 (versus Linux). I think I may know where the slowdown occurs: For whatever reason, the Windows Nvidia driver (version 331.65) does not immediately dispatch a CUDA kernel when invoked via the runtime API. To illustrate the problem I profiled the mergeSort application (from the examples that ship with CUDA 5.5).

首先考虑在Linux中运行时的内核启动时间:

Consider first the kernel launch time when running in Linux:

接下来,考虑在Windows中运行时的启动时间:

Next, consider the launch time when running in Windows:

帖子表明此问题可能与Windows有关驱动程序批处理内核启动。

This post suggests the problem might have something to do with the windows driver batching the kernel launches. Is there anyway I can disable this batching?

我使用的是GTX 690 GPU,Windows 7和版本331.65的Nvidia驱动程序。

I am running with a GTX 690 GPU, Windows 7, and version 331.65 of the Nvidia driver.

推荐答案

有一个大量开销通过WDDM堆栈发送GPU硬件命令。

There is a fair amount of overhead in sending GPU hardware commands through the WDDM stack.

正如你已经发现的,这意味着在WDDM下(只)GPU命令可以批量分摊这种开销。批处理过程可能(可能会)引入一些延迟,这可能是可变的,这取决于发生了什么。

As you've discovered, this means that under WDDM (only) GPU commands can get "batched" to amortize this overhead. The batching process may (probably will) introduce some latency, which can be variable, depending on what else is going on.

Windows下的最佳解决方案是切换操作模式的GPU从WDDM到TCC,这可以通过 nvidia-smi 命令完成,但它只支持特斯拉GPU和某些成员的Quadro系列GPU - 即不是GeForce。 (它还具有防止设备用作Windows加速显示适配器的副作用,这可能与Quadro设备或几个特定的​​较旧Fermi Tesla GPU相关。)

The best solution under windows is to switch the operating mode of the GPU from WDDM to TCC, which can be done via the nvidia-smi command, but it is only supported on Tesla GPUs and certain members of the Quadro family of GPUs -- i.e. not GeForce. (It also has the side effect of preventing the device from being used as a windows accelerated display adapter, which might be relevant for a Quadro device or a few specific older Fermi Tesla GPUs.)

AFAIK没有正式记录的方法来绕过或影响驱动程序中的WDDM批处理过程,但非正式地,我听说过,根据Greg @ NV在 / default / topic / 548639 / is-wddm-caused- this- /rel =nofollow noreferrer ;

AFAIK there is no officially documented method to circumvent or affect the WDDM batching process in the driver, but unofficially I've heard , according to Greg@NV in this link the command to issue after the cuda kernel call is cudaEventQuery(0); which may/should cause the WDDM batch queue to "flush" to the GPU.

正如Greg指出的那样,大量使用这种机制可能会导致WDDM批处理队列冲刷到GPU。

As Greg points out, extensive use of this mechanism will wipe out the amortization benefit, and may do more harm than good.

编辑:前进到2016年,一个更新的建议, WDDB命令队列的低冲击刷新将 cudaStreamQuery(stream);

moving forward to 2016, a newer recommendation for a "low-impact" flush of the WDDM command queue would be cudaStreamQuery(stream);

EDIT2: 在Windows上使用最新的驱动程序,您应该可以将Titan系列GPU设置为TCC模式,假设您为其他显示设置了其他GPU。 nvidia-smi 工具将允许您切换模式(使用 nvidia-smi --help 获取更多信息)。

Using recent drivers on windows, you should be able to place Titan family GPUs in TCC mode, assuming you have some other GPU set up for primary display. The nvidia-smi tool will allow you to switch modes (using nvidia-smi --help for more info).

有关TCC驱动程序模型的其他信息,请参见 windows安装指南,包括它可以减少内核启动的延迟。

Additional info about the TCC driver model can be found in the windows install guide, including that it may reduce the latency of kernel launches.

这篇关于在Windows中运行时,CUDA性能损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆