在 Windows 中运行时的 CUDA 性能损失 [英] CUDA performance penalty when running in Windows

查看:15
本文介绍了在 Windows 中运行时的 CUDA 性能损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在 Windows 7(相对于 Linux)中运行 CUDA 应用程序时,我注意到性能受到很大影响.我想我可能知道减速发生在哪里:无论出于何种原因,Windows Nvidia 驱动程序(版本 331.65)在通过运行时 API 调用时都不会立即调度 CUDA 内核.为了说明问题,我分析了 mergeSort 应用程序(来自 CUDA 5.5 附带的示例).

I've noticed a big performance hit when I run my CUDA application in Windows 7 (versus Linux). I think I may know where the slowdown occurs: For whatever reason, the Windows Nvidia driver (version 331.65) does not immediately dispatch a CUDA kernel when invoked via the runtime API. To illustrate the problem I profiled the mergeSort application (from the examples that ship with CUDA 5.5).

首先考虑在 Linux 中运行时的内核启动时间:

Consider first the kernel launch time when running in Linux:

接下来,考虑在 Windows 中运行时的启动时间:

Next, consider the launch time when running in Windows:

帖子表明问题可能与批处理内核启动的 Windows 驱动程序.无论如何我可以禁用此批处理吗?

This post suggests the problem might have something to do with the windows driver batching the kernel launches. Is there anyway I can disable this batching?

我使用 GTX 690 GPU、Windows 7 和 Nvidia 驱动程序的 331.65 版运行.

I am running with a GTX 690 GPU, Windows 7, and version 331.65 of the Nvidia driver.

推荐答案

有一个相当多的开销 通过 WDDM 堆栈发送 GPU 硬件命令.

There is a fair amount of overhead in sending GPU hardware commands through the WDDM stack.

正如您所发现的,这意味着在 WDDM(仅)下,GPU 命令可以得到批处理".摊销这笔开销.批处理过程可能(可能会)引入一些延迟,这可能是可变的,具体取决于其他情况.

As you've discovered, this means that under WDDM (only) GPU commands can get "batched" to amortize this overhead. The batching process may (probably will) introduce some latency, which can be variable, depending on what else is going on.

windows下最好的解决方案是将GPU的运行模式从WDDM切换到TCC,可以通过nvidia-smi命令来完成,但只支持Tesla GPU和某些Quadro 系列 GPU 的成员——即不是 GeForce.(它还具有阻止设备用作 Windows 加速显示适配器的副作用,这可能与 Quadro 设备或一些特定的较旧的 Fermi Tesla GPU 有关.)

The best solution under windows is to switch the operating mode of the GPU from WDDM to TCC, which can be done via the nvidia-smi command, but it is only supported on Tesla GPUs and certain members of the Quadro family of GPUs -- i.e. not GeForce. (It also has the side effect of preventing the device from being used as a windows accelerated display adapter, which might be relevant for a Quadro device or a few specific older Fermi Tesla GPUs.)

AFAIK 没有正式记录的方法来规避或影响驱动程序中的 WDDM 批处理过程,但我听说过,据 Greg@NV in this link cuda内核调用后发出的命令是cudaEventQuery(0);这可能/应该导致WDDM批处理队列刷新"到 GPU.

AFAIK there is no officially documented method to circumvent or affect the WDDM batching process in the driver, but unofficially I've heard , according to Greg@NV in this link the command to issue after the cuda kernel call is cudaEventQuery(0); which may/should cause the WDDM batch queue to "flush" to the GPU.

正如 Greg 所指出的,广泛使用这种机制会抹杀摊销收益,而且弊大于利.

As Greg points out, extensive use of this mechanism will wipe out the amortization benefit, and may do more harm than good.

迈向 2016 年,对低影响"的更新建议WDDM 命令队列的刷新将是 cudaStreamQuery(stream);

moving forward to 2016, a newer recommendation for a "low-impact" flush of the WDDM command queue would be cudaStreamQuery(stream);

使用 Windows 上的最新驱动程序,您应该能够将 Titan 系列 GPU 置于 TCC 模式,假设您有一些其他 GPU 设置为主显示器.nvidia-smi 工具将允许您切换模式(使用 nvidia-smi --help 了解更多信息).

Using recent drivers on windows, you should be able to place Titan family GPUs in TCC mode, assuming you have some other GPU set up for primary display. The nvidia-smi tool will allow you to switch modes (using nvidia-smi --help for more info).

有关 TCC 驱动程序模型的其他信息可以在 windows 安装指南,包括它可以减少内核启动的延迟.

Additional info about the TCC driver model can be found in the windows install guide, including that it may reduce the latency of kernel launches.

关于 TCC 支持的声明是一般性声明.并非所有 Quadro GPU 都受支持.在特定 GPU 上支持(或不支持)TCC 的最终决定因素是 nvidia-smi 工具.此处的任何内容都不应被解释为保证在您的特定 GPU 上支持 TCC.

The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.

这篇关于在 Windows 中运行时的 CUDA 性能损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆