内核启动和内核执行之间的时间 [英] Time between Kernel Launch and Kernel Execution

查看:178
本文介绍了内核启动和内核执行之间的时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过使用VS 2010的Parallel Nsight 2.1版本优化我的CUDA程序。



我的程序运行在Windows 7(32位)与GTX 480板。我已经安装了CUDA 4.1 32位工具包和301.32驱动程序。



程序的一个循环包括将主机数据复制到设备,执行内核和将结果从设备复制到主机。



正如您在下面的分析器结果的图片中可以看到的,内核运行在四个不同的流中。每个流中的内核依赖于在流2中复制到设备的数据。这就是为什么在不同流中启动内核之前,asyncMemcpy与CPU同步的原因。





在图片中让我感到烦恼的是第一次内核启动结束之间的巨大差距(在10.5778679285 )和内核执行的开始(在10.5781500)。它需要大约300 us启动内核,这是在小于1 ms的处理周期中的一个巨大的开销。



此外,内核执行和数据没有重叠



这种行为有明显的原因吗?


  1. p> Nsight CUDA分析每个API调用增加约1μs。您已启用CUDA运行时和CUDA驱动程序API跟踪。如果您要禁用CUDA运行时跟踪,我会猜测您将宽度减少50μs。


  2. 由于您在Windows 7上使用GTX 480在WDDM驱动程序模型上执行。在WDDM上,驱动程序必须进行内核调用以提交工作,从而引入大量开销。为了避免减少这种开销,CUDA驱动程序在内部SW队列中缓冲请求,并在队列满时将请求发送到驱动程序,它由同步调用刷新。这是可能的tu使用cudaEventQuery强制驱动程序刷新工作,但这可能有其他性能影响。


  3. 看来你正在提交你的工作流深度第一方式。在计算能力2.x和3.0设备上,如果以广度优先的方式提交到流,您将获得更好的结果。


时间轴截图不能为我提供足够的信息来决定为什么内存副本在所有内核完成后启动。给定API调用模式我应该能够看到传输开始后每个流完成其启动。



如果你正在等待所有流完成它可能会更快做一个cudaDeviceSynchronize比4 cudaStreamSynchronize调用。



下一个版本的Nsight将有额外的功能,以帮助理解SW排队和工作提交到计算引擎和内存复制引擎。


I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.

My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.

One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.

As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.

What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.

Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.

Are there any obvious reasons for this behavior?

解决方案

There are three issues that I can tell by the trace.

  1. Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.

  2. Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.

  3. It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.

The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.

If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.

The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.

这篇关于内核启动和内核执行之间的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆