您的个人资料优化CUDA内核? [英] How Do You Profile & Optimize CUDA Kernels?

查看:127
本文介绍了您的个人资料优化CUDA内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有点熟悉CUDA视觉分析器和占用电子表格,虽然我可能不利用他们,我可以。剖析&优化CUDA代码不像分析&优化在CPU上运行的代码。所以我希望从你的经验中学习如何充分利用我的代码。

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.

有一个帖子最近寻找最快的代码来标识自己的号码,我提供了 CUDA实现。我不满意,这个代码是尽可能快,但我很失望,找出正确的问题是什么,什么工具,我可以得到的答案。

There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.

如何找出使您的CUDA内核执行速度更快的方法?

How do you identify ways to make your CUDA kernels perform faster?

推荐答案

在Linux上开发,那么CUDA Visual Profiler给你一个负载的信息,知道怎么做它可以有点棘手。在Windows上,您还可以使用CUDA Visual Profiler,或者(在Vista / 7/2008),您可以使用与Visual Studio完美集成的Nexus,并提供合并的主机和GPU配置文件信息。

If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.

获得数据后,您需要知道如何解释数据。来自GTC的高级CUDA C 演示文稿提供了一些有用的提示。要注意的主要事情是:

Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:


  • 最佳内存访问:你需要知道你期望代码做什么,然后寻找异常。因此,如果你总是加载浮点数,每个线程从数组加载一个不同的浮点数,那么你会期望只看到64字节的加载(在当前的h / w)。任何其他负载效率低下。

  • 最小化序列化:warp serialize计数器指示您具有共享内存库冲突或持续序列化,该演示文稿将更详细地介绍

  • 重叠I / O和计算:这是Nexus真正发光的地方(你可以使用cudaEvents手动获得相同的信息),如果您有大量的数据传输要重叠计算和I / O

  • 执行配置:占用计算器可以帮助这一点,但简单的方法,计算测量预期与测量带宽之间的关系非常有用(对于计算吞吐量反之亦然)

  • Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
  • Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
  • Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
  • Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)

这只是一个开始, GTC演示和在NVIDIA网站上的其他网络研讨会。

This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.

这篇关于您的个人资料优化CUDA内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆