计算实现的带宽和触发器/ Gflops,并评估CUDA内核性能 [英] Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

查看:287
本文介绍了计算实现的带宽和触发器/ Gflops,并评估CUDA内核性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大多数论文显示了翻牌/ Gflops并为其CUDA内核实现了带宽。我还读了关于stackoverflow的答案以下问题:

Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions:

如何评估CUDA性能?

How Do You Profile & Optimize CUDA Kernels?

如何计算内核的Gflops

在程序中计算FLOPS / GFLOPS - CUDA

如何计算CUDA内核实现的带宽

大多数东西似乎确定,但仍然不让我感觉舒服在计算这些东西。谁能写一个简单的CUDA内核?然后给出deviceQuery的输出。然后逐步计算触发器/ Gflops并获得此内核的带宽。然后显示此内核的Visual Profiler结果。也就是说用这个简单的CUDA内核逐步获得的所有信息详细显示结果。这将真正有助于我们大多数人。感谢!

Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and achieved bandwidth for this kernel. Then show the Visual Profiler results for this kernel. I.e. show the results in detail with all the information obtained step by step for this simple CUDA kernel. That would be really helpful for most of us. Thanks!

推荐答案

Nsight Visual Studio 2.1及更高版本

如果您收集已实现FLOPS 实验和内存统计信息 - 缓冲实验

Visual Profiler 4.2及以上

实现带宽:当鼠标悬停在时间轴中的内核上时, Memory\DRAM Utilization 下的属性窗格。

Achieved Bandwidth: When mouse over a kernel in the Timeline this information the information is available in the Properties Pane under Memory\DRAM Utilization.

分析器无法收集FLOPS计数。这可以通过运行cuobjdump -sass来查看汇编代码来完成。逐步通过内核并计数单精度和双精度浮点指令将FMA和DFMA操作乘以2.每条指令还应乘以预测的真线程。您还必须考虑控制流。这不是乐趣,需要有一个很强的知识的指令集。这可以通过在调试器中单步执行组件来更好地实现。内核的持续时间在Visual Profiler属性窗格和详细信息窗格中以持续时间提供。

The profiler cannot collect FLOPS count yet. This can be done by running cuobjdump -sass to view the assembly code. Step through the kernel and count single and double precision floating points instructions multiplying FMA and DFMA operations by 2. Each instruction should also be multiplied by the predicated true threads. You also have to account for control flow. This is not fun and requires someone with a strong knowlege of the instruction set. This may be better accomplished by single stepping the assembly in the debugger. The duration of the kernel is available in the Visual Profiler Properties Pane and Details Pane as Duration.

这篇关于计算实现的带宽和触发器/ Gflops,并评估CUDA内核性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆