如何评价CUDA性能? [英] How to evaluate CUDA performance?

查看:193
本文介绍了如何评价CUDA性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编程CUDA内核我自己。
与CPU代码相比,我的内核代码比CPU快10倍。



但我对我的实验有疑问。



我的程序是否使用所有GPU核心,正确的共享内存使用,足够的寄存器计数,足够的占用率完全优化?



如何评估我的内核代码的性能?



我如何计算CUDA的最大吞吐量理论上?



我是正确的比较CPU的GFLOPS和GPU的GFLOPS和GFLOPS率是他们的透明理论性能吗?



提前感谢。


<我的程序是否使用所有GPU核心,正确的共享内存使用,足够的寄存器计数,足够的占用率完全优化?


要找到这一点,您可以使用其中一个CUDA分析器。请参阅如何查看个人资料优化CUDA内核?


如何在理论上计算CUDA的最大吞吐量?




这个数学略有涉入,每个架构都不同,容易出错。更好地看看你的芯片的规格数字。维基百科上有表格,例如这个表格,适用于GTX500卡。例如,从表中可以看出,GTX580的理论峰值带宽为192.4GB / s,计算吞吐量为1581.1GFLOP。



如果我理解正确,您询问是否可以直接将GPU上的理论峰值GFLOP的数量与CPU上的相应数量进行比较。在比较这些数字时,有一些事情需要考虑:




  • 较旧的GPU不支持双精度单精度(SP)。


  • 与SP相比,支持DP的GPU具有显着的性能降级。上面引用的GFLOP编号是SP。另一方面,引用CPU的数字通常用于DP,并且SP和DP在CPU上的性能之间的差异较小。


  • CPU引脚可以是仅当使用SIMD(单指令,多数据)向量化指令时可实现的速率,并且通常非常难以写入可以接近理论最大值的算法(并且它们可能必须在汇编中写入)。有时,CPU报价是通过不同类型的指令可用的所有计算资源的组合,并且通常几乎不可能编写可以同时利用它们的程序。


  • p> GPU引用的费率假设您有足够的并行工作来饱和GPU,并且您的算法不是带宽限制的。



I programmed CUDA kernel my own. Compare to CPU code, my kernel code is 10 times faster than CPUs.

But I have question with my experiments.

Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?

How can I evaluate my kernel code's performance?

How can I calcuate CUDA's maximum throughput theoretically?

Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?

Thanks in advance.

解决方案

Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?

To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?

How can I calcuate CUDA's maximum throughput theoretically?

That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.

Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?

If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:

  • Older GPUs did not support double precision (DP) floating point, only single precision (SP).

  • GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.

  • CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.

  • The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.

这篇关于如何评价CUDA性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆