如何计算内核的Gflops [英] How to calculate Gflops of a kernel

查看:355
本文介绍了如何计算内核的Gflops的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个NVIDIA Tesla C1060,它有一个峰值GFLOPS为622.08 (〜= 240Core * 1300MHz * 2)。
现在在我的内核中,我计算每个线程16000翻转(4000 x(2减法,1乘法和1 sqrt))。所以当我有1,000,000线程,我会想出16GFLOP。由于内核需要0.1秒,我将归档160GFLOPS,这将是峰值性能的四分之一。现在我的问题:




  • 这种方法是否正确?

  • 比较( if(a> b)then .... )?我还必须考虑它们吗?

  • 我可以使用CUDA profiler来获得更简单和更准确的结果吗?我尝试了指令计数器,但我无法弄清楚这个数字是什么意思。


$ b b

妹妹问题:如何计算实现的



一般来说,你所做的大部分是徒劳的,是大多数人可能进行绩效分析的倒过来。



第一个要点是,要引用的峰值严格来说是用于计算为两个FLOPS的浮点乘法指令(FMAD),并且可以以每个周期最大速率退出。以每个周期最大速率退化的其他浮点运算将正式地仅被分类为单个FLOP,而其他浮点运算可能需要许多周期来退休。因此,如果你决定引用内核性能的峰值,你真的比较你的代码性能与纯FMAD指令流,只是这样。



第二点是,当研究人员从一段代码中引用FLOP / s值时,他们通常使用模型 FLOP计数操作,不试图计数指令。矩阵乘法和Linpack LU因式分解基准是这种性能基准化方法的典型例子。这些计算的操作计数的下限是精确已知的,因此计算的吞吐量只是下限除以时间。实际指令计数是无效的。程序员经常使用各种技术,包括冗余计算,推测或预测计算,以及许多其他想法,使代码运行更快。这样的代码的实际FLOP计数是无关的,参考总是模型FLOP计数。



最后,当看量化性能时,通常只有两个比较点任何实际利益




  • 版本A的代码在同一硬件上运行得比版本B快吗?

  • 硬件A比执行感兴趣任务的硬件B性能更好吗?



在第一种情况下,执行时间处理时间。第二,合适的措施通常不是FLOP / s,它是每单位时间的有用操作(排序中的每秒记录,流体力学模拟中的每秒单元等)。有时,如上所述,有用的操作可以是已知理论复杂度的操作的模型 FLOP计数。但是实际的浮点指令计数很少,如果有的话,进入分析。



如果你的兴趣真的关于优化和理解你的代码的性能,那么可能<来自NVIDIA的Paulius Micikevicius的href =http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Analysis_Driven_Optimization.pdf =noreferrer>此演示文稿可能感兴趣。 p>

解决项目符号问题:


方法正确吗?


严格来说,没有。如果你计算浮点运算,你需要知道GPU运行的代码的确切FLOP计数。例如, sqrt 操作可以消耗比单个FLOP多很多,这取决于其实现和操作的数字的特性。编译器还可以执行许多优化,这可能改变实际的操作/指令计数。获得真正准确计数的唯一方法是反汇编编译代码并计算各个浮点操作数,甚至可能需要对代码计算的值的特性进行假设。



< blockquote>

比较(if(a> b)then ....)?我还必须考虑他们吗?


它们不是浮点乘法 - 加法操作,所以没有。


我可以使用CUDA分析器来获得更简单和更准确的结果吗?我尝试了指令计数器,但我无法弄清楚这个数字是什么意思。


不是真的。分析器不能区分浮点操作和任何其他类型的指令,因此(截至2011年)不可能通过分析器从一段代码中计算FLOP。


I want a measure of how much of the peak performance my kernel archives.

Say I have a NVIDIA Tesla C1060, which has a peak GFLOPS of 622.08 (~= 240Cores * 1300MHz * 2). Now in my kernel I counted for each thread 16000 flop (4000 x (2 subtraction, 1 multiplication and 1 sqrt)). So when I have 1,000,000 threads I would come up with 16GFLOP. And as the kernel takes 0.1 seconds I would archive 160GFLOPS, which would be a quarter of the peak performance. Now my questions:

  • Is this approach correct?
  • What about comparisons (if(a>b) then....)? Do I have to consider them as well?
  • Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.

sister question: How to calculate the achieved bandwidth of a CUDA kernel

解决方案

First some general remarks:

In general, what you are doing is mostly an exercise in futility and is the reverse of how most people would probably go about performance analysis.

The first point to make is that the peak value you are quoting is for strictly for floating point multiply-add instructions (FMAD), which count as two FLOPS, and can be retired at a maximum rate of one per cycle. Other floating point operations which retire at a maximum rate of one per cycle would formally only be classified as a single FLOP, while others might require many cycles to be retired. So if you decided to quote kernel performance against that peak, you are really comparing your codes performance against a stream of pure FMAD instructions, and nothing more than that.

The second point is that when researchers quote FLOP/s values from a piece of code, they are usually using a model FLOP count for the operation, not trying to count instructions. Matrix multiplication and the Linpack LU factorization benchmarks are classic examples of this approach to performance benchmarking. The lower bound of the operation count of those calculations is exactly known, so the calculated throughput is simply that lower bound divided by the time. The actual instruction count is irrelevent. Programmers often use all sorts of techniques, including rundundant calculations, speculative or predictive calculations, and a host of other ideas to make code run faster. The actual FLOP count of such code is irrelevent, the reference is always the model FLOP count.

Finally, when looking at quantifying performance, there are usually only two points of comparison of any real interest

  • Does version A of the code run faster than version B on the same hardware?
  • Does hardware A perform better than hardware B doing the task of interest?

In the first case you really only need to measure execution time. In the second, a suitable measure usually isn't FLOP/s, it is useful operations per unit time (records per second in sorting, cells per second in a fluid mechanical simulation, etc). Sometimes, as mentioned above, the useful operations can be the model FLOP count of an operation of known theoretical complexity. But the actual floating point instruction count rarely, if ever, enters into the analysis.

If your interest is really about optimization and understanding the performance of your code, then maybe this presentation by Paulius Micikevicius from NVIDIA might be of interest.

Addressing the bullet point questions:

Is this approach correct?

Strictly speaking, no. If you are counting floating point operations, you would need to know the exact FLOP count from the code the GPU is running. The sqrt operation can consume a lot more than a single FLOP, depending on its implementation and the characteristics of the number it is operating on, for example. The compiler can also perform a lot of optimizations which might change the actual operation/instruction count. The only way to get a truly accurate count would be to disassemble compiled code and count the individual floating point operands, perhaps even requiring assumptions about the characteristics of values the code will compute.

What about comparisons (if(a>b) then....)? Do I have to consider them as well?

They are not floating point multiply-add operations, so no.

Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.

Not really. The profiler can't differentiate between a floating point intruction and any other type of instruction, so (as of 2011) FLOP count from a piece of code via the profiler is not possible. [EDIT: see Greg's execellent answer below for a discussion of the FLOP counting facilities available in versions of the profiling tools released since this answer was written]

这篇关于如何计算内核的Gflops的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆