如何计算内核的 Gflops [英] How to calculate Gflops of a kernel

查看:39
本文介绍了如何计算内核的 Gflops的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要衡量我的内核存档的峰值性能有多少.

I want a measure of how much of the peak performance my kernel archives.

假设我有一个 NVIDIA Tesla C1060,它的 峰值 GFLOPS 为 622.08 (~=240 核 * 1300MHz * 2).现在在我的内核中,我计算了每个线程 16000 次失败(4000 x(2 减法,1 乘法和 1 sqrt)).所以当我有 1,000,000 个线程时,我会想出 16GFLOP.由于内核需要 0.1 秒,我将归档 160GFLOPS,这将是峰值性能的四分之一.现在我的问题:

Say I have a NVIDIA Tesla C1060, which has a peak GFLOPS of 622.08 (~= 240Cores * 1300MHz * 2). Now in my kernel I counted for each thread 16000 flop (4000 x (2 subtraction, 1 multiplication and 1 sqrt)). So when I have 1,000,000 threads I would come up with 16GFLOP. And as the kernel takes 0.1 seconds I would archive 160GFLOPS, which would be a quarter of the peak performance. Now my questions:

  • 这种方法正确吗?
  • 比较(if(a>b) then....)呢?我是否也必须考虑它们?
  • 我可以使用 CUDA 分析器获得更简单、更准确的结果吗?我尝试了 instructions 计数器,但我无法弄清楚这个数字的含义.
  • Is this approach correct?
  • What about comparisons (if(a>b) then....)? Do I have to consider them as well?
  • Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.

妹子问题:如何计算达到的带宽一个CUDA内核

推荐答案

先做一些一般性说明:

一般来说,您所做的大多是徒劳的,与大多数人可能进行的性能分析相反.

In general, what you are doing is mostly an exercise in futility and is the reverse of how most people would probably go about performance analysis.

首先要说明的是,您引用的峰值严格用于浮点乘加指令 (FMAD),它计为两个 FLOPS,并且可以以每个周期一个的最大速率退出.其他以每个周期最多退出一次的浮点运算在形式上将仅归类为单个 FLOP,而其他浮点操作可能需要许多周期才能退出.因此,如果您决定引用该峰值的内核性能,您实际上是在将您的代码性能与纯 FMAD 指令流进行比较,仅此而已.

The first point to make is that the peak value you are quoting is for strictly for floating point multiply-add instructions (FMAD), which count as two FLOPS, and can be retired at a maximum rate of one per cycle. Other floating point operations which retire at a maximum rate of one per cycle would formally only be classified as a single FLOP, while others might require many cycles to be retired. So if you decided to quote kernel performance against that peak, you are really comparing your codes performance against a stream of pure FMAD instructions, and nothing more than that.

第二点是,当研究人员从一段代码中引用 FLOP/s 值时,他们通常使用 model FLOP 计数进行操作,而不是尝试计数指令.矩阵乘法和 Linpack LU 分解基准是这种性能基准测试方法的经典示例.这些计算的操作计数的下限是完全已知的,因此计算的吞吐量只是该下限除以时间.实际的指令数无关紧要.程序员经常使用各种技术,包括冗余计算、推测或预测计算,以及许多其他想法来使代码运行得更快.此类代码的实际 FLOP 计数无关紧要,参考始终是模型 FLOP 计数.

The second point is that when researchers quote FLOP/s values from a piece of code, they are usually using a model FLOP count for the operation, not trying to count instructions. Matrix multiplication and the Linpack LU factorization benchmarks are classic examples of this approach to performance benchmarking. The lower bound of the operation count of those calculations is exactly known, so the calculated throughput is simply that lower bound divided by the time. The actual instruction count is irrelevent. Programmers often use all sorts of techniques, including rundundant calculations, speculative or predictive calculations, and a host of other ideas to make code run faster. The actual FLOP count of such code is irrelevent, the reference is always the model FLOP count.

最后,在查看量化性能时,通常只有两个真正感兴趣的比较点

Finally, when looking at quantifying performance, there are usually only two points of comparison of any real interest

  • A 版代码在相同硬件上运行速度是否比 B 版快?
  • 硬件 A 在执行相关任务时的性能是否优于硬件 B?

在第一种情况下,您实际上只需要测量执行时间.其次,合适的度量通常不是 FLOP/s,而是每单位时间的有用操作(排序中的每秒记录数,流体力学模拟中的每秒细胞数等).有时,如上所述,有用的操作可能是已知理论复杂度的操作的 model FLOP 计数.但实际浮点指令计数很少(如果有的话)进入分析.

In the first case you really only need to measure execution time. In the second, a suitable measure usually isn't FLOP/s, it is useful operations per unit time (records per second in sorting, cells per second in a fluid mechanical simulation, etc). Sometimes, as mentioned above, the useful operations can be the model FLOP count of an operation of known theoretical complexity. But the actual floating point instruction count rarely, if ever, enters into the analysis.

如果您的兴趣真的是优化和了解代码的性能,那么也许 来自 NVIDIA 的 Paulius Micikevicius 的此演示文稿可能会引起您的兴趣.

If your interest is really about optimization and understanding the performance of your code, then maybe this presentation by Paulius Micikevicius from NVIDIA might be of interest.

解决要点问题:

这种方法正确吗?

严格来说,没有.如果要计算浮点运算,则需要从 GPU 运行的代码中知道确切的 FLOP 计数.sqrt 操作可以消耗比单个 FLOP 更多的内容,例如,这取决于它的实现和它正在操作的数字的特征.编译器还可以执行许多优化,这些优化可能会改变实际的操作/指令计数.获得真正准确计数的唯一方法是反汇编编译后的代码并计算单个浮点操作数,甚至可能需要对代码将计算的值的特征进行假设.

Strictly speaking, no. If you are counting floating point operations, you would need to know the exact FLOP count from the code the GPU is running. The sqrt operation can consume a lot more than a single FLOP, depending on its implementation and the characteristics of the number it is operating on, for example. The compiler can also perform a lot of optimizations which might change the actual operation/instruction count. The only way to get a truly accurate count would be to disassemble compiled code and count the individual floating point operands, perhaps even requiring assumptions about the characteristics of values the code will compute.

比较(如果(a>b)那么......)呢?我也必须考虑它们吗?

What about comparisons (if(a>b) then....)? Do I have to consider them as well?

它们不是浮点乘加运算,所以不是.

They are not floating point multiply-add operations, so no.

我可以使用 CUDA 分析器获得更简单、更准确的结果吗?我尝试了指令计数器,但我无法弄清楚这个数字的含义.

Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.

不是真的.探查器无法区分浮点指令和任何其他类型的指令,因此(截至 2011 年)无法通过探查器从一段代码中计算 FLOP.

Not really. The profiler can't differentiate between a floating point intruction and any other type of instruction, so (as of 2011) FLOP count from a piece of code via the profiler is not possible.

这篇关于如何计算内核的 Gflops的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆