估计FLOPS中的GPU效率(CUDA示例) [英] Estimating the efficiency of GPU in FLOPS (CUDA SAMPLES)

查看:568
本文介绍了估计FLOPS中的GPU效率(CUDA示例)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我看来,我并不完全了解FLOPS的概念。在CUDA SAMPLES中,有矩阵乘法示例(0_Simple / matrixMul)。在此示例中,每个矩阵乘法的FLO​​P(带浮点运算)的数量通过以下公式计算:

It seems to me, that I don't completely understand the conception of FLOPS. In CUDA SAMPLES, there is Matrix Multiplication Example (0_Simple/matrixMul). In this example the number of FLOPs (operations with floating point) per matrix multiplication is calculated via the formula:

 double flopsPerMatrixMul = 2.0 * (double)dimsA.x * (double)dimsA.y * (double)dimsB.x;

因此,这意味着为了乘以矩阵 A(nxm) 超过 B(mxk),我们需要这样做: 2 * n * m * k 使用浮点运算。

So, this means, that in order to multiply matrix A(n x m) over B(m x k), we need to do: 2*n*m*k operations with floating point.

但是,为了计算所得矩阵的1个元素 C(nxk) ,则必须执行 m 乘法和(m-1)加法运算。因此,操作总数(用于计算 nxk 个元素)为 m * n * k 乘积和(m-1)* n * k 加法。

However, in order to calculate 1 element of the resulting matrix C (n x k), one have to perform m multiplication and (m-1) addition operations. So, the total number of operations (to calculate n x k elements), is m*n*k multiplications and (m-1)*n*k additions.

当然,我们也可以将添加的数量设置为 m * n * k 操作数将为 2 * n * m * k ,其中一半是乘法,一半是加法。

Of course, we could set the number of additions to m*n*k as well, and the total number of operations will be 2*n*m*k, half of them are multiplications and half additions.

但是,我想乘法在计算上要比加法更昂贵。为什么将这两种类型的操作混合在一起?在计算机科学中总是这样吗?如何考虑两种不同类型的操作?

But, I guess, multiplication is more computationally expensive, than addition. Why this two types of operations are mixed up? Is it always the case in computer science? How can one take into account two different types of operations?

对不起,我的英语)

推荐答案

简单的答案是,是的,他们同时计算乘法和加法。即使大多数浮点处理器都具有融合的乘法/加法运算,它们仍然将乘法和加法算作两个单独的浮点运算。

The short answer is that yes, they count both the multiplications and the additions. Even though most floating point processors have a fused multiply/add operation, they still count the multiply and add as two separate floating point operations.

这就是为什么人们拥有几十年来一直抱怨FLOP基本上是没有意义的测量。简而言之,您几乎需要指定一些要测量其FLOP的特定代码主体(例如 Linpack gigaflops)。即使这样,有时您仍需要严格控制诸如允许编译器进行哪些优化以确保所测量的实际上是机器速度,而不是编译器简单地消除一些操作的能力。

This is part of why people have been complaining for decades that FLOPs is basically a meaningless measurement. To mean even a little, you nearly need to specify some particular body of code for which you're measuring the FLOPs (e.g., "Linpack gigaflops"). Even then, you sometimes need fairly tight control over things like what compiler optimizations are allowed to assure that what you're measuring is really machine speed rather than the compiler's ability to simply eliminate some operations.

最终,正是由于这些问题,才导致组织成立了基准和规则,规定了必须如何运行这些基准以及如何报告结果(例如SPEC)。否则,可能很难完全确定您看到的两个不同处理器的报告结果在任何有意义的方式上都具有可比性。即使这样,比较也可能很困难,但是如果没有这种比较,它们就可能毫无意义。

Ultimately, it's concerns like these that have led to organizations being formed to set up benchmarks and rules about how those benchmarks must be run and results reported (e.g., SPEC). Otherwise, it can be difficult to be at all certain that the results you see reported for two different processors are really comparable in any meaningful way. Even with it, comparisons can be difficult, but without such things they can border on meaningless.

这篇关于估计FLOPS中的GPU效率(CUDA示例)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆