每个时钟1个CUDA内核能否处理多于1个浮点指令(Maxwell)? [英] Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

查看:116
本文介绍了每个时钟1个CUDA内核能否处理多于1个浮点指令(Maxwell)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Nvidia GPU列表-GeForce 900系列-写道:

4单精度性能的计算方式是 2倍 着色器乘以基本核心时钟速度.

即例如对于GeForce GTX 970,我们可以计算性能:

1664内核* 1050 MHz * 2 = 3494 GFlops峰值(3494400 MFlops)

我们可以在处理能力(峰值)GFLOPS-单精度"列中看到此值.

但是为什么我们必须乘以2 ?

其中写道: http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/

SMM使用基于象限的设计,具有四个32核处理模块 每个都有专门的经编程序,可以调度两个 每个时钟的指令.

好吧,nVidia Maxwell是超标量架构,每个时钟调度2条指令,但是1个CUDA内核(FP32-ALU)可以每个时钟处理1条以上的指令吗?

我们知道1个CUDA-Core包含两个单元:FP32-unit和INT-unit.但是INT单位与GFlops(浮点数每秒操作数)无关.

即一个SMM包含:

  • 128个FP32单元
  • 128个INT单元
  • 32个SFU单位
  • 32个LD/ST单元

要获得 GFlops 的性能,我们应该仅使用:128个FP32单元和32个SFU单元.

即如果同时使用128个FP32单元和32个SFU单元,则每1 SM的每个时钟可以获得160条带浮点运算的指令.

即我们必须将乘以1,2 =(160/132)的2.

1664内核* 1050 MHz * 1,2 = 2096 Glops峰值

为什么在Wiki中写到我们必须将Cores * MHz乘以2?

解决方案

摘要:一个FMA计为2 FP吞吐量的标准计算中甚至使用FLOP ,即使是在针对单个执行单元的单个指令中执行FPOP的机器上(这也是避免中间舍入的方式,

例如,如果该指令是单精度加法或单精度乘法,则该内核可以每个时钟贡献一个浮点运算,因为加法或乘法算作一个浮点运算.另一方面,如果该指令是FMA指令(浮点乘法加法),则内核将在同一时间段内执行浮点乘法浮点加法运算.这意味着有效的两个操作由一条指令执行.在计算峰值理论吞吐量时,FMA的这种使用会导致2倍增.

因此,内核每个时钟只能处理(即初始化,退出)一条指令,但是如果该指令是FMA,则算作两个浮点运算.

List of Nvidia GPU - GeForce 900 Series - there is written that:

4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed.

I.e. for example for GeForce GTX 970 we can calculate performance:

1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops)

This value we can see in column - Processing Power (peak) GFLOPS - Single Precision.

But why we must multiple by 2?

There is written: http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/

SMM uses a quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock.

Ok, nVidia Maxwell is superscalar architecture and dispatching two instructions per clock, but can 1 CUDA-core(FP32-ALU) process more than 1 instruction per clock?

We know that 1 CUDA-Core contain two units: FP32-unit and INT-unit. But INT-unit is irrelevant to GFlops (FLoating-point Operations Per Second).

I.e. one SMM contain:

  • 128 FP32-unit
  • 128 INT-unit
  • 32 SFU-unit
  • 32 LD/ST-unit

To get preformance in GFlops we should to use only: 128 FP32-units and 32 SFU-units.

I.e. if we use both 128 FP32-units and 32 SFU-units simultaneously, then we can get 160 instructions with float-point operations per clock per 1 SM.

I.e. we must multiple by 1,2 =(160/132) instad of 2.

1664 Cores * 1050 MHz * 1,2 = 2 096 GFlops peak

Why has write in wiki that we must multiple Cores*MHz by 2?

解决方案

Summary: One FMA counts as 2 FLOPs in the standard accounting of FP throughput, even on machines that do it in a single instruction for a single execution unit (which is how it avoids intermediate rounding, the fused part of FMA).


A CUDA "core" (also called SP - streaming processor) is most commonly referring to the single-precision floating point units in an SM (streaming multiprocessor). A CUDA core can initiate one single precision floating point instruction per clock cycle. (The unit is pipelined, so it can initiate one instruction per clock, and it can retire one instruction per clock, but it cannot fully process a given instruction in a given clock cycle.)

If that instruction is for example, a single-precision add or single precision multiply, then that core can contribute one floating point operation per clock, since an add or multiply counts as one floating point operation. If, on the other hand, the instruction is an FMA instruction (floating point multiply-add) then the core will perform both a floating point multiply AND a floating point add operation in the same time period. This means that effectively two operations are performed by a single instruction. This usage of FMA gives rise to the 2 multiplier when computing peak theoretical throughput.

So a core can only process (i.e. initiate, retire) a single instruction per clock, but if that instruction is an FMA, it counts as two floating point operations.

这篇关于每个时钟1个CUDA内核能否处理多于1个浮点指令(Maxwell)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆