1 个 CUDA 内核能否在每个时钟(麦克斯韦)处理超过 1 个浮点指令? [英] Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

查看:29
本文介绍了1 个 CUDA 内核能否在每个时钟(麦克斯韦)处理超过 1 个浮点指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

解决方案

总结:一个 FMA 算作 2 个 FLOPs 在 FP 吞吐量的标准核算中,即使在为单个执行单元在单个指令中执行此操作的机器上(这是它避免中间舍入的方式,融合 FMA 的一部分).


CUDA核心"(也称为 SP - 流式处理器)最常指的是 SM(流式多处理器)中的单精度浮点单元.一个 CUDA 内核可以在每个时钟周期启动一条单精度浮点指令.(该单元是流水线的,因此它可以在每个时钟启动一条指令,并且它可以在每个时钟退出一条指令,但它不能在给定的时钟周期内完全处理给定的指令.)

例如,如果该指令是单精度加法或单精度乘法,则该内核可以在每个时钟提供一个浮点操作,因为加法或乘法算作一次浮点操作.另一方面,如果该指令是 FMA 指令(浮点乘加),则内核将在同一时间段内执行浮点乘法 AND 浮点加法运算.这意味着两个操作实际上是由一个指令执行的.在计算峰值理论吞吐量时,FMA 的这种用法会产生 2 乘数.

因此,一个内核每个时钟只能处理(即启动、退出)一条指令,但如果该指令是 FMA,则它计为两个浮点运算.

List of Nvidia GPU - GeForce 900 Series - there is written that:

4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed.

I.e. for example for GeForce GTX 970 we can calculate performance:

1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops)

This value we can see in column - Processing Power (peak) GFLOPS - Single Precision.

But why we must multiple by 2?

There is written: http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/

SMM uses a quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock.

Ok, nVidia Maxwell is superscalar architecture and dispatching two instructions per clock, but can 1 CUDA-core(FP32-ALU) process more than 1 instruction per clock?

We know that 1 CUDA-Core contain two units: FP32-unit and INT-unit. But INT-unit is irrelevant to GFlops (FLoating-point Operations Per Second).

I.e. one SMM contain:

  • 128 FP32-unit
  • 128 INT-unit
  • 32 SFU-unit
  • 32 LD/ST-unit

To get preformance in GFlops we should to use only: 128 FP32-units and 32 SFU-units.

I.e. if we use both 128 FP32-units and 32 SFU-units simultaneously, then we can get 160 instructions with float-point operations per clock per 1 SM.

I.e. we must multiple by 1,2 =(160/132) instad of 2.

1664 Cores * 1050 MHz * 1,2 = 2 096 GFlops peak

Why has write in wiki that we must multiple Cores*MHz by 2?

解决方案

Summary: One FMA counts as 2 FLOPs in the standard accounting of FP throughput, even on machines that do it in a single instruction for a single execution unit (which is how it avoids intermediate rounding, the fused part of FMA).


A CUDA "core" (also called SP - streaming processor) is most commonly referring to the single-precision floating point units in an SM (streaming multiprocessor). A CUDA core can initiate one single precision floating point instruction per clock cycle. (The unit is pipelined, so it can initiate one instruction per clock, and it can retire one instruction per clock, but it cannot fully process a given instruction in a given clock cycle.)

If that instruction is for example, a single-precision add or single precision multiply, then that core can contribute one floating point operation per clock, since an add or multiply counts as one floating point operation. If, on the other hand, the instruction is an FMA instruction (floating point multiply-add) then the core will perform both a floating point multiply AND a floating point add operation in the same time period. This means that effectively two operations are performed by a single instruction. This usage of FMA gives rise to the 2 multiplier when computing peak theoretical throughput.

So a core can only process (i.e. initiate, retire) a single instruction per clock, but if that instruction is an FMA, it counts as two floating point operations.

这篇关于1 个 CUDA 内核能否在每个时钟(麦克斯韦)处理超过 1 个浮点指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆