在Nsight Compute中解释计算工作负载分析 [英] Interpreting compute workload analysis in Nsight Compute

查看：69 发布时间：2021/4/27 20:10:51 cuda nsight-compute

本文介绍了在Nsight Compute中解释计算工作负载分析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

计算工作量分析"显示不同计算管道的利用率.我知道在现代GPU中，整数和浮点管线是不同的硬件单元，可以并行执行.但是，不清楚哪个管道代表其他管道的哪个硬件单元.在网上也找不到关于管道的缩写和解释的任何文档.

Compute Workload Analysis displays the utilization of different compute pipelines. I know that in a modern GPU, integer and floating point pipelines are different hardware units and can execute in parallel. However, it is not very clear which pipeline represents which hardware unit for the other pipelines. I also couldn't find any documentation online about abbreviations and interpretations of the pipelines.

我的问题是:

1) ADU，CBU，TEX，XU 的全名是什么?他们如何映射到硬件?

1) What are the full names of ADU, CBU, TEX, XU? How do they map to the hardware?

2)哪些管线使用相同的硬件单元(例如FP16，FMA，FP64使用浮点单元)?

2) Which of the pipelines utilize the same hardware unit(e.g. FP16, FMA, FP64 uses floating point unit)?

3)现代GPU中的翘曲调度器可以每个周期调度2条指令(使用不同的流水线).可以同时使用哪些管道(例如FMA-ALU，FMA-SFU，ALU-Tensor等)?

3) A warp scheduler in a modern GPU can schedule 2 instructions per cycle(using different pipelines). Which pipelines can be used at the same time(e.g FMA-ALU, FMA-SFU, ALU-Tensor etc.)?

Ps:我正在为不熟悉Nsight Compute的用户添加屏幕截图.

P.s.: I am adding the screenshot for those who are not familiar with Nsight Compute.

推荐答案

Volta(CC 7.0)和Turing(CC 7.5)SM由4个子分区(SMSP)组成.每个子分区包含

The Volta (CC 7.0) and Turing (CC 7.5) SM is comprised of 4 sub-partitions (SMSP). Each sub-partition contains

warp Scheduler
注册文件
立即常量缓存
执行单位
- ALU，FMA，FP16，UDP(7.5+)和XU
- 以计算为中心的零件上的FP64(GV100)
- 张量单位
包含其他几个分区，这些分区包含执行单元和由4个子分区共享的资源，包括

The contains several other partitions that contains execution units and resources shared by the 4 sub-partitions including
- 指令缓存
- 索引常量缓存
- L1数据缓存，已分为标记的RAM和共享内存
- 执行单位
  - ADU，LSU，TEX
  - 在非计算部分FP64和Tensor可以实现为共享执行单元
  在Volta(CC7.0，7.2)和Turing(CC7.5)中，每个SM子分区每个周期可以发出1条指令.可以将指令发布到本地执行单元或SM共享执行单元.
  
  In Volta (CC7.0, 7.2) and Turing (CC7.5) each SM sub-partition can issue 1 instruction per cycle. The instruction can be issued to a local execution unit or the SM shared execution units.
  - ADU -地址差异单元.在将指令转发给其他执行单元之前，ADU负责分支/跳转和索引恒定负载的每个线程地址差异处理.
  - ALU -算术逻辑单元.ALU负责执行大多数整数指令，位操作指令和逻辑指令.
  - CBU -收敛壁垒单元.CBU负责屏障，收敛和分支指令.
  - FMA -浮点乘和累加单位.FMA负责大多数FP32指令，整数乘法和累加指令以及整数点积.
  - FP16 -成对的半精度浮点单元.FP16单元负责执行成对的半精度浮点指令.
  - FP64 -双精度浮点单元.FP64单元负责所有FP64指令.FP64通常实现为NVIDIA GPU上的几个不同管道.每个芯片的吞吐量差异很大.
  - LSU -负载存储单元.LSU负责向全局，本地和共享内存中加载，存储和原子指令.
  - 张量(FP16)-半精度浮点矩阵乘法和累加单位.
  - 张量(INT)-整数矩阵乘法和累加单位.
  - TEX -纹理单位.纹理单元负责对纹理和表面进行采样，加载和过滤指令.
  - UDP (统一)-统一数据路径-用于执行指令的标量单元，其中warp中所有线程的输入和输出均相同.
  - XU -先验和数据类型转换单元-XU负责特殊功能，例如sin，cos和倒数平方根以及数据类型转换.
  - ADU - Address Divergence Unit. The ADU is reponsible per thread address divergence handling for branches/jumps and indexed constant loads prior to instructions being forwarded to other execution units.
  - ALU - Arithmetic Logic Unit. The ALU is responsible for execution of most integer instructions, bit manipulation instructions, and logic instructions.
  - CBU - Convergence Barrier Unit. The CBU is repsonsible for barrier, convergence, and branch instructions.
  - FMA - Floating point Multiply and Accumulate Unit. The FMA is responsible for most FP32 instructions, integer multiply and accumulate instructions, and integer dot product.
  - FP16 - Paired half-precision floating point unit. The FP16 unit is responisble for execution of paired half-precision floating point instructions.
  - FP64 - Double precision floating point unit. The FP64 unit is responsible for all FP64 instructions. FP64 is often implemented as several different pipes on NVIDIA GPUs. The throughput varies greatly per chip.
  - LSU - Load Store Unit. The LSU is responsible for load, store and atomic instructions to global, local, and shared memory.
  - Tensor (FP16) - Half-precision floating point matrix multiply and accumulate unit.
  - Tensor (INT) - Integer matrix multiply and accumulate unit.
  - TEX - Texture Unit. The texture unit is responsible for sampling, load, and filtering instructions on textures and surfaces.
  - UDP (Uniform) - Uniform Data Path - A scalar unit used to execute instructions where input and output is identical for all threads in a warp.
  - XU - Transcendental and Data Type Conversion Unit - The XU is responsible for special functions such as sin, cos, and reciprocal square root as well as data type conversions.
  这篇关于在Nsight Compute中解释计算工作负载分析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Nsight Compute中解释计算工作负载分析 [英] Interpreting compute workload analysis in Nsight Compute

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Nsight Compute中解释计算工作负载分析 [英] Interpreting compute workload analysis in Nsight Compute

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭