什么是“多头"?和“短"字样计分板MIO/L1TEX? [英] What are the "long" and "short" scoreboard w.r.t. MIO/L1TEX?

查看:109
本文介绍了什么是“多头"?和“短"字样计分板MIO/L1TEX?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在最近的NVIDIA微体系结构中,有一个新的(?)

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states.

此分类法中的两个项目是:

Two of the items in this taxonomy are:

  • 短记分板-记分板对MIO队列操作的依赖性.
  • 长计分板-计分板依赖于L1TEX操作.
  • Short scoreboard - scoreboard dependency on an MIO queue operation.
  • Long scoreboard - scoreboard dependency on an L1TEX operation.

我认为,其中的记分板"是用于乱序执行数据依赖性跟踪的意义(请参见例如此处).

where, I presume, "scoreboard" is used the sense of out-of-order execution data dependency tracking (see e.g. here).

我的问题:

  • 形容词短"是什么意思?或多头"描述?它是单个记分牌的长度吗?两种不同的操作有两种不同的记分牌?
  • MIO之间这种有点非直觉的二分法是什么意思-有些但不是全部都是内存操作;和L1TEX操作,它们都是内存操作吗?是二分法吗停滞原因仅仅是因为还是真正的硬件?

推荐答案

NVIDIA GPU有两种指令分类:

The NVIDIA GPU has two classification of instructions:

  1. 固定延迟-数学,按位,寄存器移动
  2. 可变延迟-共享,本地,全局和纹理的ld/st以及缓慢的数学运算

短记分板长记分板是根据依赖于可变延迟指令返回的数据的指令进行报告的.报告了较短的记分板,以了解可变延迟指令的依赖性,这些指令不会离开SM,例如慢速数学运算(如倒数sqrt或共享内存).报告了长记分牌,以查找可能离开SM的依赖项,例如全局/本地内存访问和纹理获取.

The Short Scoreboard and Long Scoreboard are reported on instructions dependent on data returned from a variable latency instruction. Short scoreboards are reported for dependencies coming for variable latency instructions that will not leave the SM such as slow math such as reciprocal sqrt or shared memory). Long scoreboards are reported for dependencies that may leave the SM such as global/local memory accesses and texture fetches.

Nsight Cmpute v2020.3.1内核分析的详细说明指南

长计分板

Warp停滞不前,等待对L1TEX(局部,全局,曲面,tex)操作的记分牌依赖.为了减少等待L1TEX数据访问的周期数,请验证内存访问模式对于目标体系结构是最佳的,尝试通过增加数据局部性或更改高速缓存配置来提高高速缓存命中率,并考虑将常用数据移至共享内存中.

Warp was stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, tex) operation. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality, or by changing the cache configuration, and consider moving frequently used data to shared memory.

短记分板

Warp停滞不前,等待记分板依赖于MIO(存储器输入/输出)操作(不适用于L1TEX).由于记分板短而导致大量停顿的主要原因通常是对共享内存的内存操作.其他原因包括频繁执行特殊的数学指令(例如MUFU)或动态分支(例如BRX,JMX).验证是否存在共享内存操作,并减少存储区冲突(如果适用).

Warp was stalled waiting for a scoreboard dependency on a MIO (memory input/output) operation (not to L1TEX). The primary reason for a high number of stalls due to short scoreboards is typically memory operations to shared memory. Other reasons include frequent execution of special math instructions (e.g. MUFU) or dynamic branching (e.g. BRX, JMX). Verify if there are shared memory operations and reduce bank conflicts, if applicable.

MIO与L1TEX

MIO vs. L1TEX

MIO和L1TEX是NVIDIA SM中的分区.MIO单元负责共享执行单元(由1个或多个SM子分区共享),包括较低速率的数学单元(例如GeForce芯片上的双精度)和内存输入/输出.内存子系统包含L1,TEX单元,共享内存单元以及其他到SM的特定于域的(例如图形)接口.包括L1,TEX和共享内存在内的MIO子系统的实现在开普勒,麦克斯韦-帕斯卡和沃尔特安培之间有很大的不同.SM子分区(warp调度程序)通过指令队列与直接分派向共享执行单元发出指令.对于SM 7.0+,如果这些单元的指令队列已满,则会出现停顿原因(mio_throttle,lg_throttle和tex_throttle).

MIO and L1TEX are partitions in the NVIDIA SM. The MIO units is responsible for shared execution units (shared by 1 or more SM sub-partitions) including lower rate math units (e.g. double precision on a GeForce chip) and memory input/output. The memory subsystems contains L1, TEX unit, shared memory unit, and other domain specific (e.g. graphics) interfaces to the SM. The implementation of the MIO subsystem including L1, TEX, and shared memory varies greatly between Kepler, Maxwell-Pascal, and Volta-Ampere. SM sub-partitions (warp schedulers) issues instructions to shared execution units through instruction queues vs. direct dispatch. For SM 7.0+ there are stall reasons (mio_throttle, lg_throttle, and tex_throttle) that occur if the instruction queues for those units are full.

MIO定义中包含的内容因体系结构而异.L1TEX在技术上位于MIO分区中.L1TEX具有两个输入接口,因此非常复杂:

What is included in the definition of MIO varies by architecture. L1TEX is technically in the MIO partition. The L1TEX has is complicated as it has two input interfaces:

  1. LSU接口用于共享内存,本地/全局内存(标记)以及特殊操作(例如随机播放和专用寄存器).
  2. TEX接口用于纹理获取,并且在7.0-8.x上是速度较慢的数学运算的子集(例如GeForce卡上的FP64).后者有点令人困惑.存在慢速数学单位,以实现二进制兼容性,并且不应与纹理获取同时使用.

MIO一词可能令人困惑.给定两个不同的接口,术语L1TEX也可能造成混淆.虽然有两个接口,本地/全局和纹理/表面共享相同的缓存查找阶段,相同的缓存RAM和相同的SM至L2接口,所以对于许多度量标准,术语L1TEX用来表示该单元.

The term MIO can be confusing. The term L1TEX can also be confusing given two different interfaces. While there are two interfaces local/global and texture/surface share the same cache lookup stages, same cache RAM, and same SM to L2 interface so for many metrics the term L1TEX is used to refer to the unit.

这篇关于什么是“多头"?和“短"字样计分板MIO/L1TEX?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆