nVIDIA CC 2.1 GPU warp 调度程序如何一次发出 2 条指令进行 warp? [英] How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

查看:24
本文介绍了nVIDIA CC 2.1 GPU warp 调度程序如何一次发出 2 条指令进行 warp?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:此问题专门针对 nVIDIA Compute Capability 2.1 设备.以下信息来自 CUDA Programming Guide v4.1:

Note: This question is specific to nVIDIA Compute Capability 2.1 devices. The following information is obtained from the CUDA Programming Guide v4.1:

在计算能力 2.1 设备中,每个 SM 有 48 个 SP(核心)用于整数和浮点运算.每个 warp 都由32 个连续线程.每个 SM 有 2 个warp 调度器.在每一个指令发布时间,一个warp调度器选择一个准备好的warp线程并发出 2 条指令 用于核心上的扭曲.

In compute capability 2.1 devices, each SM has 48 SP (cores) for integer and floating point operations. Each warp is composed of 32 consecutive threads. Each SM has 2 warp schedulers. At every instruction issue time, one warp scheduler picks a ready warp of threads and issues 2 instructions for the warp on the cores.

我的疑惑:

  • 一个线程将在一个内核上执行.设备如何在单个时钟周期或单个多周期操作中向线程发出 2 条指令?
  • 这是否意味着这两条指令应该相互独立?
  • 这两条指令可以在内核上并行执行,可能是因为它们在内核中使用不同的执行单元?这是否也意味着仅在 2 条指令执行完毕后,warp 才准备就绪,还是在其中一条指令执行完毕后才准备好?

推荐答案

这是指令级并行(ILP).同时从warp发出的指令必须彼此独立.它们由 SM 指令调度器发出,用于分离 SM 中的功能单元.

This is instruction-level parallelism (ILP). The instructions issued from a warp simultaneously must be independent of each other. They are issued by the SM instruction scheduler to separate functional units in the SM.

例如,如果在 warp 的指令流中有两条独立的 FMAD 指令准备发布,并且 SM 有两组可用的 FMAD 单元可在其上发布它们,则它们都可以在同一周期内发布.(指令可以多种组合一起发出,但我没有记住它们,所以我不会在这里提供细节.)

For example, if there are two independent FMAD instructions in the warp's instruction stream that are ready to issue and the SM has two available sets of FMAD units on which to issue them, they can both be issued in the same cycle. (Instructions can be issued together in various combinations, but I have not memorized them so I won't provide details here.)

SM 2.1 中的 FMAD/IMAD 执行单元是 16 个 SP 宽.这意味着需要 2 个周期来向 16 个宽的执行单元之一发出 warp(32 线程)指令.每个 SM 有多个 (3) 个这 16 个范围的执行单元(总共 48 个 SP),外加特殊功能单元.每个 warp 调度器每个周期可以向其中两个发出.

The FMAD/IMAD execution units in SM 2.1 are 16 SPs wide. This means that it takes 2 cycles to issue a warp (32-thread) instruction to one of the 16-wide execution units. There are multiple (3) of these 16-wide execution units (48 SPs total) per SM, plus special function units. Each warp scheduler can issue to two of them per cycle.

假设 FMAD 执行单元是 pipe_Apipe_Bpipe_C.假设在第 135 个周期,有两条独立的 FMAD 指令 fmad_1fmad_2 正在等待:

Assume the FMAD execution units are pipe_A, pipe_B and pipe_C. Let us say that at cycle 135, there are two independent FMAD instructions fmad_1 and fmad_2 that are waiting:

  • 在第 135 周期,指令调度程序将发出 fmad_1 的前半个 warp(16 个线程)到 FMAD pipe_A,以及 fmad_2 的前半个 warp 到 FMAD pipe_B.
  • 在第 136 周期,fmad_1 的前半个 warp 将移动到 FMAD pipe_A 的下一个阶段,类似地 fmad_2<的前半个 warp/code> 将移至 FMAD pipe_B 中的下一个阶段.现在,warp 调度程序将 fmad_1 的后半部分 warp 发送给 FMAD pipe_A,并将 fmad_2 的后半部分 warp 发送给 FMAD pipe_B.
  • At cycle 135, the instruction scheduler will issue the first half warp (16 threads) of fmad_1 to FMAD pipe_A, and the first half warp of fmad_2 to FMAD pipe_B.
  • At cycle 136, the first half warp of fmad_1 will have moved to the next stage in FMAD pipe_A, and similarly the first half warp of fmad_2 will have moved to the next stage in FMAD pipe_B. The warp scheduler now issues the second half warp of fmad_1 to FMAD pipe_A, and the second half warp of fmad_2 to FMAD pipe_B.

所以从同一个 warp 发出 2 条指令需要 2 个周期.但是正如 OP 提到的,有两个 warp 调度程序,这意味着整个过程可以同时完成,以获取来自另一个 warp 的指令(假设有足够的功能单元).因此,最大发布率为每个周期 2 条 warp 指令.请注意,这是程序员视角的抽象视图——实际的低级架构细节可能有所不同.

So it takes 2 cycles to issue 2 instructions from the same warp. But as OP mentions there are two warp schedulers, which means this whole process can be done simultaneously for instructions from another warp (assuming there are sufficient functional units). Hence the maximum issue rate is 2 warp instructions per cycle. Note, this is an abstracted view for a programmer's perspective—the actual low-level architectural details may be different.

至于您关于接下来何时准备好经线的问题,如果有更多指令不依赖于任何未完成的(已发出但未退休的)指令,那么它们可以在下一个周期中发出.但是一旦唯一可用的指令依赖于飞行中的指令,warp 将无法发出.然而,这就是其他 warp 进来的地方——SM 可以为任何具有可用(非阻塞)指令的驻留 warp 发出指令.扭曲之间的这种任意切换提供了 GPU 依赖于高吞吐量的延迟隐藏".

As for your question about when the warp will be ready next, if there are more instructions that don't depend on any outstanding (already issued but not retired) instructions, then they can be issued in the very next cycle. But as soon as the only available instructions are dependent on in-flight instructions, the warp will not be able to issue. However that is where other warps come in -- the SM can issue instructions for any resident warp that has available (non-blocked) instructions. This arbitrary switching between warps is what provides the "latency hiding" that GPUs depend on for high throughput.

这篇关于nVIDIA CC 2.1 GPU warp 调度程序如何一次发出 2 条指令进行 warp?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆