nVIDIA CC 2.1 GPU调变器如何一次发出2条指令来进行翘曲? [英] How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

查看:205
本文介绍了nVIDIA CC 2.1 GPU调变器如何一次发出2条指令来进行翘曲?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:此问题特定于nVIDIA Compute Capability 2.1设备。以下信息来自CUDA编程指南v4.1:

Note: This question is specific to nVIDIA Compute Capability 2.1 devices. The following information is obtained from the CUDA Programming Guide v4.1:


在计算能力2.1设备中,每个 SM 对于整数和浮点运算有48 SP (核心)
。每个 warp 由32个连续线程的
组成。每个SM有2个 warp调度程序。在每个
指令发布时间,一个warp调度器挑选一个已准备好的
线程的warp,并为核心上的warp发出 2个指令

In compute capability 2.1 devices, each SM has 48 SP (cores) for integer and floating point operations. Each warp is composed of 32 consecutive threads. Each SM has 2 warp schedulers. At every instruction issue time, one warp scheduler picks a ready warp of threads and issues 2 instructions for the warp on the cores.

我的疑问:


  • 一个线程将在一个核心上执行。设备如何在单个时钟周期或单个多周期操作中向线程发出2条指令?

  • 这是否意味着2条指令应该彼此独立?

  • 这两个指令可以在核心上并行执行,也许是因为它们在核心中使用不同的执行单元?这是否也意味着,只有在2条指令执行完毕后,或者是在其中一条指令执行完毕后,warp才会准备好?

推荐答案

这是指令级并行性(ILP)。从线束同时发出的指令必须彼此独立。它们由SM指令调度程序发出以分离SM中的功能单元。

This is instruction-level parallelism (ILP). The instructions issued from a warp simultaneously must be independent of each other. They are issued by the SM instruction scheduler to separate functional units in the SM.

例如,如果在warp的指令流中有两个独立的FMAD指令准备好并且SM具有两个可用的FMAD单元集合,在它们上发布它们,它们可以在相同的周期中发布。 (说明可以以各种组合一起发布,但我没有记住它们,因此我不在这里提供详细信息。)

For example, if there are two independent FMAD instructions in the warp's instruction stream that are ready to issue and the SM has two available sets of FMAD units on which to issue them, they can both be issued in the same cycle. (Instructions can be issued together in various combinations, but I have not memorized them so I won't provide details here.)

SM 2.1中的FMAD / IMAD执行单元是16个SPs宽。这意味着需要2个周期来向16个宽的执行单元中的一个发出warp(32个线程)指令。每个SM有多个(3)这些16宽的执行单元(总共48个SP),加上特殊功能单元。

The FMAD/IMAD execution units in SM 2.1 are 16 SPs wide. This means that it takes 2 cycles to issue a warp (32-thread) instruction to one of the 16-wide execution units. There are multiple (3) of these 16-wide execution units (48 SPs total) per SM, plus special function units. Each warp scheduler can issue to two of them per cycle.

假设FMAD执行单元是 pipe_A pipe_B pipe_C 。让我们说在周期135,有两个独立的FMAD指令 fmad_1 fmad_2 正在等待:

Assume the FMAD execution units are pipe_A, pipe_B and pipe_C. Let us say that at cycle 135, there are two independent FMAD instructions fmad_1 and fmad_2 that are waiting:


  • 在周期135,指令调度程序将发出 fmad_1 到FMAD pipe_A ,以及 fmad_2 到FMAD pipe_B

  • 在周期136, fmad_1 的前半段将移动到FMAD pipe_A ,类似地, fmad_2 的上半段将转移到FMAD pipe_B 。 warp调度程序现在发出 fmad_1 到FMAD pipe_A 的第二半弯曲,以及<$ c的第二半弯曲$ c> fmad_2 更改为FMAD pipe_B

  • At cycle 135, the instruction scheduler will issue the first half warp (16 threads) of fmad_1 to FMAD pipe_A, and the first half warp of fmad_2 to FMAD pipe_B.
  • At cycle 136, the first half warp of fmad_1 will have moved to the next stage in FMAD pipe_A, and similarly the first half warp of fmad_2 will have moved to the next stage in FMAD pipe_B. The warp scheduler now issues the second half warp of fmad_1 to FMAD pipe_A, and the second half warp of fmad_2 to FMAD pipe_B.

因此,从同一个warp中发出2条指令需要2个周期。但是作为OP提到有两个warp调度器,这意味着这个整个过程可以同时完成来自另一个warp的指令(假设有足够的功能单元)。因此,最大发布率是每个周期2个warp指令。注意,这是一个程序员的透视图的抽象视图,实际的低级架构细节可能不同。

So it takes 2 cycles to issue 2 instructions from the same warp. But as OP mentions there are two warp schedulers, which means this whole process can be done simultaneously for instructions from another warp (assuming there are sufficient functional units). Hence the maximum issue rate is 2 warp instructions per cycle. Note, this is an abstracted view for a programmer's perspective—the actual low-level architectural details may be different.

至于你的问题,当warp将准备下一个,如果存在更多不依赖于任何未完成(已发出但未退役)指令的指令,则它们可以在下一个周期中发出。但是一旦唯一可用的指令依赖于飞行中的指令,则翘曲将不能发出。然而,这是其他经线进来的地方--SM可以发布任何具有可用(非阻塞)指令的常驻线程的指令。这种在经线之间的任意切换提供了GPU为高吞吐量所依赖的延迟隐藏。

As for your question about when the warp will be ready next, if there are more instructions that don't depend on any outstanding (already issued but not retired) instructions, then they can be issued in the very next cycle. But as soon as the only available instructions are dependent on in-flight instructions, the warp will not be able to issue. However that is where other warps come in -- the SM can issue instructions for any resident warp that has available (non-blocked) instructions. This arbitrary switching between warps is what provides the "latency hiding" that GPUs depend on for high throughput.

这篇关于nVIDIA CC 2.1 GPU调变器如何一次发出2条指令来进行翘曲?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆