硬件扭曲调度程序如何形成和处理扭曲？ [英] How is a warp formed and handled by the hardware warp scheduler?

查看：292 发布时间：2020/10/13 0:55:41 cuda gpu scheduling

本文介绍了硬件扭曲调度程序如何形成和处理扭曲？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题与变形和调度有关。我在这里使用NVIDIA Fermi术语。我的观察结果如下，对吗？

My questions are about warps and scheduling. I'm using NVIDIA Fermi terminology here. My observations are below, are they correct?

A。同一个线程束中的线程执行相同的指令。每个warp包含32个线程。

A. Threads in the same warp execute the same instruction. Each warp includes 32 threads.

根据Fermi白皮书：
Fermi的双重warp调度程序选择两个warp，并从每个线程发出一条
指令翘曲到一组十六个内核，十六个加载/存储单元或四个SFU。

According to the Fermi Whitepaper: "Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. "

从这里开始，我认为自16年以来，两次调度warp（32个线程） 32个中的核心被分组在一起。每个调度程序在一个周期中将一半的warp分配给16个内核，并且总共两个调度器在一个周期中将两个warp-halves分配给两个16核调度组。换句话说，在此Fermi架构中，需要将两次翘曲调度两次，一半一半。如果扭曲仅包含SFU操作，则此扭曲需要发出8次（32/4），因为SM中只有4个SFPU。

From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together. Each scheduler issues half of a warp to 16 cores in a cycle, and in all, two schedulers issue two warp-halves into two 16-core scheduling groups in a cycle. In another words, one warp needs to be scheduled twice, half by half, in this Fermi architecture. If a warp contains only SFU operations, then this warp needs to be issued 8 times(32/4), since there's only 4 SFPUs in an SM.

B。当启动大量线程（例如1-D数组，320个线程）时，连续的线程将自动分组为10个线程束，每个线程束具有32个线程。因此，如果所有线程都在执行相同的工作，它们将执行完全相同的指令。然后，在这种情况下，所有的经纱总是带有相同的指令。

B. When a large amount of threads (say 1-D array, 320 threads) is launched, consecutive threads will be grouped into 10 warps automatically, each has 32 threads. Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.

问题：
第一季度。哪一部分处理线程分组（成束）？软件还是硬件？如果是硬件，它是翘曲调度程序吗？以及硬件翘曲调度程序如何实现和工作？

Questions: Q1. Which part handles the threads grouping (into warps)? software or hardware? if hardware, is it the warp scheduler? and how the hardware warp scheduler is implemented and work?

第二季度。如果我有64个线程，则线程0-15和32-47在执行同一条指令，而16-31和48-63在执行另一条指令，那么调度程序是否足够聪明，可以将非连续线程（使用同一条指令）分组到同一线程中（即，将线程0-15和32-47分组为同一经线，并将线程16-31和48-63分组为另一经线）？

Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?

Q3 。翘曲大小（32）大于调度组大小（16核）有什么意义？（这是硬件问题）由于在这种情况下（Fermi），无论如何，翘曲将被调度两次（两个周期）。如果翘曲为16宽，那么将仅计划两次翘曲（也将在两个周期内），这与前面的情况似乎相同。我想知道该组织是否由于性能方面的原因。

Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?(this is a hardware question) Since in this case(Fermi), a warp will be scheduled twice (in two cycles) anyway. If a warp is 16 wide, simply two warps will be scheduled (also in two cycles), which seems the same with the previous case.I wonder whether this organization is due to performance concern.

我现在可以想象的是：同一经线中的线程可以保证同步，这有时会很有用，或者其他资源（如寄存器和内存）是按经线大小组织的。我不确定这是否正确。

What I can imagine now is: threads in the same warp can be guaranteed synchronized which can be useful sometimes, or other resources such as registers and memory are organized in the warp size basis. I'm not sure whether this is correct.

推荐答案

更正一些误解：

A。 ...从这里开始，我认为warp（32个线程）被调度了两次，因为32个线程中的16个内核被组合在一起。

A. ...From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together.

当向一组16个内核发出warp指令时，整个warp都会执行该指令，因为内核被时钟两次（Fermi的 hotclock），因此每个内核实际上在一个周期内执行两个线程的计算值hotclocks）。分派warp指令后，整个warp都会得到服务。不需要计划两次。

When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice.

B。 ...因此，如果所有线程都执行相同的工作，它们将执行完全相同的指令。然后，在这种情况下，所有的warp都始终执行相同的指令。

B. ...Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.

的确，一个块中的所有线程（因此，所有的warp）从相同的指令流执行，但不一定执行相同的指令。当然，warp中的所有线程在任何给定时间都在执行同一条指令。但是，warp彼此独立执行，因此在任何给定时间，一个块中的不同warp可能正在执行来自流的不同指令。 Fermi白皮书上的图使这一点显而易见。 p>

It's true that all threads in a block (and therefore all warps) are executing from the same instruction stream, but they are not necessarily executing the same instruction. Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time. The diagram on page 10 of the Fermi whitepaper makes this evident.

Q1：哪一部分处理线程分组（成束）？软件还是硬件？

Q1: Which part handles the threads grouping (into warps)? software or hardware?

这是由硬件完成的，如硬件实现部分：将块划分为扭曲的方式始终相同；每个扭曲包含连续的，递增的线程ID的线程，第一个扭曲包含线程0。线程层次结构描述了线程ID与块中的线程索引之间的关系。

It is done by hardware, as explained in the hardware implementation section of the programming guide: "The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block. "

以及硬件扭曲调度程序如何实现和工作？

and how the hardware warp scheduler is implemented and work?

我认为这在任何地方都没有正式记载。格雷格·史密斯（Greg Smith）提供了各种解释，您不妨使用 user：124092 Scheduler或类似的搜索来阅读他的一些评论。

I don't believe this is formally documented anywhere. Greg Smith has provided various explanations about it, and you may wish to seach on "user:124092 scheduler" or a similar search, to read some of his comments.

第二季度。如果我有64个线程，则线程0-15和32-47在执行同一条指令，而16-31和48-63在执行另一条指令，那么调度程序是否足够聪明，可以将非连续线程（使用同一条指令）分组到同一线程中（即，将线程0-15和32-47分组为同一经线，并将线程16-31和48-63分组为另一经线）？

Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?

这个问题是基于前面概述的误解。将线程分组为经线是非动态；它在线程块启动时固定，并且遵循上面对Q1的回答中描述的方法。此外，线程0-15永远不会与16-31以外的任何线程一起进行调度，因为0-31包含一个在Warmi上不可分割的经线，对于调度目的来说是不可分割的。

This question is predicated on misconceptions outlined earlier. The grouping of threads into a warp is not dynamic; it is fixed at threadblock launch time, and it follows the methodology described above in the answer to Q1. Furthermore, threads 0-15 will never be scheduled with any threads other than 16-31, as 0-31 comprise a warp, which is indivisible for scheduling purposes, on Fermi.

Q3。翘曲大小（32）大于调度组大小（16核）的意义何在？

Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?

再次，我相信这一点问题是基于先前的误解。在某种功能级别上，用于为扭曲提供资源的硬件单元可能以16个单元（或其他数量）存在，但是从操作级别来看，扭曲被调度为32个线程，并且每条指令计划用于整个扭曲，并在一定数量的Fermi时钟内一起执行。

Again, I believe this question is predicated on previous misconceptions. The hardware units used to provide resources for a warp may exist in 16 units (or some other number) at some functional level, but from an operational level, the warp is scheduled as 32 threads, and each instruction is scheduled for the entire warp, and executed together, within some number of Fermi hotclocks.

这篇关于硬件扭曲调度程序如何形成和处理扭曲？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

硬件扭曲调度程序如何形成和处理扭曲？ [英] How is a warp formed and handled by the hardware warp scheduler?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

硬件扭曲调度程序如何形成和处理扭曲？ [英] How is a warp formed and handled by the hardware warp scheduler?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭