CUDA翘曲和每个块的最佳线程数 [英] CUDA Warps and Optimal Number of Threads Per Block

查看:199
本文介绍了CUDA翘曲和每个块的最佳线程数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从我对Kepler GPU和CUDA的了解,一般来说,当一个SMX单元在一个块上工作时,它会启动32个线程的组。现在这里是我的问题:

From what I understand about Kepler GPUs, and CUDA in general, is that when a single SMX unit works on a block, it launches warps which are groups of 32 threads. Now here are my questions:

1)如果SMX单元可以工作在64个翘曲,这意味着每个SMX单元的限制为32x64 = 2048个线程。但是开普勒GPU有4个经线调度器,这意味着只有4个经线可以在GPU内核中同时工作。如果是,这是否意味着我真的应该寻找具有128个线程的倍数(假设线程中没有发散)的块,而不是推荐的32?这当然是忽略任何分歧,甚至是像全局内存访问这样的东西可能导致一个warp停顿,并且调度器切换到另一个。

1) If the SMX unit can work on 64 warps, that means there is a limit of 32x64 = 2048 threads per SMX unit. But Kepler GPUs have 4 warp schedulers, so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel? And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32? This is of course, ignoring any divergence or even cases where something like a global memory access can cause a warp to stall and have the scheduler switch to another.

2)如果以上是正确的,是单个SMX单元同时工作在128个线程的最佳可能的结果?对于像GTX Titan这样的14个SMX单元,总共128x14 = 1792个线程?我看到在线数字说,否则。一个Titan可以同时运行14x64(每个SMX的最大warp数)x32(每个SMX的线程数)= 28,672。这怎么可能是SMX单元启动warp,只有4个warp调度器?他们不能一次启动每个SMX的所有2048个线程?也许我对于GPU可以同时启动的最大线程数量的定义感到困惑,你可以排队什么?

2) If the above is correct, is the best possible outcome for a single SMX unit to work on 128 threads simultaneously? And for something like a GTX Titan that has 14 SMX units, a total of 128x14 = 1792 threads? I see numbers online that says otherwise. That a Titan can run 14x64 (max warps per SMX) x32(threads per SMX) = 28,672 concurrently. How can that be is SMX units launch warps, and only have 4 warp schedulers? They cannot launch all 2048 threads per SMX at once? Maybe I'm confused as to the definition of the maximum number of threads the GPU can launch concurrently, with what you are allowed to queue?

我欣赏解答和澄清

推荐答案


所以这意味着只有4个warp可以同时工作在GPU核心?

so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel?

可以在kepler SMX上的任何给定时钟周期内安排来自多达4条经线的指令。然而,由于执行单元中的管线,在任何给定的时钟周期中,指令可以处于从任何并且直到当前驻留在SMX上的所有翘曲的流水线执行的各个阶段。

Instructions from up to 4 warps can be scheduled in any given clock cycle on a kepler SMX. However due to pipelines in execution units, in any given clock cycle, instructions may be in various stages of pipeline execution from any and up to all warps currently resident on the SMX.


如果是,这是否意味着我应该寻找具有128个线程的倍数的块(假设线程中没有发散),而不是推荐的32个?

And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32?

我不知道你是如何从上一点跳到这一点的。由于指令混合可能从变形到变形(因为不同的变形可能在指令流中的不同点)和指令混合在指令流中从一个位置到另一个变化,所以在4个可调度的warp之间没有任何逻辑连接给定的时钟周期,以及任何需要具有4个经线的组。给定的扭曲可以在其指令可高度调度的点(可能在SP FMA的序列处,需要SP核心,其是丰富的),并且另外3个warp可以在指令流中的另一点处,其中它们的指令是更难以计划(可能需要SFU,其中存在较少)。因此,将经线任意分组为4组没有多大意义。注意,我们不要求经线的发散与彼此不同步。调度器的自然行为与执行资源的可用性的变化可以产生最初一起处于指令流中的不同点的变形。

I'm not sure how you jumped from the previous point to this one. Since instruction mix presumably varies from warp to warp (since different warps are presumably at different points in the instruction stream) and instruction mix varies from one place to another in the instruction stream, I don't see any logical connection between 4 warps schedulable in a given clock cycle, and any need to have groups of 4 warps. A given warp may be at a point where its instructions are highly schedulable (perhaps at a sequence of SP FMA, requiring SP cores, which are plentiful), and another 3 warps may be at another point in the instruction stream where their instructions are "harder to schedule" (perhaps requiring SFUs, which there are fewer of). Therefore arbitrarily grouping warps into sets of 4 doesn't make much sense. Note that we don't require divergence for warps to get out of sync with each other. The natural behavior of the scheduler coupled with the varying availability of execution resources could create warps that were initially together, to be at different points in the instruction stream.

对于你的第二问题,我认为你的基本知识差距在于理解GPU如何隐藏延迟。假设一个GPU有一组3条指令在一个warp上发布:

For your second question, I think your fundamental knowledge gap is in understanding how a GPU hides latency. Suppose a GPU has a set of 3 instructions to issue across a warp:

LD R0, a[idx]
LD R1, b[idx]
MPY R2, R0, R1

来自全局存储器的LD,并且它可以被发出并且不会阻止翘曲。同样可以发出第二指令。然而,由于来自全局存储器的延迟,翘曲将在第三指令处停止。直到R0和R1变得正确填充,乘法指令才能被分派。从主内存的延迟阻止它。 GPU(希望)具有可以转向的其他工作的即时供给,即处于未安装状态的其他卷曲(即,具有可以发出的指令)的GPU来处理这个问题。促进此延迟隐藏过程的最佳方法是让SMX提供许多经线。没有任何粒度(例如需要4个经线)。一般来说,您的网格中的线程/ warp / blocks越多,GPU就越有可能隐藏延迟。

The first instruction is a LD from global memory, and it can be issued and does not stall the warp. The second instruction likewise can be issued. The warp will stall at the 3rd instruction, however, due to latency from global memory. Until R0 and R1 become properly populated, the multiply instruction cannot be dispatched. Latency from main memory prevents it. The GPU deals with this problem by (hopefully) having a ready supply of "other work" it can turn to, namely other warps in an unstalled state (i.e. that have an instruction that can be issued). The best way to facilitate this latency-hiding process is to have many warps available to the SMX. There isn't any granularity to this (such as needing 4 warps). Generally speaking, the more threads/warps/blocks that are in your grid, the better chance the GPU will have of hiding latency.

因此,确实,GPU无法在单个时钟周期 中启动2048个线程(即从2048个线程发出指令)。但是当变形停止时,它被置于等待队列中,直到失速条件被解除,并且直到那时,对于下一个时钟周期,具有其他经线准备好是有帮助的。

So it is true that the GPU cannot "launch" 2048 threads (i.e. issue instructions from 2048 threads) in a single clock cycle. But when a warp stalls, it is put into a waiting queue until the stall condition is lifted, and until then, it is helpful to have other warps "ready to go", for the next clock cycle(s).

GPU延迟隐藏是一个常被误解的主题。如果您搜索它们,有很多可用的资源来了解它们。

GPU latency hiding is a commonly misunderstood topic. There are many available resources to learn about it if you search for them.

这篇关于CUDA翘曲和每个块的最佳线程数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆