如何在同一块中的经纱分歧 [英] How can warps in the same block diverge

查看:295
本文介绍了如何在同一块中的经纱分歧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有点困惑,如何可能的变化,需要通过 __ syncthreads()函数同步。块中的所有元素以SIMT方式处理相同的代码。怎么可能是他们不同步?它是否与调度程序相关?不同的经线有不同的计算时间吗?为什么使用 __ syncthreads()时会有开销?



假设我们在一个区块中有12个不同的变形,其中3个完成了他们的工作。所以现在有空闲和其他经线得到他们的计算时间。或者他们仍然得到计算时间来做 __ syncthreads()函数?

解决方案

首先让我们小心使用术语。由于代码中的控制结构(如果,while等),Warp发散是指在单个warp 内的线程,它们采用不同的执行路径。你的问题与warp和warp调度有关。 / p>

虽然SIMT模型可能建议所有线程在锁步中执行,但情况并非如此。首先,不同块中的线程是完全独立的。它们可以相对于彼此以任何顺序执行。对于你在同一个块中的线程的问题,让我们首先观察一个块可以有多达1024(或更多)线程,但今天的SM(SM或SMX是GPU内部处理线程块的引擎 t有1024个cuda内核,所以在理论上甚至不可能SM在锁步中执行线程块的所有线程。请注意,单个线程块在单个SM上执行,不是同时在所有(或多个)SM上执行。因此,即使机器具有512个或更多的总cuda核,它们也不能用于处理单个线程块的线程,因为单个线程块在单个SM上执行。 (这样做的一个原因是SM特定的资源,像共享内存,可以被线程块中的所有线程访问。)



那么会发生什么?结果是每个SM有一个warp调度器。 < em> warp 只不过是32个线程的集合,它们被分组在一起,一起调度并一起执行。如果线程块有1024个线程,那么每个线程块有32个线程的32个线程。现在,例如,在Fermi上,一个SM有32个CUDA核心,所以合理的考虑一个SM在锁步执行翘曲(和是什么发生在费米)。通过锁步,我的意思是(忽略翘曲发散的情况,以及指令级并行性的某些方面,我试图保持这里的解释简单...)在warp中没有指令被执行,直到前一个指令已被warp中的所有线程执行。因此,Fermi SM只能在任何给定的时刻执行线程块中的其中一个线程。该线程块中的所有其他warp都排队等待,准备就绪,等待。



现在,当warp的执行因为任何原因而遇到停顿时,warp调度器可以自由移动该warp,并带来另一个准备好的(这个新的warp可能甚至不是来自同一个线程块,但我离题。)希望现在你可以看到,如果一个线程块有超过32个线程,而不是所有的线程实际上是在lockstep执行。一些经纱正在进行其他经纱。



这种行为通常是可取的,除非它不是。有时候你不想让线程块中的任何线程超过某一点,直到满足一个条件。这是 __ syncthreads()的用途。例如,您可能正在将数据从全局数据复制到共享内存,并且您不希望任何线程块数据处理开始,直到共享内存已正确填充。 __ syncthreads()确保所有线程都有机会复制他们的数据元素,之前任何线程可以继续超越屏障,并可能开始计算现在的数据



__ syncthreads()的开销有两种风格。首先,只有一个非常小的成本,只是为了处理与这个内置函数相关的机器级指令。第二, __ syncthreads()通常会产生强制warp调度器和SM在线程块中的所有warp中进行shuffle的效果,直到每个warp都遇到了障碍。如果这是有用的,伟大的。但如果不需要它,那么你花时间做一些不需要的东西。所以这样的建议不只是通过你的代码自由地洒在 __ syncthreads()。在需要时谨慎使用。如果你可以制作一个不像其他算法那样使用它的算法,那么该算法可能会更好(更快)。


I am a bit confused how it is possible that Warps diverge and need to be synchronized via __syncthreads() function. All elements in a Block handle the same code in a SIMT fashion. How could it be that they are not in sync? Is it related to the scheduler? Do the different warps get different computing times? And why is there an overhead when using __syncthreads()?

Lets say we have 12 different Warps in a block 3 of them have finished their work. So now there are idling and the other warps get their computation time. Or do they still get computation time to do the __syncthreads() function?

解决方案

First let's be careful with terminology. Warp divergence refers to threads within a single warp that take different execution paths, due to control structures in the code (if, while, etc.) Your question really has to do with warps and warp scheduling.

Although the SIMT model might suggest that all threads execute in lockstep, this is not the case. First of all, threads within different blocks are completely independent. They may execute in any order with respect to each other. For your question about threads within the same block, let's first observe that a block can have up to 1024 (or perhaps more) threads, but today's SM's (SM or SMX is the "engine" inside the GPU that processes a threadblock) don't have 1024 cuda cores, so it's not even theoretically possible for an SM to execute all threads of a threadblock in lockstep. Note that a single threadblock executes on a single SM, not across all (or more than one) SMs simultaneously. So even if a machine has 512 or more total cuda cores, they cannot all be used to handle the threads of a single threadblock, because a single threadblock executes on a single SM. (One reason for this is so that SM-specific resources, like shared memory, can be accessible to all threads within a threadblock.)

So what happens? It turns out each SM has a warp scheduler. A warp is nothing more than a collection of 32 threads that gets grouped together, scheduled together, and executed together. If a threadblock has 1024 threads then it has 32 warps of 32 threads per warp. Now, for example, on Fermi, an SM has 32 CUDA cores, so it is reasonable to think about an SM executing a warp in lockstep (and that is what happens, on Fermi). By lockstep, I mean that (ignoring the case of warp divergence, and also certain aspects of instruction-level-parallelism, I'm trying to keep the explanation simple here...) no instruction in the warp is executed until the previous instruction has been executed by all threads in the warp. So a Fermi SM can only actually be executing one of the warps in a threadblock at any given instant. All other warps in that threadblock are queued up, ready to go, waiting.

Now, when the execution of a warp hits a stall for any reason, the warp scheduler is free to move that warp out and bring another ready-to-go warp in (this new warp might not even be from the same threadblock, but I digress.) Hopefully by now you can see that if a threadblock has more than 32 threads in it, not all the threads are actually getting executed in lockstep. Some warps are proceeding ahead of other warps.

This behavior is normally desirable, except when it isn't. There are times when you do not want any thread in the threadblock to proceed beyond a certain point, until a condition is met. This is what __syncthreads() is for. For example, you might be copying data from global to shared memory, and you don't want any of the threadblock data processing to commence until shared memory has been properly populated. __syncthreads() ensures that all threads have had a chance to copy their data element(s) before any thread can proceed beyond the barrier and presumably begin computations on the data that is now resident in shared memory.

The overhead with __syncthreads() is in two flavors. First of all there's a very small cost just to process the machine level instructions associated with this built-in function. Second, __syncthreads() will normally have the effect of forcing the warp scheduler and SM to shuffle through all the warps in the threadblock, until each warp has met the barrier. If this is useful, great. But if it's not needed, then you're spending time doing something that isn't needed. So thus the advice to not just liberally sprinkle __syncthreads() through your code. Use it sparingly and where needed. If you can craft an algorithm that doesn't use it as much as another, that algorithm may be better (faster).

这篇关于如何在同一块中的经纱分歧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆