CUDA:同步线程 [英] CUDA: synchronizing threads

查看:265
本文介绍了CUDA:同步线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几乎在我阅读关于使用CUDA编程的任何地方,都提到了一个线程中的所有线程都做同样的事情的重要性。

在我的代码中,我有一个情况,避免一定条件。它看起来像这样:

Almost anywhere I read about programming with CUDA there is a mention of the importance that all of the threads in a warp do the same thing.
In my code I have a situation where I can't avoid a certain condition. It looks like this:

// some math code, calculating d1, d2
if (d1 < 0.5)
{
    buffer[x1] += 1;  // buffer is in the global memory
}
if (d2 < 0.5)
{
    buffer[x2] += 1;
}
// some more math code.

一些线程可能会进入一个条件,一些可能进入两者和其他可能不进入任一。

Some of the threads might enter into one for the conditions, some might enter into both and other might not enter into either.

现在,为了使所有线程在条件后再次执行做同样的事情,我应该在条件后使用 __ syncthreads()?或者这是以自动方式发生的吗?

两个线程可以不是做同样的事情,因为它们中的一个是一个操作,因此毁了每个人?

Now in order to make all the thread get back to "doing the same thing" again after the conditions, should I synchronize them after the conditions using __syncthreads() ? Or does this somehow happens automagically?
Can two threads be not doing the same thing due to one of them being one operation behind, thus ruining it for everyone? Or is there some behind the scenes effort to get them to do the same thing again after a branch?

推荐答案

在一个warp中,没有线程会领先任何其他人。如果存在条件分支并且它由warp中的一些线程而不是其它线程(也称为warpdivergence)采用,则其他线程将仅仅是空闲的,直到分支完成并且它们都在公共指令上收敛在一起。因此,如果你只需要内部同步线程,这是自动发生。

Within a warp, no threads will "get ahead" of any others. If there is a conditional branch and it is taken by some threads in the warp but not others (a.k.a. warp "divergence"), the other threads will just idle until the branch is complete and they all "converge" back together on a common instruction. So if you only need within-warp synchronization of threads, that happens "automagically."

但是不同的经线不是这样同步的。因此,如果你的算法要求某些操作在许多warp中完成,那么你需要使用显式的同步调用(参见CUDA Programming Guide,第5.4节)。

But different warps are not synchronized this way. So if your algorithm requires that certain operations be complete across many warps then you'll need to use explicit synchronization calls (see the CUDA Programming Guide, Section 5.4).

EDIT:重新整理了下面几段内容,以澄清一些事情。

reorganized the next few paragraphs to clarify some things.

这里:指令同步和内存可见性。

There are really two different issues here: Instruction synchronization and memory visibility.


  • __ syncthreads()指令同步并确保存储器可见性,但仅在一个块内,而不是在块之间(CUDA编程指南,附录B.6)。它对共享内存中的写入读取非常有用,但不适用于同步全局内存访问。

  • __syncthreads() enforces instruction synchronization and ensures memory visibility, but only within a block, not across blocks (CUDA Programming Guide, Appendix B.6). It is useful for write-then-read on shared memory, but is not appropriate for synchronizing global memory access.

__ threadfence()确保全局内存可见性,但不执行任何指令同步,体验它的使用有限(但参见附录B.5中的示例代码)。

__threadfence() ensures global memory visibility but doesn't do any instruction synchronization, so in my experience it is of limited use (but see sample code in Appendix B.5).

在内核中不能进行全局指令同步。如果你在任何线程上调用 g()之前在所有线程上都需要 f() c> f()和 g()分成两个不同的内核,并从主机连续调用它们。

Global instruction synchronization is not possible within a kernel. If you need f() done on all threads before calling g() on any thread, split f() and g() into two different kernels and call them serially from the host.

如果只需要增加共享或全局计数器,请考虑使用原子增量函数 atomicInc()(附录B.10)。对于上面的代码,如果 x1 x2 不是全局唯一的(在网格中的所有线程) ,非原子增量将导致竞争条件,类似于附录B.2.4的最后一段。

If you just need to increment shared or global counters, consider using the atomic increment function atomicInc() (Appendix B.10). In the case of your code above, if x1 and x2 are not globally unique (across all threads in your grid), non-atomic increments will result in a race-condition, similar to the last paragraph of Appendix B.2.4.

最后,请记住,对全局内存的任何操作,特别是同步功能(包括原子)都不利于性能。

Finally, keep in mind that any operations on global memory, and synchronization functions in particular (including atomics) are bad for performance.

不知道你解决的问题很难推测,但也许你可以重新设计你的算法,在一些地方使用共享内存,而不是全局内存。这将减少同步的需要,并提高您的性能。

Without knowing the problem you're solving it is hard to speculate, but perhaps you can redesign your algorithm to use shared memory instead of global memory in some places. This will reduce the need for synchronization and give you a performance boost.

这篇关于CUDA:同步线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆