如何避免在Cuda中的简单if语句中的分歧分支 [英] How to avoid divergent branch in simple if statements in Cuda

查看:517
本文介绍了如何避免在Cuda中的简单if语句中的分歧分支的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道,当线程必须比较和存储来自本地,共享或全局变量的值时,如何避免内核分支。例如,以下代码检查共享变量,并将 bool 标志相应设置为true

I am wondering, how can anyone avoid branching in kernels when the threads have to compare and store values either from local, shared or global variables. For example the following code checks a shared variable and sets a bool flag to true accordingly

if ( shared_variable < local_value ){
    shared_bool_var = true;
}
__syncthreads();

这里的问题是所有线程都访问同一个变量,所有的都会覆盖为true。
所以我会使用threadId.x检查只让一个线程访问该变量,但这会导致分支分歧。

The problem here is that all threads access the same variable and all will overwrite to true. So i would use a threadId.x check to only let one thread access that variable but this would cause branch divergence.

if ( threadIdx.x == 0 && shared_variable < local_value ){
    shared_bool_var = true;
}
__syncthreads();

这里的问题是我应该怎么做?在这两种情况下,似乎安全,因为syncthread将保护免受危害(读前写等)。我的首选是第二个解决方案,但通常代码不是那么简单。

The question here is what should I prefer to do? In both cases it seems safe since the syncthread will protect from Hazards (read before write etc). My preference is the second solution but usually the code is not that simple.

在上述情况下,是安全的,允许所有线程访问1共享内存位置,会导致银行冲突或串行化的内存访问?
感谢

In the aforementioned case, is it safe to allow all threads to access 1 shared memory location or this would cause a bank conflict or serialization of memory access? Thanks

推荐答案

一个重要的事情要注意:语义和功能上讲,两个代码段不等同: p>

One important thing to note: semantically and functionally speaking, both code stanzas are not equivalent:

// set var to true if ANY thread in the block verifies the predicate
if (shared_variable < local_value) {
    shared_bool_var = true;
}

// set var to true if THE FIRST thread in the block verifies the predicate
if (threadIdx.x == 0 && shared_variable < local_value) {
    shared_bool_var = true;
}

但回到您的问题:


在上述情况下,是否可以安全地允许所有线程访问1个共享内存位置,否则会导致内存访问冲突或串行化?

In the aforementioned case, is it safe to allow all threads to access 1 shared memory location or this would cause a bank conflict or serialization of memory access?

CUDA编程指南,似乎有一些类型的写崩溃机制,防止对相同地址的写访问序列化:相反,只有一个线程写入其值(但是哪个线程未定义)。

After verification in the CUDA programming guide, it seems there is some kind of write-collapsing mechanism that prevents serialization of write-accesses to the same address: instead, only one thread writes its value (but which thread is undefined).

CC 1.x:

CC 1.x:


由warp的执行的非原子指令对共享存储器中的多个线程的多个线程写入相同的位置,每个半线程只有一个线程执行写入,并且哪个线程执行最终写入未定义。

If a non-atomic instruction executed by a warp writes to the same location in shared memory for more than one of the threads of the warp, only one thread per half-warp performs a write and which thread performs the final write is undefined.

CC 2.x及更高版本:

CC 2.x and above:


对warp的共享内存请求不会在访问同一32位字内的任何地址的两个线程之间产生存储体冲突(即使两个地址位于同一个存储体中):在这种情况下, [...]对于写访问,每个地址只由一个线程(哪个线程执行写操作未定义)写入。

A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, [...] for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).

此外:


使用threadId.x检查只允许一个线程访问该变量,但这会导致分支分歧。

So i would use a threadId.x check to only let one thread access that variable but this would cause branch divergence.

这不是发散比第一代码。第一节表现出发散,每当整个翘曲不相同地评估谓词。第二节在每个块的第一次翘曲中表现出发散。在这两种情况下,这些分支都不会影响性能:没有 else 正文和如果单个指令。

This isn't "more divergent" than the first code. The first stanza exhibits divergence whenever a whole warp doesn't evaluate the predicate identically. The second stanza exhibits divergence only in the first warp of every block. In both cases, none of these branches have an impact on performance: there is no else body and the if body is a single instruction.

这篇关于如何避免在Cuda中的简单if语句中的分歧分支的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆