分支分歧真的那么糟糕吗? [英] Is branch divergence really so bad?

查看:32
本文介绍了分支分歧真的那么糟糕吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在互联网上看到了许多关于分支分歧的问题,以及如何避免它.然而,即使在阅读了数十篇关于 CUDA 工作原理的文章后,我似乎看不出在大多数情况下避免分支分歧有什么帮助.在有人伸出爪子扑向我之前,请允许我描述一下我认为大多数情况"的情况.

I've seen many questions scattered across the Internet about branch divergence, and how to avoid it. However, even after reading dozens of articles on how CUDA works, I can't seem to see how avoiding branch divergence helps in most cases. Before anyone jumps on on me with claws outstretched, allow me to describe what I consider to be "most cases".

在我看来,大多数分支分歧实例都涉及许多真正不同的代码块.例如,我们有以下场景:

It seems to me that most instances of branch divergence involve a number of truly distinct blocks of code. For example, we have the following scenario:

if (A):
  foo(A)
else:
  bar(B)

如果我们有两个线程遇到这种分歧,线程 1 将首先执行,走路径 A.接下来,线程 2 将走路径 B.为了消除分歧,我们可以将上面的代码块改为这样:

If we have two threads that encounter this divergence, thread 1 will execute first, taking path A. Following this, thread 2 will take path B. In order to remove the divergence, we might change the block above to read like this:

foo(A)
bar(B)

假设在线程 2 上调用 foo(A) 并在线程 1 上调用 bar(B) 是安全的,人们可能会期望性能会有所提高.但是,这是我的看法:

Assuming it is safe to call foo(A) on thread 2 and bar(B) on thread 1, one might expect performance to improve. However, here's the way I see it:

在第一种情况下,线程 1 和 2 串行执行.调用这两个时钟周期.

In the first case, threads 1 and 2 execute in serial. Call this two clock cycles.

第二种情况,线程1和2并行执行foo(A),然后并行执行bar(B).在我看来,这仍然像两个时钟周期,不同之处在于,在前一种情况下,如果 foo(A) 涉及从内存读取,我想线程 2 可以在该延迟期间开始执行,这会导致在延迟隐藏中.如果是这种情况,分支发散的代码会更快.

In the second case, threads 1 and 2 execute foo(A) in parallel, then execute bar(B) in parallel. This still looks to me like two clock cycles, the difference is that in the former case, if foo(A) involves a read from memory, I imagine thread 2 can begin execution during that latency, which results in latency hiding. If this is the case, the branch divergent code is faster.

推荐答案

你假设(至少这是你给出的例子和你做的唯一参考)避免分支分歧的唯一方法是允许所有线程执行所有代码.

You're assuming (at least it's the example you give and the only reference you make) that the only way to avoid branch divergence is to allow all threads to execute all the code.

在这种情况下,我同意没有太大区别.

In that case I agree there's not much difference.

但避免分支分歧可能更多地与更高级别的算法重组有关,而不仅仅是添加或删除一些 if 语句并使代码安全"地在所有线程中执行.

But avoiding branch divergence probably has more to do with algorithm re-structuring at a higher level than just the addition or removal of some if statements and making code "safe" to execute in all threads.

我将提供一个例子.假设我知道奇数线程需要处理像素的蓝色分量,偶数线程需要处理绿色分量:

I'll offer up one example. Suppose I know that odd threads will need to handle the blue component of a pixel and even threads will need to handle the green component:

#define N 2 // number of pixel components
#define BLUE 0
#define GREEN 1
// pixel order: px0BL px0GR px1BL px1GR ...


if (threadIdx.x & 1)  foo(pixel(N*threadIdx.x+BLUE));
else                  bar(pixel(N*threadIdx.x+GREEN));

这意味着每个备用线程都采用给定的路径,无论是 foo 还是 bar.所以现在我的扭曲需要两倍的时间来执行.

This means that every alternate thread is taking a given path, whether it be foo or bar. So now my warp takes twice as long to execute.

但是,如果我重新排列像素数据,以使颜色分量可能以 32 像素的块连续:BL0 BL1 BL2 ... GR0 GR1 GR2 ...

However, if I rearrange my pixel data so that the color components are contiguous perhaps in chunks of 32 pixels: BL0 BL1 BL2 ... GR0 GR1 GR2 ...

我可以写类似的代码:

if (threadIdx.x & 32)  foo(pixel(threadIdx.x));
else                   bar(pixel(threadIdx.x));

看起来我仍然有分歧的可能性.但是由于分歧发生在扭曲边界上,给定扭曲执行 if 路径或 else 路径,因此不会发生实际的分歧.

It still looks like I have the possibility for divergence. But since the divergence happens on warp boundaries, a give warp executes either the if path or the else path, so no actual divergence occurs.

这是一个微不足道的例子,可能很愚蠢,但它说明了可能有一些方法可以解决扭曲分歧,而不涉及运行所有分歧路径的所有代码.

This is a trivial example, and probably stupid, but it illustrates that there may be ways to work around warp divergence that don't involve running all the code of all the divergent paths.

这篇关于分支分歧真的那么糟糕吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆