是分支发散真的那么坏吗? [英] Is branch divergence really so bad?

查看:241
本文介绍了是分支发散真的那么坏吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到许多分散在互联网上的关于分支分歧的问题,以及如何避免 。然而,即使在阅读了关于CUDA如何工作的几十篇文章之后,我似乎不能在大多数情况下看到如何避免分支发散有助于

I've seen many questions scattered across the Internet about branch divergence, and how to avoid it. However, even after reading dozens of articles on how CUDA works, I can't seem to see how avoiding branch divergence helps in most cases. Before anyone jumps on on me with claws outstretched, allow me to describe what I consider to be "most cases".

在我看来,分支分歧的大多数实例涉及到一些真正独特的代码块。例如,我们有以下场景:

It seems to me that most instances of branch divergence involve a number of truly distinct blocks of code. For example, we have the following scenario:

if (A):
  foo(A)
else:
  bar(B)



如果我们有两个线程遇到这种分歧,线程1将首先执行,采用路径A.接下来,线程2将采取路径B.为了消除发散,我们可以改变上面的块,如下所示:

If we have two threads that encounter this divergence, thread 1 will execute first, taking path A. Following this, thread 2 will take path B. In order to remove the divergence, we might change the block above to read like this:

foo(A)
bar(B)

假设在线程2上调用 foo(A)是安全的,在线程1上调用 bar(B)人们可能期望性能提高。但是,这是我看到的方式:

Assuming it is safe to call foo(A) on thread 2 and bar(B) on thread 1, one might expect performance to improve. However, here's the way I see it:

在第一种情况下,线程1和2串行执行。调用这两个时钟周期。

In the first case, threads 1 and 2 execute in serial. Call this two clock cycles.

在第二种情况下,线程1和2并行执行 foo(A) ,然后并行执行 bar(B)。这仍然看起来像我两个时钟周期,区别是在前一种情况下,如果 foo(A)涉及从内存读取,我想线程2可以开始执行在延迟期间,这导致延迟隐藏。

In the second case, threads 1 and 2 execute foo(A) in parallel, then execute bar(B) in parallel. This still looks to me like two clock cycles, the difference is that in the former case, if foo(A) involves a read from memory, I imagine thread 2 can begin execution during that latency, which results in latency hiding. If this is the case, the branch divergent code is faster.

推荐答案

至少它是你给的例子,你唯一的参考),避免分支分歧的唯一方法是允许所有线程执行所有的代码。

You're assuming (at least it's the example you give and the only reference you make) that the only way to avoid branch divergence is to allow all threads to execute all the code.

但是,避免分支分歧可能与更高级别的算法重构相关,而不仅仅是添加或删除一些if语句并使代码安全在所有线程中执行。

But avoiding branch divergence probably has more to do with algorithm re-structuring at a higher level than just the addition or removal of some if statements and making code "safe" to execute in all threads.

我会举一个例子。假设我知道奇数线程将需要处理像素的蓝色分量,甚至线程将需要处理绿色组件:

I'll offer up one example. Suppose I know that odd threads will need to handle the blue component of a pixel and even threads will need to handle the green component:

#define N 2 // number of pixel components
#define BLUE 0
#define GREEN 1
// pixel order: px0BL px0GR px1BL px1GR ...


if (threadIdx.x & 1)  foo(pixel(N*threadIdx.x+BLUE));
else                  bar(pixel(N*threadIdx.x+GREEN));

这意味着每个替换线程都采用给定的路径,无论是 foo bar 。所以现在我的warp需要两倍的时间来执行。

This means that every alternate thread is taking a given path, whether it be foo or bar. So now my warp takes twice as long to execute.

但是,如果我重新排列我的像素数据,使颜色组件是连续的,大约32像素的块:
BL0 BL1 BL2 ... GR0 GR1 GR2 ...

However, if I rearrange my pixel data so that the color components are contiguous perhaps in chunks of 32 pixels: BL0 BL1 BL2 ... GR0 GR1 GR2 ...

我可以写类似的代码:

if (threadIdx.x & 32)  foo(pixel(threadIdx.x));
else                   bar(pixel(threadIdx.x));

看起来我有可能分歧。但是由于分支发生在warp边界上,所以warp会执行 if 路径或 else 路径,因此没有实际发散。

It still looks like I have the possibility for divergence. But since the divergence happens on warp boundaries, a give warp executes either the if path or the else path, so no actual divergence occurs.

这是一个简单的例子,可能是愚蠢的,但它说明了可能有办法解决绕线发散,不涉及运行所有所有分歧路径的代码。

This is a trivial example, and probably stupid, but it illustrates that there may be ways to work around warp divergence that don't involve running all the code of all the divergent paths.

这篇关于是分支发散真的那么坏吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆