分支和断言指令 [英] Branch and predicated instructions

查看:160
本文介绍了分支和断言指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

第5.4.2节< a>说明通过分支指令或在某些条件下预测指令来处理分支发散。我不明白两者之间的区别,为什么一个会导致比另一个更好的性能。

Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by "branch instructions" or, under certain conditions, "predicated instructions". I don't understand the difference between the two, and why one leads to better performance than the other.

此注释表明分支指令导致更多数量的已执行指令,由于分支地址解析和提取而停滞,以及由于分支本身和保留分支的开销,而谓词指令仅引起指令执行延迟以进行条件测试和设置谓词。为什么?

This comment suggests that branch instructions lead to a greater number of executed instructions, stalling due to "branch address resolution and fetch", and overhead due to "the branch itself" and "book keeping for divergence", while predicated instructions incur only the "instruction execution latency to do the condition test and set the predicate". Why?

推荐答案

指令预测意味着根据谓词,线程有条件地执行指令。

Instruction predication means that an instruction is conditionally executed by a thread depending on a predicate. Threads for which the predicate is true execute the instruction, the rest do nothing.

例如:

var = 0;

// Not taken by all threads
if (condition) {
    var = 1;
} else {
    var = 2;
}

output = var;

会导致(不是实际的编译器输出):

Would result in (not actual compiler output):

       mov.s32 var, 0;       // Executed by all threads.
       setp pred, condition; // Executed by all threads, sets predicate.

@pred  mov.s32 var, 1;       // Executed only by threads where pred is true.
@!pred mov.s32 var, 2;       // Executed only by threads where pred is false.
       mov.s32 output, var;  // Executed by all threads.

总而言之,如果,无支化。非常高效。

All in all, that's 3 instructions for the if, no branching. Very efficient.

具有分支的等效代码如下:

The equivalent code with branches would look like:

       mov.s32 var, 0;       // Executed by all threads.
       setp pred, condition; // Executed by all threads, sets predicate.

@!pred bra IF_FALSE;         // Conditional branches are predicated instructions.
IF_TRUE:                    // Label for clarity, not actually used.
       mov.s32 var, 1;
       bra IF_END;
IF_FALSE:
       mov.s32 var, 2;
IF_END:
       mov.s32 output, var;

注意它是多长时间( c $ c>)。条件分支要求禁用部分warp,执行第一个路径,然后回滚到warp发散的点,并执行第二个路径,直到两个收敛。它需要更长时间,需要额外的簿记,更多的代码加载(特别是在有很多指令要执行的情况下),因此更多的内存请求。所有这些使得分支比简单的谓词慢。

Notice how much longer it is (5 instructions for the if). The conditional branch requires disabling part of the warp, executing the first path, then rolling back to the point where the warp diverged and executing the second path until both converge. It takes longer, requires extra bookkeeping, more code loading (particularly in the case where there are many instructions to execute) and hence more memory requests. All that make branching slower than simple predication.

实际上,在这个非常简单的条件赋值的情况下,编译器可以做得更好, if

And actually, in the case of this very simple conditional assignment, the compiler can do even better, with only 2 instructions for the if:

mov.s32 var, 0;       // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
selp var, 1, 2, pred; // Sets var depending on predicate (true: 1, false: 2).

这篇关于分支和断言指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆