通过有条件的早期计算来避免停顿管道 [英] Avoid stalling pipeline by calculating conditional early

查看：81 发布时间：2020/4/27 3:42:29 performance language-agnostic compiler-optimization cpu-architecture branch-prediction

本文介绍了通过有条件的早期计算来避免停顿管道的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在谈论ifs的性能时，我们通常会谈论错误的预测如何使管道停滞.我看到的推荐解决方案是:

When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are:

信任通常情况下会产生一个结果的分支预测变量；或
在合理可能的情况下，避免一点点魔术的分支；或
视情况而定.

我找不到的是我们是否可以尽早计算条件以在可能的情况下提供帮助.因此，代替:

What I couldn't find was whether or not we can calculate the condition early to help where possible. So, instead of:

... work
if (a > b) {
    ... more work
}

执行以下操作:

bool aGreaterThanB = a > b;
... work
if (aGreaterThanB) {
    ... more work
}

这样的事情是否可以完全避免在这种情况下出现停顿(取决于管道的长度以及我们可以在bool和if之间进行的工作量)?不一定是我写的那样，但是有没有一种方法可以尽早评估条件，因此CPU不必尝试预测分支?

Could something like this potentially avoid stalls on this conditional altogether (depending on the length of the pipeline and the amount of work we can put between the bool and the if)? It doesn't have to be as I've written it, but is there a way to evaluate conditionals early so the CPU doesn't have to try and predict branches?

如果有帮助，编译器是否有可能做任何事情?

Also, if that helps, is it something a compiler is likely to do anyway?

推荐答案

是，允许将分支条件尽可能早地计算为是有益的，这样就可以尽早解决任何错误的预测，并且管道的前端部分可以尽早开始重新填充.在最好的情况下，如果正在进行的工作足以完全掩盖前端气泡，则可以免费进行错误预测.

Yes, it can be beneficial to allow the the branch condition to calculated as early as possible, so that any misprediction can be resolved early and the front-end part of the pipeline can start re-filling early. In the best case, the mis-prediction can be free if there is enough work already in-flight to totally hide the front end bubble.

不幸的是，在乱序的CPU上， early 的定义有些微妙，因此尽早解决分支的问题并不像在源代码中四处移动那样简单-可能必须改变条件的计算方式.

Unfortunately, on out-of-order CPUs, early has a somewhat subtle definition and so getting the branch to resolve early isn't as simple as just moving lines around in the source - you'll probably have to make a change to way the condition is calculated.

不幸的是，更早既没有引用条件/分支在源文件中的位置，也没有引用与比较或分支相对应的汇编指令的位置.因此，从根本上讲，它⁷不能像您的示例那样工作.

Unfortunately, earlier doesn't refer to the position of the condition/branch in the source file, nor does it refer to the positions of the assembly instructions corresponding to the comparison or branch. So at a fundamental level, it mostly⁷ doesn't work as in your example.

即使源代码级定位很重要，在您的示例中也可能不起作用，因为:

Even if source level positioning mattered, it probably wouldn't work in your example because:

您已经将条件的评估上移并将其分配给bool，但不是测试(<运算符)可能会错误地预测，而是后续的条件分支:毕竟，这是 branch 的错误预测.在您的示例中，分支在两个地方都位于同一位置:分支的形式已从if (a > b)更改为if (aGreaterThanB).

You've moved the evaluation of the condition up and assigned it to a bool, but it's not the test (the < operator) that can mispredict, it's the subsequent conditional branch: after all, it's a branch misprediction. In your example, the branch is in the same place in both places: its form has simply changed from if (a > b) to if (aGreaterThanB).

除此之外，您转换代码的方式不太可能会欺骗大多数编译器.优化编译器不会按编写顺序逐行发出代码，而是根据源代码级依存关系安排它们认为合适的事情.将条件提早提起可能会被忽略，因为编译器希望将检查放在自然位置:大约在带有标志寄存器的体系结构分支之前.

Beyond that, the way you've transformed the code isn't likely to fool most compilers. Optimizing compilers don't emit code line-by-line in the order you've written it, but rather schedule things as they see fit based on the source-level dependencies. Pulling the condition up earlier will likely just be ignored, since compilers will want to put the check where it would naturally go: approximately right before the branch on architectures with a flag register.

例如，考虑一个简单功能的以下两个实现，它们遵循您建议的模式.第二个功能将条件移到该功能的顶部.

For example, consider the following two implementations of a simple function, which follow the pattern you suggested. The second function moves the condition up to the top of the function.

int test1(int a, int b) {
    int result = a * b;
    result *= result;
    if (a > b) {
        return result + a;
    }
    return result + b * 3;
}

int test2(int a, int b) {
    bool aGreaterThanB = a > b;
    int result = a * b;
    result *= result;
    if (aGreaterThanB) {
        return result + a;
    }
    return result + b * 3;
}

我检查了gcc，clang ²和MSVC，并全部编译了两个函数，完全相同(编译器之间的输出有所不同，但是对于每个编译器，两个函数的输出是相同的).例如，用gcc编译test2会导致:

I checked gcc, clang² and MSVC, and all compiled both functions identically (the output differed between compilers, but for each compiler the output was the same for the two functions). For example, compiling test2 with gcc resulted in:

test2(int, int):
  mov eax, edi
  imul eax, esi
  imul eax, eax
  cmp edi, esi
  jg .L4
  lea edi, [rsi+rsi*2]
.L4:
  add eax, edi
  ret

cmp指令对应于a > b条件，并且gcc已将其移回所有工作"的上方，并将其放置在作为条件分支的jg旁边.

The cmp instruction corresponds to the a > b condition, and gcc has moved it back down past all the "work" and put it right next to the jg which is the conditional branch.

因此，如果我们知道对源代码中的操作顺序进行简单的操作不起作用，那么起作用了吗?事实证明，您可以做的任何事情都可以通过允许较早地解决错误预测来提高数据流图中向上"的分支条件.我不会深入了解现代CPU如何依赖数据流，但是您可以在此处通过指针简要了解以便在末尾进一步阅读.

So if we know that simple manipulation of the order of operations in the source doesn't work, what does work? As it turns out, anything you can do move the branch condition "up" in the data flow graph might improve performance by allowing the misprediction to be resolved earlier. I'm not going to get deep into how modern CPUs depend on dataflow, but you can find a brief overview here with pointers to further reading at the end.

这是一个涉及链接列表遍历的真实示例.

Here's a real-world example involving linked-list traversal.

考虑将所有值相加的任务，即以空值结尾的链表，该链表还将其长度¹存储为列表头结构的成员.链接列表实现为一个list_head对象和零个或多个列表节点(具有单个int value有效负载)，其定义如下:

Consider the the task of summing all values a null-terminated linked list which also stores its length¹ as a member of the list head structure. The linked list implemented as one list_head object and zero or more list nodes (with a single int value payload), defined like so:

struct list_node {
    int value;
    list_node* next;
};

struct list_head {
    int size;
    list_node *first;
};

规范搜索循环将在最后一个节点中使用node->next == nullptr哨兵确定已到达列表的末尾，如下所示:

The canonical search loop would use the node->next == nullptr sentinel in the last node to determine that is has reached the end of the list, like this:

long sum_sentinel(list_head list) {
    int sum = 0;
    for (list_node* cur = list.first; cur; cur = cur->next) {
        sum += cur->value;
    }
    return sum;
}

那就像您得到的一样简单.

That's about as simple as you get.

但是，这会将结束求和的分支(首先是cur == null的分支)放在节点到节点指针跟踪的末尾，这是数据流图中最长的依赖关系.如果此分支的预测错误，则错误预测的解决方案将延迟"发生，并且前端气泡将直接添加到运行时.

However, this puts the branch that ends the summation (the one that first cur == null) at the end of the node-to-node pointer-chasing, which is the longest dependency in the data flow graph. If this branch mispredicts, the resolution of the mispredict will occur "late" and the front-end bubble will add directly to the runtime.

另一方面，您可以通过显式计数节点来进行求和，如下所示:

On the other hand, you could do the summation by explicitly counting nodes, like so:

long sum_counter(list_head list) {
    int sum = 0;
    list_node* cur = list.first;
    for (int i = 0; i < list.size; cur = cur->next, i++) {
        sum += cur->value;
    }
    return sum;
}

将此与哨兵解决方案进行比较，似乎我们增加了额外的工作:现在我们需要初始化，跟踪和递减count ⁴.但是，关键是，此递减依赖关系链非常短，因此它将领先于"指针追逐工作，并且错误预测将在较早发生，而仍有剩余的剩余指针追逐工作要做，可能是因为运行时大大改善.

Comparing this to the sentinel solution, it seems like we have added extra work: we now need to initialize, track and decrement the count⁴. The key, however, is that this decrement dependency chain is very short and so it will "run ahead" of pointer-chasing work and the mis-prediction will occur early while there is still valid remaining pointer chasing work to do, possibly with a large improvement in runtime.

让我们实际尝试一下.首先，我们检查两种解决方案的装配体，以便可以验证没有发生任何意外情况:

Let's actually try this. First we examine the assembly for the two solutions, so we can verify there isn't anything unexpected going on:

<sum_sentinel(list_head)>:
test   rsi,rsi
je     1fe <sum_sentinel(list_head)+0x1e>
xor    eax,eax
loop:
add    eax,DWORD PTR [rsi]
mov    rsi,QWORD PTR [rsi+0x8]
test   rsi,rsi
jne    loop
cdqe   
ret    


<sum_counter(list_head)>:
test   edi,edi
jle    1d0 <sum_counter(list_head)+0x20>
xor    edx,edx
xor    eax,eax
loop:
add    edx,0x1
add    eax,DWORD PTR [rsi]
mov    rsi,QWORD PTR [rsi+0x8]
cmp    edi,edx
jne    loop:
cdqe   
ret

如预期的那样，哨兵方法稍微更简单:设置过程中减少一条指令，循环⁵中减少一条指令，但是总体而言，关键指针的追逐和添加步骤是相同的，我们期望循环由连续的节点指针的延迟决定.

As expected, the sentinel approach is slightly simpler: one less instruction during setup, and one less instruction in the loop⁵, but overall the key pointer chasing and addition steps are identical and we expect this loop to be dominated by the latency of successive node pointers.

实际上，当预测影响可忽略时，在对短列表或长列表求和时，循环的执行几乎相同.对于长列表，分支预测的影响自动变小，因为到达列表末尾的单个错误预测将在许多节点上摊销，并且对于L1中包含的列表，运行时间渐近地达到每个节点几乎正好4个周期.我们希望英特尔能够提供最佳的4周期负载使用延迟.

Indeed, the loops perform virtually identically when summing short or long lists when the prediction impact is negligible. For long lists the branch prediction impact is automatically small since the single mis-prediction when the end of the list is reached is amortized across many nodes, and the runtime asymptotically reaches almost exactly 4 cycles per node for lists contained in L1, which is what we expect with Intel's best-case 4 cycle load-to-use latency.

对于短列表，如果列表的模式是可预测的，则分支错误预测可忽略不计:要么总是相同，要么以一定的周期循环(在良好的预测条件下可以达到1000或更多！).在这种情况下，汇总多个短列表时，每个节点的时间可以少于4个周期，因为多个列表可以一次运行(例如，如果汇总一个列表数组).在任何情况下，两种实现的性能几乎相同.例如，当列表始终有5个节点时，使用任一实现将一个列表求和的时间约为12个周期:

For short lists, branch misprediction is neglible if the pattern of lists is predictable: either always the same or cycling with some moderate period (which can be 1000 or more with good prediction!). In this case the time per node can be less than 4 cycles when summing many short lists since multiple lists can be in flight at once (e.g., if summary an array of lists). In any case, both implementations perform almost identically. For example, when lists always have 5 nodes, the time to sum one list is about 12 cycles with either implementation:

** Running benchmark group Tests written in C++ **
                     Benchmark   Cycles   BR_MIS
       Linked-list w/ Sentinel    12.19     0.00
          Linked-list w/ count    12.40     0.00

让我们通过更改列表生成代码，以创建平均长度为5，但实际长度在[0, 10]中均匀分布的列表.求和代码不变:只有输入不同.列表长度随机的结果:

Let's add branch prediction to the mix, by changing the list generation code to create lists with an average a length of 5, but with actual length uniformly distributed in [0, 10]. The summation code is unchanged: only the input differs. The results with random list lengths:

** Running benchmark group Tests written in C++ **
                     Benchmark   Cycles   BR_MIS
       Linked-list w/ Sentinel    43.87     0.88
          Linked-list w/ count    27.48     0.89

BR_MIS列显示，由于循环退出是不可预测的，我们按预期得到了每个列表⁶几乎一个分支的错误预测.

The BR_MIS column shows that we get nearly one branch misprediction per list⁶, as expected, since the loop exit is unpredictable.

但是，哨兵算法现在需要约44个周期，而计数算法需要约27.5个周期.计数算法大约快16.5个周期.您可以使用列表的长度和其他因素，并更改绝对时间，但是增量几乎总是在16-17个周期左右，这恰好与最近Intel的分支错误预测惩罚相同！通过尽早解决分支问题，我们避免了前端泡沫，因为泡沫根本不会发生.

However, the sentinel algorithm now takes ~44 cycles versus the ~27.5 cycles of the count algorithm. The count algorithm is about 16.5 cycles faster. You can play with the list lengths and other factors, and change the absolute timings, but the delta is almost always around 16-17 cycles, which not coincidentally is about the same as the branch misprediction penalty on recent Intel! By resolving the branch condition early, we are avoiding the front end bubble, where nothing would be happening at all.

另一个例子是像一个循环，该循环计算一个浮点值，例如泰勒级数逼近，其中终止条件取决于所计算值的某些函数.这具有与上述相同的效果:终止条件取决于慢循环携带的依赖性，因此解析速度与值本身的计算一样慢.如果出口无法预测，那么出口将陷入困境.

Another example would be something like a loop which calculates a floating point value, say a Taylor series approximation, where the termination condition depends on some function of the calculated value. This has the same effect as above: the termination condition depends on the slow loop carried dependency, so it is just as slow to resolve as the calculation of the value itself. If the exit is unpredictable, you'll suffer a stall on exit.

如果您可以更改该值以预先计算迭代计数，则可以使用解耦的整数计数器作为终止条件，从而避免出现冒泡.即使前期计算增加了一些时间，它仍然可以提供整体加速效果(无论如何，该计算可以与循环的第一次迭代并行运行，因此，通过查看以下内容，所需的成本可能要低得多)延迟).

If you could change that to calculate the iteration count up front, you could use a decoupled integer counter as the termination condition, avoiding the bubble. Even if the up-front calculation adds some time, it could still provide an overall speedup (and the calculation can run in parallel with the first iterations of the loop, anyways, so it may be much less costly what you'd expect by looking at its latency).

¹ MIPS是一个有趣的例外，这里没有标志寄存器-测试结果直接存储在通用寄存器中.

¹ MIPS is an interesting exception here having no flag registers - test results are stored directly into general purpose registers.

² Clang以无分支方式编译了此变量和许多其他变量，但是它仍然很有趣，因为您仍然具有相同的测试指令结构和条件移动(代替了分支).

² Clang compiled this and many other variants in a branch-free manner, but it still interesting because you still have the same structure of a test instruction and a conditional move (taking the place of the branch).

³类似于C ++ 11 std::list.

³ Like the C++11 std::list.

⁴事实证明，在x86上，由于dec隐式设置零标志的方式，两种方法之间的每节点工作实际上非常相似，因此我们不这样做. t不需要额外的test指令，而指针追逐中使用的mov则不需要，因此计数器方法具有额外的dec，而哨兵方法具有额外的测试，这涉及清洗.

⁴ As it turns out, on x86, the per-node work is actually very similar between the two approaches because of the way that dec implicitly set the zero flag, so we don't need an extra test instruction, while the mov used in pointer chasing doesn't, so the counter approach has an extra dec while the sentinel approach has an extra test, making it about a wash.

⁵尽管这部分只是因为gcc未能设法将递增的for循环转换为递减的for循环，以利用dec设置零标志的优势，避免了cmp .也许新的gcc版本会做的更好.另请参见脚注4.

⁵ Although this part is just because gcc didn't manage to transform the incrementing for-loop to a decrementing one to take advantage of dec setting the zero flag, avoiding the cmp. Maybe newer gcc versions do better. See also footnote 4.

⁶我猜这比0.9更接近于1.0，因为分支预测变量可能仍然获得length = 10 case正确的值，因为一旦循环9次，下一次迭代将始终退出.合成/精确度较低的分布不会显示出来.

⁶ I guess this is closer to 0.9 than to 1.0 since perhaps the branch predictors still get the length = 10 case correct, since once you've looped 9 times the next iteration will always exit. A less synthetic/exact distribution wouldn't exhibit that.

⁷我主要说，因为在某些情况下，您可以通过这种源代码或程序集级别的重新排序来节省一两个周期，因为这样的事情可能会比较少如果对无序处理器中的执行顺序有影响，则执行顺序也受汇编顺序的影响，但仅在数据流图的约束范围内.另请参见此评论.

⁷ I say mostly because in some cases you might save a cycle or two via such source or assembly-level re-orderings, because such things can have a minor effect on the execution order in out-of-order processors, execution order is also affected by assembly order, but only within the constraints of the data-flow graph. See also this comment.

这篇关于通过有条件的早期计算来避免停顿管道的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

通过有条件的早期计算来避免停顿管道 [英] Avoid stalling pipeline by calculating conditional early

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过有条件的早期计算来避免停顿管道 [英] Avoid stalling pipeline by calculating conditional early

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭