为什么树矢量化使这种排序算法2X慢? [英] Why does tree vectorization make this sorting algorithm 2x slower?

查看:255
本文介绍了为什么树矢量化使这种排序算法2X慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题变得快一倍(!),如果 -fprofile弧在海湾合作委员会(4.7.2)已启用。这个问题的严重简化C code(事实证明,我可以初始化所有零阵,怪异的表现行为保持,但它使很多理由更简单):

The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs is enabled in gcc (4.7.2). The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler):

#include <time.h>
#include <stdio.h>

#define ELEMENTS 100000

int main() {
  int a[ELEMENTS] = { 0 };
  clock_t start = clock();
  for (int i = 0; i < ELEMENTS; ++i) {
    int lowerElementIndex = i;
    for (int j = i+1; j < ELEMENTS; ++j) {
      if (a[j] < a[lowerElementIndex]) {
        lowerElementIndex = j;
      }
    }
    int tmp = a[i];
    a[i] = a[lowerElementIndex];
    a[lowerElementIndex] = tmp;
  } 
  clock_t end = clock();
  float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
  printf("Time: %2.3f\n", timeExec);
  printf("ignore this line %d\n", a[ELEMENTS-1]);
}

与优化标志打了很长一段时间后,事实证明, -ftree-矢量也产生这种怪异的行为,所以我们可以采取 -fprofile弧出了问题。与 PERF 分析后我发现的唯一相关的区别是:

After playing with the optimization flags for a long while, it turned out that -ftree-vectorize also yields this weird behavior so we can take -fprofile-arcs out of the question. After profiling with perf I have found that the only relevant difference is:

快速案例的gcc -std = C99 -O2 simp.c (在3.1s运行)

Fast case gcc -std=c99 -O2 simp.c (runs in 3.1s)

    cmpl    %esi, %ecx
    jge .L3
    movl    %ecx, %esi
    movslq  %edx, %rdi
.L3:

慢的情况下的gcc -std = C99 -O2 -ftree-矢量simp.c (在6.1s运行)

    cmpl    %ecx, %esi
    cmovl   %edx, %edi
    cmovl   %esi, %ecx

对于第一个片段:鉴于阵列只包含零,我们总是跳转到 .L3 。它可以从分支prediction大大受益。

As for the first snippet: Given that the array only contains zeros, we always jump to .L3. It can greatly benefit from branch prediction.

我猜 cmovl 指令不能从分支prediction受益。

I guess the cmovl instructions cannot benefit from branch prediction.

问题:


  1. 都是我的猜测之上正确的?这是否使算法慢?

  1. Are all my above guesses correct? Does this make the algorithm slow?

如果有,如何从发射该指令prevent GCC(除琐碎 -fno树矢量化当然,解决办法),但还在做尽可能多的优化,尽可能?

If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization workaround of course) but still doing as much optimizations as possible?

这是什么 -ftree矢量化文档是相当
模糊的,我需要更多的解释明白发生了什么。

What is this -ftree-vectorization? The documentation is quite vague, I would need a little more explanation to understand what's happening.

更新:由于它在评论想出了:怪异的性能行为w.r.t.在 -ftree-矢量标记保持用随机数据。 <一个href=\"http://stackoverflow.com/questions/21055946/why-does-tree-vectorization-make-this-sorting-algorithm-2x-slower#comment31663424_21055946\">As Yakk指出,对于选择排序,它实际上是很难创建一个数据集,将导致大量的分支误predictions的。


Update: Since it came up in comments: The weird performance behavior w.r.t. the -ftree-vectorize flag remains with random data. As Yakk points out, for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions.

由于它也来了:我有酷睿i5 CPU

Since it also came up: I have a Core i5 CPU.

<一个href=\"http://stackoverflow.com/questions/21055946/why-does-tree-vectorization-make-this-sorting-algorithm-2x-slower#comment31663424_21055946\">Based在Yakk的评论,我创建了一个测试。在code以下(网上没有提升)当然不再是一个排序算法;我只拿出了内循环。它唯一的目的是检查的分支prediction的效果:我们跳过如果枝C>循环的概率 p

Based on Yakk's comment, I created a test. The code below (online without boost) is of course no longer a sorting algorithm; I only took out the inner loop. Its only goal is to examine the effect of branch prediction: We skip the if branch in the for loop with probability p.

#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8; 
constexpr double p = 0.50;

int main() {
  printf("p = %.2f\n", p);
  int* a = new int[ELEMENTS];
  mt19937 mt(1759);
  bernoulli_distribution rnd(p);
  for (int i = 0 ; i < ELEMENTS; ++i){
    a[i] = rnd(mt)? i : -i;
  }
  auto start = high_resolution_clock::now();
  int lowerElementIndex = 0;
  for (int i=0; i<ELEMENTS; ++i) {
    if (a[i] < a[lowerElementIndex]) {
      lowerElementIndex = i;
    }
  } 
  auto finish = high_resolution_clock::now();
  printf("%ld  ms\n", duration_cast<milliseconds>(finish-start).count());
  printf("Ignore this line   %d\n", a[lowerElementIndex]);
  delete[] a;
}

感兴趣的循环:

这将被称为为 CMOV

G ++ -std = C ++ 11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L30:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    movslq  %eax, %rdx
    cmovl   %rdx, %rbp
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L30


这会无CMOV 被称为中, -fno-if转换标记所指出的的 Turix在他的回答


This will be referred to as no cmov, the -fno-if-conversion flag was pointed out by Turix in his answer.

G ++ -std = C ++ 11 -O2 -fno-if转换-lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L29:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    jge .L28
    movslq  %eax, %rbp
.L28:
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L29


并排的不同侧面。


The difference side by side

cmpl    %edx, (%rbx,%rax,4) |     cmpl  %edx, (%rbx,%rax,4)
movslq  %eax, %rdx          |     jge   .L28
cmovl   %rdx, %rbp          |     movslq    %eax, %rbp
                            | .L28:


执行时间为伯努利参数的函数 P

在code与 CMOV 指令是绝对不敏感 P 。在code的没有的在 CMOV 指令是赢家,如果 P&LT; 0.26 0.81&LT; p 键,最多为4.38x快( p = 1 )。当然,对于分支predictor糟糕的情况是在大约 P = 0.5 其中code是比code与1.58x慢 CMOV 指令。

The code with the cmov instruction is absolutely insensitive to p. The code without the cmov instruction is the winner if p<0.26 or 0.81<p and is at most 4.38x faster (p=1). Of course, the worse situation for the branch predictor is at around p=0.5 where the code is 1.58x slower than the code with the cmov instruction.

推荐答案

注:图更新添加到问题之前回答;这里是一些组装code的引用可能已经过时。

Note: Answered before graph update was added to the question; some assembly code references here may be obsolete.

(改编,从我们上面聊天,这是刺激足以引起我做了些研究延长。)

(Adapted and extended from our above chat, which was stimulating enough to cause me to do a bit more research.)

第一(根据我们上面聊天),似乎回答了你的第一个问题是是。在矢量优化code,优化(负面)影响业绩分支的 preDIC A 的,而在原来的code中的性能(正面)受分支的 prediction 的。 (注意额外的 前)。

First (as per our above chat), it appears that the answer to your first question is "yes". In the vector "optimized" code, the optimization (negatively) affecting performance is branch predication, whereas in the original code the performance is (positively) affected by branch prediction. (Note the extra 'a' in the former.)

您的回复问题3:即使你的情况,实际上是没有量化正在做的,从第11步(条件执行)的这里似乎与矢量优化相关的步骤之一是有针对性的循环中压平条件语句,就像在循环位:

Re your 3rd question: Even though in your case, there is actually no vectorization being done, from step 11 ("Conditional Execution") here it appears that one of the steps associated with vectorization optimizations is to "flatten" conditionals within targeted loops, like this bit in your loop:

if (a[j] < a[lowerElementIndex]
    lowerElementIndex = j;

显然,这种情况发生,即使没有向量化。

Apparently, this happens even if there is no vectorization.

这解释了为什么编译器使用条件移动指令( cmovl )。我们的目标是有向的避免的完全是一个分支(而不是尝试的 predict 的它正确)。相反,两个 cmovl 说明将在previous的结果之前,向下发送管道 CMPL 是已知而比较的结果将被转发启用/ prevent之前将其写回移动(即,实际上正在之前,它们的效果)。

This explains why the compiler is using the conditional move instructions (cmovl). The goal there is to avoid a branch entirely (as opposed to trying to predict it correctly). Instead, the two cmovl instructions will be sent down the pipeline before the result of the previous cmpl is known and the comparison result will then be "forwarded" to enable/prevent the moves prior to their writeback (i.e., prior to them actually taking effect).

请注意,如果循环已经矢量,这可能是值得的到达那里通过循环多次迭代可以有效并行地完成的点

Note that if the loop had been vectorized, this might have been worth it to get to the point where multiple iterations through the loop could effectively be accomplished in parallel.

不过,在你的情况下,在优化的尝试实际上事与愿违,因为在扁平循环,这两个条件的动作是通过管道通过循环发送的每一次。这本身可能不是那么糟糕,但有是导致第二招( cmovl%ESI,%ECX )将不得不等待,直到一个RAW数据的危险阵列/存储器存取( MOVL(%RSP,%RSI,4),%ESI )完成时,即使结果将是最终忽略。因此,大量的时间花费在那个特定的 cmovl 。 (我希望这是你的处理器没有内置到它的predication /转发实施应对危险足够复杂的逻辑问题。)

However, in your case, the attempt at optimization actually backfires because in the flattened loop, the two conditional moves are sent through the pipeline every single time through the loop. This in itself might not be so bad either, except that there is a RAW data hazard that causes the second move (cmovl %esi, %ecx) to have to wait until the array/memory access (movl (%rsp,%rsi,4), %esi) is completed, even if the result is going to be ultimately ignored. Hence the huge time spent on that particular cmovl. (I would expect this is an issue with your processor not having complex enough logic built into its predication/forwarding implementation to deal with the hazard.)

在另一方面,在非优化的情况下,当你正确地想通了,分公司的 prediction 的可以帮助避免不得不等待相应阵列/内存访问的结果有(在 MOVL(RSP%,%RCX,4),ECX%指令)。在这种情况下,当处理器正确predicts执行的分支(为全0阵列将每一次,但【连在乱阵应(仍然)是大致线的[每@编辑Yakk的评论]一半的时间),它不必等待内存访问完成先走,并排队在循环未来数指令。因此,在正确的predictions,你得到提振,而在不正确predictions,结果是没有比优化的情况更糟糕,而且,的更好,因为有时避免2浪费的能力 cmovl 流水线中的指令。

On the other hand, in the non-optimized case, as you rightly figured out, branch prediction can help to avoid having to wait on the result of the corresponding array/memory access there (the movl (%rsp,%rcx,4), %ecx instruction). In that case, when the processor correctly predicts a taken branch (which for an all-0 array will be every single time, but [even] in a random array should [still] be roughlymore than [edited per @Yakk's comment] half the time), it does not have to wait for the memory access to finish to go ahead and queue up the next few instructions in the loop. So in correct predictions, you get a boost, whereas in incorrect predictions, the result is no worse than in the "optimized" case and, furthermore, better because of the ability to sometimes avoid having the 2 "wasted" cmovl instructions in the pipeline.

[以下是由于去除我的每左右您的评论你的处理器错误的假设。]

回到你的问题,我建议看着那上面的链接以获得更多关于相关的矢量标志,但最终,我是pretty确保它也可以忽略最优化考虑到你的赛扬ISN 'T能够使用它(在这方面)反正。

[增加了上述后辗转]

重新您的第二个问题( ...我怎么能发光,从该指令prevent GCC ... 的),你可以尝试在 -fno-if转换 -fno-IF-conversion2 标记(不知道这些总是工作 - 他们不再我的Mac上工作),虽然我不认为你的问题是一般 cmovl 指令(即,我不会的总是的使用这些标志),只是在这一特定用途上下文(其中分支机构prediction会给出@ Yakk的关于您的排序算法点是非常有帮助)。

[Added after above was removed]
Re your second question ("...how can I prevent gcc from emitting this instruction..."), you could try the -fno-if-conversion and -fno-if-conversion2 flags (not sure if these always work -- they no longer work on my mac), although I do not think your problem is with the cmovl instruction in general (i.e., I wouldn't always use those flags), just with its use in this particular context (where branch prediction is going to be very helpful given @Yakk's point about your sort algorithm).

这篇关于为什么树矢量化使这种排序算法2X慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆