与GCC 5.4.0一起昂贵的跳跃 [英] An expensive jump with GCC 5.4.0

查看:122
本文介绍了与GCC 5.4.0一起昂贵的跳跃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的函数(仅显示重要部分):

  double CompareShifted(const std :: vector(< uint16_t>& l,const std :: vector< uint16_t>& curr,int shift,int shiftY){
...
for(std :: size_t i = std :: max (0,-shift); i if((curr [i] <479)&(1 [i + shift] <479)){
nontopOverlap ++;
}
...
}
...
}

这样写,功能在我的机器上花费了〜34ms。将条件更改为bool乘法(使代码如下所示):

  double CompareShifted(const std :: vector< uint16_t> (std :: size_t i = std :: max(0,const std :: vector< uint16_t>& curr,int shift,int shiftY){
...
;如果((curr [i] <479)*(l [i + shift] <479)){
nontopOverlap ++;
}
...
}
...
}

执行时间减少到~19ms。

使用的编译器是GCC 5.4.0和-O3,并在检查生成的asm代码之后godbolt.org我发现第一个例子会产生跳跃,而第二个例子不会。我决定尝试使用GCC 6.2.0,它在使用第一个示例时也会生成跳转指令,但GCC 7似乎不再生成一个。



找到这种方式加速代码是相当可怕的,花了相当长的一段时间。为什么编译器的行为如此?它是有意的,它是程序员应该留意的东西吗?是否还有类似这样的事情?



编辑:链接到godbolt > https://godbolt.org/g/5lKPF3

解决方案

逻辑AND运算符(&& )使用短路评估,这意味着只有在第一次比较评估为真时才进行第二次测试。这通常正是你所需要的语义。例如,考虑以下代码:

  if((p!= nullptr)&&(p-> first > 0))

在取消引用之前,您必须确保指针非空。如果这个不是短路评估,则会导致未定义的行为,因为您将取消引用空指针。



它在评估条件是昂贵的过程的情况下,短路评估也可能获得性能增益。例如:

  if((DoLengthyCheck1(p)&&(DoLengthyCheck2(p))

code>

如果 DoLengthyCheck1 失败,调用<$ c $没有意义然而,在生成的二进制文件中,短路操作常常导致两个分支,因为这是最简单的方法因为编译器会保留这些语义(这就是为什么在硬币的另一面,短路评估有时会抑制优化的潜力)。你可以通过查看相关部分如果语句由GCC 5.4为生成的对象代码:

  movzx r13d ,WORD PTR [rbp + rcx * 2] 
movzx eax,WORD PTR [rbx + rcx * 2]

cmp r13w,478;(curr [i]< 479)
ja .L5

cmp ax,478;(l [i + shift] <479)
ja .L5

add r8d,1; nontopOverla p ++

您在这里看到两个比较( cmp 指令),每一个都有一个单独的条件跳转/分支( ja ,或者如果跳转,则跳转)。

一般的经验法则是分支很慢,因此要避免在紧密的环路中。几乎所有的x86处理器都是如此,从谦虚的8088(它的缓慢读取时间和非常小的预取队列[与指令缓存相当],再加上完全缺乏分支预测,意味着采用分支需要缓存被转储)到现代实现(其长管道使预测错误的分支同样昂贵)。注意我在那里滑过的小警告。自Pentium Pro以来的现代处理器拥有先进的分支预测引擎,旨在最大限度地降低分支机构的成本。如果分支的方向可以正确预测,成本是最小的。大多数情况下,这种方法效果很好,但是如果您遇到分支预测器不在您身边的病例,

你说基准证实替换&& 与 * 使得代码明显更快。当我们比较目标代码的相关部分时,原因很明显:

  movzx r13d,WORD PTR [rbp + rcx * 2] 
movzx eax,WORD PTR [rbx + rcx * 2]

xor r15d,r15d; (curr [i] <479)
cmp r13w,478
setbe r15b

xor r14d,r14d; (l [i + shift] <479)
cmp ax,478
setbe r14b

imul r14d,r15d;这两次比较的结果为

cmp r14d,1; nontopOverlap ++
sbb r8d,-1

这可能有点违反直觉速度更快,因为这里有更多指令,但这是优化有时的工作原理。你可以看到在这里完成了相同的比较( cmp ),但现在每个比较前都有一个 xor ,随后是一个 setbe 。 XOR只是清除寄存器的标准技巧。 setbe 是一个x86指令,它根据标志的值设置一个位,通常用于实现无分支代码。在这里, setbe ja 的倒数。如果比较结果低于或等于它,它将目标寄存器设置为1(因为寄存器被预先置零,否则它将为0),而 ja 以上比较。一旦在 r15b r14b 寄存器中获得了这两个值,它们就会被乘以 IMUL 。乘法传统上是一种相对较慢的操作,但它在现代处理器上的运算速度非常快,而且速度特别快,因为它只乘以两个字节大小的值。

您可以很容易地用逐位AND运算符(& )替换乘法运算,而不进行短路评估。这使代码更清晰,并且是编译器普遍认可的模式。但是当你用你的代码做这件事并用GCC 5.4进行编译时,它会继续发出第一个分支:

  movzx r13d, WORD PTR [rbp + rcx * 2] 
movzx eax,WORD PTR [rbx + rcx * 2]

cmp r13w,478; (curr [i] <479)
ja .L4

cmp ax,478; (l [i + shift] <479)
setbe r14b

cmp r14d,1; nontopOverlap ++
sbb r8d,-1

没有必要发出代码的技术原因这种方式,但由于某种原因,其内部启发法告诉它,这是更快。如果分支预测器支持您的话,它可能会更快,但如果分支预测失败的次数比成功的次数多,它可能会变慢。



<新一代编译器(以及其他编译器,如Clang)知道这个规则,并且有时会使用它来生成您通过手动优化寻找的相同代码。我经常看到Clang将&& 表达式转换为如果我已经使用& 。以下是使用普通的&& 运算符从GCC 6.2获得的相关输出:

  movzx r13d,WORD PTR [rbp + rcx * 2] 
movzx eax,WORD PTR [rbx + rcx * 2]

cmp r13d,478; (curr [i] <479)
jg .L7

xor r14d,r14d; (l [i + shift] <479)
cmp eax,478
setle r14b

add esi,r14d; nontopOverlap ++

请注意 是多么的聪明!它使用签名条件( jg setle )而不是无符号条件( ja setbe ),但这并不重要。你可以看到它仍然像第一个条件那样执行compare-and-branch,并且使用相同的 setCC 指令为第二个条件生成无分支代码,但它在增量方面的效率更高。而不是进行第二次冗余比较来设置 sbb 操作的标志,它使用了 r14d 的知识可以是1或0,以便无条件地将此值添加到 nontopOverlap 中。如果 r14d 为0,则加法是无操作;



当您使用短路时,GCC 6.2实际上会产生更多高效率的代码&& 运算符比按位& 运算符:

  movzx r13d,WORD PTR [rbp + rcx * 2] 
movzx eax,WORD PTR [rbx + rcx * 2]

cmp r13d ,478; (curr [i] <479)
jg .L6

cmp eax,478; (l [i + shift] <479)
setle r14b

cmp r14b,1; nontopOverlap ++
sbb esi,-1

分支和条件集仍然存在,但现在它恢复为增加 nontopOverlap 的不太聪明的方式。这是一个重要的教训,为什么你在试图超越你的编译器时应该小心!



但是,如果你可以用基准测试来证明分支代码实际上是较慢的,那么它可能会付出努力去尝试和聪明的编译器。您只需仔细检查反汇编就可以做到这一点 - 并准备在升级到更高版本的编译器时重新评估您的决定。例如,您拥有的代码可以重写为:

  nontopOverlap + =((curr [i] <479)& amp ;(l [i + shift] <479)); 

如果语句完全没有 ,绝大多数编译器都不会考虑为此发送分支代码。 GCC也不例外;所有版本都会生成类似于以下内容的内容:

  movzx r14d,WORD PTR [rbp + rcx * 2] 
movzx eax,WORD PTR [rbx + rcx * 2]

cmp r14d,478; (curr [i] <479)
setle r15b

xor r13d,r13d; (l [i + shift] <479)
cmp eax,478
setle r13b

和r13d,r15d;合并两次比较的结果
add esi,r13d; nontopOverlap ++

如果您一直关注前面的例子,这应该看起来非常熟悉。两个比较都是以无分支方式完成的,中间结果是合在一起,然后这个结果(将会是0或1)是 ed添加到 nontopOverlap 。如果你想要无分代码,这实际上可以确保你得到它。



GCC 7变得更聪明。它现在生成几乎相同的代码(除了一些轻微的指令重新排列)作为原始代码的上述技巧。所以,对于你的问题的回答,为什么编译器会这样?,可能是因为它们不完美!他们尝试使用启发式技术来生成最佳代码,但他们并不总是做出最佳决策。但至少他们可以随着时间变得更聪明!



查看这种情况的一种方法是分支代码具有更好的最佳情况性能。如果分支预测成功,跳过不必要的操作会导致运行时间稍微加快。但是,无分支代码具有更好的最差情况性能。如果分支预测失败,根据需要执行一些额外的指令以避免分支将肯定比错误预测的分支快。即使是最聪明最聪明的编译器也很难做出这样的选择。



对于你的问题,这是程序员需要注意的问题,答案几乎肯定是否定的,除了在某些热循环中,你正试图通过微优化加速。然后,你坐下来进行拆卸并找出调整方法。正如我之前所说的,当你更新到新版本的编译器时,准备重新审视这些决定,因为它可能会对你的棘手的代码做些愚蠢的事情,或者它可能已经改变了它的优化启发式,足以让你可以回去使用您的原始代码。彻底评论!


I had a function which looked like this (showing only the important part):

double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY)  {
...
  for(std::size_t i=std::max(0,-shift);i<max;i++) {
     if ((curr[i] < 479) && (l[i + shift] < 479)) {
       nontopOverlap++;
     }
     ...
  }
...
}

Written like this, the function took ~34ms on my machine. After changing the condition to bool multiplication (making the code look like this):

double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY)  {
...
  for(std::size_t i=std::max(0,-shift);i<max;i++) {
     if ((curr[i] < 479) * (l[i + shift] < 479)) {
       nontopOverlap++;
     }
     ...
  }
...
}

the execution time decreased to ~19ms.

The compiler used was GCC 5.4.0 with -O3 and after checking the generated asm code using godbolt.org I found out that the first example generates a jump, while the second one does not. I decided to try GCC 6.2.0 which also generates a jump instruction when using the first example, but GCC 7 seems to not generate one anymore.

Finding out this way to speed up the code was rather gruesome and took quite some time. Why does the compiler behave this way? Is it intended and is it something the programmers should look out for? Are there any more things similar to this?

EDIT: link to godbolt https://godbolt.org/g/5lKPF3

解决方案

The logical AND operator (&&) uses short-circuit evaluation, which means that the second test is only done if the first comparison evaluates to true. This is often exactly the semantics that you require. For example, consider the following code:

if ((p != nullptr) && (p->first > 0))

You must ensure that the pointer is non-null before you dereference it. If this wasn't a short-circuit evaluation, you'd have undefined behavior because you'd be dereferencing a null pointer.

It is also possible that short circuit evaluation yields a performance gain in cases where the evaluation of the conditions is an expensive process. For example:

if ((DoLengthyCheck1(p) && (DoLengthyCheck2(p))

If DoLengthyCheck1 fails, there is no point in calling DoLengthyCheck2.

However, in the resulting binary, a short-circuit operation often results in two branches, since this is the easiest way for the compiler to preserve these semantics. (Which is why, on the other side of the coin, short-circuit evaluation can sometimes inhibit optimization potential.) You can see this by looking at the relevant portion of object code generated for your if statement by GCC 5.4:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13w, 478         ; (curr[i] < 479)
    ja      .L5

    cmp     ax, 478           ; (l[i + shift] < 479)
    ja      .L5

    add     r8d, 1            ; nontopOverlap++

You see here the two comparisons (cmp instructions) here, each followed by a separate conditional jump/branch (ja, or jump if above).

It is a general rule of thumb that branches are slow and are therefore to be avoided in tight loops. This has been true on virtually all x86 processors, from the humble 8088 (whose slow fetch times and extremely small prefetch queue [comparable to an instruction cache], combined with utter lack of branch prediction, meant that taken branches required the cache to be dumped) to modern implementations (whose long pipelines make mispredicted branches similarly expensive). Note the little caveat that I slipped in there. Modern processors since the Pentium Pro have advanced branch prediction engines that are designed to minimize the cost of branches. If the direction of the branch can be properly predicted, the cost is minimal. Most of the time, this works well, but if you get into pathological cases where the branch predictor is not on your side, your code can get extremely slow. This is presumably where you are here, since you say that your array is unsorted.

You say that benchmarks confirmed that replacing the && with a * makes the code noticeably faster. The reason for this is evident when we compare the relevant portion of the object code:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    xor     r15d, r15d        ; (curr[i] < 479)
    cmp     r13w, 478
    setbe   r15b

    xor     r14d, r14d        ; (l[i + shift] < 479)
    cmp     ax, 478
    setbe   r14b

    imul    r14d, r15d        ; meld results of the two comparisons

    cmp     r14d, 1           ; nontopOverlap++
    sbb     r8d, -1

It is a bit counter-intuitive that this could be faster, since there are more instructions here, but that is how optimization works sometimes. You see the same comparisons (cmp) being done here, but now, each is preceded by an xor and followed by a setbe. The XOR is just a standard trick for clearing a register. The setbe is an x86 instruction that sets a bit based on the value of a flag, and is often used to implement branchless code. Here, setbe is the inverse of ja. It sets its destination register to 1 if the comparison was below-or-equal (since the register was pre-zeroed, it will be 0 otherwise), whereas ja branched if the comparison was above. Once these two values have been obtained in the r15b and r14b registers, they are multiplied together using imul. Multiplication was traditionally a relatively slow operation, but it is darn fast on modern processors, and this will be especially fast, because it's only multiplying two byte-sized values.

You could just as easily have replaced the multiplication with the bitwise AND operator (&), which does not do short-circuit evaluation. This makes the code much clearer, and is a pattern that compilers generally recognize. But when you do this with your code and compile it with GCC 5.4, it continues to emit the first branch:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13w, 478         ; (curr[i] < 479)
    ja      .L4

    cmp     ax, 478           ; (l[i + shift] < 479)
    setbe   r14b

    cmp     r14d, 1           ; nontopOverlap++
    sbb     r8d, -1

There is no technical reason it had to emit the code this way, but for some reason, its internal heuristics are telling it that this is faster. It would probably be faster if the branch predictor was on your side, but it will likely be slower if branch prediction fails more often than it succeeds.

Newer generations of the compiler (and other compilers, like Clang) know this rule, and will sometimes use it to generate the same code that you would have sought by hand-optimizing. I regularly see Clang translate && expressions to the same code that would have been emitted if I'd have used &. The following is the relevant output from GCC 6.2 with your code using the normal && operator:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13d, 478         ; (curr[i] < 479)
    jg      .L7

    xor     r14d, r14d        ; (l[i + shift] < 479)
    cmp     eax, 478
    setle   r14b

    add     esi, r14d         ; nontopOverlap++

Note how clever this is! It is using signed conditions (jg and setle) as opposed to unsigned conditions (ja and setbe), but this isn't important. You can see that it still does the compare-and-branch for the first condition like the older version, and uses the same setCC instruction to generate branchless code for the second condition, but it has gotten a lot more efficient in how it does the increment. Instead of doing a second, redundant comparison to set the flags for a sbb operation, it uses the knowledge that r14d will be either 1 or 0 to simply unconditionally add this value to nontopOverlap. If r14d is 0, then the addition is a no-op; otherwise, it adds 1, exactly like it is supposed to do.

GCC 6.2 actually produces more efficient code when you use the short-circuiting && operator than the bitwise & operator:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13d, 478         ; (curr[i] < 479)
    jg      .L6

    cmp     eax, 478          ; (l[i + shift] < 479)
    setle   r14b

    cmp     r14b, 1           ; nontopOverlap++
    sbb     esi, -1

The branch and the conditional set are still there, but now it reverts back to the less-clever way of incrementing nontopOverlap. This is an important lesson in why you should be careful when trying to out-clever your compiler!

But if you can prove with benchmarks that the branching code is actually slower, then it may pay to try and out-clever your compiler. You just have to do so with careful inspection of the disassembly—and be prepared to re-evaluate your decisions when you upgrade to a later version of the compiler. For example, the code you have could be rewritten as:

nontopOverlap += ((curr[i] < 479) & (l[i + shift] < 479));

There is no if statement here at all, and the vast majority of compilers will never think about emitting branching code for this. GCC is no exception; all versions generate something akin to the following:

    movzx   r14d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r14d, 478         ; (curr[i] < 479)
    setle   r15b

    xor     r13d, r13d        ; (l[i + shift] < 479)
    cmp     eax, 478
    setle   r13b

    and     r13d, r15d        ; meld results of the two comparisons
    add     esi, r13d         ; nontopOverlap++

If you've been following along with the previous examples, this should look very familiar to you. Both comparisons are done in a branchless way, the intermediate results are anded together, and then this result (which will be either 0 or 1) is added to nontopOverlap. If you want branchless code, this will virtually ensure that you get it.

GCC 7 has gotten even smarter. It now generates virtually identical code (excepting some slight rearrangement of instructions) for the above trick as the original code. So, the answer to your question, "Why does the compiler behave this way?", is probably because they're not perfect! They try to use heuristics to generate the most optimal code possible, but they don't always make the best decisions. But at least they can get smarter over time!

One way of looking at this situation is that the branching code has the better best-case performance. If branch prediction is successful, skipping unnecessary operations will result in a slightly faster running time. However, branchless code has the better worst-case performance. If branch prediction fails, executing a few additional instructions as necessary to avoid a branch will definitely be faster than a mispredicted branch. Even the smartest and most clever of compilers will have a hard time making this choice.

And for your question of whether this is something programmers need to watch out for, the answer is almost certainly no, except in certain hot loops that you are trying to speed up via micro-optimizations. Then, you sit down with the disassembly and find ways to tweak it. And, as I said before, be prepared to revisit those decisions when you update to a newer version of the compiler, because it may either do something stupid with your tricky code, or it may have changed its optimization heuristics enough that you can go back to using your original code. Comment thoroughly!

这篇关于与GCC 5.4.0一起昂贵的跳跃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆