进行乘除运算时,额外的移动速度会更快吗? [英] Is an extra move somehow faster when doing division-by-multiplication?

查看:62
本文介绍了进行乘除运算时,额外的移动速度会更快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑此功能:

unsigned long f(unsigned long x) {
    return x / 7;
}

使用-O3,Clang 将除法转换为乘法,如下所示:

With -O3, Clang turns the division into a multiplication, as expected:

f:                                      # @f
        movabs  rcx, 2635249153387078803
        mov     rax, rdi
        mul     rcx
        sub     rdi, rdx
        shr     rdi
        lea     rax, [rdi + rdx]
        shr     rax, 2
        ret

除了使用rdx,而Clang使用rcx之外,GCC基本上也做同样的事情.但是他们俩似乎都在做额外的动作.为什么不这样呢?

GCC does basically the same thing, except for using rdx where Clang uses rcx. But they both appear to be doing an extra move. Why not this instead?

f:
        movabs  rax, 2635249153387078803
        mul     rdi
        sub     rdi, rdx
        shr     rdi
        lea     rax, [rdi + rdx]
        shr     rax, 2
        ret

尤其是,它们都将分子放在了​​rax中,但是通过在其中放置魔术数字,您完全不必移动分子.如果这实际上更好,我感到惊讶的是,GCC和Clang都没有这样做,因为它是如此明显.微建筑是否有某种原因导致他们的方式实际上比我的方式快?

In particular, they both put the numerator in rax, but by putting the magic number there instead, you avoid having to move the numerator at all. If this is actually better, I'm surprised that neither GCC nor Clang do it this way, since it feels so obvious. Is there some microarchitectural reason that their way is actually faster than my way?

Godbolt链接.

推荐答案

这看起来像gcc和clang都错过了优化;多余的动作无济于事.

This very much looks like a missed optimization by both gcc and clang; no benefit to that extra mov.

如果尚未报告,则GCC和LLVM都接受错过优化错误报告: https://bugs.llvm .org/ https://gcc.gnu.org/bugzilla/.对于GCC,甚至还有一个错误标记未优化".

If it's not already reported, GCC and LLVM both accept missed-optimization bug reports: https://bugs.llvm.org/ and https://gcc.gnu.org/bugzilla/. For GCC there's even a bug tag "missed-optimization".

浪费的mov指令并不罕见,尤其是在查看微小的功能(在这些微小的功能中,输入/输出reg是根据调用约定而不是寄存器分配器确定的)时.有时仍然会在循环中发生,例如每次迭代都要做大量的额外工作,因此一切都在循环后运行一次的代码的正确位置./facepalm.

Wasted mov instructions are unfortunately not rare, especially when looking at tiny functions where the input / output regs are nailed down the calling convention, not up to the register allocator. The do still happen in loops sometimes, like doing a bunch of extra work each iteration so everything is in the right places for the code that runs once after a loop. /facepalm.

零延迟mov(消除运动)有助于降低此类错失的优化的成本(以及无法避免的情况),但是它仍然需要前端操作,因此严格来说要差得多. (除非偶然,它以后可以帮助对齐某些内容,但是如果是这个原因,那么nop就应该一样好.)

Zero-latency mov (mov-elimination) helps reduce the cost of such missed optimizations (and cases where mov isn't avoidable), but it still takes a front-end uop so it's pretty much strictly worse. (Except by chance where it helps alignment of something later, but if that's the reason then a nop would have been as good).

这会占用ROB中的空间,从而减少了无序执行程序可以看到的超出缓存未命中或其他停顿的时间. mov从未真正免费,只有执行单元和延迟部分被消除了- x86的MOV真的可以免费"吗?为什么我根本不能复制它?

And it takes up space in the ROB, reducing how far ahead out-of-order exec can see past a cache miss or other stall. mov is never truly free, only the execution-unit and latency part is eliminated - Can x86's MOV really be "free"? Why can't I reproduce this at all?

我对编译器内部的总猜测:

My total guess about compiler internals:

也许gcc/clang的内部机制需要学习这种除法模式是可交换的,并且可以将输入值存储在其他寄存器中,并将常量放入RAX中.

Probably gcc/clang's internal machinery need to learn that this division pattern is commutative and can take the input value in some other register and put the constant in RAX.

在循环中,他们希望在其他寄存器中使用该常量,以便可以重用它,但希望编译器仍可以在有用的情况下弄清楚该常量.

In a loop they'd want the constant in some other register so they could reuse it, but hopefully the compiler could still figure that out for cases where it's useful.

这篇关于进行乘除运算时,额外的移动速度会更快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆