为什么Clang仅从Sandy Bridge开始才执行此优化技巧? [英] Why does Clang do this optimization trick only from Sandy Bridge onward?

查看:117
本文介绍了为什么Clang仅从Sandy Bridge开始才执行此优化技巧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到Clang对以下代码段做了有趣的除法优化技巧

 int64_t s2(int64_t a, int64_t b)
{
    return a/b;
}
 

如果将march指定为Sandy Bridge或更高版本,则为以下装配体输出

        mov     rax, rdi
        mov     rcx, rdi
        or      rcx, rsi
        shr     rcx, 32
        je      .LBB1_1
        cqo
        idiv    rsi
        ret
.LBB1_1:
        xor     edx, edx
        div     esi
        ret

以下是此表,发现32/64位的延迟Core2和Nehalem的分频分别为40/116和26/89.因此,如果操作数确实通常不宽,那么用32位除法而不是64位除法所节省的费用可能与在SnB上一样值得

那么为什么仅对SnB和更高版本的微体系结构启用它?为什么其他编译器(例如GCC或ICC)不这样做?

解决方案

我猜想clang开发人员测试了哪些优胜者,并发现它只是SnB系列.

这听起来不错,因为P6家族的时髦摊档以及AMD的不同分隔线.


使用来自P6系列的imm8移位(而不是隐式移位)的标志结果会导致前端在发出标志读取指令之前停滞,直到该移位退回. (因为P6解码器不检查imm8 = 0的情况以使标志保持不变,而SnB则不这样做). INC指令与添加1:有关系吗?.这可能就是为什么clang不将其用于P6-系列的原因.

可能有一种不同的方法来检查不会导致此停顿的相关条件(例如,在je之前的test rcx,rcx,在Core2/Nehalem上是值得的).如果Clang开发人员没有意识到P6系列产品运行缓慢的原因,那么他们就不会考虑修复它,而只是将其用于SnB之前的目标. (不幸的是,没有人将我添加到与此相关的补丁程序审查或错误CC列表中;这是我第一次看到使用clang进行此优化.尽管我认为我可能在其他LLVM审查的评论中提到了shift标志停滞或无论如何,尝试添加test并查看是否值得在Nehalem上进行尝试可能会很有趣.)


Agner Fog表示,无论操作数大小如何,AMD的分频器都具有相同的最佳情况下div的性能,大概仅取决于输入的实际大小.只有最坏的情况会随着操作数大小而增长. 因此,我认为使用小输入在AMD上扩展为128/64位的小输入来运行idiv r64是无害的.(AMD上的div/idiv对所有用户来说都是2微秒操作数大小(8位除外,因为它只需要写一个输出寄存器:AH和AL = AX,所以只有1位.不同于英特尔的微码整数除法.)

英特尔完全不同:idiv r32是9 uop,而idiv r64是59 uop,在Haswell上,最佳情况下的吞吐量要差3倍. SnB家族的其他成员相似.

为什么不像GCC或ICC这样的其他编译器呢?

可能是因为clang开发人员想到了这一点,而gcc/icc尚未复制它们.如果您已经看过钱德勒·卡鲁斯(Chandler Carruth)关于perf的讨论,那么他使用的一个示例就是在分支附近玩耍以跳过div.我猜这种优化是他的主意.看起来很漂亮. :)

I noticed that Clang does an interesting division optimization trick for the following snippet

int64_t s2(int64_t a, int64_t b)
{
    return a/b;
}

Below is the assembly output if specifying march as Sandy Bridge or above

        mov     rax, rdi
        mov     rcx, rdi
        or      rcx, rsi
        shr     rcx, 32
        je      .LBB1_1
        cqo
        idiv    rsi
        ret
.LBB1_1:
        xor     edx, edx
        div     esi
        ret

Here are the Godbolt links for the signed version and the unsigned version

From what I understand it checks whether the high bits of the two operands are zero, and does a 32-bit division if that's true

I checked this table and see that the latencies for 32/64-bit division on Core2 and Nehalem are 40/116 and 26/89 respectively. Hence if the operands are indeed often not wide then the savings by doing a 32-bit division instead of a 64-bit one may be just as worth as on SnB

So why is it enabled only for SnB and later microarchitectures? Why don't other compilers like GCC or ICC do it?

解决方案

I'm guessing that clang devs tested which uarches it was good on, and found it was only SnB-family.

That sounds right, because of a funky stall on P6-family, and AMD's different dividers.


Using the flag result from a shift imm8 (not a shift-by-implicit-1) on P6-family causes the front-end to stall before issuing the flag-reading instruction until the shift is retired. (Because the P6 decoders don't check for the imm8=0 case for leaving flags unmodified, while SnB does). INC instruction vs ADD 1: Does it matter?. That might be why clang doesn't use it for P6-family.

Probably a different way of checking the relevant condition that didn't cause this stall (like a test rcx,rcx before the je, would make it worth it on Core2/Nehalem). But if clang devs didn't realize the reason it was slow on P6-family, they wouldn't have thought to fix it, and just left it not done for pre-SnB targets. (Nobody added me to a patch review or bug CC list about this one, unfortunately; this is the first I've seen of clang doing this optimization. Although I think I might have mentioned shift flag stalls in comments on some other LLVM review or bug. Anyway, might be fun to try adding a test and see if that makes it worthwhile on Nehalem.)


AMD's dividers have the same best-case div performance regardless of operand-size, presumably depending only on the actual magnitude of the inputs, according to Agner Fog. Only the worst-case grows with operand-size. So I think it's harmless to run idiv r64 with small inputs sign-extended to 128 / 64-bit on AMD. (div/idiv on AMD is 2 uops for all operand sizes (except 8-bit where it's one because it only has to write one output register: AH and AL = AX. Unlike Intel's microcoded integer division.)

Intel is very different: idiv r32 is 9 uops, vs. idiv r64 being 59 uops, with a best-case throughput that's 3x worse, on Haswell. Other members of SnB-family are similar.

Why don't other compilers like GCC or ICC do it?

Probably because clang developers thought of it, and gcc/icc haven't copied them yet. If you've watched Chandler Carruth's talks about perf, one example he used was playing around with a branch to skip a div. I'm guessing this optimization was his idea. Looks nifty. :)

这篇关于为什么Clang仅从Sandy Bridge开始才执行此优化技巧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆