Clang生成的7次比较比8次比较糟糕的代码 [英] Clang generates worse code for 7 comparisons than for 8 comparisons

查看:102
本文介绍了Clang生成的7次比较比8次比较糟糕的代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对clang将小整数的许多==比较转换为一条大SIMD指令的能力感到很感兴趣,但是后来我注意到了一些奇怪的事情. 当我进行7次比较时,Clang生成了更糟糕"的代码(在我的业余评估中),而当我进行8次比较时,Clang生成了该代码.

I was intrigued by clang's ability to convert many == comparisons of small integers to to one big SIMD instruction, but then I noticed something strange. Clang generated "worse" code(in my amateur evaluation) when I had 7 comparisons compared to the code when I had 8 comparisons.

bool f1(short x){
    return (x==-1) | (x == 150) |
           (x==5) | (x==64) | 
           (x==15) | (x==223) | 
           (x==42) | (x==47);
}

bool f2(short x){
    return (x==-1) | (x == 150) |
           (x==5) | (x==64) | 
           (x==15) | (x==223) | 
           (x==42);
}

我的问题是这是一个很小的性能错误,或者clang有一个很好的理由不想引入虚拟比较(即假装与7个值中的一个进行额外的比较),并在其中使用一个更多的常量代码来实现它.

My question is this a small performance bug, or clang has a very good reason for not wanting to introduce a dummy comparison(i.e. pretend that there is one extra comparison with one of the 7 values) and use one more constant in the code to achieve it.

godbolt链接此处:

godbolt link here:

# clang(trunk) -O2 -march=haswell
f1(short):
    vmovd   xmm0, edi
    vpbroadcastw    xmm0, xmm0             # set1(x)
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]  # 16 bytes = 8 shorts
    vpacksswb       xmm0, xmm0, xmm0
    vpmovmskb       eax, xmm0
    test    al, al
    setne   al           # booleanize the parallel-compare bitmask
    ret

vs.

f2(short):
    cmp     di, -1
    sete    r8b
    cmp     edi, 150
    sete    dl
    cmp     di, 5             # scalar checks of 3 conditions
    vmovd   xmm0, edi
    vpbroadcastw    xmm0, xmm0
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI1_0]  # low 8 bytes = 4 shorts
    sete    al
    vpmovsxwd       xmm0, xmm0
    vmovmskps       esi, xmm0
    test    sil, sil
    setne   cl                # SIMD check of the other 4
    or      al, r8b
    or      al, dl
    or      al, cl            # and combine.
    ret

quickbench似乎不起作用,因为IDK如何为其提供-mavx2标志. (编者注:简单地将uops计入前端成本表明,这显然会降低吞吐量.而且还会增加延迟.)

quickbench does not seem to work because IDK how to provide -mavx2 flag to it. (Editor's note: simply counting uops for front-end cost shows this is obviously worse for throughput. And also latency.)

推荐答案

似乎clang的优化程序没有考虑复制元素以使其达到SIMD方便的比较数量.但是您是对的,那比做额外的标量工作更好. 显然错过了优化,应该将其报告为clang/LLVM优化器错误. https://bugs.llvm.org/

It looks like clang's optimizer didn't think of duplicating an element to bring it up to a SIMD-convenient number of comparisons. But you're right, that would be better than doing extra scalar work. Clearly a missed optimization which should get reported as a clang/LLVM optimizer bug. https://bugs.llvm.org/

f1()的asm明显优于f2():vpacksswb xmm在主流Intel和AMD CPU上的成本与vpmovsxwd xmm相同,就像其他单UOP洗牌一样.而且如果vpmovsx-> vmovmskps可能在整数域和FP域之间有旁路延迟 1 .

The asm for f1() is clearly better than f2(): vpacksswb xmm has the same cost as vpmovsxwd xmm on mainstream Intel and AMD CPUs, like other single-uop shuffles. And if anything vpmovsx -> vmovmskps could have bypass latency between integer and FP domains1.

脚注1:在装有AVX2的主流Intel CPU(桑迪布里奇家族)上,可能没有多余的旁路延迟; FP运算之间的整数混洗通常很好,IIRC. ( https://agner.org/optimize/).但是对于Nehalem上的SSE4.1版本,是的,可能会有整数版本所没有的额外惩罚.

Footnote 1: Probably no extra bypass latency on mainstream Intel CPUs with AVX2 (Sandybridge-family); integer shuffles between FP ops are typically fine, IIRC. (https://agner.org/optimize/). But for an SSE4.1 version on Nehalem, yes there might be an extra penalty the integer version wouldn't have.

您不需要AVX2,但是在没有pshufb控制向量的一条指令中进行单词广播确实可以提高效率.然后clang为-march=nehalem

You don't need AVX2, but word-broadcast in one instruction without a pshufb control vector does make it more efficient. And clang chooses pshuflw -> pshufd for -march=nehalem

当然,这两个版本都是次优的.在移动掩码之前,无需重新整理即可压缩比较结果.

Of course, both versions are sub-optimal. There's no need to shuffle to compress the compare result before movemask.

例如,可以选择要使用test sil, 0b00001010检查的位,而不是test al, al,以检查位1和3,而忽略其他位置的非零位.

Instead of test al, al, it's possible to select which bits you want to check with test sil, 0b00001010 for example, to check bits 1 and 3 but ignore non-zero bits in other positions.

pcmpeqw在word元素内将两个字节设置为相同,因此pmovmskb可以得到结果,并获得带有位对的整数.

pcmpeqw sets both bytes the same inside a word element so it's fine to pmovmskb that result and get an integer with pairs of bits.

使用字节寄存器代替双字寄存器也有零好处:test sil,sil应避免使用REX前缀,而应使用test esi,esi.

There's also zero benefit to using a byte register instead of a dword register: test sil,sil should avoid the REX prefix and use test esi,esi.

因此,即使没有复制条件之一,f2()也可能是:

So even without duplicating one of the conditions, f2() could be:

f2:
    vmovd           xmm0, edi
    vpbroadcastw    xmm0, xmm0             # set1(x)
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
    vpmovmskb       eax, xmm0
    test    eax, 0b011111111111111    # (1<<15) - 1 = low 14 bits set
    setne   al
    ret

test将根据pmovmksb结果的低14位设置ZF,因为在测试掩码中清除了较高的位. TEST = AND不会写入其输出.通常对于选择比较蒙版的部分很有用.

That test will set ZF according to the low 14 bits of the pmovmksb result, because the higher bits are cleared in the TEST mask. TEST = AND that doesn't write its output. Often useful for selecting parts of a compare mask.

但是,由于我们首先需要在内存中使用16个字节的常量,因此我们应该复制其中一个元素以将其填充为最多8个元素.然后,我们可以像普通人一样使用test eax,eax.压缩掩码以适合8位AL会浪费时间和代码大小. test r32, r32test r8,r8一样快,并且不需要SIL,DIL或BPL的REX前缀.

But since we need a 16-byte constant in memory in the first place, yes we should duplicate one of the elements to pad it up to 8 elements. Then we can use test eax,eax like a normal person. Compressing the mask to fit in 8-bit AL is a total waste of time and code-size. test r32, r32 is just as fast as test r8,r8 and doesn't need a REX prefix for SIL, DIL, or BPL.

有趣的事实:AVX512VL让我们使用vpbroadcastw xmm0, edi来组合movd和广播.

Fun fact: AVX512VL would let us use vpbroadcastw xmm0, edi to combine the movd and broadcast.

或者仅比较4个元素,而不是对movmskps进行额外的改组,我们在这里只需要SSE2.真正使用口罩是很有用的.

Or to compare only 4 elements, instead of extra shuffling for movmskps, we only need SSE2 here. And using a mask truly is useful.

test_4_possibilities_SSE2:
    movd            xmm0, edi
    pshufd          xmm0, xmm0, 0             # set1_epi32(x)
    pcmpeqw         xmm0, [const]             # == set_epi32(a, b, c, d)
    pmovmskb        eax, xmm0
    test    eax, 0b0001000100010001     # the low bit of each group of 4
    setne   al
    ret

我们进行双字广播,并且忽略每个32位元素的高16位中的比较结果.为test使用蒙版可以使我们比任何额外的指令都更便宜地进行此操作.

We do a dword broadcast and ignore the compare result in the high 16 bits of each 32-bit element. Using a mask for test lets us do that more cheaply than any extra instruction would.

没有AVX2,用pshufd广播的SIMD双字比需要广播的单词便宜.

Without AVX2, a SIMD dword broadcast with pshufd is cheaper than needing a word broadcast.

另一个选择是将imul0x00010001一起广播一个单词到32位寄存器中,但是具有3个周期的延迟,因此它可能比punpcklwd-> pshufd

Another option is to imul with 0x00010001 to broadcast a word into a 32-bit register, but that has 3 cycle latency so it's potentially worse than punpcklwd -> pshufd

但是,在循环内,值得为pshufb(SSSE3)加载控制向量,而不是使用2个shuffle或imul.

Inside a loop, though, it would be worth loading a control vector for pshufb (SSSE3) instead of using 2 shuffles or an imul.

这篇关于Clang生成的7次比较比8次比较糟糕的代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆