Clang生成的7次比较比8次比较糟糕的代码 [英] Clang generates worse code for 7 comparisons than for 8 comparisons
问题描述
我对clang将小整数的许多==比较转换为一条大SIMD指令的能力感到很感兴趣,但是后来我注意到了一些奇怪的事情. 当我进行7次比较时,Clang生成了更糟糕"的代码(在我的业余评估中),而当我进行8次比较时,Clang生成了该代码.
I was intrigued by clang's ability to convert many == comparisons of small integers to to one big SIMD instruction, but then I noticed something strange. Clang generated "worse" code(in my amateur evaluation) when I had 7 comparisons compared to the code when I had 8 comparisons.
bool f1(short x){
return (x==-1) | (x == 150) |
(x==5) | (x==64) |
(x==15) | (x==223) |
(x==42) | (x==47);
}
bool f2(short x){
return (x==-1) | (x == 150) |
(x==5) | (x==64) |
(x==15) | (x==223) |
(x==42);
}
我的问题是这是一个很小的性能错误,或者clang有一个很好的理由不想引入虚拟比较(即假装与7个值中的一个进行额外的比较),并在其中使用一个更多的常量代码来实现它.
My question is this a small performance bug, or clang has a very good reason for not wanting to introduce a dummy comparison(i.e. pretend that there is one extra comparison with one of the 7 values) and use one more constant in the code to achieve it.
godbolt链接此处:
godbolt link here:
# clang(trunk) -O2 -march=haswell
f1(short):
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0 # set1(x)
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] # 16 bytes = 8 shorts
vpacksswb xmm0, xmm0, xmm0
vpmovmskb eax, xmm0
test al, al
setne al # booleanize the parallel-compare bitmask
ret
vs.
f2(short):
cmp di, -1
sete r8b
cmp edi, 150
sete dl
cmp di, 5 # scalar checks of 3 conditions
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI1_0] # low 8 bytes = 4 shorts
sete al
vpmovsxwd xmm0, xmm0
vmovmskps esi, xmm0
test sil, sil
setne cl # SIMD check of the other 4
or al, r8b
or al, dl
or al, cl # and combine.
ret
quickbench似乎不起作用,因为IDK如何为其提供-mavx2标志. (编者注:简单地将uops计入前端成本表明,这显然会降低吞吐量.而且还会增加延迟.)
quickbench does not seem to work because IDK how to provide -mavx2 flag to it. (Editor's note: simply counting uops for front-end cost shows this is obviously worse for throughput. And also latency.)
推荐答案
似乎clang的优化程序没有考虑复制元素以使其达到SIMD方便的比较数量.但是您是对的,那比做额外的标量工作更好. 显然错过了优化,应该将其报告为clang/LLVM优化器错误. https://bugs.llvm.org/
It looks like clang's optimizer didn't think of duplicating an element to bring it up to a SIMD-convenient number of comparisons. But you're right, that would be better than doing extra scalar work. Clearly a missed optimization which should get reported as a clang/LLVM optimizer bug. https://bugs.llvm.org/
f1()
的asm明显优于f2()
:vpacksswb xmm
在主流Intel和AMD CPU上的成本与vpmovsxwd xmm
相同,就像其他单UOP洗牌一样.而且如果vpmovsx
-> vmovmskps
可能在整数域和FP域之间有旁路延迟 1 .
The asm for f1()
is clearly better than f2()
: vpacksswb xmm
has the same cost as vpmovsxwd xmm
on mainstream Intel and AMD CPUs, like other single-uop shuffles. And if anything vpmovsx
-> vmovmskps
could have bypass latency between integer and FP domains1.
脚注1:在装有AVX2的主流Intel CPU(桑迪布里奇家族)上,可能没有多余的旁路延迟; FP运算之间的整数混洗通常很好,IIRC. ( https://agner.org/optimize/).但是对于Nehalem上的SSE4.1版本,是的,可能会有整数版本所没有的额外惩罚.
Footnote 1: Probably no extra bypass latency on mainstream Intel CPUs with AVX2 (Sandybridge-family); integer shuffles between FP ops are typically fine, IIRC. (https://agner.org/optimize/). But for an SSE4.1 version on Nehalem, yes there might be an extra penalty the integer version wouldn't have.
您不需要AVX2,但是在没有pshufb
控制向量的一条指令中进行单词广播确实可以提高效率.然后clang为-march=nehalem
You don't need AVX2, but word-broadcast in one instruction without a pshufb
control vector does make it more efficient. And clang chooses pshuflw
-> pshufd
for -march=nehalem
当然,这两个版本都是次优的.在移动掩码之前,无需重新整理即可压缩比较结果.
Of course, both versions are sub-optimal. There's no need to shuffle to compress the compare result before movemask.
例如,可以选择要使用test sil, 0b00001010
检查的位,而不是test al, al
,以检查位1和3,而忽略其他位置的非零位.
Instead of test al, al
, it's possible to select which bits you want to check with test sil, 0b00001010
for example, to check bits 1 and 3 but ignore non-zero bits in other positions.
pcmpeqw
在word元素内将两个字节设置为相同,因此pmovmskb
可以得到结果,并获得带有位对的整数.
pcmpeqw
sets both bytes the same inside a word element so it's fine to pmovmskb
that result and get an integer with pairs of bits.
使用字节寄存器代替双字寄存器也有零好处:test sil,sil
应避免使用REX前缀,而应使用test esi,esi
.
There's also zero benefit to using a byte register instead of a dword register: test sil,sil
should avoid the REX prefix and use test esi,esi
.
因此,即使没有复制条件之一,f2()
也可能是:
So even without duplicating one of the conditions, f2()
could be:
f2:
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0 # set1(x)
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpmovmskb eax, xmm0
test eax, 0b011111111111111 # (1<<15) - 1 = low 14 bits set
setne al
ret
该test
将根据pmovmksb
结果的低14位设置ZF,因为在测试掩码中清除了较高的位. TEST = AND不会写入其输出.通常对于选择比较蒙版的部分很有用.
That test
will set ZF according to the low 14 bits of the pmovmksb
result, because the higher bits are cleared in the TEST mask. TEST = AND that doesn't write its output. Often useful for selecting parts of a compare mask.
但是,由于我们首先需要在内存中使用16个字节的常量,因此我们应该复制其中一个元素以将其填充为最多8个元素.然后,我们可以像普通人一样使用test eax,eax
.压缩掩码以适合8位AL
会浪费时间和代码大小. test r32, r32
与test r8,r8
一样快,并且不需要SIL,DIL或BPL的REX前缀.
But since we need a 16-byte constant in memory in the first place, yes we should duplicate one of the elements to pad it up to 8 elements. Then we can use test eax,eax
like a normal person. Compressing the mask to fit in 8-bit AL
is a total waste of time and code-size. test r32, r32
is just as fast as test r8,r8
and doesn't need a REX prefix for SIL, DIL, or BPL.
有趣的事实:AVX512VL让我们使用vpbroadcastw xmm0, edi
来组合movd
和广播.
Fun fact: AVX512VL would let us use vpbroadcastw xmm0, edi
to combine the movd
and broadcast.
或者仅比较4个元素,而不是对movmskps
进行额外的改组,我们在这里只需要SSE2.真正使用口罩是很有用的.
Or to compare only 4 elements, instead of extra shuffling for movmskps
, we only need SSE2 here. And using a mask truly is useful.
test_4_possibilities_SSE2:
movd xmm0, edi
pshufd xmm0, xmm0, 0 # set1_epi32(x)
pcmpeqw xmm0, [const] # == set_epi32(a, b, c, d)
pmovmskb eax, xmm0
test eax, 0b0001000100010001 # the low bit of each group of 4
setne al
ret
我们进行双字广播,并且忽略每个32位元素的高16位中的比较结果.为test
使用蒙版可以使我们比任何额外的指令都更便宜地进行此操作.
We do a dword broadcast and ignore the compare result in the high 16 bits of each 32-bit element. Using a mask for test
lets us do that more cheaply than any extra instruction would.
没有AVX2,用pshufd
广播的SIMD双字比需要广播的单词便宜.
Without AVX2, a SIMD dword broadcast with pshufd
is cheaper than needing a word broadcast.
另一个选择是将imul
与0x00010001
一起广播一个单词到32位寄存器中,但是具有3个周期的延迟,因此它可能比punpcklwd
-> pshufd
Another option is to imul
with 0x00010001
to broadcast a word into a 32-bit register, but that has 3 cycle latency so it's potentially worse than punpcklwd
-> pshufd
但是,在循环内,值得为pshufb
(SSSE3)加载控制向量,而不是使用2个shuffle或imul.
Inside a loop, though, it would be worth loading a control vector for pshufb
(SSSE3) instead of using 2 shuffles or an imul.
这篇关于Clang生成的7次比较比8次比较糟糕的代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!