GCC错误产生的组件订购,导致性能损失 [英] Wrong gcc generated assembly ordering, results in performance hit
问题描述
我有以下code,它复制到DMA缓冲区中的数据从内存中:
的(;可能的(L&0); 1- = 128)
{
__m256i M0 = _mm256_load_si256((__m256i *)(SRC));
__m256i M1 = _mm256_load_si256((__m256i *)(SRC + 32));
__m256i 2 = _mm256_load_si256((__m256i *)(SRC + 64));
__m256i M3 = _mm256_load_si256((__m256i *)(SRC + 96)); _mm256_stream_si256((__m256i *)(DST),M0);
_mm256_stream_si256((__m256i *)(DST + 32),M1);
_mm256_stream_si256((__m256i *)(DST + 64),2);
_mm256_stream_si256((__m256i *)(DST + 96),M3); SRC + = 128;
DST + = 128;
}
这是多么 GCC
汇编输出如下:
405280:C5 FD 6F 50 20 vmovdqa为0x20(%RAX),%ymm2
405285:C5 FD 6F 48 40 vmovdqa 0X40(RAX%),%ymm1
40528a:C5 FD 6F 40 60 vmovdqa 0x60的(RAX%),%ymm0
40528f:C5 FD 6F 18 vmovdqa(RAX%),%ymm3
405293:48 83 80 E8子$ 0xffffffffffffff80,RAX%
405297:C5 FD E7 52 20 vmovntdq%ymm2,0x20(%RDX)
40529c:C5 FD E7 4A 40 vmovntdq%ymm1,0x40(%RDX)
4052a1:C5 FD E7 42 60 vmovntdq%ymm0,0x60(%RDX)
4052a6:C5 FD E7 1A vmovntdq%ymm3(%的RDX)
4052aa:48 83 80 EA子$ 0xffffffffffffff80,RDX%
4052ae:48 39 C8 CMP%RCX,RAX%
4052b1:75 CD JNE 405280< sender_body + 0x6e0>
请注意最后 vmovdqa
和 vmovntdq
指令重新排序。随着 GCC
产生code以上我能在我的应用程序吞吐量达到每秒10 227 571包〜。
接下来,我那说明手动重新排序16进制软件。该装置现在环路看起来按以下方式:
405280:C5 FD 6F 18 vmovdqa(RAX%),%ymm3
405284:C5 FD 6F 50 20 vmovdqa为0x20(%RAX),%ymm2
405289:C5 FD 6F 48 40 vmovdqa 0X40(RAX%),%ymm1
40528e:C5 FD 6F 40 60 vmovdqa 0x60的(RAX%),%ymm0
405293:48 83 80 E8子$ 0xffffffffffffff80,RAX%
405297:C5 FD E7 1A vmovntdq%ymm3(%的RDX)
40529b:C5 FD E7 52 20 vmovntdq%ymm2,0x20(%RDX)
4052a0:C5 FD E7 4A 40 vmovntdq%ymm1,0x40(%RDX)
4052a5:C5 FD E7 42 60 vmovntdq%ymm0,0x60(%RDX)
4052aa:48 83 80 EA子$ 0xffffffffffffff80,RDX%
4052ae:48 39 C8 CMP%RCX,RAX%
4052b1:75 CD JNE 405280< sender_body + 0x6e0>
随着正常有序的指示,我得到〜每秒13 668 313包。所以很明显,通过重排序 GCC介绍
降低性能。
你有没有遇到过吗?这是一个已知的bug或我应该填写bug报告?
编译标志:
-O3 -pipe -g -msse4.1 -mavx
我的gcc版本:
gcc版本4.6.3(Ubuntu的/ Linaro的4.6.3-1ubuntu5)
我觉得这个问题有意思。 GCC是著名的生产不到最佳code,但我觉得很有意思想办法鼓励它产生更好的code(对于最热/瓶颈$ C $只有c,当然),无微 - 管理太沉重。在这种特殊情况下,我看着三个工具我用了这样的情况:
-
挥发性
:如果是重要的存储器访问发生在特定的顺序,那么挥发性
是一个合适的工具。请注意,这可能是矫枉过正,以及每次挥发性
的指针被废弃时会导致一个单独的负载。SSE / AVX加载/存储内部函数不能用
挥发性
指针来使用,因为它们的功能。使用像_mm256_load_si256((挥发性__m256i *)SRC);
含蓄地把它强制转换为常量__m256i *
,失去了挥发性
预选赛。我们可以直接取消引用指针波动,虽然。 (当我们需要告诉编译器,该数据可能不对齐,或者说,我们希望有一个流式存储时,才需要加载/存储内部函数)。
M0 =((挥发性__m256i *)SRC)[0];
M1 =((挥发性__m256i *)SRC)[1];
M2 =((挥发性__m256i *)SRC)[2];
立方米=((挥发性__m256i *)SRC)[3];不幸的是,这并不跟店家帮忙,因为我们要发出流店。 A
*(挥发性...)DST = TMP;
不会给我们我们想要的东西。 -
__ asm__ __volatile__();
作为编译器重新排序屏障这是GNU C是编写编译器内存屏障。 (停止编译时不会排放如
MFENCE
实际屏障指令重新排序)。它停止从替换用存储器编译器跨越这条语句访问。 -
使用for循环结构的指标限制。
GCC是著名的pretty寄存器较差的使用。早期版本做了很多的寄存器之间不必要的动作,尽管这是pretty最小时下。然而,在整个海湾合作委员会的许多版本x86-64的测试表明,在循环,最好是使用索引的限制,而不是一个独立的循环变量,以获得最佳效果。
所有综合以上,我构建了以下功能(经过几次迭代后):
的#include<&stdlib.h中GT;
#包括LT&;&immintrin.h GT;的#define可能(x)的__builtin_expect((x)中,1)
#定义不太可能(X)__builtin_expect((X),0)虚空副本(无效* const的目的,常量无效* const的来源,常量为size_t字节)
{
__m256i * DST =(__m256i *)目的地;
常量__m256i * SRC =(常量__m256i *)来源;
常量__m256i *结束=(常量__m256i *)源+字节/的sizeof(__m256i); 而(可能(SRC<结束)){
常量__m256i M0 =((挥发性常量__m256i *)SRC)[0];
常量__m256i M1 =((挥发性常量__m256i *)SRC)[1];
常量__m256i 2 =((挥发性常量__m256i *)SRC)[2];
常量__m256i立方米=((挥发性常量__m256i *)SRC)[3]; _mm256_stream_si256(DST,M0);
_mm256_stream_si256(DST + 1,M1);
_mm256_stream_si256(DST + 2,2);
_mm256_stream_si256(DST + 3,M3); __asm__ __volatile__(); SRC + = 4;
DST + = 4;
}
}
使用编译它( example.c
)使用GCC-4.8.4
的gcc -std = C99 -mavx2 -march = X86-64 -mtune =通用-O2 -S example.c
收益率( example.s
)
.fileexample.c
。文本
.p2align 4日,15
.globl副本
.TYPE复制,@function
复制:
.LFB993:
.cfi_startproc
和Q $ -32%的RDX
leaq(RSI%,%RDX),RCX%
cmpq%RCX,RSI%
JNB .L5
MOVQ%RSI,RAX%
MOVQ%RDI,RDX%
.p2align 4日,10
.p2align 3
.L4:
vmovdqa(RAX%),%ymm3
vmovdqa 32(%RAX),%ymm2
vmovdqa 64(%RAX),%ymm1
vmovdqa 96(%RAX),%ymm0
vmovntdq%ymm3(%的RDX)
vmovntdq%ymm2,32(%RDX)
vmovntdq%ymm1,64(%RDX)
vmovntdq%ymm0,96(%RDX)
SUBQ $ -128%RAX
SUBQ $ -128%的RDX
cmpq%RAX,RCX%
JA .L4
vzeroupper
.L5:
RET
.cfi_endproc
.LFE993:
.size复制。复制
.identGCC:(Ubuntu的4.8.4-2ubuntu1〜14.04)4.8.4
.section伪.note.GNU堆栈,,@ PROGBITS
编译( -c
而不是 -S
)的实际code的拆装
0000000000000000<复制计算值:
0:48 83 E2 E0和$ 0xffffffffffffffe0,RDX%
4:48 8D 0C 16 LEA(%RSI,%RDX,1),%RCX
8:48 39 CE CMP%RCX,RSI%
A:73 41宰4E<复制+ 0x4e>
D:48 89 F0 MOV%RSI,RAX%
10:48 89发MOV%RDI,RDX%
13:0F 44 1F 00 00 nopl为0x0(RAX%,%RAX,1)
18:C5 FD 6F 18 vmovdqa(RAX%),%ymm3
1C:C5 FD 6F 50 20 vmovdqa为0x20(%RAX),%ymm2
21:C5 FD 6F 48 40 vmovdqa 0X40(RAX%),%ymm1
26:C5 FD 6F 40 60 vmovdqa 0x60的(RAX%),%ymm0
2B:C5 FD E7 1A vmovntdq%ymm3(%的RDX)
2F:C5 FD E7 52 20 vmovntdq%ymm2,0x20(%RDX)
34:C5 FD E7 4A 40 vmovntdq%ymm1,0x40(%RDX)
39:C5 FD E7 42 60 vmovntdq%ymm0,0x60(%RDX)
3E:48 83 80 E8子$ 0xffffffffffffff80,RAX%
42:48 83 80 EA子$ 0xffffffffffffff80,RDX%
46:48 39 C1 CMP%RAX,RCX%
49:77 CD JA第18版,复制+为0x18>
4B:C5 F8 77 vzeroupper
4E:C3 retq
在没有任何所有的优化中,code是完全令人厌恶,充满了不必要的动作,所以一些优化是必要的。 (以上使用 -O2
,这是一般的优化级别我使用。)
如果对大小( -Os
)的优化,在code相极好的第一眼,
0000000000000000<复制计算值:
0:48 83 E2 E0和$ 0xffffffffffffffe0,RDX%
4:48 01 F2添加%RSI,RDX%
7:48 39 D6 CMP%的RDX,RSI%
答:73 30宰3C<拷贝+&为0x3C GT;
C:C5 FD 1207米1E vmovdqa(%RSI),%ymm3
10:C5 FD 6F 56 20 vmovdqa为0x20(%RSI),%ymm2
15:C5 FD 1207米4E 40 vmovdqa 0X40(%RSI),%ymm1
1A:C5 FD 6F 46 60 vmovdqa 0x60的(%RSI),%ymm0
1F:C5 FD E7 1F vmovntdq%ymm3(%RDI)
23:C5 FD E7 57 20 vmovntdq%ymm2,0x20(%RDI)
28:C5 FD E7 4F 40 vmovntdq%ymm1,0x40(%RDI)
2D:C5 FD E7 47 60 vmovntdq%ymm0,0x60(%RDI)
32:48 83 80 EE子$ 0xffffffffffffff80,RSI%
36:48 83 80 EF子$ 0xffffffffffffff80,%RDI
3A:EB CB JMP 7 LT;拷贝+ 0x7的>
3C:C3 retq
,直到你发现,过去 JMP
是比较,基本上是做一个 JMP
, CMP
和宰
在每次迭代,这可能会产生pretty糟糕的结果。
请注意:如果你做真实世界的code类似的东西,请不要添加注释(尤其是 __ asm__ __volatile__();
),并记得定期与所有可用的编译器检查,以确保code不被任何编译还算可以。
看着彼得科尔德'出色答卷,我决定去远一点迭代的功能,只是为了好玩。
由于罗斯岭提到的意见,使用时 _mm256_load_si256()
指针不解除引用(被重铸到 __ m256i前*
作为参数传递给函数),从而挥发性
使用时,将不利于 _mm256_load_si256()
。在另一种评论,勒布提出一个解决办法: _mm256_load_si256((__ m256i []){*(挥发性__m256i *)(SRC)})
,它的指针用品的功能的src
经由挥发性指针访问元素,并将其转换为数组。对于一个简单的定位的送,我preFER直接挥发指针;它在我的code意图相匹配。 (我瞄准KISS,虽然经常打我只有它的愚蠢的一部分。)
在x86-64的,内循环的开始对齐为16字节,所以在功能头部分操作的数目并不重要。静止,从而避免了多余的二进制和(掩蔽在要复制的字节量的五个最低显著比特)是在一般当然是有用的。
GCC为此提供了两个选项。一个是 __ builtin_assume_aligned()
的内置,它允许程序员传达各种对准信息给编译器。另一种是typedef'ing有额外的属性,这里键入 __属性__((排列(32)))
,它可以被用来传达例如函数的参数alignedness 。这两个应该铛提供(尽管支持是最近,而不是在3.5还),并可能其他如ICC提供(尽管ICC,据我所知,使用 __ assume_aligned()
)。
,以减轻寄存器洗牌GCC做的一种方法,是使用一个辅助的功能。一些进一步的迭代后,我来到这, another.c
:
的#include<&stdlib.h中GT;
#包括LT&;&immintrin.h GT;的#define可能(x)的__builtin_expect((x)中,1)
#定义不太可能(X)__builtin_expect((X),0)#如果(__clang_major __ + 0> = 3)
#定义IS_ALIGNED(X,N)((无效*)(X))
#elif指令(__GNUC __ + 0> = 4)
的#define IS_ALIGNED(X,N)__builtin_assume_aligned((x)时,(n))的
#其他
#定义IS_ALIGNED(X,N)((无效*)(X))
#万一typedef的__m256i __m256i_aligned __attribute __((排列(32)));
无效do_copy(寄存器__m256i_aligned * DST,
注册挥发性__m256i_aligned * SRC,
注册__m256i_aligned *完)
{
做{
注册常量__m256i M0 = SRC [0];
寄存器常量__m256i M1 = SRC [1];
注册常量__m256i M2 = SRC [2];
寄存器常量__m256i立方米= SRC [3]; __asm__ __volatile__(); _mm256_stream_si256(DST,M0);
_mm256_stream_si256(DST + 1,M1);
_mm256_stream_si256(DST + 2,2);
_mm256_stream_si256(DST + 3,M3); __asm__ __volatile__(); SRC + = 4;
DST + = 4; }而(可能(SRC< END));
}虚空副本(void *的DST,常量无效* SRC,常量为size_t字节)
{
如果(字节< 128)
返回; do_copy(IS_ALIGNED(DST,32)
IS_ALIGNED(SRC,32)
IS_ALIGNED((无效*)((字符*)SRC +字节),32));
}
随 GCC -march = X86-64 -mtune =通用-mavx2 -O2 -S another.c
编译基本上(意见和指示为简洁起见省略):
do_copy:
.L3:
vmovdqa(%RSI),%ymm3
vmovdqa 32(%RSI),%ymm2
vmovdqa 64(%RSI),%ymm1
vmovdqa 96(%RSI),%ymm0
vmovntdq%ymm3(%RDI)
vmovntdq%ymm2,32(%RDI)
vmovntdq%ymm1,64(%RDI)
vmovntdq%ymm0,96(%RDI)
SUBQ $ -128,RSI%
SUBQ $ -128%RDI
cmpq%的RDX,RSI%
JB .L3
vzeroupper
RET复制:
cmpq $ 127,RDX%
JA .L8
代表RET
.L8:
addq%RSI,RDX%
JMP do_copy
在 -O3
进一步优化只是内联辅助功能,
do_copy:
.L3:
vmovdqa(%RSI),%ymm3
vmovdqa 32(%RSI),%ymm2
vmovdqa 64(%RSI),%ymm1
vmovdqa 96(%RSI),%ymm0
vmovntdq%ymm3(%RDI)
vmovntdq%ymm2,32(%RDI)
vmovntdq%ymm1,64(%RDI)
vmovntdq%ymm0,96(%RDI)
SUBQ $ -128,RSI%
SUBQ $ -128%RDI
cmpq%的RDX,RSI%
JB .L3
vzeroupper
RET复制:
cmpq $ 127,RDX%
JA .L10
代表RET
.L10:
leaq(RSI%,%RDX),RAX%
.L8:
vmovdqa(%RSI),%ymm3
vmovdqa 32(%RSI),%ymm2
vmovdqa 64(%RSI),%ymm1
vmovdqa 96(%RSI),%ymm0
vmovntdq%ymm3(%RDI)
vmovntdq%ymm2,32(%RDI)
vmovntdq%ymm1,64(%RDI)
vmovntdq%ymm0,96(%RDI)
SUBQ $ -128,RSI%
SUBQ $ -128%RDI
cmpq%RSI,RAX%
JA .L8
vzeroupper
RET
甚至与 -Os
生成code是非常漂亮的,
do_copy:
.L3:
vmovdqa(%RSI),%ymm3
vmovdqa 32(%RSI),%ymm2
vmovdqa 64(%RSI),%ymm1
vmovdqa 96(%RSI),%ymm0
vmovntdq%ymm3(%RDI)
vmovntdq%ymm2,32(%RDI)
vmovntdq%ymm1,64(%RDI)
vmovntdq%ymm0,96(%RDI)
SUBQ $ -128,RSI%
SUBQ $ -128%RDI
cmpq%的RDX,RSI%
JB .L3
RET复制:
cmpq $ 127,RDX%
JBE .L5
addq%RSI,RDX%
JMP do_copy
.L5:
RET
当然,没有优化GCC-4.8.4仍然产生pretty坏code。随着铛-3.5 -march = X86-64 -mtune =通用-mavx2 -O2
和 -Os
我们得到实质上
do_copy:
.LBB0_1:
vmovaps(%RSI),%ymm0
vmovaps 32(%RSI),%ymm1
vmovaps 64(%RSI),%ymm2
vmovaps 96(%RSI),%ymm3
vmovntps%ymm0(%RDI)
vmovntps%ymm1,32(%RDI)
vmovntps%ymm2,64(%RDI)
vmovntps%ymm3,96(%RDI)
SUBQ $ -128,RSI%
SUBQ $ -128%RDI
cmpq%的RDX,RSI%
JB .LBB0_1
vzeroupper
retq复制:
cmpq $ 128,RDX%
JB .LBB1_3
addq%RSI,RDX%
.LBB1_2:
vmovaps(%RSI),%ymm0
vmovaps 32(%RSI),%ymm1
vmovaps 64(%RSI),%ymm2
vmovaps 96(%RSI),%ymm3
vmovntps%ymm0(%RDI)
vmovntps%ymm1,32(%RDI)
vmovntps%ymm2,64(%RDI)
vmovntps%ymm3,96(%RDI)
SUBQ $ -128,RSI%
SUBQ $ -128%RDI
cmpq%的RDX,RSI%
JB .LBB1_2
.LBB1_3:
vzeroupper
retq
我喜欢的 another.c
code(它很适合我的编码风格),我很高兴与GCC-4.8生成的code 1.4和在 -O1
, -O2
, -O3 $铛-3.5 C $ C>和
-Os
两个,所以我觉得这对我来说已经足够了。 (请注意,但是,我并没有实际基准的任何这一点,因为我没有相关的code。我们同时使用时间和非时间(NT)的内存访问,和缓存行为(和缓存交互与周围code)是最重要的事情就是这样,所以这将毫无意义微基准这一点,我想。)
I have got the following code, which copies data from memory to DMA buffer:
for (; likely(l > 0); l-=128)
{
__m256i m0 = _mm256_load_si256( (__m256i*) (src) );
__m256i m1 = _mm256_load_si256( (__m256i*) (src+32) );
__m256i m2 = _mm256_load_si256( (__m256i*) (src+64) );
__m256i m3 = _mm256_load_si256( (__m256i*) (src+96) );
_mm256_stream_si256( (__m256i *) (dst), m0 );
_mm256_stream_si256( (__m256i *) (dst+32), m1 );
_mm256_stream_si256( (__m256i *) (dst+64), m2 );
_mm256_stream_si256( (__m256i *) (dst+96), m3 );
src += 128;
dst += 128;
}
That is how gcc
assembly output looks like:
405280: c5 fd 6f 50 20 vmovdqa 0x20(%rax),%ymm2
405285: c5 fd 6f 48 40 vmovdqa 0x40(%rax),%ymm1
40528a: c5 fd 6f 40 60 vmovdqa 0x60(%rax),%ymm0
40528f: c5 fd 6f 18 vmovdqa (%rax),%ymm3
405293: 48 83 e8 80 sub $0xffffffffffffff80,%rax
405297: c5 fd e7 52 20 vmovntdq %ymm2,0x20(%rdx)
40529c: c5 fd e7 4a 40 vmovntdq %ymm1,0x40(%rdx)
4052a1: c5 fd e7 42 60 vmovntdq %ymm0,0x60(%rdx)
4052a6: c5 fd e7 1a vmovntdq %ymm3,(%rdx)
4052aa: 48 83 ea 80 sub $0xffffffffffffff80,%rdx
4052ae: 48 39 c8 cmp %rcx,%rax
4052b1: 75 cd jne 405280 <sender_body+0x6e0>
Note the reordering of last vmovdqa
and vmovntdq
instructions. With the gcc
generated code above I am able to reach throughput of ~10 227 571 packets per second in my application.
Next, I reorder that instructions manually in hexeditor. That means now the loop looks the following way:
405280: c5 fd 6f 18 vmovdqa (%rax),%ymm3
405284: c5 fd 6f 50 20 vmovdqa 0x20(%rax),%ymm2
405289: c5 fd 6f 48 40 vmovdqa 0x40(%rax),%ymm1
40528e: c5 fd 6f 40 60 vmovdqa 0x60(%rax),%ymm0
405293: 48 83 e8 80 sub $0xffffffffffffff80,%rax
405297: c5 fd e7 1a vmovntdq %ymm3,(%rdx)
40529b: c5 fd e7 52 20 vmovntdq %ymm2,0x20(%rdx)
4052a0: c5 fd e7 4a 40 vmovntdq %ymm1,0x40(%rdx)
4052a5: c5 fd e7 42 60 vmovntdq %ymm0,0x60(%rdx)
4052aa: 48 83 ea 80 sub $0xffffffffffffff80,%rdx
4052ae: 48 39 c8 cmp %rcx,%rax
4052b1: 75 cd jne 405280 <sender_body+0x6e0>
With the properly ordered instructions I get ~13 668 313 packets per second. So it is obvious that reordering introduced by gcc
reduces performance.
Have you come across that? Is this a known bug or should I fill a bug report?
Compilation flags:
-O3 -pipe -g -msse4.1 -mavx
My gcc version:
gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
I find this problem interesting. GCC is known for producing less than optimal code, but I find it fascinating to find ways to "encourage" it to produce better code (for hottest/bottleneck code only, of course), without micro-managing too heavily. In this particular case, I looked at three "tools" I use for such situations:
volatile
: If it is important the memory accesses occur in specific order, thenvolatile
is a suitable tool. Note that it can be overkill, and will lead to a separate load every time avolatile
pointer is dereferenced.SSE/AVX load/store intrinsics can't be used with
volatile
pointers, because they are functions. Using something like_mm256_load_si256((volatile __m256i *)src);
implicitly casts it toconst __m256i*
, losing thevolatile
qualifier.We can directly dereference volatile pointers, though. (load/store intrinsics are only needed when we need to tell the compiler that the data might be unaligned, or that we want a streaming store.)
m0 = ((volatile __m256i *)src)[0]; m1 = ((volatile __m256i *)src)[1]; m2 = ((volatile __m256i *)src)[2]; m3 = ((volatile __m256i *)src)[3];
Unfortunately this doesn't help with the stores, because we want to emit streaming stores. A
*(volatile...)dst = tmp;
won't give us what we want.__asm__ __volatile__ ("");
as a compiler reordering barrier.This is the GNU C was of writing a compiler memory-barrier. (Stopping compile-time reordering without emitting an actual barrier instruction like
mfence
). It stops the compiler from reordering memory accesses across this statement.Using an index limit for loop structures.
GCC is known for pretty poor register usage. Earlier versions made a lot of unnecessary moves between registers, although that is pretty minimal nowadays. However, testing on x86-64 across many versions of GCC indicate that in loops, it is better to use an index limit, rather than a independent loop variable, for best results.
Combining all the above, I constructed the following function (after a few iterations):
#include <stdlib.h>
#include <immintrin.h>
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
void copy(void *const destination, const void *const source, const size_t bytes)
{
__m256i *dst = (__m256i *)destination;
const __m256i *src = (const __m256i *)source;
const __m256i *end = (const __m256i *)source + bytes / sizeof (__m256i);
while (likely(src < end)) {
const __m256i m0 = ((volatile const __m256i *)src)[0];
const __m256i m1 = ((volatile const __m256i *)src)[1];
const __m256i m2 = ((volatile const __m256i *)src)[2];
const __m256i m3 = ((volatile const __m256i *)src)[3];
_mm256_stream_si256( dst, m0 );
_mm256_stream_si256( dst + 1, m1 );
_mm256_stream_si256( dst + 2, m2 );
_mm256_stream_si256( dst + 3, m3 );
__asm__ __volatile__ ("");
src += 4;
dst += 4;
}
}
Compiling it (example.c
) using GCC-4.8.4 using
gcc -std=c99 -mavx2 -march=x86-64 -mtune=generic -O2 -S example.c
yields (example.s
):
.file "example.c"
.text
.p2align 4,,15
.globl copy
.type copy, @function
copy:
.LFB993:
.cfi_startproc
andq $-32, %rdx
leaq (%rsi,%rdx), %rcx
cmpq %rcx, %rsi
jnb .L5
movq %rsi, %rax
movq %rdi, %rdx
.p2align 4,,10
.p2align 3
.L4:
vmovdqa (%rax), %ymm3
vmovdqa 32(%rax), %ymm2
vmovdqa 64(%rax), %ymm1
vmovdqa 96(%rax), %ymm0
vmovntdq %ymm3, (%rdx)
vmovntdq %ymm2, 32(%rdx)
vmovntdq %ymm1, 64(%rdx)
vmovntdq %ymm0, 96(%rdx)
subq $-128, %rax
subq $-128, %rdx
cmpq %rax, %rcx
ja .L4
vzeroupper
.L5:
ret
.cfi_endproc
.LFE993:
.size copy, .-copy
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
.section .note.GNU-stack,"",@progbits
The disassembly of the actual compiled (-c
instead of -S
) code is
0000000000000000 <copy>:
0: 48 83 e2 e0 and $0xffffffffffffffe0,%rdx
4: 48 8d 0c 16 lea (%rsi,%rdx,1),%rcx
8: 48 39 ce cmp %rcx,%rsi
b: 73 41 jae 4e <copy+0x4e>
d: 48 89 f0 mov %rsi,%rax
10: 48 89 fa mov %rdi,%rdx
13: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
18: c5 fd 6f 18 vmovdqa (%rax),%ymm3
1c: c5 fd 6f 50 20 vmovdqa 0x20(%rax),%ymm2
21: c5 fd 6f 48 40 vmovdqa 0x40(%rax),%ymm1
26: c5 fd 6f 40 60 vmovdqa 0x60(%rax),%ymm0
2b: c5 fd e7 1a vmovntdq %ymm3,(%rdx)
2f: c5 fd e7 52 20 vmovntdq %ymm2,0x20(%rdx)
34: c5 fd e7 4a 40 vmovntdq %ymm1,0x40(%rdx)
39: c5 fd e7 42 60 vmovntdq %ymm0,0x60(%rdx)
3e: 48 83 e8 80 sub $0xffffffffffffff80,%rax
42: 48 83 ea 80 sub $0xffffffffffffff80,%rdx
46: 48 39 c1 cmp %rax,%rcx
49: 77 cd ja 18 <copy+0x18>
4b: c5 f8 77 vzeroupper
4e: c3 retq
Without any optimizations at all, the code is completely disgusting, full of unnecessary moves, so some optimization is necessary. (The above uses -O2
, which is generally the optimization level I use.)
If optimizing for size (-Os
), the code looks excellent at first glance,
0000000000000000 <copy>:
0: 48 83 e2 e0 and $0xffffffffffffffe0,%rdx
4: 48 01 f2 add %rsi,%rdx
7: 48 39 d6 cmp %rdx,%rsi
a: 73 30 jae 3c <copy+0x3c>
c: c5 fd 6f 1e vmovdqa (%rsi),%ymm3
10: c5 fd 6f 56 20 vmovdqa 0x20(%rsi),%ymm2
15: c5 fd 6f 4e 40 vmovdqa 0x40(%rsi),%ymm1
1a: c5 fd 6f 46 60 vmovdqa 0x60(%rsi),%ymm0
1f: c5 fd e7 1f vmovntdq %ymm3,(%rdi)
23: c5 fd e7 57 20 vmovntdq %ymm2,0x20(%rdi)
28: c5 fd e7 4f 40 vmovntdq %ymm1,0x40(%rdi)
2d: c5 fd e7 47 60 vmovntdq %ymm0,0x60(%rdi)
32: 48 83 ee 80 sub $0xffffffffffffff80,%rsi
36: 48 83 ef 80 sub $0xffffffffffffff80,%rdi
3a: eb cb jmp 7 <copy+0x7>
3c: c3 retq
until you notice that the last jmp
is to the comparison, essentially doing a jmp
, cmp
, and a jae
at every iteration, which probably yields pretty poor results.
Note: If you do something similar for real-world code, please do add comments (especially for the __asm__ __volatile__ ("");
), and remember to periodically check with all compilers available, to make sure the code is not compiled too badly by any.
Looking at Peter Cordes' excellent answer, I decided to iterate the function a bit further, just for fun.
As Ross Ridge mentions in the comments, when using _mm256_load_si256()
the pointer is not dereferenced (prior to being re-cast to aligned __m256i *
as a parameter to the function), thus volatile
won't help when using _mm256_load_si256()
. In another comment, Seb suggests a workaround: _mm256_load_si256((__m256i []){ *(volatile __m256i *)(src) })
, which supplies the function with a pointer to src
by accessing the element via a volatile pointer and casting it to an array. For a simple aligned load, I prefer the direct volatile pointer; it matches my intent in the code. (I do aim for KISS, although often I hit only the stupid part of it.)
On x86-64, the start of the inner loop is aligned to 16 bytes, so the number of operations in the function "header" part is not really important. Still, avoiding the superfluous binary AND (masking the five least significant bits of the amount to copy in bytes) is certainly useful in general.
GCC provides two options for this. One is the __builtin_assume_aligned()
built-in, which allows a programmer to convey all sorts of alignment information to the compiler. The other is typedef'ing a type that has extra attributes, here __attribute__((aligned (32)))
, which can be used to convey the alignedness of function parameters for example. Both of these should be available in clang (although support is recent, not in 3.5 yet), and may be available in others such as icc (although ICC, AFAIK, uses __assume_aligned()
).
One way to mitigate the register shuffling GCC does, is to use a helper function. After some further iterations, I arrived at this, another.c
:
#include <stdlib.h>
#include <immintrin.h>
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
#if (__clang_major__+0 >= 3)
#define IS_ALIGNED(x, n) ((void *)(x))
#elif (__GNUC__+0 >= 4)
#define IS_ALIGNED(x, n) __builtin_assume_aligned((x), (n))
#else
#define IS_ALIGNED(x, n) ((void *)(x))
#endif
typedef __m256i __m256i_aligned __attribute__((aligned (32)));
void do_copy(register __m256i_aligned *dst,
register volatile __m256i_aligned *src,
register __m256i_aligned *end)
{
do {
register const __m256i m0 = src[0];
register const __m256i m1 = src[1];
register const __m256i m2 = src[2];
register const __m256i m3 = src[3];
__asm__ __volatile__ ("");
_mm256_stream_si256( dst, m0 );
_mm256_stream_si256( dst + 1, m1 );
_mm256_stream_si256( dst + 2, m2 );
_mm256_stream_si256( dst + 3, m3 );
__asm__ __volatile__ ("");
src += 4;
dst += 4;
} while (likely(src < end));
}
void copy(void *dst, const void *src, const size_t bytes)
{
if (bytes < 128)
return;
do_copy(IS_ALIGNED(dst, 32),
IS_ALIGNED(src, 32),
IS_ALIGNED((void *)((char *)src + bytes), 32));
}
which compiles with gcc -march=x86-64 -mtune=generic -mavx2 -O2 -S another.c
to essentially (comments and directives omitted for brevity):
do_copy:
.L3:
vmovdqa (%rsi), %ymm3
vmovdqa 32(%rsi), %ymm2
vmovdqa 64(%rsi), %ymm1
vmovdqa 96(%rsi), %ymm0
vmovntdq %ymm3, (%rdi)
vmovntdq %ymm2, 32(%rdi)
vmovntdq %ymm1, 64(%rdi)
vmovntdq %ymm0, 96(%rdi)
subq $-128, %rsi
subq $-128, %rdi
cmpq %rdx, %rsi
jb .L3
vzeroupper
ret
copy:
cmpq $127, %rdx
ja .L8
rep ret
.L8:
addq %rsi, %rdx
jmp do_copy
Further optimization at -O3
just inlines the helper function,
do_copy:
.L3:
vmovdqa (%rsi), %ymm3
vmovdqa 32(%rsi), %ymm2
vmovdqa 64(%rsi), %ymm1
vmovdqa 96(%rsi), %ymm0
vmovntdq %ymm3, (%rdi)
vmovntdq %ymm2, 32(%rdi)
vmovntdq %ymm1, 64(%rdi)
vmovntdq %ymm0, 96(%rdi)
subq $-128, %rsi
subq $-128, %rdi
cmpq %rdx, %rsi
jb .L3
vzeroupper
ret
copy:
cmpq $127, %rdx
ja .L10
rep ret
.L10:
leaq (%rsi,%rdx), %rax
.L8:
vmovdqa (%rsi), %ymm3
vmovdqa 32(%rsi), %ymm2
vmovdqa 64(%rsi), %ymm1
vmovdqa 96(%rsi), %ymm0
vmovntdq %ymm3, (%rdi)
vmovntdq %ymm2, 32(%rdi)
vmovntdq %ymm1, 64(%rdi)
vmovntdq %ymm0, 96(%rdi)
subq $-128, %rsi
subq $-128, %rdi
cmpq %rsi, %rax
ja .L8
vzeroupper
ret
and even with -Os
the generated code is very nice,
do_copy:
.L3:
vmovdqa (%rsi), %ymm3
vmovdqa 32(%rsi), %ymm2
vmovdqa 64(%rsi), %ymm1
vmovdqa 96(%rsi), %ymm0
vmovntdq %ymm3, (%rdi)
vmovntdq %ymm2, 32(%rdi)
vmovntdq %ymm1, 64(%rdi)
vmovntdq %ymm0, 96(%rdi)
subq $-128, %rsi
subq $-128, %rdi
cmpq %rdx, %rsi
jb .L3
ret
copy:
cmpq $127, %rdx
jbe .L5
addq %rsi, %rdx
jmp do_copy
.L5:
ret
Of course, without optimizations GCC-4.8.4 still produces pretty bad code. With clang-3.5 -march=x86-64 -mtune=generic -mavx2 -O2
and -Os
we get essentially
do_copy:
.LBB0_1:
vmovaps (%rsi), %ymm0
vmovaps 32(%rsi), %ymm1
vmovaps 64(%rsi), %ymm2
vmovaps 96(%rsi), %ymm3
vmovntps %ymm0, (%rdi)
vmovntps %ymm1, 32(%rdi)
vmovntps %ymm2, 64(%rdi)
vmovntps %ymm3, 96(%rdi)
subq $-128, %rsi
subq $-128, %rdi
cmpq %rdx, %rsi
jb .LBB0_1
vzeroupper
retq
copy:
cmpq $128, %rdx
jb .LBB1_3
addq %rsi, %rdx
.LBB1_2:
vmovaps (%rsi), %ymm0
vmovaps 32(%rsi), %ymm1
vmovaps 64(%rsi), %ymm2
vmovaps 96(%rsi), %ymm3
vmovntps %ymm0, (%rdi)
vmovntps %ymm1, 32(%rdi)
vmovntps %ymm2, 64(%rdi)
vmovntps %ymm3, 96(%rdi)
subq $-128, %rsi
subq $-128, %rdi
cmpq %rdx, %rsi
jb .LBB1_2
.LBB1_3:
vzeroupper
retq
I like the another.c
code (it suits my coding style), and I'm happy with the code generated by GCC-4.8.4 and clang-3.5 at -O1
, -O2
, -O3
, and -Os
on both, so I think it is good enough for me. (Note, however, that I haven't actually benchmarked any of this, because I don't have the relevant code. We use both temporal and non-temporal (nt) memory accesses, and cache behaviour (and cache interaction with the surrounding code) is paramount for things like this, so it would make no sense to microbenchmark this, I think.)
这篇关于GCC错误产生的组件订购,导致性能损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!