GCC错误产生的组件订购,导致性能损失 [英] Wrong gcc generated assembly ordering, results in performance hit

查看:669
本文介绍了GCC错误产生的组件订购,导致性能损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下code,它复制到DMA缓冲区中的数据从内存中:

 的(;可能的(L&0); 1- = 128)
{
    __m256i M0 = _mm256_load_si256((__m256i *)(SRC));
    __m256i M1 = _mm256_load_si256((__m256i *)(SRC + 32));
    __m256i 2 = _mm256_load_si256((__m256i *)(SRC + 64));
    __m256i M3 = _mm256_load_si256((__m256i *)(SRC + 96));    _mm256_stream_si256((__m256i *)(DST),M0);
    _mm256_stream_si256((__m256i *)(DST + 32),M1);
    _mm256_stream_si256((__m256i *)(DST + 64),2);
    _mm256_stream_si256((__m256i *)(DST + 96),M3);    SRC + = 128;
    DST + = 128;
}

这是多么 GCC 汇编输出如下:

  405280:C5 FD 6F 50 20 vmovdqa为0x20(%RAX),%ymm2
405285:C5 FD 6F 48 40 vmovdqa 0X40(RAX%),%ymm1
40528a:C5 FD 6F 40 60 vmovdqa 0x60的(RAX%),%ymm0
40528f:C5 FD 6F 18 vmovdqa(RAX%),%ymm3
405293:48 83 80 E8子$ 0xffffffffffffff80,RAX%
405297:C5 FD E7 52 20 vmovntdq%ymm2,0x20(%RDX)
40529c:C5 FD E7 4A 40 vmovntdq%ymm1,0x40(%RDX)
4052a1:C5 FD E7 42 60 vmovntdq%ymm0,0x60(%RDX)
4052a6:C5 FD E7 1A vmovntdq%ymm3(%的RDX)
4052aa:48 83 80 EA子$ 0xffffffffffffff80,RDX%
4052ae:48 39 C8 CMP%RCX,RAX%
4052b1:75 CD JNE 405280< sender_body + 0x6e0>

请注意最后 vmovdqa vmovntdq 指令重新排序。随着 GCC 产生code以上我能在我的应用程序吞吐量达到每秒10 227 571包〜。

接下来,我那说明手动重新排序16进制软件。该装置现在环路看起来按以下方式:

  405280:C5 FD 6F 18 vmovdqa(RAX%),%ymm3
405284:C5 FD 6F 50 20 vmovdqa为0x20(%RAX),%ymm2
405289:C5 FD 6F 48 40 vmovdqa 0X40(RAX%),%ymm1
40528e:C5 FD 6F 40 60 vmovdqa 0x60的(RAX%),%ymm0
405293:48 83 80 E8子$ 0xffffffffffffff80,RAX%
405297:C5 FD E7 1A vmovntdq%ymm3(%的RDX)
40529b:C5 FD E7 52 20 vmovntdq%ymm2,0x20(%RDX)
4052a0:C5 FD E7 4A 40 vmovntdq%ymm1,0x40(%RDX)
4052a5:C5 FD E7 42 60 vmovntdq%ymm0,0x60(%RDX)
4052aa:48 83 80 EA子$ 0xffffffffffffff80,RDX%
4052ae:48 39 C8 CMP%RCX,RAX%
4052b1:75 CD JNE 405280< sender_body + 0x6e0>

随着正常有序的指示,我得到〜每秒13 668 313包。所以很明显,通过重排序 GCC介绍降低性能。

你有没有遇到过吗?这是一个已知的bug或我应该填写bug报告?

编译标志:

  -O3 -pipe -g -msse4.1 -mavx

我的gcc版本:

  gcc版本4.6.3(Ubuntu的/ Linaro的4.6.3-1ubuntu5)


解决方案

我觉得这个问题有意思。 GCC是著名的生产不到最佳code,但我觉得很有意思想办法鼓励它产生更好的code(对于最热/瓶颈$ C $只有c,当然),无微 - 管理太沉重。在这种特殊情况下,我看着三个工具我用了这样的情况:


  • 挥发性:如果是重要的存储器访问发生在特定的顺序,那么挥发性是一个合适的工具。请注意,这可能是矫枉过正,以及每次挥发性的指针被废弃时会导致一个单独的负载。

    SSE / AVX加载/存储内部函数不能用挥发性指针来使用,因为它们的功能。使用像 _mm256_load_si256((挥发性__m256i *)SRC); 含蓄地把它强制转换为常量__m256i * ,失去了挥发性预选赛。

    我们可以直接取消引用指针波动,虽然。 (当我们需要告诉编译器,该数据可能不对齐,或者说,我们希望有一个流式存储时,才需要加载/存储内部函数)。

      M0 =((挥发性__m256i *)SRC)[0];
    M1 =((挥发性__m256i *)SRC)[1];
    M2 =((挥发性__m256i *)SRC)[2];
    立方米=((挥发性__m256i *)SRC)[3];

    不幸的是,这并不跟店家帮忙,因为我们要发出流店。 A *(挥发性...)DST = TMP; 不会给我们我们想要的东西。


  • __ asm__ __volatile__(); 作为编译器重新排序屏障

    这是GNU C是编写编译器内存屏障。 (停止编译时不会排放如 MFENCE 实际屏障指令重新排序)。它停止从替换用存储器编译器跨越这条语句访问。


  • 使用for循环结构的指标限制。

    GCC是著名的pretty寄存器较差的使用​​。早期版本做了很多的寄存器之间不必要的动作,尽管这是pretty最小时下。然而,在整个海湾合作委员会的许多版本x86-64的测试表明,在循环,最好是使用索引的限制,而不是一个独立的循环变量,以获得最佳效果。


所有综合以上,我构建了以下功能(经过几次迭代后):

 的#include<&stdlib.h中GT;
#包括LT&;&immintrin.h GT;的#define可能(x)的__builtin_expect((x)中,1)
#定义不太可能(X)__builtin_expect((X),0)虚空副本(无效* const的目的,常量无效* const的来源,常量为size_t字节)
{
    __m256i * DST =(__m256i *)目的地;
    常量__m256i * SRC =(常量__m256i *)来源;
    常量__m256i *结束=(常量__m256i *)源+字节/的sizeof(__m256i);    而(可能(SRC<结束)){
        常量__m256i M0 =((挥发性常量__m256i *)SRC)[0];
        常量__m256i M1 =((挥发性常量__m256i *)SRC)[1];
        常量__m256i 2 =((挥发性常量__m256i *)SRC)[2];
        常量__m256i立方米=((挥发性常量__m256i *)SRC)[3];        _mm256_stream_si256(DST,M0);
        _mm256_stream_si256(DST + 1,M1);
        _mm256_stream_si256(DST + 2,2);
        _mm256_stream_si256(DST + 3,M3);        __asm​​__ __volatile__();        SRC + = 4;
        DST + = 4;
    }
}

使用编译它( example.c )使用GCC-4.8.4

 的gcc -std = C99 -mavx2 -march = X86-64 -mtune =通用-O2 -S example.c

收益率( example.s

  .fileexample.c
        。文本
        .p2align 4日,15
        .globl副本
        .TYPE复制,@function
复制:
.LFB993:
        .cfi_startproc
        和Q $ -32%的RDX
        leaq(RSI%,%RDX),RCX%
        cmpq%RCX,RSI%
        JNB .L5
        MOVQ%RSI,RAX%
        MOVQ%RDI,RDX%
        .p2align 4日,10
        .p2align 3
.L4:
        vmovdqa(RAX%),%ymm3
        vmovdqa 32(%RAX),%ymm2
        vmovdqa 64(%RAX),%ymm1
        vmovdqa 96(%RAX),%ymm0
        vmovntdq%ymm3(%的RDX)
        vmovntdq%ymm2,32(%RDX)
        vmovntdq%ymm1,64(%RDX)
        vmovntdq%ymm0,96(%RDX)
        SUBQ $ -128%RAX
        SUBQ $ -128%的RDX
        cmpq%RAX,RCX%
        JA .L4
        vzeroupper
.L5:
        RET
        .cfi_endproc
.LFE993:
        .size复制。复制
        .identGCC:(Ubuntu的4.8.4-2ubuntu1〜14.04)4.8.4
        .section伪.note.GNU堆栈,,@ PROGBITS

编译( -c 而不是 -S )的实际code的拆装

  0000000000000000<复制计算值:
   0:48 83 E2 E0和$ 0xffffffffffffffe0,RDX%
   4:48 8D 0C 16 LEA(%RSI,%RDX,1),%RCX
   8:48 39 CE CMP%RCX,RSI%
   A:73 41宰4E<复制+ 0x4e>
   D:48 89 F0 MOV%RSI,RAX%
  10:48 89发MOV%RDI,RDX%
  13:0F 44 1F 00 00 nopl为0x0(RAX%,%RAX,1)
  18:C5 FD 6F 18 vmovdqa(RAX%),%ymm3
  1C:C5 FD 6F 50 20 vmovdqa为0x20(%RAX),%ymm2
  21:C5 FD 6F 48 40 vmovdqa 0X40(RAX%),%ymm1
  26:C5 FD 6F 40 60 vmovdqa 0x60的(RAX%),%ymm0
  2B:C5 FD E7 1A vmovntdq%ymm3(%的RDX)
  2F:C5 FD E7 52 20 vmovntdq%ymm2,0x20(%RDX)
  34:C5 FD E7 4A 40 vmovntdq%ymm1,0x40(%RDX)
  39:C5 FD E7 42 60 vmovntdq%ymm0,0x60(%RDX)
  3E:48 83 80 E8子$ 0xffffffffffffff80,RAX%
  42:48 83 80 EA子$ 0xffffffffffffff80,RDX%
  46:48 39 C1 CMP%RAX,RCX%
  49:77 CD JA第18版,复制+为0x18>
  4B:C5 F8 77 vzeroupper
  4E:C3 retq

在没有任何所有的优化中,code是完全令人厌恶,充满了不必要的动作,所以一些优化是必要的。 (以上使用 -O2 ,这是一般的优化级别我使用。)

如果对大小( -Os )的优化,在code相极好的第一眼,

  0000000000000000<复制计算值:
   0:48 83 E2 E0和$ 0xffffffffffffffe0,RDX%
   4:48 01 F2添加%RSI,RDX%
   7:48 39 D6 CMP%的RDX,RSI%
   答:73 30宰3C<拷贝+&为0x3C GT;
   C:C5 FD 1207米1E vmovdqa(%RSI),%ymm3
  10:C5 FD 6F 56 20 vmovdqa为0x20(%RSI),%ymm2
  15:C5 FD 1207米4E 40 vmovdqa 0X40(%RSI),%ymm1
  1A:C5 FD 6F 46 60 vmovdqa 0x60的(%RSI),%ymm0
  1F:C5 FD E7 1F vmovntdq%ymm3(%RDI)
  23:C5 FD E7 57 20 vmovntdq%ymm2,0x20(%RDI)
  28:C5 FD E7 4F 40 vmovntdq%ymm1,0x40(%RDI)
  2D:C5 FD E7 47 60 vmovntdq%ymm0,0x60(%RDI)
  32:48 83 80 EE子$ 0xffffffffffffff80,RSI%
  36:48 83 80 EF子$ 0xffffffffffffff80,%RDI
  3A:EB CB JMP 7 LT;拷贝+ 0x7的>
  3C:C3 retq

,直到你发现,过去 JMP 是比较,基本上是做一个 JMP CMP 在每次迭代,这可能会产生pretty糟糕的结果。

请注意:如果你做真实世界的code类似的东西,请不要添加注释(尤其是 __ asm__ __volatile__(); ),并记得定期与所有可用的编译器检查,以确保code不被任何编译还算可以。


看着彼得科尔德'出色答卷,我决定去远一点迭代的功能,只是为了好玩。

由于罗斯岭提到的意见,使用时 _mm256_load_si256()指针不解除引用(被重铸到 __ m256i前* 作为参数传递给函数),从而挥发性使用时,将不利于 _mm256_load_si256()。在另一种评论,勒布提出一个解决办法: _mm256_load_si256((__ m256i []){*(挥发性__m256i *)(SRC)}),它的指针用品的功能的src 经由挥发性指针访问元素,并将其转换为数组。对于一个简单的定位的送,我preFER直接挥发指针;它在我的code意图相匹配。 (我瞄准KISS,虽然经常打我只有它的愚蠢的一部分。)

在x86-64的,内循环的开始对齐为16字节,所以在功能头部分操作的数目并不重要。静止,从而避免了多余的二进制和(掩蔽在要复制的字节量的五个最低显著比特)是在一般当然是有用的。

GCC为此提供了两个选项。一个是 __ builtin_assume_aligned() 的内置,它允许程序员传达各种对准信息给编译器。另一种是typedef'ing有额外的属性,这里键入 __属性__((排列(32))),它可以被用来传达例如函数的参数alignedness 。这两个应该铛提供(尽管支持是最近,而不是在3.5还),并可能其他如ICC提供(尽管ICC,据我所知,使用 __ assume_aligned())。

,以减轻寄存器洗牌GCC做的一种方法,是使用一个辅助的功能。一些进一步的迭代后,我来到这, another.c

 的#include<&stdlib.h中GT;
#包括LT&;&immintrin.h GT;的#define可能(x)的__builtin_expect((x)中,1)
#定义不太可能(X)__builtin_expect((X),0)#如果(__clang_major __ + 0> = 3)
#定义IS_ALIGNED(X,N)((无效*)(X))
#elif指令(__GNUC __ + 0> = 4)
的#define IS_ALIGNED(X,N)__builtin_assume_aligned((x)时,(n))的
#其他
#定义IS_ALIGNED(X,N)((无效*)(X))
#万一typedef的__m256i __m256i_aligned __attribute __((排列(32)));
无效do_copy(寄存器__m256i_aligned * DST,
             注册挥发性__m256i_aligned * SRC,
             注册__m256i_aligned *完)
{
    做{
        注册常量__m256i M0 = SRC [0];
        寄存器常量__m256i M1 = SRC [1];
        注册常量__m256i M2 = SRC [2];
        寄存器常量__m256i立方米= SRC [3];        __asm​​__ __volatile__();        _mm256_stream_si256(DST,M0);
        _mm256_stream_si256(DST + 1,M1);
        _mm256_stream_si256(DST + 2,2);
        _mm256_stream_si256(DST + 3,M3);        __asm​​__ __volatile__();        SRC + = 4;
        DST + = 4;    }而(可能(SRC< END));
}虚空副本(void *的DST,常量无效* SRC,常量为size_t字节)
{
    如果(字节< 128)
        返回;    do_copy(IS_ALIGNED(DST,32)
            IS_ALIGNED(SRC,32)
            IS_ALIGNED((无效*)((字符*)SRC +字节),32));
}

GCC -march = X86-64 -mtune =通用-mavx2 -O2 -S another.c 编译基本上(意见和指示为简洁起见省略):

  do_copy:
.L3:
        vmovdqa(%RSI),%ymm3
        vmovdqa 32(%RSI),%ymm2
        vmovdqa 64(%RSI),%ymm1
        vmovdqa 96(%RSI),%ymm0
        vmovntdq%ymm3(%RDI)
        vmovntdq%ymm2,32(%RDI)
        vmovntdq%ymm1,64(%RDI)
        vmovntdq%ymm0,96(%RDI)
        SUBQ $ -128,RSI%
        SUBQ $ -128%RDI
        cmpq%的RDX,RSI%
        JB .L3
        vzeroupper
        RET复制:
        cmpq $ 127,RDX%
        JA .L8
        代表RET
.L8:
        addq%RSI,RDX%
        JMP do_copy

-O3 进一步优化只是内联辅助功能,

  do_copy:
.L3:
        vmovdqa(%RSI),%ymm3
        vmovdqa 32(%RSI),%ymm2
        vmovdqa 64(%RSI),%ymm1
        vmovdqa 96(%RSI),%ymm0
        vmovntdq%ymm3(%RDI)
        vmovntdq%ymm2,32(%RDI)
        vmovntdq%ymm1,64(%RDI)
        vmovntdq%ymm0,96(%RDI)
        SUBQ $ -128,RSI%
        SUBQ $ -128%RDI
        cmpq%的RDX,RSI%
        JB .L3
        vzeroupper
        RET复制:
        cmpq $ 127,RDX%
        JA .L10
        代表RET
.L10:
        leaq(RSI%,%RDX),RAX%
.L8:
        vmovdqa(%RSI),%ymm3
        vmovdqa 32(%RSI),%ymm2
        vmovdqa 64(%RSI),%ymm1
        vmovdqa 96(%RSI),%ymm0
        vmovntdq%ymm3(%RDI)
        vmovntdq%ymm2,32(%RDI)
        vmovntdq%ymm1,64(%RDI)
        vmovntdq%ymm0,96(%RDI)
        SUBQ $ -128,RSI%
        SUBQ $ -128%RDI
        cmpq%RSI,RAX%
        JA .L8
        vzeroupper
        RET

甚至与 -Os 生成code是非常漂亮的,

  do_copy:
.L3:
        vmovdqa(%RSI),%ymm3
        vmovdqa 32(%RSI),%ymm2
        vmovdqa 64(%RSI),%ymm1
        vmovdqa 96(%RSI),%ymm0
        vmovntdq%ymm3(%RDI)
        vmovntdq%ymm2,32(%RDI)
        vmovntdq%ymm1,64(%RDI)
        vmovntdq%ymm0,96(%RDI)
        SUBQ $ -128,RSI%
        SUBQ $ -128%RDI
        cmpq%的RDX,RSI%
        JB .L3
        RET复制:
        cmpq $ 127,RDX%
        JBE .L5
        addq%RSI,RDX%
        JMP do_copy
.L5:
        RET

当然,没有优化GCC-4.8.4仍然产生pretty坏code。随着铛-3.5 -march = X86-64 -mtune =通用-mavx2 -O2 -Os 我们得到实质上

  do_copy:
.LBB0_1:
        vmovaps(%RSI),%ymm0
        vmovaps 32(%RSI),%ymm1
        vmovaps 64(%RSI),%ymm2
        vmovaps 96(%RSI),%ymm3
        vmovntps%ymm0(%RDI)
        vmovntps%ymm1,32(%RDI)
        vmovntps%ymm2,64(%RDI)
        vmovntps%ymm3,96(%RDI)
        SUBQ $ -128,RSI%
        SUBQ $ -128%RDI
        cmpq%的RDX,RSI%
        JB .LBB0_1
        vzeroupper
        retq复制:
        cmpq $ 128,RDX%
        JB .LBB1_3
        addq%RSI,RDX%
.LBB1_2:
        vmovaps(%RSI),%ymm0
        vmovaps 32(%RSI),%ymm1
        vmovaps 64(%RSI),%ymm2
        vmovaps 96(%RSI),%ymm3
        vmovntps%ymm0(%RDI)
        vmovntps%ymm1,32(%RDI)
        vmovntps%ymm2,64(%RDI)
        vmovntps%ymm3,96(%RDI)
        SUBQ $ -128,RSI%
        SUBQ $ -128%RDI
        cmpq%的RDX,RSI%
        JB .LBB1_2
.LBB1_3:
        vzeroupper
        retq

我喜欢的 another.c code(它很适合我的编码风格),我很高兴与GCC-4.8生成的code 1.4和在 -O1 -O2 -O3 -Os 两个,所以我觉得这对我来说已经足够了。 (请注意,但是,我并没有实际基准的任何这一点,因为我没有相关的code。我们同时使用时间和非时间(NT)的内存访问,和缓存行为(和缓存交互与周围code)是最重要的事情就是这样,所以这将毫无意义微基准这一点,我想。)

I have got the following code, which copies data from memory to DMA buffer:

for (; likely(l > 0); l-=128)
{
    __m256i m0 = _mm256_load_si256( (__m256i*) (src) );
    __m256i m1 = _mm256_load_si256( (__m256i*) (src+32) );
    __m256i m2 = _mm256_load_si256( (__m256i*) (src+64) );
    __m256i m3 = _mm256_load_si256( (__m256i*) (src+96) );

    _mm256_stream_si256( (__m256i *) (dst), m0 );
    _mm256_stream_si256( (__m256i *) (dst+32), m1 );
    _mm256_stream_si256( (__m256i *) (dst+64), m2 );
    _mm256_stream_si256( (__m256i *) (dst+96), m3 );

    src += 128;
    dst += 128;
}

That is how gcc assembly output looks like:

405280:       c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
405285:       c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
40528a:       c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
40528f:       c5 fd 6f 18             vmovdqa (%rax),%ymm3
405293:       48 83 e8 80             sub    $0xffffffffffffff80,%rax
405297:       c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
40529c:       c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
4052a1:       c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
4052a6:       c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
4052aa:       48 83 ea 80             sub    $0xffffffffffffff80,%rdx
4052ae:       48 39 c8                cmp    %rcx,%rax
4052b1:       75 cd                   jne    405280 <sender_body+0x6e0>

Note the reordering of last vmovdqa and vmovntdq instructions. With the gcc generated code above I am able to reach throughput of ~10 227 571 packets per second in my application.

Next, I reorder that instructions manually in hexeditor. That means now the loop looks the following way:

405280:       c5 fd 6f 18             vmovdqa (%rax),%ymm3
405284:       c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
405289:       c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
40528e:       c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
405293:       48 83 e8 80             sub    $0xffffffffffffff80,%rax
405297:       c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
40529b:       c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
4052a0:       c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
4052a5:       c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
4052aa:       48 83 ea 80             sub    $0xffffffffffffff80,%rdx
4052ae:       48 39 c8                cmp    %rcx,%rax
4052b1:       75 cd                   jne    405280 <sender_body+0x6e0>

With the properly ordered instructions I get ~13 668 313 packets per second. So it is obvious that reordering introduced by gcc reduces performance.

Have you come across that? Is this a known bug or should I fill a bug report?

Compilation flags:

-O3 -pipe -g -msse4.1 -mavx

My gcc version:

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

解决方案

I find this problem interesting. GCC is known for producing less than optimal code, but I find it fascinating to find ways to "encourage" it to produce better code (for hottest/bottleneck code only, of course), without micro-managing too heavily. In this particular case, I looked at three "tools" I use for such situations:

  • volatile: If it is important the memory accesses occur in specific order, then volatile is a suitable tool. Note that it can be overkill, and will lead to a separate load every time a volatile pointer is dereferenced.

    SSE/AVX load/store intrinsics can't be used with volatile pointers, because they are functions. Using something like _mm256_load_si256((volatile __m256i *)src); implicitly casts it to const __m256i*, losing the volatile qualifier.

    We can directly dereference volatile pointers, though. (load/store intrinsics are only needed when we need to tell the compiler that the data might be unaligned, or that we want a streaming store.)

    m0 = ((volatile __m256i *)src)[0];
    m1 = ((volatile __m256i *)src)[1];
    m2 = ((volatile __m256i *)src)[2];
    m3 = ((volatile __m256i *)src)[3];
    

    Unfortunately this doesn't help with the stores, because we want to emit streaming stores. A *(volatile...)dst = tmp; won't give us what we want.

  • __asm__ __volatile__ (""); as a compiler reordering barrier.

    This is the GNU C was of writing a compiler memory-barrier. (Stopping compile-time reordering without emitting an actual barrier instruction like mfence). It stops the compiler from reordering memory accesses across this statement.

  • Using an index limit for loop structures.

    GCC is known for pretty poor register usage. Earlier versions made a lot of unnecessary moves between registers, although that is pretty minimal nowadays. However, testing on x86-64 across many versions of GCC indicate that in loops, it is better to use an index limit, rather than a independent loop variable, for best results.

Combining all the above, I constructed the following function (after a few iterations):

#include <stdlib.h>
#include <immintrin.h>

#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

void copy(void *const destination, const void *const source, const size_t bytes)
{
    __m256i       *dst = (__m256i *)destination;
    const __m256i *src = (const __m256i *)source;
    const __m256i *end = (const __m256i *)source + bytes / sizeof (__m256i);

    while (likely(src < end)) {
        const __m256i m0 = ((volatile const __m256i *)src)[0];
        const __m256i m1 = ((volatile const __m256i *)src)[1];
        const __m256i m2 = ((volatile const __m256i *)src)[2];
        const __m256i m3 = ((volatile const __m256i *)src)[3];

        _mm256_stream_si256( dst,     m0 );
        _mm256_stream_si256( dst + 1, m1 );
        _mm256_stream_si256( dst + 2, m2 );
        _mm256_stream_si256( dst + 3, m3 );

        __asm__ __volatile__ ("");

        src += 4;
        dst += 4;
    }
}

Compiling it (example.c) using GCC-4.8.4 using

gcc -std=c99 -mavx2 -march=x86-64 -mtune=generic -O2 -S example.c

yields (example.s):

        .file   "example.c"
        .text
        .p2align 4,,15
        .globl  copy
        .type   copy, @function
copy:
.LFB993:
        .cfi_startproc
        andq    $-32, %rdx
        leaq    (%rsi,%rdx), %rcx
        cmpq    %rcx, %rsi
        jnb     .L5
        movq    %rsi, %rax
        movq    %rdi, %rdx
        .p2align 4,,10
        .p2align 3
.L4:
        vmovdqa (%rax), %ymm3
        vmovdqa 32(%rax), %ymm2
        vmovdqa 64(%rax), %ymm1
        vmovdqa 96(%rax), %ymm0
        vmovntdq        %ymm3, (%rdx)
        vmovntdq        %ymm2, 32(%rdx)
        vmovntdq        %ymm1, 64(%rdx)
        vmovntdq        %ymm0, 96(%rdx)
        subq    $-128, %rax
        subq    $-128, %rdx
        cmpq    %rax, %rcx
        ja      .L4
        vzeroupper
.L5:
        ret
        .cfi_endproc
.LFE993:
        .size   copy, .-copy
        .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
        .section        .note.GNU-stack,"",@progbits

The disassembly of the actual compiled (-c instead of -S) code is

0000000000000000 <copy>:
   0:   48 83 e2 e0             and    $0xffffffffffffffe0,%rdx
   4:   48 8d 0c 16             lea    (%rsi,%rdx,1),%rcx
   8:   48 39 ce                cmp    %rcx,%rsi
   b:   73 41                   jae    4e <copy+0x4e>
   d:   48 89 f0                mov    %rsi,%rax
  10:   48 89 fa                mov    %rdi,%rdx
  13:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  18:   c5 fd 6f 18             vmovdqa (%rax),%ymm3
  1c:   c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
  21:   c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
  26:   c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
  2b:   c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
  2f:   c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
  34:   c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
  39:   c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
  3e:   48 83 e8 80             sub    $0xffffffffffffff80,%rax
  42:   48 83 ea 80             sub    $0xffffffffffffff80,%rdx
  46:   48 39 c1                cmp    %rax,%rcx
  49:   77 cd                   ja     18 <copy+0x18>
  4b:   c5 f8 77                vzeroupper 
  4e:   c3                      retq

Without any optimizations at all, the code is completely disgusting, full of unnecessary moves, so some optimization is necessary. (The above uses -O2, which is generally the optimization level I use.)

If optimizing for size (-Os), the code looks excellent at first glance,

0000000000000000 <copy>:
   0:   48 83 e2 e0             and    $0xffffffffffffffe0,%rdx
   4:   48 01 f2                add    %rsi,%rdx
   7:   48 39 d6                cmp    %rdx,%rsi
   a:   73 30                   jae    3c <copy+0x3c>
   c:   c5 fd 6f 1e             vmovdqa (%rsi),%ymm3
  10:   c5 fd 6f 56 20          vmovdqa 0x20(%rsi),%ymm2
  15:   c5 fd 6f 4e 40          vmovdqa 0x40(%rsi),%ymm1
  1a:   c5 fd 6f 46 60          vmovdqa 0x60(%rsi),%ymm0
  1f:   c5 fd e7 1f             vmovntdq %ymm3,(%rdi)
  23:   c5 fd e7 57 20          vmovntdq %ymm2,0x20(%rdi)
  28:   c5 fd e7 4f 40          vmovntdq %ymm1,0x40(%rdi)
  2d:   c5 fd e7 47 60          vmovntdq %ymm0,0x60(%rdi)
  32:   48 83 ee 80             sub    $0xffffffffffffff80,%rsi
  36:   48 83 ef 80             sub    $0xffffffffffffff80,%rdi
  3a:   eb cb                   jmp    7 <copy+0x7>
  3c:   c3                      retq

until you notice that the last jmp is to the comparison, essentially doing a jmp, cmp, and a jae at every iteration, which probably yields pretty poor results.

Note: If you do something similar for real-world code, please do add comments (especially for the __asm__ __volatile__ ("");), and remember to periodically check with all compilers available, to make sure the code is not compiled too badly by any.


Looking at Peter Cordes' excellent answer, I decided to iterate the function a bit further, just for fun.

As Ross Ridge mentions in the comments, when using _mm256_load_si256() the pointer is not dereferenced (prior to being re-cast to aligned __m256i * as a parameter to the function), thus volatile won't help when using _mm256_load_si256(). In another comment, Seb suggests a workaround: _mm256_load_si256((__m256i []){ *(volatile __m256i *)(src) }), which supplies the function with a pointer to src by accessing the element via a volatile pointer and casting it to an array. For a simple aligned load, I prefer the direct volatile pointer; it matches my intent in the code. (I do aim for KISS, although often I hit only the stupid part of it.)

On x86-64, the start of the inner loop is aligned to 16 bytes, so the number of operations in the function "header" part is not really important. Still, avoiding the superfluous binary AND (masking the five least significant bits of the amount to copy in bytes) is certainly useful in general.

GCC provides two options for this. One is the __builtin_assume_aligned() built-in, which allows a programmer to convey all sorts of alignment information to the compiler. The other is typedef'ing a type that has extra attributes, here __attribute__((aligned (32))), which can be used to convey the alignedness of function parameters for example. Both of these should be available in clang (although support is recent, not in 3.5 yet), and may be available in others such as icc (although ICC, AFAIK, uses __assume_aligned()).

One way to mitigate the register shuffling GCC does, is to use a helper function. After some further iterations, I arrived at this, another.c:

#include <stdlib.h>
#include <immintrin.h>

#define likely(x)   __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

#if (__clang_major__+0 >= 3)
#define IS_ALIGNED(x, n) ((void *)(x))
#elif (__GNUC__+0 >= 4)
#define IS_ALIGNED(x, n) __builtin_assume_aligned((x), (n))
#else
#define IS_ALIGNED(x, n) ((void *)(x))
#endif

typedef __m256i __m256i_aligned __attribute__((aligned (32)));


void do_copy(register          __m256i_aligned *dst,
             register volatile __m256i_aligned *src,
             register          __m256i_aligned *end)
{
    do {
        register const __m256i m0 = src[0];
        register const __m256i m1 = src[1];
        register const __m256i m2 = src[2];
        register const __m256i m3 = src[3];

        __asm__ __volatile__ ("");

        _mm256_stream_si256( dst,     m0 );
        _mm256_stream_si256( dst + 1, m1 );
        _mm256_stream_si256( dst + 2, m2 );
        _mm256_stream_si256( dst + 3, m3 );

        __asm__ __volatile__ ("");

        src += 4;
        dst += 4;

    } while (likely(src < end));
}

void copy(void *dst, const void *src, const size_t bytes)
{
    if (bytes < 128)
        return;

    do_copy(IS_ALIGNED(dst, 32),
            IS_ALIGNED(src, 32),
            IS_ALIGNED((void *)((char *)src + bytes), 32));
}

which compiles with gcc -march=x86-64 -mtune=generic -mavx2 -O2 -S another.c to essentially (comments and directives omitted for brevity):

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        vzeroupper
        ret

copy:
        cmpq     $127, %rdx
        ja       .L8
        rep ret
.L8:
        addq     %rsi, %rdx
        jmp      do_copy

Further optimization at -O3 just inlines the helper function,

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        vzeroupper
        ret

copy:
        cmpq     $127, %rdx
        ja       .L10
        rep ret
.L10:
        leaq     (%rsi,%rdx), %rax
.L8:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rsi, %rax
        ja       .L8
        vzeroupper
        ret

and even with -Os the generated code is very nice,

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        ret

copy:
        cmpq     $127, %rdx
        jbe      .L5
        addq     %rsi, %rdx
        jmp      do_copy
.L5:
        ret

Of course, without optimizations GCC-4.8.4 still produces pretty bad code. With clang-3.5 -march=x86-64 -mtune=generic -mavx2 -O2 and -Os we get essentially

do_copy:
.LBB0_1:
        vmovaps  (%rsi), %ymm0
        vmovaps  32(%rsi), %ymm1
        vmovaps  64(%rsi), %ymm2
        vmovaps  96(%rsi), %ymm3
        vmovntps %ymm0, (%rdi)
        vmovntps %ymm1, 32(%rdi)
        vmovntps %ymm2, 64(%rdi)
        vmovntps %ymm3, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .LBB0_1
        vzeroupper
        retq

copy:
        cmpq     $128, %rdx
        jb       .LBB1_3
        addq     %rsi, %rdx
.LBB1_2:
        vmovaps  (%rsi), %ymm0
        vmovaps  32(%rsi), %ymm1
        vmovaps  64(%rsi), %ymm2
        vmovaps  96(%rsi), %ymm3
        vmovntps %ymm0, (%rdi)
        vmovntps %ymm1, 32(%rdi)
        vmovntps %ymm2, 64(%rdi)
        vmovntps %ymm3, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .LBB1_2
.LBB1_3:
        vzeroupper
        retq

I like the another.c code (it suits my coding style), and I'm happy with the code generated by GCC-4.8.4 and clang-3.5 at -O1, -O2, -O3, and -Os on both, so I think it is good enough for me. (Note, however, that I haven't actually benchmarked any of this, because I don't have the relevant code. We use both temporal and non-temporal (nt) memory accesses, and cache behaviour (and cache interaction with the surrounding code) is paramount for things like this, so it would make no sense to microbenchmark this, I think.)

这篇关于GCC错误产生的组件订购,导致性能损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆