生产循环无GCC CMP指令 [英] Produce loops without cmp instruction in GCC

查看:318
本文介绍了生产循环无GCC CMP指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我公司拥有一批紧密循环,我试图用GCC和内部函数进行优化。考虑例如下列功能

I have a number of tight loops I'm trying to optimize with GCC and intrinsics. Consider for example the following function.

void triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    int i;
    __m256 k4 = _mm256_set1_ps(k);
    for(i=0; i<n; i+=8) {
        _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
    }
}

这会产生一个主循环像这样

This produces a main loop like this

20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add    rax,0x20
33: cmp    rax,rcx
36: jne    20 

CMP 指令是不必要的。与其让 RAX 从零开始,终点为的sizeof(浮动)* N 我们可以设置基本指针( RSI RDI RDX )至年底阵列和设置 RAX -sizeof(浮点)* N 然后测试为零。我可以用我自己组装code这样

But the cmp instruction is unnecessary. Instead of having rax start at zero and finish at sizeof(float)*n we can set the base pointers (rsi, rdi, and rdx) to the end of the array and set rax to -sizeof(float)*n and then test for zero. I am able to do this with my own assembly code like this

.L2  vmulps          ymm1, ymm2, [rdi+rax]
     vaddps          ymm0, ymm1, [rsi+rax]
     vmovaps         [rdx+rax], ymm0
     add             rax, 32
     jne             .L2

但我不能设法让GCC做到这一点。我有几个测试,现在在哪里这使得显著差异。直到最近,GCC和内在已经断绝我好,所以我想知道如果有一个编译器开关或方式重新排序/改变我的code,让 CMP 指令没有产生与海湾合作委员会。

but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the cmp instruction is not produced with GCC.

我试过以下,但仍然产生 CMP 。我已经尝试了所有的变化仍然产生 CMP

I tried the following but it still produces cmp. All variations I have tried still produce cmp.

void triad2(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    float *x2 = x+n;
    float *y2 = y+n;
    float *z2 = z+n;    
    int i;
    __m256 k4 = _mm256_set1_ps(k);
    for(i=-n; i<0; i+=8) {
        _mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
    }
}

编辑:
我感兴趣的(实际上是 N = 2048 )最大化这些功能的阵列配合在L1缓存指令级并行(ILP)。尽管展开可用于提高带宽可以减小ILP(假设全带宽可以得到没有展开)。

I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for n=2048). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).

编辑:
这里是结果的核2(pre的Nehalem),一个IvyBridge的和的Haswell系统的一个表。内部函数是使用内部函数的结果,unroll1是我的组装code不使用 CMP 和unroll16是我的组装code展开16次。百分比是峰值性能(频率* num_bytes_cycle哪里num_bytes_cycle是24 SSE,48为AVX和96为FMA)的百分比。

Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using cmp, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).

                 SSE         AVX         FMA
intrinsic      71.3%       90.9%       53.6%      
unroll1        97.0%       96.1%       63.5%
unroll16       98.6%       90.4%       93.6%
ScottD         96.5%
32B code align             95.5%

有关SSE,我得到几乎一样好结果没有展开与解开但只有当我不使用 CMP 。在AVX我得到的最好的结果没有展开,并且不使用 CMP 。有趣的是,在IB实际上展开更差。在Haswell的,我得到迄今为止最好的结果通过展开。这就是为什么我问这个<一个href=\"https://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62\">question.该人士$ ​​C $ C,以测试这可以在这个问题上找到。

For SSE I get almost as good a result without unrolling as with unroll but only if I don't use cmp. On AVX I get the best result without unrolling and without using cmp. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this question. The source code to test this can be found in that question.

编辑:

基于ScottD的答案,我现在对我的酷睿2系统内部函数(pre Nehalem的64位模式)获得近97%。我不知道为什么 CMP 事项实际上,因为它应该每次迭代的2个时钟周期反正。对于Sandy Bridge的原来的效率损失是由于code不对齐的额外 CMP 。在Haswell的只有展开的作品反正。

Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode). I'm not sure why the cmp matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra cmp. On Haswell only unrolling works anyway.

推荐答案

这个怎么样。编译器是gcc 4.9.0 MinGW的64:

How about this. Compiler is gcc 4.9.0 mingw x64:

void triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    intptr_t i;
    __m256 k4 = _mm256_set1_ps(k);

    for(i = -n; i < 0; i += 8) {
        _mm256_store_ps(&z[i+n], _mm256_add_ps(_mm256_load_ps(&x[i+n]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i+n]))));
    }
}

GCC -c -O3 -march = corei7 -mavx2 triad.c

gcc -c -O3 -march=corei7 -mavx2 triad.c

0000000000000000 <triad>:
   0:   44 89 c8                mov    eax,r9d
   3:   f7 d8                   neg    eax
   5:   48 98                   cdqe
   7:   48 85 c0                test   rax,rax
   a:   79 31                   jns    3d <triad+0x3d>
   c:   c5 fc 28 0d 00 00 00 00 vmovaps ymm1,YMMWORD PTR [rip+0x0]
  14:   4d 63 c9                movsxd r9,r9d
  17:   49 c1 e1 02             shl    r9,0x2
  1b:   4c 01 ca                add    rdx,r9
  1e:   4c 01 c9                add    rcx,r9
  21:   4d 01 c8                add    r8,r9

  24:   c5 f4 59 04 82          vmulps ymm0,ymm1,YMMWORD PTR [rdx+rax*4]
  29:   c5 fc 58 04 81          vaddps ymm0,ymm0,YMMWORD PTR [rcx+rax*4]
  2e:   c4 c1 7c 29 04 80       vmovaps YMMWORD PTR [r8+rax*4],ymm0
  34:   48 83 c0 08             add    rax,0x8
  38:   78 ea                   js     24 <triad+0x24>

  3a:   c5 f8 77                vzeroupper
  3d:   c3                      ret

就像你的手写code,GCC使用了循环5指令。海湾合作委员会code使用的规模= 4,其中你使用的规模= 1。我能得到GCC使用规模= 1 5指令循环,但C code是别扭,在循环中的AVX指令2从5字节增长到6个字节。

Like your hand written code, gcc is using 5 instructions for the loop. The gcc code uses scale=4 where yours uses scale=1. I was able to get gcc to use scale=1 with a 5 instruction loop, but the C code is awkward and 2 of the AVX instructions in the loop grow from 5 bytes to 6 bytes.

这篇关于生产循环无GCC CMP指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆