生产循环无GCC CMP指令 [英] Produce loops without cmp instruction in GCC
问题描述
我公司拥有一批紧密循环,我试图用GCC和内部函数进行优化。考虑例如下列功能
I have a number of tight loops I'm trying to optimize with GCC and intrinsics. Consider for example the following function.
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=0; i<n; i+=8) {
_mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
}
}
这会产生一个主循环像这样
This produces a main loop like this
20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add rax,0x20
33: cmp rax,rcx
36: jne 20
但 CMP
指令是不必要的。与其让 RAX
从零开始,终点为的sizeof(浮动)* N
我们可以设置基本指针( RSI
, RDI
和 RDX
)至年底阵列和设置 RAX
到 -sizeof(浮点)* N
然后测试为零。我可以用我自己组装code这样
But the cmp
instruction is unnecessary. Instead of having rax
start at zero and finish at sizeof(float)*n
we can set the base pointers (rsi
, rdi
, and rdx
) to the end of the array and set rax
to -sizeof(float)*n
and then test for zero. I am able to do this with my own assembly code like this
.L2 vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm0, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm0
add rax, 32
jne .L2
但我不能设法让GCC做到这一点。我有几个测试,现在在哪里这使得显著差异。直到最近,GCC和内在已经断绝我好,所以我想知道如果有一个编译器开关或方式重新排序/改变我的code,让 CMP
指令没有产生与海湾合作委员会。
but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the cmp
instruction is not produced with GCC.
我试过以下,但仍然产生 CMP
。我已经尝试了所有的变化仍然产生 CMP
。
I tried the following but it still produces cmp
. All variations I have tried still produce cmp
.
void triad2(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
float *x2 = x+n;
float *y2 = y+n;
float *z2 = z+n;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=-n; i<0; i+=8) {
_mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
}
}
编辑:
我感兴趣的(实际上是 N = 2048
)最大化这些功能的阵列配合在L1缓存指令级并行(ILP)。尽管展开可用于提高带宽可以减小ILP(假设全带宽可以得到没有展开)。
I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for n=2048
). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).
编辑:
这里是结果的核2(pre的Nehalem),一个IvyBridge的和的Haswell系统的一个表。内部函数是使用内部函数的结果,unroll1是我的组装code不使用 CMP
和unroll16是我的组装code展开16次。百分比是峰值性能(频率* num_bytes_cycle哪里num_bytes_cycle是24 SSE,48为AVX和96为FMA)的百分比。
Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using cmp
, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).
SSE AVX FMA
intrinsic 71.3% 90.9% 53.6%
unroll1 97.0% 96.1% 63.5%
unroll16 98.6% 90.4% 93.6%
ScottD 96.5%
32B code align 95.5%
有关SSE,我得到几乎一样好结果没有展开与解开但只有当我不使用 CMP
。在AVX我得到的最好的结果没有展开,并且不使用 CMP
。有趣的是,在IB实际上展开更差。在Haswell的,我得到迄今为止最好的结果通过展开。这就是为什么我问这个<一个href=\"https://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62\">question.该人士$ C $ C,以测试这可以在这个问题上找到。
For SSE I get almost as good a result without unrolling as with unroll but only if I don't use cmp
. On AVX I get the best result without unrolling and without using cmp
. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this question. The source code to test this can be found in that question.
编辑:
基于ScottD的答案,我现在对我的酷睿2系统内部函数(pre Nehalem的64位模式)获得近97%。我不知道为什么 CMP
事项实际上,因为它应该每次迭代的2个时钟周期反正。对于Sandy Bridge的原来的效率损失是由于code不对齐的额外 CMP
。在Haswell的只有展开的作品反正。
Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode). I'm not sure why the cmp
matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra cmp
. On Haswell only unrolling works anyway.
推荐答案
这个怎么样。编译器是gcc 4.9.0 MinGW的64:
How about this. Compiler is gcc 4.9.0 mingw x64:
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
intptr_t i;
__m256 k4 = _mm256_set1_ps(k);
for(i = -n; i < 0; i += 8) {
_mm256_store_ps(&z[i+n], _mm256_add_ps(_mm256_load_ps(&x[i+n]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i+n]))));
}
}
GCC -c -O3 -march = corei7 -mavx2 triad.c
gcc -c -O3 -march=corei7 -mavx2 triad.c
0000000000000000 <triad>:
0: 44 89 c8 mov eax,r9d
3: f7 d8 neg eax
5: 48 98 cdqe
7: 48 85 c0 test rax,rax
a: 79 31 jns 3d <triad+0x3d>
c: c5 fc 28 0d 00 00 00 00 vmovaps ymm1,YMMWORD PTR [rip+0x0]
14: 4d 63 c9 movsxd r9,r9d
17: 49 c1 e1 02 shl r9,0x2
1b: 4c 01 ca add rdx,r9
1e: 4c 01 c9 add rcx,r9
21: 4d 01 c8 add r8,r9
24: c5 f4 59 04 82 vmulps ymm0,ymm1,YMMWORD PTR [rdx+rax*4]
29: c5 fc 58 04 81 vaddps ymm0,ymm0,YMMWORD PTR [rcx+rax*4]
2e: c4 c1 7c 29 04 80 vmovaps YMMWORD PTR [r8+rax*4],ymm0
34: 48 83 c0 08 add rax,0x8
38: 78 ea js 24 <triad+0x24>
3a: c5 f8 77 vzeroupper
3d: c3 ret
就像你的手写code,GCC使用了循环5指令。海湾合作委员会code使用的规模= 4,其中你使用的规模= 1。我能得到GCC使用规模= 1 5指令循环,但C code是别扭,在循环中的AVX指令2从5字节增长到6个字节。
Like your hand written code, gcc is using 5 instructions for the loop. The gcc code uses scale=4 where yours uses scale=1. I was able to get gcc to use scale=1 with a 5 instruction loop, but the C code is awkward and 2 of the AVX instructions in the loop grow from 5 bytes to 6 bytes.
这篇关于生产循环无GCC CMP指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!