C $ C $空调回路性能 [英] C code loop performance
问题描述
我有我的应用程序中的乘加内核和我想提高它的性能。
我使用了英特尔酷睿i7-960(3.2GHz的时钟),并已实现手动使用上证所内部函数内核如下:
的for(int i = 0; I<迭代; I + = 4){
Y1 = _mm_set_ss(输出[I]);
Y2 = _mm_set_ss(输出[I + 1]);
Y3 = _mm_set_ss(输出[1 + 2]);
Y4 = _mm_set_ss(输出[1 + 3]); 对于(K = 0; K< ksize; k ++){
为(L = 0; L&下; ksize:L ++){
W = _mm_set_ss(重量[I + K + 1]); X1 = _mm_set_ss(输入[I + K + 1]);
Y1 = _mm_add_ss(Y1,_mm_mul_ss(W,X));
...
X4 = _mm_set_ss(输入[I + K + L + 3]);
Y4 = _mm_add_ss(Y4,_mm_mul_ss(W,X4));
}
}
_mm_store_ss(&放大器;输出[I],Y1);
_mm_store_ss(安培;输出[I + 1],Y2);
_mm_store_ss(安培;输出[I + 2],Y3);
_mm_store_ss(安培;输出[I + 3],Y4);
}
我知道我可以使用包装FP向量提高性能和我已经这样做了成功地,但我想知道为什么单个标code是不能够满足处理器的峰值性能。
这个内核我的机器上的表现〜每个周期1.6 FP操作,而最大的将是每个周期2 FP操作(因为FP +添加FP MUL可以并行执行)。
如果我是学习生成的汇编code正确的,理想的时间表会是什么样子如下,其中 MOV
指令需要3个周期,开关延迟从负载域为相关的指令需要2个周期的FP领域,FP乘法需要4个周期与FP添加需要3个周期。 (请注意,从乘的依赖 - >添加不产生任何开关延迟,因为操作属于同一个域)。
根据被测性能(〜最大理论性能的80%)存在的每8个周期〜3指令的开销。
我想要么:
- 摆脱这种开销,或
- 解释它来自哪里
当然有与高速缓存未命中和放大器的问题;数据失准可以增加的移动指令延迟,但是否有可能在这里发挥作用的其他因素?像寄存器读摊位的东西?
我希望我的问题是清楚的,在此先感谢您的答复!
更新:内环的组件如下所示:
...
21座:
movssl(%RSI,%RDI,4),%XMM4
movssl(%RCX,%RDI,4),%XMM0
movssl为0x4(%RCX,%RDI,4),%xmm1中
movssl 0x8中(%RCX,%RDI,4),%XMM2
movssl位于0xC(%RCX,%RDI,4),%XMM3
INC%RDI
mulss%XMM4,%XMM0
CMP $ 0x32,%RDI
mulss%XMM4,xmm1中的%
mulss%XMM4,%XMM2
mulss%XMM3,%XMM4
addss%XMM0,%xmm5
addss%将xmm1,%xmm6
addss%XMM2,%XMM7
addss%XMM4,%XMM8
JL 0x401b52<嵌段21基
...
我在评论中发现:
- 循环需要5个时钟周期执行。
- 它应该采取4个周期。 (因为有4增加了4 mulitplies)
不过,您的装配显示5 SSE movssl
的说明。据瓦格纳雾的表所有浮点SSE指令的举动至少 1安装/周期的倒数吞吐量Nehalem处理器。
既然你有其中5,你不能这样做优于5次/迭代
因此,为了得到最佳性能,你需要减少你有负载的#。你如何能做到,我不能立即看到这种特殊情况下 - 但它有可能
一个常用方法是使用平铺。在这里你添加嵌套层次,提高地方。尽管它主要用于改善高速缓存访问,它也可以在寄存器中用于减少有需要的装入/存储的#
最终,你的目标是减少负载的数量小于加/ MULS的编号。所以这可能是要走的路。
I have a multiply-add kernel inside my application and I want to increase its performance.
I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows:
for(int i=0; i<iterations; i+=4) {
y1 = _mm_set_ss(output[i]);
y2 = _mm_set_ss(output[i+1]);
y3 = _mm_set_ss(output[i+2]);
y4 = _mm_set_ss(output[i+3]);
for(k=0; k<ksize; k++){
for(l=0; l<ksize; l++){
w = _mm_set_ss(weight[i+k+l]);
x1 = _mm_set_ss(input[i+k+l]);
y1 = _mm_add_ss(y1,_mm_mul_ss(w,x1));
…
x4 = _mm_set_ss(input[i+k+l+3]);
y4 = _mm_add_ss(y4,_mm_mul_ss(w,x4));
}
}
_mm_store_ss(&output[i],y1);
_mm_store_ss(&output[i+1],y2);
_mm_store_ss(&output[i+2],y3);
_mm_store_ss(&output[i+3],y4);
}
I know I can use packed fp vectors to increase the performance and I already did so succesfully, but I want to know why the single scalar code isn't able to meet the processor's peak performance.
The performance of this kernel on my machine is ~1.6 FP operations per cycle, while the maximum would be 2 FP operations per cycle (since FP add + FP mul can be executed in parallel).
If I'm correct from studying the generated assembly code, the ideal schedule would look like follows, where the mov
instruction takes 3 cycles, the switch latency from the load domain to the FP domain for the dependent instructions takes 2 cycles, the FP multiply takes 4 cycles and the FP add takes 3 cycles. (Note that the dependence from the multiply -> add doesn't incur any switch latency because the operations belong to the same domain).
According to the measured performance (~80% of the maximum theoretical performance) there is an overhead of ~3 instructions per 8 cycles.
I am trying to either:
- get rid of this overhead, or
- explain where it comes from
Of course there is the problem with cache misses & data misalignment which can increase the latency of the move instructions, but are there any other factors that could play a role here? Like register read stalls or something?
I hope my problem is clear, thanks in advance for your responses!
Update: The assembly of the inner-loop looks as follows:
...
Block 21:
movssl (%rsi,%rdi,4), %xmm4
movssl (%rcx,%rdi,4), %xmm0
movssl 0x4(%rcx,%rdi,4), %xmm1
movssl 0x8(%rcx,%rdi,4), %xmm2
movssl 0xc(%rcx,%rdi,4), %xmm3
inc %rdi
mulss %xmm4, %xmm0
cmp $0x32, %rdi
mulss %xmm4, %xmm1
mulss %xmm4, %xmm2
mulss %xmm3, %xmm4
addss %xmm0, %xmm5
addss %xmm1, %xmm6
addss %xmm2, %xmm7
addss %xmm4, %xmm8
jl 0x401b52 <Block 21>
...
I noticed in the comments that:
- The loop takes 5 cycles to execute.
- It's "supposed" to take 4 cycles. (since there's 4 adds and 4 mulitplies)
However, your assembly shows 5 SSE movssl
instructions. According to Agner Fog's tables all floating-point SSE move instructions are at least 1 inst/cycle reciprocal throughput for Nehalem.
Since you have 5 of them, you can't do better than 5 cycles/iteration.
So in order to get to peak performance, you need to reduce the # of loads that you have. How you can do that I can't see immediately this particular case - but it might be possible.
One common approach is to use tiling. Where you add nesting levels to improve locality. Although it's used mostly for improving cache access, it can also be used in registers to reduce the # of load/stores that are needed.
Ultimately, your goal is to reduce the number of loads to be less than the numbers of add/muls. So this might be the way to go.
这篇关于C $ C $空调回路性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!