C $ C $空调回路性能 [英] C code loop performance

查看:150
本文介绍了C $ C $空调回路性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我的应用程序中的乘加内核和我想提高它的性能。

我使用了英特尔酷睿i7-960(3.2GHz的时钟),并已实现手动使用上证所内部函数内核如下:

 的for(int i = 0; I<迭代; I + = 4){
    Y1 = _mm_set_ss(输出[I]);
    Y2 = _mm_set_ss(输出[I + 1]);
    Y3 = _mm_set_ss(输出[1 + 2]);
    Y4 = _mm_set_ss(输出[1 + 3]);    对于(K = 0; K< ksize; k ++){
        为(L = 0; L&下; ksize:L ++){
            W = _mm_set_ss(重量[I + K + 1]);            X1 = _mm_set_ss(输入[I + K + 1]);
            Y1 = _mm_add_ss(Y1,_mm_mul_ss(W,X));
            ...
            X4 = _mm_set_ss(输入[I + K + L + 3]);
            Y4 = _mm_add_ss(Y4,_mm_mul_ss(W,X4));
        }
    }
    _mm_store_ss(&放大器;输出[I],Y1);
    _mm_store_ss(安培;输出[I + 1],Y2);
    _mm_store_ss(安培;输出[I + 2],Y3);
    _mm_store_ss(安培;输出[I + 3],Y4);
}

我知道我可以使用包装FP向量提高性能和我已经这样做了成功地,但我想知道为什么单个标code是不能够满足处理器的峰值性能。

这个内核我的机器上的表现〜每个周期1.6 FP操作,而最大的将是每个周期2 FP操作(因为FP +添加FP MUL可以并行执行)。

如果我是学习生成的汇编code正确的,理想的时间表会是什么样子如下,其中 MOV 指令需要3个周期,开关延迟从负载域为相关的指令需要2个周期的FP领域,FP乘法需要4个周期与FP添加需要3个周期。 (请注意,从乘的依赖 - >添加不产生任何开关延迟,因为操作属于同一个域)。

根据被测性能(〜最大理论性能的80%)存在的每8个周期〜3指令的开销。

我想要么:


  • 摆脱这种开销,或

  • 解释它来自哪里

当然有与高速缓存未命中和放大器的问题;数据失准可以增加的移动指令延迟,但是否有可能在这里发挥作用的其他因素?像寄存器读摊位的东西?

我希望我的问题是清楚的,在此先感谢您的答复!


更新:内环的组件如下所示:

  ...
21座:
  movssl(%RSI,%RDI,4),%XMM4
  movssl(%RCX,%RDI,4),%XMM0
  movssl为0x4(%RCX,%RDI,4),%xmm1中
  movssl 0x8中(%RCX,%RDI,4),%XMM2
  movssl位于0xC(%RCX,%RDI,4),%XMM3
  INC%RDI
  mulss%XMM4,%XMM0
  CMP $ 0x32,%RDI
  mulss%XMM4,xmm1中的%
  mulss%XMM4,%XMM2
  mulss%XMM3,%XMM4
  addss%XMM0,%xmm5
  addss%将xmm1,%xmm6
  addss%XMM2,%XMM7
  addss%XMM4,%XMM8
  JL 0x401b52<嵌段21基
...


解决方案

我在评论中发现:


  • 循环需要5个时钟周期执行。

  • 它应该采取4个周期。 (因为有4增加了4 mulitplies)

不过,您的装配显示5 SSE movssl 的说明。据瓦格纳雾的表所有浮点SSE指令的举动至少 1安装/周期倒数吞吐量Nehalem处理器。

既然你有其中5,你不能这样做优于5次/迭代


因此​​,为了得到最佳性能,你需要减少你有负载的#。你如何能做到,我不能立即看到这种特殊情况下 - 但它有可能

一个常用方法是使用平铺。在这里你添加嵌套层次,提高地方。尽管它主要用于改善高速缓存访​​问,它也可以在寄存器中用于减少有需要的装入/存储的#

最终,你的目标是减少负载的数量小于加/ MULS的编号。所以这可能是要走的路。

I have a multiply-add kernel inside my application and I want to increase its performance.

I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows:

 for(int i=0; i<iterations; i+=4) {
    y1 = _mm_set_ss(output[i]);
    y2 = _mm_set_ss(output[i+1]);
    y3 = _mm_set_ss(output[i+2]);
    y4 = _mm_set_ss(output[i+3]);

    for(k=0; k<ksize; k++){
        for(l=0; l<ksize; l++){
            w  = _mm_set_ss(weight[i+k+l]);

            x1 = _mm_set_ss(input[i+k+l]);
            y1 = _mm_add_ss(y1,_mm_mul_ss(w,x1));
            …
            x4 = _mm_set_ss(input[i+k+l+3]);
            y4 = _mm_add_ss(y4,_mm_mul_ss(w,x4));
        }
    }
    _mm_store_ss(&output[i],y1);
    _mm_store_ss(&output[i+1],y2);
    _mm_store_ss(&output[i+2],y3);
    _mm_store_ss(&output[i+3],y4);
 }

I know I can use packed fp vectors to increase the performance and I already did so succesfully, but I want to know why the single scalar code isn't able to meet the processor's peak performance.

The performance of this kernel on my machine is ~1.6 FP operations per cycle, while the maximum would be 2 FP operations per cycle (since FP add + FP mul can be executed in parallel).

If I'm correct from studying the generated assembly code, the ideal schedule would look like follows, where the mov instruction takes 3 cycles, the switch latency from the load domain to the FP domain for the dependent instructions takes 2 cycles, the FP multiply takes 4 cycles and the FP add takes 3 cycles. (Note that the dependence from the multiply -> add doesn't incur any switch latency because the operations belong to the same domain).

According to the measured performance (~80% of the maximum theoretical performance) there is an overhead of ~3 instructions per 8 cycles.

I am trying to either:

  • get rid of this overhead, or
  • explain where it comes from

Of course there is the problem with cache misses & data misalignment which can increase the latency of the move instructions, but are there any other factors that could play a role here? Like register read stalls or something?

I hope my problem is clear, thanks in advance for your responses!


Update: The assembly of the inner-loop looks as follows:

...
Block 21: 
  movssl  (%rsi,%rdi,4), %xmm4 
  movssl  (%rcx,%rdi,4), %xmm0 
  movssl  0x4(%rcx,%rdi,4), %xmm1 
  movssl  0x8(%rcx,%rdi,4), %xmm2 
  movssl  0xc(%rcx,%rdi,4), %xmm3 
  inc %rdi 
  mulss %xmm4, %xmm0 
  cmp $0x32, %rdi 
  mulss %xmm4, %xmm1 
  mulss %xmm4, %xmm2 
  mulss %xmm3, %xmm4 
  addss %xmm0, %xmm5 
  addss %xmm1, %xmm6 
  addss %xmm2, %xmm7 
  addss %xmm4, %xmm8 
  jl 0x401b52 <Block 21> 
...

解决方案

I noticed in the comments that:

  • The loop takes 5 cycles to execute.
  • It's "supposed" to take 4 cycles. (since there's 4 adds and 4 mulitplies)

However, your assembly shows 5 SSE movssl instructions. According to Agner Fog's tables all floating-point SSE move instructions are at least 1 inst/cycle reciprocal throughput for Nehalem.

Since you have 5 of them, you can't do better than 5 cycles/iteration.


So in order to get to peak performance, you need to reduce the # of loads that you have. How you can do that I can't see immediately this particular case - but it might be possible.

One common approach is to use tiling. Where you add nesting levels to improve locality. Although it's used mostly for improving cache access, it can also be used in registers to reduce the # of load/stores that are needed.

Ultimately, your goal is to reduce the number of loads to be less than the numbers of add/muls. So this might be the way to go.

这篇关于C $ C $空调回路性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆