依赖链分析 [英] Dependency chain analysis

查看:184
本文介绍了依赖链分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

瓦格纳雾的优化大会指南,12.7节:一个循环的例子。其中有一段讨论的例子code的:


  

[...]分析奔腾M:。... 13微指令在3%=时钟每4.33c退休时一次迭代


  
  

有在循环依赖关系链。潜伏期为:2为
  存储器读取,5为乘法,3为减法,和3存储器
  写,总共13个时钟周期。这是三倍之多
  退休时间,但它不是一个循环携带的依赖性,因为
  从每次迭代结果保存到内存中,并在不重用
  下一次迭代。在乱序执行机制,
  流水线能够每个计算可以开始之前
  的preceding计算完成。唯一的循环体
  依赖链是添加EAX,16 其中有1只潜伏期。


  ##例12.6b。 DAXPY算法,32位模式
[...];未显示:循环前初始化一些暂存器
L1:
    MOVAPD将xmm1,[ESI + EAX]; X [I]中,X第[i + 1]
    mulpd将xmm1,XMM2; X [I] * DA,X [I + 1] * DA
    MOVAPD XMM0 [EDI + EAX];值Y [i],Y [I + 1]
    subpd XMM0,xmm1中; Y [I] -X [I] * DA,Y [I + 1] -X [I + 1] * DA
    MOVAPD [EDI + EAX],XMM0;存储结果
    添加EAX,16;添加两个要素指标的大小
    CMP EAX,ECX;有n个比较* 8
    JL L1;环回

我不明白为什么依赖链不会增加整体吞吐量。我知道,它找到最糟糕的瓶颈是唯一重要的。考虑依存关系链之前确定最糟糕的瓶颈融合域UOP吞吐量,在每个迭代周期4.33。我不明白为什么依赖链并不比一个更大的瓶颈。


  1. 我看到作者解释说,它与乱序执行和流水线连接,但我无法看到它。我的意思是,虽然只有乘法导致延迟5个周期所以才有此值大于4周大。


  2. 我不明白为什么笔者不关心这里的依赖性:
    添加EAX,16 - > CMP EAX,ECX - > JL L1
    毕竟,除了必须在 CMP 执行和 CMP 必须在执行JL



PS:后段列出的最大瓶颈奔腾M作为德code,它限制每6C一次迭代,因为128B矢量OPS德code至每两个微指令。见瓦格纳雾的指南分析的休息和分析+调优酷睿2,FMA4推土机和SandyBridge的。


解决方案

  1. 在MUL不是循环携带依赖链的一部分,所以不可能有 mulpd 在多次反复的insn飞行一次。单指令的等待时间是不是这里的问题可言,它是依赖的的。每次迭代有负载,mulpd,subpd,店的的独立的13C依存关系链。乱序执行是允许多个迭代微指令是在飞行一次。


  2. CMP / JL 在每次迭代取决于添加从迭代,但添加在下一次迭代不依赖于 CMP 。推测执行和分支prediction意味着控制依赖(条件分支和间接跳转/调用)是的的数据依赖链的一部分。这就是为什么从一个迭代的指令可以启动 JL 从preceding迭代退休前运行。

    相比之下, CMOV 的数据,而不是依赖的控制依赖,所以网点的循环往往有循环携带依赖性链。这往往比如果分支predicts分支以及要慢一些。

    每个循环迭代有一个单独的 CMP / JL 依赖链,就像FP依赖链。




  

我不明白为什么依赖链不会增加整体吞吐量。


我不知道这句话的意思。我想我是能够找出所有其他混合单词和句式。 (例如,连锁的依赖,而不是依赖链。)有一个在我的编辑你的问题;他们中的一些可能会帮助您的理解了。

From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code:

[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.

There is a dependency chain in the loop. The latencies are: 2 for memory read, 5 for multiplication, 3 for subtraction, and 3 for memory write, which totals 13 clock cycles. This is three times as much as the retirement time but it is not a loop-carried dependence because the results from each iteration are saved to memory and not reused in the next iteration. The out-of-order execution mechanism and pipelining makes it possible that each calculation can start before the preceding calculation is finished. The only loop-carried dependency chain is add eax,16 which has a latency of only 1.

## Example 12.6b.  DAXPY algorithm, 32-bit mode
[...]   ; not shown: initialize some regs before the loop
L1:
    movapd xmm1, [esi+eax]   ; X[i], X[i+1]
    mulpd  xmm1, xmm2        ; X[i] * DA, X[i+1] * DA
    movapd xmm0, [edi+eax]   ; Y[i], Y[i+1]
    subpd  xmm0, xmm1        ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
    movapd [edi+eax], xmm0   ; Store result
    add eax, 16              ; Add size of two elements to index
    cmp eax, ecx             ; Compare with n*8
    jl L1                    ; Loop back

I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.

  1. I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.

  2. I also cannot understand why the author doesn't care about the dependency here: add eax, 16 -> cmp eax, ecx -> jl L1 After all, addition must be executed before cmp and cmp must be executed before jl.


PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.

解决方案

  1. the mul isn't part of a loop-carried dependency chain, so there can be mulpd insns from multiple iterations in flight at once. The latency of a single instruction isn't the issue here at all, it's the dependency chain. Each iteration has a separate 13c dependency chain of load, mulpd, subpd, store. Out-of-order execution is what allows uops from multiple iterations to be in flight at once.

  2. The cmp / jl in each iteration depend on the add from that iteration, but the add in the next iteration doesn't depend on the cmp. Speculative execution and branch prediction mean that control dependencies (conditional branches and indirect jumps/calls) are not part of data dependency chains. This is why instructions from one iteration can start running before the jl from the preceding iteration retires.

    By comparison, cmov is a data dependency instead of a control dependency, so branchless loops tend to have loop-carried dependency chains. This tends to be slower than branching if the branch predicts well.

    Each loop iteration has a separate cmp/jl dependency chain, just like the FP dependency chain.


I cannot understand why dependency chain doesn't increase a whole throughput.

I have no idea what this sentence means. I think I was able to figure out all your other mixed up words and phrasing. (e.g. "chain dependency" instead of "dependency chain".) Have a look at my edits to your question; some of them might help your understanding, too.

这篇关于依赖链分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆