依赖链分析 [英] Dependency chain analysis
问题描述
瓦格纳雾的优化大会指南,12.7节:一个循环的例子。其中有一段讨论的例子code的:
[...]分析奔腾M:。... 13微指令在3%=时钟每4.33c退休时一次迭代
有在循环依赖关系链。潜伏期为:2为
存储器读取,5为乘法,3为减法,和3存储器
写,总共13个时钟周期。这是三倍之多
退休时间,但它不是一个循环携带的依赖性,因为
从每次迭代结果保存到内存中,并在不重用
下一次迭代。在乱序执行机制,
流水线能够每个计算可以开始之前
的preceding计算完成。唯一的循环体
依赖链是添加EAX,16
其中有1只潜伏期。
块引用>##例12.6b。 DAXPY算法,32位模式
[...];未显示:循环前初始化一些暂存器
L1:
MOVAPD将xmm1,[ESI + EAX]; X [I]中,X第[i + 1]
mulpd将xmm1,XMM2; X [I] * DA,X [I + 1] * DA
MOVAPD XMM0 [EDI + EAX];值Y [i],Y [I + 1]
subpd XMM0,xmm1中; Y [I] -X [I] * DA,Y [I + 1] -X [I + 1] * DA
MOVAPD [EDI + EAX],XMM0;存储结果
添加EAX,16;添加两个要素指标的大小
CMP EAX,ECX;有n个比较* 8
JL L1;环回我不明白为什么依赖链不会增加整体吞吐量。我知道,它找到最糟糕的瓶颈是唯一重要的。考虑依存关系链之前确定最糟糕的瓶颈融合域UOP吞吐量,在每个迭代周期4.33。我不明白为什么依赖链并不比一个更大的瓶颈。
我看到作者解释说,它与乱序执行和流水线连接,但我无法看到它。我的意思是,虽然只有乘法导致延迟5个周期所以才有此值大于4周大。
我不明白为什么笔者不关心这里的依赖性:
添加EAX,16 - > CMP EAX,ECX - > JL L1
毕竟,除了必须在CMP
执行和CMP
必须在执行JL
。PS:后段列出的最大瓶颈奔腾M作为德code,它限制每6C一次迭代,因为128B矢量OPS德code至每两个微指令。见瓦格纳雾的指南分析的休息和分析+调优酷睿2,FMA4推土机和SandyBridge的。
解决方案
在MUL不是循环携带依赖链的一部分,所以不可能有
mulpd
在多次反复的insn飞行一次。单指令的等待时间是不是这里的问题可言,它是依赖的链的。每次迭代有负载,mulpd,subpd,店的的独立的13C依存关系链。乱序执行是允许多个迭代微指令是在飞行一次。的
CMP
/JL
在每次迭代取决于添加
从迭代,但添加
在下一次迭代不依赖于CMP
。推测执行和分支prediction意味着控制依赖(条件分支和间接跳转/调用)是的不的数据依赖链的一部分。这就是为什么从一个迭代的指令可以启动JL
从preceding迭代退休前运行。相比之下,
CMOV
的是的数据,而不是依赖的控制依赖,所以网点的循环往往有循环携带依赖性链。这往往比如果分支predicts分支以及要慢一些。每个循环迭代有一个单独的
CMP
/JL
依赖链,就像FP依赖链。
我不明白为什么依赖链不会增加整体吞吐量。
块引用>我不知道这句话的意思。我想我是能够找出所有其他混合单词和句式。 (例如,连锁的依赖,而不是依赖链。)有一个在我的编辑你的问题;他们中的一些可能会帮助您的理解了。
From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code:
[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.
There is a dependency chain in the loop. The latencies are: 2 for memory read, 5 for multiplication, 3 for subtraction, and 3 for memory write, which totals 13 clock cycles. This is three times as much as the retirement time but it is not a loop-carried dependence because the results from each iteration are saved to memory and not reused in the next iteration. The out-of-order execution mechanism and pipelining makes it possible that each calculation can start before the preceding calculation is finished. The only loop-carried dependency chain is
add eax,16
which has a latency of only 1.
## Example 12.6b. DAXPY algorithm, 32-bit mode [...] ; not shown: initialize some regs before the loop L1: movapd xmm1, [esi+eax] ; X[i], X[i+1] mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA movapd xmm0, [edi+eax] ; Y[i], Y[i+1] subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA movapd [edi+eax], xmm0 ; Store result add eax, 16 ; Add size of two elements to index cmp eax, ecx ; Compare with n*8 jl L1 ; Loop back
I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.
I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.
I also cannot understand why the author doesn't care about the dependency here:
add eax, 16 -> cmp eax, ecx -> jl L1
After all, addition must be executed beforecmp
andcmp
must be executed beforejl
.
PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.
解决方案
the mul isn't part of a loop-carried dependency chain, so there can be
mulpd
insns from multiple iterations in flight at once. The latency of a single instruction isn't the issue here at all, it's the dependency chain. Each iteration has a separate 13c dependency chain of load, mulpd, subpd, store. Out-of-order execution is what allows uops from multiple iterations to be in flight at once.The
cmp
/jl
in each iteration depend on theadd
from that iteration, but theadd
in the next iteration doesn't depend on thecmp
. Speculative execution and branch prediction mean that control dependencies (conditional branches and indirect jumps/calls) are not part of data dependency chains. This is why instructions from one iteration can start running before thejl
from the preceding iteration retires.By comparison,
cmov
is a data dependency instead of a control dependency, so branchless loops tend to have loop-carried dependency chains. This tends to be slower than branching if the branch predicts well.Each loop iteration has a separate
cmp
/jl
dependency chain, just like the FP dependency chain.
I cannot understand why dependency chain doesn't increase a whole throughput.
I have no idea what this sentence means. I think I was able to figure out all your other mixed up words and phrasing. (e.g. "chain dependency" instead of "dependency chain".) Have a look at my edits to your question; some of them might help your understanding, too.
这篇关于依赖链分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!