使执行吞吐量最大化的最小依赖链数是多少? [英] What is the minimal number of dependency chains to maximize the execution throughput?

查看:162
本文介绍了使执行吞吐量最大化的最小依赖链数是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出由真正的依赖关系链接并周期性重复的指令链,例如(a-> b-> c)->(a-> b-> c)-> ...

假定它可以分为几个较短的独立子依赖链,以从乱序执行中受益:

  • (a0-> b0-> c0)->(a0-> b0-> c0)-> ...
  • (a1-> b1-> c1)->(a1-> b1-> c1)-> ...

乱序引擎将每条指令调度到具有延迟和互惠吞吐量的相应CPU单元.

最大化执行吞吐量的子依赖链的最佳数量是多少?

根据Agner的手册使用汇编语言优化子例程,第12.15节:最佳如果CPU没有其他事情可做,则累加器的数量是依赖性链中最关键指令的等待时间除以该指令的倒数吞吐量. 最关键的指令"是什么意思?是否有其他技术文档可以解决此类问题?

解决方案

这取决于它们有多长时间,以及每个周期可以独立运行多少微秒.

这还取决于硬件的宽度.例如

  • 具有两个ALU执行单元和每个时钟3个融合域uop吞吐量的PIII与
  • 具有四个ALU执行单元(其中只有三个可以处理向量),每个时钟有四个融合域uop吞吐量.

我认为最关键的指令"是指构成关键路径大部分长度的指令.如果循环承载的依赖链由具有不同延迟的多个指令组成,则它是某种平均值. (也许是几何平均值?)


一个很好的例子是FP add(例如,求和一个数组):

在Sandybridge上,它具有每秒时钟吞吐量,但延迟为3c,因此,每条3c单链相关的addps指令以一个uop运行,仅维持最大FP乘法吞吐量的1/3. (并且不占用其他两个执行端口.)

三个并行的dep链可以使addps指令保持端口1饱和.因此,如果使用三个累加器,则可以使三个累加器处于运行状态.如果还让5个FP乘以飞行,则也可以使port0饱和.循环开销可以在端口5上运行(希望不会从p01窃取周期).负载微指令可以与加法微融合,因此它们不会占用融合域带宽.但是,您可以使用单独的movaps指令来执行某些负载,并且仍然不会使每个时钟的融合域uop吞吐量达到4,但是前端的瓶颈可能会限制您的吞吐量.


对于FP添加,Haswell的每个时钟吞吐量仍然只有一个,但是对于FP mul和FMA,每个时钟只有两个.

因此,如果使用FMA(乘数为1.0)对一个数组求和,则需要10个矢量累加器(10个深度链)来保持10个FMA处于飞行状态,从而使p01饱和. p5和p6不再使用,但您也可以用微熔丝负载使负载端口饱和.


Skylake将FMA的延迟减少了1个周期,降至4个,并删除了FP添加单元. (因此,FP添加是在FMA单元中完成的,这使[v]addps的吞吐量增加了一倍,但延迟增加了1c.)

因此,在SKL上,您仅需要8个矢量累加器(8个dep链)即可饱和p01.但是,只要您不用完寄存器,拥有更多的dep链不会有什么坏处.因此,使用10个累加器在Haswell上理想的代码在SKL上仍然是理想的.不过,您可以使用常数向量为1.0的addps而不是fma213ps(或其他任何东西)来节省一些电量.


有关吞吐量/延迟/端口号的信息,请参见Agner的说明表;有关详细信息,请参见他的microarch PDF.我没有检查端口号或延迟号,但我经常键入此示例,以至于我确定它是正确的:P.

另请参见标记的其他链接Wiki.

Given a chain of instructions linked by true dependencies and repeated periodically (i.e. a loop), for example (a->b->c)->(a->b->c)->...

Assuming that it can be split into several shorter and independent sub-dependency chains to benefit from out-of-order execution :

  • (a0->b0->c0)->(a0->b0->c0)->...
  • (a1->b1->c1)->(a1->b1->c1)->...

The out-of-order engine schedules each instruction to the corresponding CPU unit which have a latency and a reciprocal throughput.

What is the optimal number of sub-dependency chains maximizing the execution throughput ?

According to Agner's manual Optimizing subroutines in assembly language, Section 12.15 : "The optimal number of accumulators if the CPU has nothing else to do is the latency of the most critical instruction in the dependency chain divided by the reciprocal throughput for that instruction". What does "most critical instruction" mean ? Is there any other technical documentation tackling this kind of problem ?

解决方案

It depends how long they are, and how many uops per cycle each one can run on its own.

It also depends on how wide the hardware is. e.g.

  • PIII with two ALU execution units and 3 per clock fused-domain uop throughput vs.
  • Haswell with four ALU execution units (only three of which can handle vectors), and 4 per clock fused-domain uop throughput.

I think "most critical instruction" means the one that makes up most of the length of the critical path. If a loop-carried dependency chain is composed of multiple instructions with different latencies, it's some sort of average. (Like maybe a geometric average?)


A good example is FP add (e.g. summing an array):

On Sandybridge, it has one-per-clock throughput but 3c latency, so a single chain of dependent addps instructions runs at one uop per 3c, sustaining only 1/3rd of the max FP multiply throughput. (And leaving the other two execution ports totally unoccupied.)

Three parallel dep chains can keep port1 saturated with addps instructions. So if you use three accumulators, you can keep three adds in flight. If you also keep 5 FP multiplies in flight, you can saturate port0 as well. Loop overhead can run on port5 (and hopefully not steal cycles from p01). The load uops can micro-fuse with the adds, so they don't take up fused-domain bandwidth. But you could do some of the loads with separate movaps instructions and still not saturate the 4 per clock fused-domain uop throughput, but bottlenecks in the frontend might limit you to less throughput.


Haswell still only has one per clock throughput for FP add, but two per clock for FP mul and FMA.

So if you sum an array using FMA (with a multiplier of 1.0), you need 10 vector accumulators (10 dep chains) to keep 10 FMAs in flight, saturating p01. p5 and p6 go unused, but you can also saturate the load ports with micro-fused loads.


Skylake reduced the latency of FMA by 1 cycle, down to 4, and dropped the FP add unit. (So FP add is done in the FMA unit, which doubled the throughput for [v]addps, at the cost of 1c more latency).

So on SKL you only need 8 vector accumulators (8 dep chains) to saturate p01. But having more dep chains doesn't hurt, as long as you don't run out of registers. So code that's ideal on Haswell, using 10 accumulators, should still be ideal on SKL. You could maybe save some power by just using addps instead of fma213ps (or whatever) with a constant vector of 1.0, though.


See Agner's instruction tables for throughput/latency/port numbers, and his microarch PDF for more details. I didn't check the port numbers or latency numbers, but I've typed this example so often that I'm pretty sure it's correct :P.

Also see other links in the tag wiki.

这篇关于使执行吞吐量最大化的最小依赖链数是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆