预测现代超标量处理器上的操作延迟需要考虑哪些因素,我该如何手动计算它们? [英] What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
问题描述
我希望能够手动预测任意算术运算(即没有分支或内存,尽管这也很好),但考虑到指令重新排序,x86-64汇编代码在给定的特定架构下将采用多长时间,超标量,延迟,CPI等.
I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
要实现此目标,必须遵循什么/描述规则?
What / describe the rules must be followed to achieve this?
我认为我已经弄清了一些初步规则,但是在将任何示例代码分解到如此详细的级别时,我还找不到任何参考,因此我不得不做出一些猜测. (例如,英特尔优化手册几乎没有对提及指令进行重新排序.)
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
至少,我正在寻找(1)确认每条规则正确还是对每条规则的正确陈述,以及(2)我可能忘记的任何规则的列表.
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
- 每个周期发出尽可能多的指令,从当前周期开始按顺序开始,并且可能比重新排序缓冲区的大小提前.
- 在以下情况下,可以在给定的周期内发出指令:
- 没有影响其操作数的指令仍在执行.并且:
- 如果它是浮点指令,则在发出每个浮点指令之前(浮点指令具有静态指令重排序).并且:
- 在该周期上有一个功能单元可用于该指令.每个(?)功能单元都是流水线式的,这意味着对于给定功能类的CPI(每个功能单元的CPI),其每个周期可以接受1条新指令,并且总功能单元的数量为1/CPI(此处含义模糊:大概是
addps
和subps
使用相同的功能单元?如何确定?).并且: - 在此循环中,已经发出了少于超标量宽度(通常为
4
)的指令.
- As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
- An instruction can be issued on a given cycle if:
- No instructions that affect its operands are still being executed. And:
- If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
- There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g.
addps
andsubps
use the same functional unit? How do I determine this?). And: - Fewer than the superscalar width (typically
4
) number of instructions have already been issued this cycle.
作为示例,请考虑以下示例代码(用于计算叉积):
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210 shufps xmm0, xmm1, 201 shufps xmm2, xmm2, 201 mulps xmm0, xmm3 shufps xmm1, xmm1, 210 mulps xmm1, xmm2 subps xmm0, xmm1
我为Haswell预测延迟的尝试看起来像这样:
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5 ; `shufps` Haswell latency=1, CPI=1 ; `subps` Haswell latency=3, CPI=1 shufps xmm3, xmm2, 210 ; cycle 1 shufps xmm0, xmm1, 201 ; cycle 2 shufps xmm2, xmm2, 201 ; cycle 3 mulps xmm0, xmm3 ; (superscalar execution) shufps xmm1, xmm1, 210 ; cycle 4 mulps xmm1, xmm2 ; cycle 5 ; cycle 6 (stall `xmm0` and `xmm1`) ; cycle 7 (stall `xmm1`) ; cycle 8 (stall `xmm1`) subps xmm0, xmm1 ; cycle 9 ; cycle 10 (stall `xmm0`)
推荐答案
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions.
这称为静态(性能)分析.维基百科( https://en.wikipedia.org/wiki/List_of_performance_analysis_tools ) CodeXL具有静态内核分析器"(即用于计算内核,又名循环).我从未尝试过.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
英特尔还拥有一个免费工具,用于分析Sandybridge系列CPU中循环如何通过管道:
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA还不错,但是有错误(例如,Sandybridge上
shld
的数据错误,最后我检查了一下,它不知道IACA is not bad, but has bugs (e.g. wrong data for
shld
on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).静态分析通常非常好,但是绝对可以通过性能计数器进行分析来检查.参见 x86的MOV是否真的可以免费"?为什么我根本无法重现这些内容?,举个例子,它描述了一个简单的循环来研究微体系结构特征.
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Agner Fog的微体系结构指南(第2章:乱序执行)解释了一些依赖关系的基础知识链和无序执行.他的优化装配体"指南包含更多入门和高级性能方面的内容.
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
他的微体系结构指南的后续章节涵盖了Nehalem,Sandybridge,Haswell,K8/K10,Bulldozer和Ryzen等CPU中管线的详细信息. (还有Atom/Silvermont/Jaguar).
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog的指令表(电子表格或PDF)通常也是指令延迟/吞吐量/执行端口故障的最佳来源.
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter的微结构分析文档非常好,带有图表.例如 https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/和
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
另请参见 x86标签Wiki 中的其他性能链接.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
以Intel术语:
-
问题" 意味着将uop发送到核心的乱序部分;连同寄存器重命名,这是前端的最后一步.问题/重命名阶段通常是管道中最狭窄的一点,例如从Core2开始在Intel上扩展4倍. (由于SKL改进的解码器和uop-cache带宽,以及后端和缓存带宽的改进,像Haswell尤其是Skylake之类的后来的uarch通常实际上在某些实际代码中非常接近.) :micro-fusion可让您通过前端发送2 uops,并且仅占用一个ROB条目. (我能够在Skylake上构建一个循环,该循环保持7个未融合-每时钟域oups ).另请参见 http://blog.stuffedcow.net/2013/05/measuring -rob-capacity/ re:乱序窗口大小.
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
调度" 表示调度程序将uop发送到执行端口.一旦所有输入准备就绪,并且相关的执行端口可用,就会发生这种情况. x86 uops到底是如何安排的?.调度发生在未融合"域中;在OoO调度程序(又称为预留站,RS)中分别跟踪微融合的uops.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
许多其他计算机体系结构文献使用相反的术语,但这是您可以在英特尔优化手册中找到的术语,以及诸如
uops_issued.any
或uops_dispatched_port.port_5
之类的硬件性能计数器的名称.A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like
uops_issued.any
oruops_dispatched_port.port_5
.任意算术x86-64汇编代码需要多长时间
exactly how long arbitrary arithmetical x86-64 assembly code will take
由于OoO执行程序,它也取决于周围的代码
最终的
subps
结果不必在CPU开始运行以后的指令之前就已准备好.延迟仅对以后需要该值作为输入的指令起作用,而与整数循环无关紧要.It depends on the surrounding code as well, because of OoO exec
Your final
subps
result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.有时吞吐量很重要,乱序的exec可以隐藏多个独立的短依赖链的延迟. (例如,如果您对包含多个向量的大型数组的每个元素执行相同的操作,则可以同时运行多个叉积.)即使按程序顺序,您也将最终同时进行多个迭代.您需要先完成一次迭代,然后再进行下一次迭代. (如果OoO执行人员很难在硬件中进行所有重新排序,则软件流水线可以为高延迟循环体提供帮助.)
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
根据这三个因素,您可以大致表征一小段非分支代码.通常,其中只有一个是给定用例的瓶颈.通常,您正在查看要用作循环的 part 的块,而不是整个循环的主体,但是 OoO exec通常工作得很好,您只需将这些数字加起来即可 ,如果它们的时间不长到OoO窗口大小无法找到所有ILP的地方.
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
-
从每个输入到输出的
- 等待时间.查看从每个输入到每个输出的依赖链上的哪些指令.例如一种选择可能需要一种输入才能尽快准备就绪.
- 总uop计数(用于前端吞吐量瓶颈),Intel CPU上的融合域.例如从理论上讲,Core2和更高版本可以将每个时钟4个融合域uops发出/重命名为无序调度程序/ROB.在实践中,Sandybridge系列通常可以通过uop缓存和循环缓冲区来实现这一目标,尤其是Skylake拥有改进的解码器和uop-cache吞吐量. 每个后端执行端口的
-
uop计数(未融合的域).例如乱码的代码通常会在Intel CPU的端口5上出现瓶颈.英特尔通常只发布吞吐量数字,而不发布端口故障,这就是为什么如果您不只是重复同一条指令数百万次,您就必须查看Agner Fog的表(或IACA输出)来做有意义的事情.
- latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
- total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
通常,您可以假定最佳情况下的调度/分发,因为可以在其他端口上运行的uops不会经常窃取繁忙的端口,但是确实会发生这种情况. (如何精确安排x86 uops?)
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
仅依靠CPI ;两条CPI = 1指令可能会竞争也可能不会竞争相同执行端口.如果没有,则可以并行执行.例如Haswell只能在端口0上运行
psadbw
(5c延迟,1c吞吐量,即CPI = 1),但是它是单个uop,因此1psadbw
+ 3add
指令的混合可以每个时钟维持4条指令.在Intel CPU的3个不同端口上都有矢量ALU,其中一些操作在所有3个端口(例如布尔值)上复制,而某些操作仅在一个端口上复制(例如Skylake之前的移位).Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run
psadbw
on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1psadbw
+ 3add
instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).有时候,您可以提出几种不同的策略,一种可能会降低延迟,但会花费更多的时间.一个典型的例子是
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like
imul eax, ecx, 10
(1 uop, 3c latency on Intel) vs.lea eax, [rcx + rcx*4]
/add eax,eax
(2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
另请参见为什么mulss在Haswell上仅需要3个周期,与Agner的指令表不同吗?(最终比您从问题标题中猜测的要详细得多)以获取静态分析的另一摘要,以及一些有关展开的整洁内容带有多个累加器以减少排放量.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
每个(?)功能单元都已流水线化
Every (?) functional unit is pipelined
分频器在最近的CPU中流水线化,但不是完全流水线化. (但是,FP分频是单码组,因此,如果将一个
divps
与数十个mulps
/addps
混合使用,则如果延迟无关紧要,则对吞吐量的影响可以忽略不计:The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one
divps
mixed in with dozens ofmulps
/addps
, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication.rcpps
+ a Newton iteration is worse throughput and about the same latency.所有其他内容都已在主流英特尔CPU上全面流水线化;单个uop的多周期(互惠)吞吐量. (像
shl eax, cl
这样的可变计数整数移位在3 uop时的吞吐量低于预期,因为它们通过标志合并的uops产生了依赖关系.但是,如果通过add
之类的东西通过FLAGS打破了这种依赖关系,您可以更好的吞吐量和延迟时间.)Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like
shl eax, cl
have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with anadd
or something, you can get better throughput and latency.)在Ryzen之前的AMD上,整数乘数也仅部分流水线化.例如推土机的
imul ecx, edx
仅1 uop,但延迟为4c,吞吐量为2c.On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's
imul ecx, edx
is only 1 uop, but with 4c latency, 2c throughput.至强融核(KNL)也有一些未完全流水线的改组指令,但它往往会在前端(指令解码)而不是后端造成瓶颈,并且确实具有较小的缓冲区+ OoO exec功能隐藏后端气泡.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
如果是浮点指令,则每条浮点指令在发出之前(浮点指令具有静态指令重排序)
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
否.
也许您读过Silvermont,它对于FP/SIMD不执行OoO exec,而只是整数(具有20很小的uop窗口).也许有些ARM芯片也是如此,但NEON的调度程序更简单?我对ARM uarch详细信息了解不多.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
主流的大核心微体系结构(如P6/SnB系列)以及所有AMD OoO芯片,对SIMD和FP指令的OoO执行与对整数的相同. AMD CPU使用单独的调度程序,但是Intel使用统一的调度程序,因此可以将其完整大小应用于以整数或FP代码查找ILP(无论当前正在运行的是哪个).
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
甚至位于Silvermont的Knight's Landing(在Xeon Phi)也为SIMD做OoO主管.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86通常对指令排序不太敏感,但是uop调度不进行关键路径分析.因此,有时可以先将指令放在关键路径上,这样一来,当其他指令在该端口上运行时,它们就不会卡在等待输入就绪的状态,从而导致稍后我们遇到需要该指令结果的指令时停滞不前.关键路径. (也就是说,这就是关键路径.)
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
我为Haswell预测延迟的尝试看起来像这样:
My attempt to predict the latency for Haswell looks something like this:
是的,看起来不错.
shufps
在端口5上运行,addps
在p1上运行,mulps
在p0或p1上运行. Skylake删除专用的FP-add单元并在p0/p1的FMA单元上运行SIMD FP add/mul/FMA,所有这些都具有4c的延迟(从Haswell中的3/5/5或从3/3/5中的上/下) Broadwell).Yup, that looks right.
shufps
runs on port 5,addps
runs on p1,mulps
runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).这是一个很好的示例,说明为什么通常在SIMD向量中保留整个XYZ方向向量会很烂.保留X数组,Y数组和Z数组会让您并行执行4个交叉产品,没有任何洗牌.
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
SSE标签Wiki 包含这些幻灯片的链接:
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.
这篇关于预测现代超标量处理器上的操作延迟需要考虑哪些因素,我该如何手动计算它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!