英特尔固有技术指南-延迟和吞吐量 [英] Intel Intrinsics guide - Latency and Throughput

查看:126
本文介绍了英特尔固有技术指南-延迟和吞吐量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释英特尔内在指南中给出的延迟和吞吐量值?

我是否正确理解延迟是指一条指令运行所需的时间量,吞吐量是每个时间单位可以启动的指令数?

如果我的定义正确,为什么在较新的CPU版本(例如mulps)上某些指令的等待时间更长?

解决方案

该表遗漏了:Broadwell上的MULPS延迟:3. Skylake上:4.

在这种情况下,内在查找器的延迟是准确的,尽管我在这个链接的问题上的答案,以了解有关处理吞吐量和延迟数以及它们对于现代乱序CPU的含义的更多详细信息.

MULPS潜伏期确实从4(Nehalem)增加到5(Sandybridge).这可能是为了节省功率或晶体管,但更可能是因为SandyBridge将uop延迟标准化为仅几个不同的值,以避免写回冲突:即当同一执行单元将在同一周期中产生两个结果时,例如从开始一个周期为2c的循环开始,然后在下一个周期为1c的循环开始.

这简化了uop调度程序,该调度程序将uops从预留站调度到执行单元.或多或少以最早的顺序排列,但必须过滤输入准备就绪的对象.调度程序非常耗电,这是乱序执行的电源成本的重要组成部分. (不幸的是,使调度程序以关键路径优先的顺序选择uops,以避免阿格纳·福格(Agner Fog)解释了同样的事情(在他的微拱pdf的SnB部分中):

混合具有不同延迟的μops

当μops与 将不同的延迟发布到相同的执行端口,例如 第114页所述.此问题在Sandy上已解决. 桥.执行延迟是标准化的,因此所有具有 延迟3发出到端口1,所有μop延迟5 go 到端口0.延迟为1的μops可以进入端口0、1或5. 允许延迟,但除法和平方根除外.

延迟的标准化具有回写的优势 避免冲突.缺点是某些μop具有更高的 延迟超过了必要.

嗯,我刚刚意识到Agner的VEXTRACTF128 xmm, ymm, imm8数字很奇怪. Agner将其列为SnB上的1 uop 2c延迟,但Intel则将其列为1c延迟(Intel Intrinsic Guide?

Have I understood it correctly that the latency is the amount of time units an instruction takes to run, and the throughput is the number of instructions that can be started per time unit?

If my definition is correct, why is the latency for some instructions higher on newer CPU versions (e.g. mulps)?

解决方案

Missing from that table: MULPS latency on Broadwell: 3. On Skylake: 4.

The intrinsic finder's latency is accurate in this case, although it occasionally doesn't match Agner Fog's experimental testing. (That VEXTRACTF128 latency may be a case of Intel not including a bypass delay in their table). See my answer on that linked question for more details about what to do with throughput and latency numbers, and what they mean for a modern out-of-order CPU.

MULPS latency did increase from 4 (Nehalem) to 5 (Sandybridge). This may have been to save power or transistors, but more likely because SandyBridge standardized uop latencies to only a few different values, to avoid writeback conflict: i.e. when the same execution unit would produce two results in the same cycle, e.g. from starting a 2c uop one cycle, then a 1c uop the next cycle.

This simplifies the uop scheduler, which dispatches uops from the Reservation Station to the execution units. More or less in oldest-first order, but it has has to filter by which ones have their inputs ready. The scheduler is power-hungry, and this is a significant part of the power cost of out-of-order execution. (It's unfortunately not practical to make a scheduler that picks uops in critical-path-first order, to avoid having independent uops steal cycles from the critical path with resource conflicts.)

Agner Fog explains the same thing (in the SnB section of his microarch pdf):

Mixing μops with different latencies

Previous processors have a write-back conflict when μops with different latencies are issued to the same execution port, as described on page 114. This problem is largely solved on the Sandy Bridge. Execution latencies are standardized so that all μops with a latency of 3 are issued to port 1 and all μops with a latency of 5 go to port 0. μops with a latency of 1 can go to port 0, 1 or 5. No other latencies are allowed, except for division and square root.

The standardization of latencies has the advantage that write-back conflicts are avoided. The disadvantage is that some μops have higher latencies than necessary.

Hmm, I just realized that Agner's numbers for VEXTRACTF128 xmm, ymm, imm8 are weird. Agner lists it as 1 uop 2c latency on SnB, but Intel lists it as 1c latency (as discussed here). Maybe the execution unit is 1c latency, but there's a built-in 1c bypass delay (for lane-crossing?) before you can use the result. That would explain the discrepancy between Intel's numbers and Agner's experimental test.


Some instructions are still 2c latency, because they decode to 2 dependent uops that are each 1c latency. MULPS is a single uop, even the AVX 256b version, because even Intel's first-gen AVX CPUs have full-width 256b execution units (except the divide/sqrt unit). Needing twice as many copies of the FP multiplier circuitry is a good reason for optimizing it to save transistors at the cost of latency.


This pattern holds up to and including Broadwell, AFAICT from searching Agner's tables. (Using LibreOffice, I selected the whole table, and did data->filter->standard filter, and looked for rows with column C = 1 and column F = 4. (And then repeat for 2.) Look for any uops that aren't loads or stores.

Haswell sticks to the pattern of only 1, 3 and 5 cycle ALU uop latencies (except for AESENC/AESDEC, which is 1 uop for port5 with 7c latency. And of course DIVPS and SQRTPS). There's also CVTPI2PS xmm, mm, at 1 uop 4c latency, but maybe that's 3c for the p1 uop and 1c of bypass delay, the way Agner Fog measured it or unavoidable. VMOVMSKPS r32, ymm is also 2c (vs. 3c for the r32,xmm version).

Broadwell dropped MULPS latency to 3, same as ADDPS, but kept FMA at 5c. Presumably they figured out how to shortcut the FMA unit to produce just a multiply when no add was needed.


Skylake is able to handle uops with latency=4. Latency for FMA, ADDPS/D, and MULPS/D = 4 cycles. (SKL drops the dedicated vector-FP add unit, and does everything with the FMA unit. So ADDPS/D throughput is doubled to match MULPS/D and FMA...PS/D. I'm not sure which change motivated what, and whether they would have introduced 4c latency instructions at all if they hadn't wanted to drop the vec-FP adder without hurting ADDPS latency too badly.)

Other SKL instructions with 4c latency: PHMINPOSUW (down from 5c), AESDEC/AESENC, CVTDQ2PS (up from 3c, but this might be 3c + bypass), RCPPS (down from 5c), RSQRTPS, CMPPS/D (up from 3c). Hmm, I guess FP compares were done in the adder, and now have to use FMA.

MOVD r32, xmm and MOVD xmm, r32 are listed as 2c, perhaps a bypass delay from int-vec to int? Or a glitch in Agner's testing? Testing the latency would require other instructions to create a round-trip back to xmm. It's 1c on HSW. Agner lists SKL MOVQ r64, xmm as 2 cycles (port0), but MOVQ xmm, r64 as 1c (port5), and it seems extremely weird that reading a 64-bit register is faster than reading a 32-bit register. Agner has had mistakes in his table in the past; this may be another.

这篇关于英特尔固有技术指南-延迟和吞吐量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆