长等待时间指令 [英] Long latency instruction
问题描述
我想要一条长等待时间的单指令x86 1 指令,以便创建长的依赖链,作为测试微体系结构功能的一部分.
I would like a long-latency single-uop x861 instruction, in order to create long dependency chains as part of testing microarchitectural features.
当前我正在使用fsqrt
,但我想知道还有更好的方法.
Currently I'm using fsqrt
, but I'm wondering is there is something better.
理想情况下,该指令在以下标准上得分会很高:
Ideally, the instruction will score well on the following criteria:
- 长时间等待
- 稳定/固定的延迟时间
- 一个或几个微码(特别是:未微码)
- 消耗尽可能少的uarch资源(加载/存储缓冲区,页面遍历等)
- 能够(在延迟方面)与自身链接
- 能够使用GP寄存器链接输入和输出
- 不干扰正常的OoO执行(除了消耗的ROB,RS等资源之外)
因此fsqrt
在大多数情况下都可以,但是等待时间并不长,并且似乎很难与GP规则链接.
So fsqrt
is OK in most senses, but the latency isn't that long and it seems hard to chain with GP regs.
1 特别是在现代的Intel x86上,如果它在AMD Zen *上也能很好地工作,则可以加分.
1 On modern Intel x86 in particular, with bonus points if it also works well on AMD Zen*.
推荐答案
主流Intel CPU没有任何非常长的延迟单uup整数指令.所有ALU端口上都有1个周期等待时间的整数ALU,端口1上有3个周期等待时间的流水线ALU.我认为AMD是相似的.
Mainstream Intel CPUs don't have any very long latency single-uop integer instructions. There are integer ALUs for 1-cycle latency uops on all ALU ports, and a 3-cycle-latency pipelined ALU on port 1. I think AMD is similar.
div/sqrt单元是唯一真正的高延迟ALU,但是整数div/idiv是在Intel上进行微编码的,因此,请使用FP,其中div/sqrt通常是单uup指令.
The div/sqrt unit is the only truly high-latency ALU, but integer div/idiv are microcoded on Intel so yes, use FP where div/sqrt are typically single-uop instructions.
AMD的整数div
/idiv
是2 uop指令(可能要写入2个输出),并具有与数据相关的延迟.
AMD's integer div
/ idiv
are 2-uop instructions (presumably to write the 2 outputs), with data-dependent latency.
此外,AMD Bulldozer/Piledriver(其中2个整数内核共享一个SIMD/FP单元)对于movd xmm, r32
(10c 2 uops)和movd r32, xmm
(8c 1 uop)具有很高的延迟. Steamroller将其每个缩短1c. Ryzen在任一方向上都有3个周期的1个单位.
Also, AMD Bulldozer/Piledriver (where 2 integer cores share a SIMD/FP unit) has pretty high latency for movd xmm, r32
(10c 2 uops) and movd r32, xmm
(8c 1 uop). Steamroller shortens that by 1c each. Ryzen has 3-cycle 1 uop in either direction.
movd
便宜:具有1周期(Broadwell和更早版本)或2周期延迟(Skylake)的单Uop. ( https://agner.org/optimize/)
movd
to/from XMM regs is cheap on Intel: single-uop with 1-cycle (Broadwell and earlier) or 2-cycle latency (Skylake). (https://agner.org/optimize/)
sqrtss
具有固定的延迟(在IvB及更高版本上),可能输入不正常除外.如果带整数的链仅涉及任意整数位模式的movd xmm, r32
,则可能需要设置DAZ/FTZ以消除FP辅助的可能性. NaN输入很好;不会导致SSE/AVX数学运算变慢,只有x87.
sqrtss
has fixed latency (on IvB and later), other than maybe with subnormal inputs. If your chain-with-integer involves just movd xmm, r32
of an arbitrary integer bit-pattern, you might want to set DAZ/FTZ to remove the possibility of FP assists. NaN inputs are fine; that doesn't cause a slowdown for SSE/AVX math, only x87.
其他CPU(Sandybridge和更早的版本,以及所有AMD)具有可变延迟sqrtss
,因此您可能希望在那里控制起始位模式.
Other CPUs (Sandybridge and earlier, and all AMD) have variable-latency sqrtss
so you probably want to control the starting bit-pattern there.
如果您想使用sqrtsd
来使每单位时间的等待时间长于sqrtss
,同样可以.即使在Skylake上,延迟仍然是可变的. (15-16个周期).
Same goes if you want to use sqrtsd
for higher latency per uop than sqrtss
. It's still variable latency even on Skylake. (15-16 cycles).
您可以假定延迟是输入位模式的纯函数,因此,每次以相同的输入启动sqrtss
指令链将产生相同的延迟序列.或使用0.0
,1.0
,+inf
或NaN
作为开始输入,对于序列中的每个uop,您都会得到相同的延迟.
You can assume that the latency is a pure function of the input bit-pattern, so starting a chain of sqrtss
instructions with the same input every time will give the same sequence of latencies. Or with a starting input of 0.0
, 1.0
, +inf
, or NaN
, you'll get the same latency for every uop in the sequence.
(简单的输入(例如1.0和0.0)(输入和输出中的几个有效数字)可能以最低的延迟运行.sqrt(1.0)= 1.0和sqrt(0)= 0,所以它们是自持久的. sqrt(NaN)= NaN)
(Simple inputs like 1.0 and 0.0 (few significant figures in the input and output) presumably run with the lowest latency. sqrt(1.0) = 1.0 and sqrt(0) = 0, so these are self-perpetuating. Same for sqrt(NaN) = NaN)
您可以使用and reg, 0
或其他不中断清零作为链的一部分来控制输入位模式.或or reg, -1
来创建NaN.然后,您可以在Sandybridge或更早版本以及包括Zen在内的AMD上获得固定的延迟.
You might use and reg, 0
or other non-dep-breaking zeroing as part of your chain to control the input bit-pattern. Or perhaps or reg, -1
to create NaN. Then you can get fixed latency on Sandybridge or earlier, and on AMD including Zen.
或者也许是pinsrw xmm0, eax, 7
(对于Intel的端口5,为2 oups)仅修改XMM的高位qword,而将底部保留为已知的0.0
或1.0
.除非端口5的压力不是问题,否则将and
设置为0并使用movd
可能更便宜.
Or perhaps pinsrw xmm0, eax, 7
(2 uops for port 5 on Intel) to only modify the high qword of an XMM, leaving the bottom as known 0.0
or 1.0
. Probably cheaper to just and
with 0 and use movd
, unless port-5 pressure is a non-issue.
要创建吞吐量瓶颈(而不是延迟),您在Skylake上的最佳选择是vsqrtpd ymm
-p0为1 uop,延迟= 15-16,吞吐量= 9-12.
To create a throughput bottleneck (not latency), your best bet on Skylake is vsqrtpd ymm
- 1 uop for p0, latency = 15-16, throughput = 9-12.
在Broadwell或更早的版本中,该值为3微秒(2p0 p15),但是我认为Skylake拓宽了SIMD分频器(我想是在为AVX512做准备).
On Broadwell and earlier, it was 3 uops (2p0 p15), but Skylake I think widened the SIMD divider (in preparation for AVX512 I guess).
这篇关于长等待时间指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!