长等待时间指令 [英] Long latency instruction

查看:177
本文介绍了长等待时间指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一条长等待时间的单指令x86 1 指令,以便创建长的依赖链,作为测试微体系结构功能的一部分.

I would like a long-latency single-uop x861 instruction, in order to create long dependency chains as part of testing microarchitectural features.

当前我正在使用fsqrt,但我想知道还有更好的方法.

Currently I'm using fsqrt, but I'm wondering is there is something better.

理想情况下,该指令在以下标准上得分会很高:

Ideally, the instruction will score well on the following criteria:

  • 长时间等待
  • 稳定/固定的延迟时间
  • 一个或几个微码(特别是:未微码)
  • 消耗尽可能少的uarch资源(加载/存储缓冲区,页面遍历等)
  • 能够(在延迟方面)与自身链接
  • 能够使用GP寄存器链接输入和输出
  • 不干扰正常的OoO执行(除了消耗的ROB,RS等资源之外)

因此fsqrt在大多数情况下都可以,但是等待时间并不长,并且似乎很难与GP规则链接.

So fsqrt is OK in most senses, but the latency isn't that long and it seems hard to chain with GP regs.

1 特别是在现代的Intel x86上,如果它在AMD Zen *上也能很好地工作,则可以加分.

1 On modern Intel x86 in particular, with bonus points if it also works well on AMD Zen*.

推荐答案

主流Intel CPU没有任何非常长的延迟单uup整数指令.所有ALU端口上都有1个周期等待时间的整数ALU,端口1上有3个周期等待时间的流水线ALU.我认为AMD是相似的.

Mainstream Intel CPUs don't have any very long latency single-uop integer instructions. There are integer ALUs for 1-cycle latency uops on all ALU ports, and a 3-cycle-latency pipelined ALU on port 1. I think AMD is similar.

div/sqrt单元是唯一真正的高延迟ALU,但是整数div/idiv是在Intel上进行微编码的,因此,请使用FP,其中div/sqrt通常是单uup指令.

The div/sqrt unit is the only truly high-latency ALU, but integer div/idiv are microcoded on Intel so yes, use FP where div/sqrt are typically single-uop instructions.

AMD的整数div/idiv是2 uop指令(可能要写入2个输出),并具有与数据相关的延迟.

AMD's integer div / idiv are 2-uop instructions (presumably to write the 2 outputs), with data-dependent latency.

此外,AMD Bulldozer/Piledriver(其中2个整数内核共享一个SIMD/FP单元)对于movd xmm, r32(10c 2 uops)和movd r32, xmm(8c 1 uop)具有很高的延迟. Steamroller将其每个缩短1c. Ryzen在任一方向上都有3个周期的1个单位.

Also, AMD Bulldozer/Piledriver (where 2 integer cores share a SIMD/FP unit) has pretty high latency for movd xmm, r32 (10c 2 uops) and movd r32, xmm (8c 1 uop). Steamroller shortens that by 1c each. Ryzen has 3-cycle 1 uop in either direction.

movd便宜:具有1周期(Broadwell和更早版本)或2周期延迟(Skylake)的单Uop. ( https://agner.org/optimize/)

movd to/from XMM regs is cheap on Intel: single-uop with 1-cycle (Broadwell and earlier) or 2-cycle latency (Skylake). (https://agner.org/optimize/)

sqrtss具有固定的延迟(在IvB及更高版本上),可能输入不正常除外.如果带整数的链仅涉及任意整数位模式的movd xmm, r32,则可能需要设置DAZ/FTZ以消除FP辅助的可能性. NaN输入很好;不会导致SSE/AVX数学运算变慢,只有x87.

sqrtss has fixed latency (on IvB and later), other than maybe with subnormal inputs. If your chain-with-integer involves just movd xmm, r32 of an arbitrary integer bit-pattern, you might want to set DAZ/FTZ to remove the possibility of FP assists. NaN inputs are fine; that doesn't cause a slowdown for SSE/AVX math, only x87.

其他CPU(Sandybridge和更早的版本,以及所有AMD)具有可变延迟sqrtss,因此您可能希望在那里控制起始位模式.

Other CPUs (Sandybridge and earlier, and all AMD) have variable-latency sqrtss so you probably want to control the starting bit-pattern there.

如果您想使用sqrtsd来使每单位时间的等待时间长于sqrtss ,同样可以.即使在Skylake上,延迟仍然是可变的. (15-16个周期).

Same goes if you want to use sqrtsd for higher latency per uop than sqrtss. It's still variable latency even on Skylake. (15-16 cycles).

您可以假定延迟是输入位模式的纯函数,因此,每次以相同的输入启动sqrtss指令链将产生相同的延迟序列.或使用0.01.0+infNaN作为开始输入,对于序列中的每个uop,您都会得到相同的延迟.

You can assume that the latency is a pure function of the input bit-pattern, so starting a chain of sqrtss instructions with the same input every time will give the same sequence of latencies. Or with a starting input of 0.0, 1.0, +inf, or NaN, you'll get the same latency for every uop in the sequence.

(简单的输入(例如1.0和0.0)(输入和输出中的几个有效数字)可能以最低的延迟运行.sqrt(1.0)= 1.0和sqrt(0)= 0,所以它们是自持久的. sqrt(NaN)= NaN)

(Simple inputs like 1.0 and 0.0 (few significant figures in the input and output) presumably run with the lowest latency. sqrt(1.0) = 1.0 and sqrt(0) = 0, so these are self-perpetuating. Same for sqrt(NaN) = NaN)

您可以使用and reg, 0或其他不中断清零作为链的一部分来控制输入位模式.或or reg, -1来创建NaN.然后,您可以在Sandybridge或更早版本以及包括Zen在内的AMD上获得固定的延迟.

You might use and reg, 0 or other non-dep-breaking zeroing as part of your chain to control the input bit-pattern. Or perhaps or reg, -1 to create NaN. Then you can get fixed latency on Sandybridge or earlier, and on AMD including Zen.

或者也许是pinsrw xmm0, eax, 7(对于Intel的端口5,为2 oups)仅修改XMM的高位qword,而将底部保留为已知的0.01.0.除非端口5的压力不是问题,否则将and设置为0并使用movd可能更便宜.

Or perhaps pinsrw xmm0, eax, 7 (2 uops for port 5 on Intel) to only modify the high qword of an XMM, leaving the bottom as known 0.0 or 1.0. Probably cheaper to just and with 0 and use movd, unless port-5 pressure is a non-issue.

要创建吞吐量瓶颈(而不是延迟),您在Skylake上的最佳选择是vsqrtpd ymm-p0为1 uop,延迟= 15-16,吞吐量= 9-12.

To create a throughput bottleneck (not latency), your best bet on Skylake is vsqrtpd ymm - 1 uop for p0, latency = 15-16, throughput = 9-12.

在Broadwell或更早的版本中,该值为3微秒(2p0 p15),但是我认为Skylake拓宽了SIMD分频器(我想是在为AVX512做准备).

On Broadwell and earlier, it was 3 uops (2p0 p15), but Skylake I think widened the SIMD divider (in preparation for AVX512 I guess).

这篇关于长等待时间指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆