多个值或范围意味着一条指令的等待时间是什么? [英] What do multiple values or ranges means as the latency for a single instruction?

查看:90
本文介绍了多个值或范围意味着一条指令的等待时间是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 https://uops.info/上的指令延迟有疑问.

对于某些说明,如 PCMPEQB(XMM, M128) 表条目中的延迟Skylake列为[1;≤8]

我对延迟有些了解,但我知道这只是一个数字!!!例如1或2或3或... 但是这是什么[1;≤8] !!! ???这意味着延迟取决于内存,介于1到8之间?

如果是真的,什么时候是1 ..什么时候是3,依此类推?

例如,延迟时间是多少?

pcmpeqb xmm0, xword [.my_aligned_data]

....
....

align 16
.my_aligned_data db 5,6,7,2,5,6,7,2,5,6,7,2,5,6,7,2

pcmpeqb xmm0, xword [.my_aligned_data]的确切等待时间值是什么?

或者例如

PMOVMSKB (R32, XMM)

该指令的等待时间为(≤3)!!!什么意思 ?!这意味着等待时间在1到3之间吗?如果是这样,则此指令仅适用于寄存器!!!那么是1还是更高的数字?

解决方案

为什么两个数字:分开?

该指令具有2个输入和2个uops(未融合域),因此不需要同时输入两个.例如加载需要内存地址,但是在加载准备好之前不需要向量寄存器输入.

这就是为什么延迟值中有2个单独的字段的原因.

点击 https://uops.info/中的等待时间数字链接,以了解要操作的操作数的细分哪个结果有哪个延迟.

https://www.uops.info/html-lat/SKL/PCMPEQB_XMM_M128-Measurements.html 将其细分为Skylake的此特定指令,该指令具有2个输入和1个输出(与输入之一相同的操作数,因为这是非VEX版本.)事实:即使在HSW和更高版本上具有索引寻址模式,也可以使uop微融合,这与VEX版本不同)):

操作数1(r/w):是XMM寄存器
操作数2(r):内存

  • 延迟操作数1→1:1
  • 延迟操作数2→1(地址,基址寄存器):≤8
  • 延迟操作数2→1(内存):≤5

下面是用于测试该指令的特定指令序列.

与其他任何测试结果或发布的数字相比,uops.info测试的真正细分之处尤其明显,特别是对于诸如mulshr reg, cl的多uop指令.例如对于移位,从reg或移位计数到输出的等待时间仅为1个周期;额外的符号仅用于FLAGS合并.


可变延迟:为什么≤8

存储转发延迟 是SnB系列上的变量,地址生成/L1d负载使用延迟也可以是( InstLatx64 不同的是, https://uops.info/不仅在这种情况下放弃了.他们的测试比没有测试要好得多!

例如存储/重新加载有一些延迟,但是您如何选择将其归咎于存储还是加载呢? (明智的选择是将负载的延迟列为L1d负载使用延迟,但 https://www.uops.info/html-lat/SKL/DIVPD_XMM_M128-Measurements.html 例如显示了针对不同实验的不同链潜伏期".例如对于运行divpd的1 -> 1测试之一,并使用ORPD和ANDPD重复创建具有相同除数的dep链,uops.info列出了dep链中这些额外指令的已知延迟.它以 Chain延迟:≥10的形式列出. (从理论上讲,如果资源冲突或其他影响使其在divpd输出准备就绪后的10个周期内并不总是产生结果,则可能会更高.这些实验的目的是捕获我们可能没有想到的怪异效果.) 核心周期:44.0"减去至少10的链延迟,可以说divpd延迟最多为34,其余的dep链占了其他10(但可能更多).

(34.0似乎很高;也许我是在误解某些东西.输入的确有很多有效的尾数位,而实验2我认为在循环中没有其他操作的情况下是1.0 / 1.0,它测量了XMM的6个周期的延迟-> XMM是最好的情况.)

请注意,我在这里只是在讨论xmm-> xmm情况,而不是它们的更复杂的测试,这些测试将XMM输出作为对地址或内存内容的依赖来反馈.

I have a question about instruction latency on https://uops.info/.

For some instructions like PCMPEQB(XMM, M128) the latency in the table entry for Skylake is listed as [1;≤8]

I know a little about latency, but what i know is that it's just a single number !!! for example, 1 or 2 or 3 or ... but what is this [1;≤8] !!!??? It means latency depends on memory and it's between 1 and 8 ?

If it's true, when is it 1 .. when is it 3, etc?

For example, what is the latency for this :

pcmpeqb xmm0, xword [.my_aligned_data]

....
....

align 16
.my_aligned_data db 5,6,7,2,5,6,7,2,5,6,7,2,5,6,7,2

here what is the exact latency value for this pcmpeqb xmm0, xword [.my_aligned_data] ???

or for example,

PMOVMSKB (R32, XMM)

the latency for this instruction is (≤3) !!! what is meaning ?! is it meaning that the latency is between 1 and 3 ?? If it is, this instruction is just for registers !!! So when is it 1 vs any higher number?

解决方案

Why 2 numbers, : separated?

The instruction has 2 inputs and 2 uops (unfused domain), so both inputs aren't needed at the same time. e.g. the memory address is needed for the load, but the vector register input isn't needed until the load is ready.

That's why there are 2 separate fields in the latency value.

Click on the latency number link in https://uops.info/ for the breakdown of which operand to which result has which latency.

https://www.uops.info/html-lat/SKL/PCMPEQB_XMM_M128-Measurements.html breaks it down for this specific instruction for Skylake, which has 2 inputs and one output (in the same operand as one of the inputs because this is the non-VEX version. (Fun fact: that lets it keep a uop micro-fused even with an indexed addressing mode on HSW and later, unlike the VEX version)):

Operand 1 (r/w): is the XMM Register
Operand 2 (r): Memory

  • Latency operand 1 → 1: 1
  • Latency operand 2 → 1 (address, base register): ≤8
  • Latency operand 2 → 1 (memory): ≤5

And below that there are the specific instruction sequences that were used to test this instruction.

This detailed breakdown is where uops.info testing really shines compared to any other testing results or published numbers, especially for multi-uop instructions like mul or shr reg, cl. e.g. for shifts, the latency from reg or shift count to output is only 1 cycle; the extra uops are just for FLAGS merging.


Variable latency: why ≤8

Store-forwarding latency is variable on SnB family, and address-generation / L1d Load-use latency can be as well (Is there a penalty when base+offset is in a different page than the base?). Notice this has a memory source operand. But that's not why the latency is listed as ≤ n.

The ≤n latency values are an upper limit, I think. It does not mean that the latency from that operand could be as low as 1.

I think they only give an upper bound in cases where they weren't able to definitively test accurately for a definite lower bound.

Instructions like PMOVMSKB (R32, XMM) that produce their output in a different domain than their input are very hard to pin down. You need to use other instructions to feed the output back into the input to create a loop-carried dependency chain, and it's hard to design experiments to pin the blame on one part of the chain vs. another.

But unlike InstLatx64, the people behind https://uops.info/ didn't just give up in those cases. Their tests are vastly better than nothing!

e.g. a store/reload has some latency but how do you choose which of it to blame on the store vs. the load? (A sensible choice would be to list the load's latency as the L1d load-use latency, but unfortunately that's not what Agner Fog chose. His load vs. store latencies are totally arbitrary, like divided in half or something, leading to insanely low load latencies that aren't the load-use latency :/)

There are different ways of getting data from integer regs back into XMM regs as an input dependency for pmovmskb: ALU via movd or pinsrb/w/d/q, or a load. Or on AVX512 CPUs, via kmov and then using a masked instruction. None of these are simple and you can't assume that load-use latency for a SIMD load will be the same as an integer load. (We know store-forwarding latency is higher.)

As @BeeOnRope comments, uops.info typically times a round trip, and the displayed latency is the value of the entire cycle, minus any known padding instructions, minus 1. For example, if you time a GP -> SIMD -> GP roundtrip at 4 cycles (no padding), both of those instructions will be shown as <= 3.

When getting an upper bound for each one, you presumably can assume that any instruction has at least 1 cycle latency. e.g. for a pmovmskb -> movd chain, you can assume that movd has at least 1 cycle of latency, so the pmovmskb latency is at most the round-trip latency minus 1. But really it's probably less.


https://www.uops.info/html-lat/SKL/DIVPD_XMM_M128-Measurements.html for example shows different "Chain latencies" for different experiments. e.g. for one of the 1 -> 1 tests that runs divpd and with ORPD and ANDPD creating a dep chain with the same dividend repeatedly, uops.info lists the known latency of those extra instruction in the dep chain. It lists that as Chain latency: ≥10. (It could theoretically be higher if resource conflicts or some other effect make it not always produce a result exactly 10 cycles after the divpd output was ready. The point of these experiments is to catch weird effects that we might not have expected.) So given the "Core cycles: 44.0" minus the chain latency of at least 10, we can say that the divpd latency is at most 34, with the rest of the dep chain accounting for the other 10 (but possibly more).

(34.0 seems high; maybe I'm misinterpreting something. The inputs do have lots of significant mantissa bits, vs. experiment 2 which I think is doing 1.0 / 1.0 with nothing else in the loop, measuring 6 cycle latency from XMM -> XMM as a best case.)

Note that I'm just talking about the xmm -> xmm case here, not their more complex tests that feed back the XMM output as a dependency for the address or for memory contents.

这篇关于多个值或范围意味着一条指令的等待时间是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆