mov r64,m64是一个周期还是两个周期的延迟? [英] Is mov r64, m64 one cycle or two cycle latency?

查看:95
本文介绍了mov r64,m64是一个周期还是两个周期的延迟?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在IvyBridge上,我编写了以下简单程序来测量mov的延迟:

I'm on IvyBridge, I wrote the following simple program to measure the latency of mov:

section .bss
align   64
buf:    resb    64

section .text
global _start
_start:
    mov rcx,    1000000000
    xor rax,    rax
loop:
    mov rax,    [buf+rax]

    dec rcx,
    jne loop

    xor rdi,    rdi
    mov rax,    60
    syscall

perf显示结果:

 5,181,691,439      cycles

因此,每次迭代都有5个周期的延迟.我从多个在线资源中搜索,L1缓存的延迟为4.因此,mov本身的延迟应为1.

So every iteration has 5 cycle latency. I searched from multiple online resource, the latency of L1 cache is 4. Therefore the latency of mov itself should be 1.

但是,Agner指令表显示mov r64, m64对于IveBridge具有2个周期的延迟.我不知道其他地方可以找到这种延迟.

However, Agner instruction table shows mov r64, m64 has 2 cycle latency for IveBridge. I don't know other place to find this latency.

我在上述测量程序中是否出错?为什么该程序显示mov延迟是1而不是2?

Do I make mistake in the above measuring program? Why this program shows the mov latency is 1 rather than 2?

(我通过使用L2缓存得到了相同的结果:如果buf+rax是L1缺少L2命中,类似的测量结果显示mov rax, [buf+rax]具有12个周期的延迟.IvyBridge有11个周期的L2缓存,因此mov的延迟是还是1个周期)

(I got the same result by using L2 cache: if buf+rax is L1 missing L2 hit, similar measuring shows mov rax, [buf+rax] has 12 cycle latency. IvyBridge has 11 cycle latency L2 cache, so the mov latency is still 1 cycle)

推荐答案

因此mov本身的延迟应为1.

Therefore the latency of mov itself should be 1.

否,mov 负载.数据也不必经过ALU mov操作.

No, the mov is the load. There isn't also an ALU mov operation that the data has to go through.

Agner Fog的指令表不包含负载使用延迟(就像您正在测量的那样).它们在他的微体系结构PDF中的缓存和内存访问"表中.每个uarch的部分.例如SnB/IvB(第9.13节)具有"Level 1 data"(第1级数据).具有"32 kB,8路,64 B行大小,每个核心延迟4 "的行.

Agner Fog's instruction tables don't contain the load-use latency (like you're measuring). They're in his microarch PDF in tables in the "cache and memory access" section for each uarch. e.g. SnB/IvB (Section 9.13) has a "Level 1 data" row with "32 kB, 8 way, 64 B line size, latency 4, per core".

这4个周期的延迟是诸如mov rax, [rax]之类的一系列相关指令的负载使用延迟. 您正在测量5个周期,因为您使用的是[reg + 0..2047]以外的寻址模式.在位移较小的情况下,加载单元推测直接将基址寄存器用作TLB查找的输入将与使用加法器结果相同的结果. 有没有惩罚当base + offset与基准位于不同的页面时?.因此,您的寻址模式[disp32 + rax]使用正常路径,在加载端口中开始TLB查找之前,再等待一个周期的加法器结果.

This 4-cycle latency is the load-use latency for a chain of dependent instructions like mov rax, [rax]. You're measuring 5 cycles because you're using an addressing mode other than [reg + 0..2047]. With small displacements, the load unit speculates that using the base register directly as the input to TLB lookup will give the same result as using the adder result. Is there a penalty when base+offset is in a different page than the base?. So your addressing mode [disp32 + rax] uses the normal path, waiting one more cycle for the adder result before starting the TLB lookup in the load port.

对于不同域之间的大多数操作(例如整数寄存器和XMM寄存器),您只能真正测量往返行程,例如movd xmm0,eax/mov eax, xmm0,并且很难将其分开并找出延迟的时间.每个指令分别是 1 .

For most operations between different domains (like integer registers and XMM registers), you can only really measure a round-trip like movd xmm0,eax / mov eax, xmm0, and it's hard to pick that apart and figure out what the latency of each instruction is separately1.

对于负载,您可以链接到另一个负载以测量缓存负载使用的延迟,而不是存储/重载链.

For loads, you can chain to another load to measure cache load-use latency, instead of a chain of store/reload.

Agner由于某种原因决定查看其表的存储转发延迟,并对如何在各个存储之间分配存储转发延迟做出完全任意的选择并重新加载.

Agner for some reason decided to only look at store-forwarding latency for his tables, and to make a totally arbitrary choice of how to split up the store-forwarding latency between the store and the reload.

(来自他的指令表电子表格的术语定义"表,在简介"后的左侧)

(from the "definition of terms" sheet of his instruction table spreadsheet, way at the left after the Introduction)

无法测量存储器读取或写入的延迟 软件方法的说明.只能测量 存储器写入的总延迟,然后是从存储器读取的总延迟 相同的地址. 这里测量的实际上不是缓存访问 时间,因为在大多数情况下,微处理器足够智能 商店转发"直接从写入单元到读取单元 而不是等待数据进入缓存并再次返回. 此商店转发过程的延迟被任意划分 表中的写入延迟和读取延迟.但实际上, 对性能优化有意义的唯一值是总和 写入时间和读取时间.

It is not possible to measure the latency of a memory read or write instruction with software methods. It is only possible to measure the combined latency of a memory write followed by a memory read from the same address. What is measured here is not actually the cache access time, because in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables. But in fact, the only value that makes sense to performance optimization is the sum of the write time and the read time.

这显然是不正确的:L1d负载使用等待时间是通过间接级别进行指针追逐的事情.您可能会争辩说它只是可变的,因为某些负载可能会丢失高速缓存,但是如果您要选择要放入表中的内容,则最好选择L1d负载使用延迟.然后计算存储等待时间数,这样像现在一样,存储+加载等待时间=存储转发等待时间.然后,英特尔Atom的存储延迟将为-2,因为它具有 3c L1d加载使用延迟,但根据Agner的使用者指南进行1c存储转发.

This is obviously incorrect: L1d load-use latency is a thing for pointer-chasing through levels of indirection. You could argue that it's simply variable because some loads can miss in cache, but if you're going to pick something to put in your table you might as well pick the L1d load-use latency. And then calculate the store latency numbers such that store+load latency = store-forwarding latency like now. Intel Atom would then have store latency = -2, because it has 3c L1d load-use latency, but 1c store-forwarding according to Agner's uarch guide.

例如,这对于加载到XMM或YMM寄存器中不太容易,但是一旦计算出movq rax, xmm0的延迟,仍然可以实现. x87寄存器更难,因为无法通过ALU直接将数据从st0导入到eax/rax,而不是存储/重装.但是也许您可以使用FP比较来做一些事情,例如fucomi,它直接设置整数FLAGS(在具有P6和更高版本的CPU上).

This is less easy for loads into XMM or YMM registers, for example, but still possible once you work out the latency of movq rax, xmm0. It's harder for x87 registers, because there's no way to directly get the data from st0 into eax/rax through the ALU, instead of a store/reload. But perhaps you could do something with an FP compare like fucomi that sets integer FLAGS directly (on CPUs that have it: P6 and later).

不过,对于整数加载延迟而言,反映指针追赶延迟要好得多. IDK(如果有人愿意为他更新Agner的表),或者如果他愿意接受这样的更新,则为IDK.不过,对于大多数架构,都需要重新进行测试,以确保您对不同的寄存器集具有正确的负载使用延迟.

Still, it would have been a lot better for at least the integer load latency to reflect pointer-chasing latency. IDK if anyone's offered to update Agner's tables for him, or if he'd accept such an update. It would take fresh testing on most uarches to be sure you had the right load-use latency for different register sets, though.

脚注1:例如, http://instlatx64.atw.hu 不尝试,并且只是说差异. reg.设置"在等待时间"列中,而有用数据仅在吞吐量"列中.但是它们有MOVD r64, xmm+MOVD xmm, r64往返行,在这种情况下总共2个周期在IvB上,因此我们可以很自信地知道它们单向仅为1c.不是零的一种方法. :P

footnote 1: For example, http://instlatx64.atw.hu doesn't try, and just says "diff. reg. set" in the latency column, with useful data only in the throughput column. But they have lines for the MOVD r64, xmm+MOVD xmm, r64 round trip, in this case 2 cycles total on IvB so we can be pretty confident they're only 1c each way. Not zero one way. :P

但是对于整数寄存器中的负载,它们确实显示了MOV r32, [m32]的IvB 4周期负载使用延迟,因为显然它们是在[reg + 0..2047]寻址模式下进行测试的.

But for loads into integer registers, they do show IvB's 4-cycle load-use latency for MOV r32, [m32], because apparently they test with a [reg + 0..2047] addressing mode.

https://uops.info/很好,但是却很漂亮延迟的宽松界限:IIRC,他们构造了一个带有往返行程的循环(例如,存储和重新加载,或者xmm-> integer和integer-> xmm),然后假设其他所有步骤仅在延迟上给出上限1个周期.参见做什么多值或范围是指一条指令的等待时间?了解更多.

其他缓存延迟信息来源:

https://www.7-cpu.com/提供了许多详细信息其他架构,甚至许多非x86架构,如ARM,MIPS,PowerPC和IA-64.

https://www.7-cpu.com/ has good details for lots of other uarches, even many non-x86 like ARM, MIPS, PowerPC, and IA-64.

这些页面还有其他详细信息,例如缓存和TLB大小,TLB时序,分支未命中实验结果以及内存带宽.缓存延迟的详细信息如下所示:

The pages have other details like cache and TLB sizes, TLB timing, branch miss experiment results, and memory bandwidth. The cache latency details look like this:

(来自其Skylake页面)

  • L1数据缓存延迟= 4个周期,用于通过指针进行简单访问
  • L1数据高速缓存延迟= 5个周期,用于使用复杂地址计算(size_t n, *p; n = p[n])进行访问.
  • L2缓存延迟= 12个周期
  • L3缓存延迟= 42个周期(核心0)(i7-6700 Skylake 4.0 GHz)
  • L3缓存延迟= 38个周期(i7-7700K 4 GHz,Kaby Lake)
  • RAM延迟= 42个周期+ 51 ns(i7-6700 Skylake)
  • L1 Data Cache Latency = 4 cycles for simple access via pointer
  • L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
  • L2 Cache Latency = 12 cycles
  • L3 Cache Latency = 42 cycles (core 0) (i7-6700 Skylake 4.0 GHz)
  • L3 Cache Latency = 38 cycles (i7-7700K 4 GHz, Kaby Lake)
  • RAM Latency = 42 cycles + 51 ns (i7-6700 Skylake)

这篇关于mov r64,m64是一个周期还是两个周期的延迟?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆