当base + offset与base在不同的页面中时,会受到惩罚吗? [英] Is there a penalty when base+offset is in a different page than the base?

查看:66
本文介绍了当base + offset与base在不同的页面中时,会受到惩罚吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这三个代码段的执行时间:

pageboundary: dq (pageboundary + 8)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx - 8]
    sub ecx, 1
    jnz .loop

这:

pageboundary: dq (pageboundary - 8)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx + 8]
    sub ecx, 1
    jnz .loop

这:

pageboundary: dq (pageboundary - 4096)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx + 4096]
    sub ecx, 1
    jnz .loop

在4770K上,第一个代码段的每个迭代大约5个周期,第二个代码段的每个迭代大约9个周期,然后是第三个代码段的大约5个周期.他们俩都访问完全相同的地址,该地址是4K对齐的.在第二个片段中,只有地址 calculation 跨越页面边界:rdxrdx + 8不属于同一页面,负载仍然对齐.具有较大的偏移量,它又回到了5个周期.

此效果通常如何起作用?


通过如下所示的ALU指令路由来自加载的结果:

.loop:
    mov rdx, [rdx + 8]
    or rdx, 0
    sub ecx, 1
    jnz .loop

使每个迭代花费6个周期,这就是5 + 1. Reg + 8应该是一个特殊的快速加载,并且AFAIK需要4个周期,因此即使在这种情况下,似乎也要付出一定的代价,但是只有1个周期.


使用这样的测试来回应一些评论:

.loop:
    lfence
    ; or rdx, 0
    mov rdx, [rdx + 8]
    ; or rdx, 0
    ; uncomment one of the ORs
    lfence
    sub ecx, 1
    jnz .loop

or放置在mov之前会使循环比没有任何or的循环快,将or放置在mov之后会使循环变慢.

解决方案

优化规则:在链接表/树等指针连接的数据结构中,将nextleft/right指针放在第一个对象的16个字节. malloc通常返回16字节对齐的块(alignof(maxalign_t)),因此这将确保链接指针与对象的起始位置在同一页中.

确保重要结构成员与对象开头在同一页面上的任何其他方法也将起作用.


桑迪布里奇(Sandybridge)家庭通常有5个周期的L1d负载使用等待时间,但是在特殊的情况下,使用基址+分配寻址模式以较小的 positive 位移进行指针追逐.

Sandybridge系列对于[reg + 0..2047]寻址模式具有4个周期的加载使用等待时间,当基本reg是mov加载而不是ALU指令的结果时.如果reg+dispreg位于不同的页面,则为罚款.

基于在Haswell和Skylake(可能是原始SnB,但我们不知道)上的这些测试结果,看来以下所有条件都必须为真:

  • 基本规则来自另一个负载. (粗略的启发式启发,用于指针追逐,通常意味着加载延迟可能是dep链的一部分).如果通常分配对象时不跨越页面边界,那么这是一种很好的启发式方法. (硬件显然可以检测到从哪个执行单元转发输入.)
  • 寻址模式为[reg][reg+disp8/disp32]. (或带有零归零索引寄存器的索引加载!通常几乎没有用处,但可能会为问题/重命名阶段转换负载提供一些见识.)
  • 排量< 2048 .也就是说,第11位以上的所有位均为零(HW可以在没有完整的整数加法器/比较器的情况下进行检查的条件.)
  • ( Skylake,但不是Haswell/Broadwell ):最后一次加载不是重试的快速路径. (因此,base = 4或5个周期负载的结果,它将尝试快速路径.但是base = 10周期重试负载的结果,它不会尝试.SKL的罚款似乎是10,而HSW的惩罚是9 ).

    我不知道这是最后一次在该负载端口上尝试的负载是否重要,或者实际上是否是产生该输入的负载发生了什么.也许并行追踪两个dep链的实验可能会有所启发;我只尝试了一个将追逐dep链并改变页面和不改变页面位移的指针.

如果所有这些情况都成立,则装载端口推测最终有效地址将与基址寄存器位于同一页面.在实际情况下,负载使用延迟会形成循环传送的dep链,例如链表或二叉树.

微体系结构解释(我最好的解释结果的解释,而不是英特尔发布的任何内容):

似乎为L1dTLB编制索引对L1d负载延迟至关重要.提前开始该1个周期(无需等待加法器的输出来计算最终地址),就可以使用该地址的低12位来完成对L1d进行索引的整个过程,从而减少了一个周期,然后将该组中的8个标记与高1个标记进行比较TLB产生的物理地址的位. (英特尔的L1d是VIPT 8路32kiB,因此它没有混叠问题,因为索引位全部来自地址的低12位:页面内的偏移量在虚拟地址和物理地址中都是相同的.低12位可从虚拟转换为phys.)

由于我们找不到跨越64字节边界的效果,因此我们知道加载端口在索引高速缓存之前添加了位移.

正如哈迪(Hadi)所建议的那样,如果位11出现进位,加载端口可能会使错误的TLB加载完成,然后使用正常路径将其重做. (在HSW上,总加载延迟=9.在SKL上,总加载延迟可以为7.5或10 .)

从理论上讲,可以立即中止并在下一个周期重试(使它成为5个或6个周期而不是9个周期),但是请记住,加载端口的流水线传输速率为每个时钟吞吐量1.调度程序希望能够在下一个周期将另一个uop发送到加载端口,而Sandybridge系列标准化了5个周期或更短周期的延迟. (没有2个周期的说明).

我没有测试2M大页面是否有帮助,但可能没有.我认为TLB硬件非常简单,无法识别高出1页的索引仍会选择相同的条目.因此,只要位移越过4k边界,它就可能会进行缓慢的重试,即使它在同一大页面中也是如此. (页面拆分加载以这种方式工作:如果数据实际上越过了4k边界(例如,第4页加载了8字节),则您不但要为页面拆分付出代价,而不仅要考虑高速缓存行拆分代价,而不论大页面如何./p>


英特尔的优化手册 2.4.5.2 L1 DCache 部分(在Sandybridge部分)中记录了这种特殊情况,但没有提及任何不同的页面限制,或者仅是事实用于指针追逐,并且在dep链中有ALU指令时不会发生.

 (Sandybridge)
Table 2-21. Effect of Addressing Modes on Load Latency
-----------------------------------------------------------------------
Data Type             |  Base + Offset > 2048    | Base + Offset < 2048
                      |  Base + Index [+ Offset] |
----------------------+--------------------------+----------------------
Integer               |            5             |  4
MMX, SSE, 128-bit AVX |            6             |  5
X87                   |            7             |  6
256-bit AVX           |            7             |  7
 (remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)

此表周围的文字也没有提及Haswell/Skylake上存在的限制,并且可能还存在于SnB上(我不知道).

也许Sandybridge没有这些限制,英特尔没有记录Haswell回归,否则英特尔只是没有记录这些限制.该表非常明确地确定该寻址模式始终为4c延迟,且偏移量= 0..2047.


@Harold将ALU指令作为加载/使用指针追随依赖链的一部分的实验,证实了造成这种情况的原因:ALU insn减少了总延迟,有效地降低了在这种特定的页面交叉情况下,将and rdx, rdx负增量延迟等指令添加到mov rdx, [rdx-8] dep链中即可.


此答案中的先前猜测包括这样的建议,即在ALU中使用负载 result 与另一个负载是决定延迟的因素.那太奇怪了,需要展望未来.就我而言,这是对在循环中添加ALU指令的结果的错误解释. (我不知道9个循环对页面翻页的影响,我认为硬件机制是加载端口内部结果的转发快速路径.这很有意义.)

我们可以证明,重要的是基本reg输入的来源,而不是加载结果的目的地:将同一地址存储在页面边界前后的2个不同位置.创建一个由ALU => load => load组成的dep链,并检查它是否是第二种负载,它很容易受到这种速度下降的影响,并且能够通过简单的寻址模式从加速中受益.

%define off  16
    lea    rdi, [buf+4096 - 16]
    mov    [rdi], rdi
    mov    [rdi+off], rdi

    mov     ebp, 100000000
.loop:

    and    rdi, rdi
    mov    rdi, [rdi]        ; base comes from AND
    mov    rdi, [rdi+off]    ; base comes from a load

    dec   ebp
    jnz  .loop

    ... sys_exit_group(0)

section .bss
align 4096
buf:    resb 4096*2

在SKL i7-6700k上与Linux perf一起计时.

  • off = 8,这是正确的猜测,我们得到的总等待时间= 10个周期= 1 + 5 +4.(每次迭代10个周期).

  • off = 16[rdi+off]负载很慢,我们得到16个周期/iter = 1 + 5 +10.(SKL上的罚款似乎比HSW高)

在反转加载顺序(先执行[rdi+off]加载)的情况下,无论off = 8还是off = 16,它始终为10c,因此我们证明了mov rdi, [rdi+off]在以下情况下不会尝试推测性快速路径它的输入来自ALU指令.

在没有andoff=8的情况下,每次迭代我们都会获得预期的8c:两者都使用快速路径. (@harold确认HSW在这里也获得了8分.)

如果没有andoff=16,则每次迭代可获得15c:5 + 10 . mov rdi, [rdi+16]尝试快速路径并失败,耗时10c.然后mov rdi, [rdi]不会尝试快速路径,因为其输入失败. ( @harold的HSW在这里需要13:4 + 9 .因此,这可以确认HSW确实尝试了快速路径,即使最后一个快速路径失败了,并且快速路径失败罚金实际上仅为9在HSW上与在SKL上分别为10)

不幸的是,SKL没有意识到没有位移的[base]总是可以安全地使用快速路径.


在SKL中,仅循环mov rdi, [rdi+16],平均延迟为7.5个周期.基于对其他混合的测试,我认为它在5c和10c之间交替:在没有尝试快速路径的5c负载之后,下一个尝试并失败了,并取10c.这样,下一次加载将使用安全的5c路径.

在我们知道快速路径总是会失败的情况下,添加归零索引寄存器实际上会加快它的速度.或不使用基址寄存器(例如[nosplit off + rdi*1]),NASM会将该基址寄存器组装为48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10].请注意,这需要一个disp32,因此对代码大小不利.

还请注意,在某些情况下,微融合内存操作数的索引寻址模式是未分层的,而base + disp模式则不是.但是,如果您使用的是纯负载(例如movvbroadcastss),则索引寻址模式没有内在的错误.但是,使用额外的清零寄存器并不是很好.

The execution times for these three snippets:

pageboundary: dq (pageboundary + 8)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx - 8]
    sub ecx, 1
    jnz .loop

And this:

pageboundary: dq (pageboundary - 8)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx + 8]
    sub ecx, 1
    jnz .loop

And this:

pageboundary: dq (pageboundary - 4096)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx + 4096]
    sub ecx, 1
    jnz .loop

Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access the exact same address, which is 4K-aligned. In the second snippet, only the address calculation crosses the page boundary: rdx and rdx + 8 don't belong to the same page, the load is still aligned. With a large offset it's back to 5 cycles again.

How does this effect work in general?


Routing the result from the load through an ALU instruction like this:

.loop:
    mov rdx, [rdx + 8]
    or rdx, 0
    sub ecx, 1
    jnz .loop

Makes it take 6 cycles per iteration, which makes sense as 5+1. Reg+8 should be a special fast load and AFAIK take 4 cycles, so even in this case there seems to be some penalty, but only 1 cycle.


A test like this was used in response to some of the comments:

.loop:
    lfence
    ; or rdx, 0
    mov rdx, [rdx + 8]
    ; or rdx, 0
    ; uncomment one of the ORs
    lfence
    sub ecx, 1
    jnz .loop

Putting the or before the mov makes the loop faster than without any or, putting the or after the mov makes it a cycle slower.

解决方案

Optimization rule: in pointer-connected data structures like linked-lists / trees, put the next or left/right pointers in the first 16 bytes of the object. malloc typically returns 16-byte aligned blocks (alignof(maxalign_t)), so this will ensure the linking pointers are in the same page as the start of the object.

Any other way of ensuring that important struct members are in the same page as the start of the object will also work.


Sandybridge-family normally has 5 cycle L1d load-use latency, but there's a special case for pointer-chasing with small positive displacements with base+disp addressing modes.

Sandybridge-family has 4 cycle load-use latency for [reg + 0..2047] addressing modes, when the base reg is the result of a mov load, not an ALU instruction. Or a penalty if reg+disp is in a different page than reg.

Based on these test results on Haswell and Skylake (and probably original SnB but we don't know), it appears that all of the following conditions must be true:

  • base reg comes from another load. (A rough heuristic for pointer-chasing, and usually means that load latency is probably part of a dep chain). If objects are usually allocated not crossing a page boundary, then this is a good heuristic. (The HW can apparently detect which execution unit the input is being forwarded from.)
  • Addressing mode is [reg] or [reg+disp8/disp32]. (Or an indexed load with an xor-zeroed index register! Usually not practically useful, but might provide some insight into the issue/rename stage transforming load uops.)
  • displacement < 2048. i.e. all bits above bit 11 are zero (a condition HW can check without a full integer adder/comparator.)
  • (Skylake but not Haswell/Broadwell): the last load wasn't a retried-fastpath. (So base = result of a 4 or 5 cycle load, it will attempt the fast path. But base = result of a 10 cycle retried load, it won't. The penalty on SKL seems to be 10, vs. 9 on HSW).

    I don't know if it's the last load attempted on that load port that matters, or if it's actually what happened to the load that produced that input. Perhaps experiments chasing two dep chains in parallel could shed some light; I've only tried one pointer chasing dep chain with a mix of page-changing and non-page-changing displacements.

If all those things are true, the load port speculates that the final effective address will be in the same page as the base register. This is a useful optimization in real cases when load-use latency forms a loop-carried dep chain, like for a linked list or binary tree.

microarchitectural explanation (my best guess at explaining the result, not from anything Intel published):

It seems that indexing the L1dTLB is on the critical path for L1d load latency. Starting that 1 cycle early (without waiting for the output of an adder to calculate the final address) shaves a cycle off the full process of indexing L1d using the low 12 bits of the address, then comparing the 8 tags in that set against the high bits of the physical address produced by the TLB. (Intel's L1d is VIPT 8-way 32kiB, so it has no aliasing problems because the index bits all come from the low 12 bits of the address: the offset within a page which is the same in both the virtual and physical address. i.e. the low 12 bits translate for free from virt to phys.)

Since we don't find an effect for crossing 64-byte boundaries, we know the load port is adding the displacement before indexing the cache.

As Hadi suggests, it seems likely that if there's carry-out from bit 11, the load port lets the wrong-TLB load complete and then redoes it using the normal path. (On HSW, the total load latency = 9. On SKL the total load latency can be 7.5 or 10).

Aborting right away and retrying on the next cycle (to make it 5 or 6 cycles instead of 9) would in theory be possible, but remember that the load ports are pipelined with 1 per clock throughput. The scheduler is expecting to be able to send another uop to the load port in the next cycle, and Sandybridge-family standardizes latencies for everything of 5 cycles and shorter. (There are no 2-cycle instructions).

I didn't test if 2M hugepages help, but probably not. I think the TLB hardware is simple enough that it couldn't recognize that a 1-page-higher index would still pick the same entry. So it probably does the slow retry any time the displacement crosses a 4k boundary, even if that's in the same hugepage. (Page-split loads work this way: if the data actually crosses a 4k boundary (e.g. 8-byte load from page-4), you pay the page-split penalty not just the cache-line split penalty, regardless of hugepages)


Intel's optimization manual documents this special case in section 2.4.5.2 L1 DCache (in the Sandybridge section), but doesn't mention any different-page limitation, or the fact that it's only for pointer-chasing, and doesn't happen when there's an ALU instruction in the dep chain.

 (Sandybridge)
Table 2-21. Effect of Addressing Modes on Load Latency
-----------------------------------------------------------------------
Data Type             |  Base + Offset > 2048    | Base + Offset < 2048
                      |  Base + Index [+ Offset] |
----------------------+--------------------------+----------------------
Integer               |            5             |  4
MMX, SSE, 128-bit AVX |            6             |  5
X87                   |            7             |  6
256-bit AVX           |            7             |  7
 (remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)

The text around this table also doesn't mention the limitations that exist on Haswell/Skylake, and may also exist on SnB (I don't know).

Maybe Sandybridge doesn't have those limitations and Intel didn't document the Haswell regression, or else Intel just didn't document the limitations in the first place. The table is pretty definite about that addressing mode always being 4c latency with offset = 0..2047.


@Harold's experiment of putting an ALU instruction as part of the load/use pointer-chasing dependency chain confirms that it's this effect that's causing the slowdown: an ALU insn decreased the total latency, effectively giving an instruction like and rdx, rdx negative incremental latency when added to the mov rdx, [rdx-8] dep chain in this specific page-crossing case.


Previous guesses in this answer included the suggestion that using the load result in an ALU vs. another load was what determined the latency. That would be super weird and require looking into the future. That was a wrong interpretation on my part of the effect of adding an ALU instruction into the loop. (I hadn't known about the 9-cycle effect on page crossing, and was thinking that the HW mechanism was a forwarding fast-path for the result inside the load port. That would make sense.)

We can prove that it's the source of the base reg input that matters, not the destination of the load result: Store the same address at 2 separate locations, before and after a page boundary. Create a dep chain of ALU => load => load, and check that it's the 2nd load that's vulnerable to this slowdown / able to benefit from the speedup with a simple addressing mode.

%define off  16
    lea    rdi, [buf+4096 - 16]
    mov    [rdi], rdi
    mov    [rdi+off], rdi

    mov     ebp, 100000000
.loop:

    and    rdi, rdi
    mov    rdi, [rdi]        ; base comes from AND
    mov    rdi, [rdi+off]    ; base comes from a load

    dec   ebp
    jnz  .loop

    ... sys_exit_group(0)

section .bss
align 4096
buf:    resb 4096*2

Timed with Linux perf on SKL i7-6700k.

  • off = 8, the speculation is correct and we get total latency = 10 cycles = 1 + 5 + 4. (10 cycles per iteration).

  • off = 16, the [rdi+off] load is slow, and we get 16 cycles / iter = 1 + 5 + 10. (The penalty seems to be higher on SKL than HSW)

With the load order reversed (doing the [rdi+off] load first), it's always 10c regardless of off=8 or off=16, so we've proved that mov rdi, [rdi+off] doesn't attempt the speculative fast-path if its input is from an ALU instruction.

Without the and, and off=8, we get the expected 8c per iter: both use the fast path. (@harold confirms HSW also gets 8 here).

Without the and, and off=16, we get 15c per iter: 5+10. The mov rdi, [rdi+16] attempts the fast path and fails, taking 10c. Then mov rdi, [rdi] doesn't attempt the fast-path because its input failed. (@harold's HSW takes 13 here: 4 + 9. So that confirms HSW does attempt the fast-path even if the last fast-path failed, and that the fast-path fail penalty really is only 9 on HSW vs. 10 on SKL)

It's unfortunate that SKL doesn't realize that [base] with no displacement can always safely use the fast path.


On SKL, with just mov rdi, [rdi+16] in the loop, the average latency is 7.5 cycles. Based on tests with other mixes, I think it alternates between 5c and 10c: after a 5c load that didn't attempt the fast path, the next one does attempt it and fails, taking 10c. That makes the next load use the safe 5c path.

Adding a zeroed index register actually speeds it up in this case where we know the fast-path is always going to fail. Or using no base register, like [nosplit off + rdi*1], which NASM assembles to 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]. Notice that this requires a disp32, so it's bad for code size.

Also beware that indexed addressing modes for micro-fused memory operands are un-laminated in some cases, while base+disp modes aren't. But if you're using pure loads (like mov or vbroadcastss), there's nothing inherently wrong with an indexed addressing mode. Using an extra zeroed register isn't great, though.

这篇关于当base + offset与base在不同的页面中时,会受到惩罚吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆