当 base+offset 与 base 位于不同的页面时,是否有惩罚? [英] Is there a penalty when base+offset is in a different page than the base?

查看:21
本文介绍了当 base+offset 与 base 位于不同的页面时,是否有惩罚?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这三个片段的执行时间:

pageboundary: dq (pageboundary + 8)...mov rdx, [rel pageboundary].环形:mov rdx, [rdx - 8]子 ecx, 1循环

还有这个:

pageboundary: dq (pageboundary - 8)...mov rdx, [rel pageboundary].环形:mov rdx, [rdx + 8]子 ecx, 1循环

还有这个:

pageboundary: dq (pageboundary - 4096)...mov rdx, [rel pageboundary].环形:mov rdx, [rdx + 4096]子 ecx, 1循环

在 4770K 上,第一个代码段每次迭代大约 5 个周期,第二个代码段每次迭代大约 9 个周期,然后第三个代码段每次迭代 5 个周期.它们都访问完全相同的地址,即 4K 对齐.在第二个片段中,只有地址计算跨越了页面边界:rdxrdx + 8 不属于同一个页面,负载仍然对齐.偏移量大时,它又回到了 5 个周期.

这种效果一般是如何工作的?

<小时>

通过 ALU 指令路由加载结果,如下所示:

.loop:mov rdx, [rdx + 8]或 rdx, 0子 ecx, 1循环

使每次迭代需要 6 个周期,这与 5+1 是有意义的.Reg+8 应该是一个特殊的快速加载,AFAIK 需要 4 个周期,所以即使在这种情况下似乎也有一些惩罚,但只有 1 个周期.

<小时>

使用这样的测试来回应一些评论:

.loop:围栏;或 rdx, 0mov rdx, [rdx + 8];或 rdx, 0;取消注释 OR 之一围栏子 ecx, 1循环

or 放在 mov 之前使循环比没有任何 or 更快,将 or 放在之后mov 使它的循环变慢.

解决方案

优化规则:在链表/树等指针连接的数据结构中,将nextleftcode>/right 对象前 16 个字节中的指针.malloc 通常返回 16 字节对齐的块 (alignof(maxalign_t)),因此这将确保链接指针与对象的开头位于同一页面中.>

确保重要的结构成员与对象的开头在同一页面中的任何其他方式也都有效.


Sandybridge 系列通常有 5 个周期的 L1d 负载使用延迟,但有一种特殊情况,用于使用 base+disp 寻址模式进行小位移的指针追逐.

Sandybridge 系列对于 [reg + 0..2047] 寻址模式有 4 个周期的负载使用延迟,当基本 reg 是 mov加载,而不是 ALU 指令.或者,如果 reg+dispreg 位于不同的页面,则惩罚.

根据 Haswell 和 Skylake(可能还有原始 SnB 但我们不知道)的这些测试结果,似乎以下所有条件都必须为真:

  • base reg 来自另一个负载.(指针追逐的粗略启发式方法,通常意味着加载延迟可能是 dep 链的一部分).如果对象的分配通常不跨越页面边界,那么这是一个很好的启发式方法.(硬件显然可以检测到输入是从哪个执行单元转发过来的.)

  • 寻址方式为[reg][reg+disp8/disp32].(或者带有异或归零索引寄存器的索引加载!通常没有实际用处,但可能会提供一些有关转换负载 uops 的问题/重命名阶段的见解.)

  • 位移<2048.即第 11 位以上的所有位都为零(硬件可以在没有完整整数加法器/比较器的情况下检查.)

  • (Skylake 但不是 Haswell/Broadwell):最后一次加载不是重试的快速路径.(所以 base = 4 或 5 个周期加载的结果,它将尝试快速路径.但 base = 10 个周期重试加载的结果,它不会.SKL 的惩罚似乎是 10,而 HSW 的惩罚是 9).

    我不知道重要的是在该加载端口上尝试的最后一次加载,还是产生该输入的负载实际上发生了什么.也许并行追踪两条深度链的实验可以提供一些启示;我只尝试了一个指针追逐 dep 链,混合了页面变化和非页面变化的位移.

如果所有这些都是真的,加载端口推测最终的有效地址将与基址寄存器在同一页中.这是一个有用的优化当负载使用延迟形成循环承载的深度链时的实际情况,例如链表或二叉树.

微架构解释(我对解释结果的最佳猜测,并非来自英特尔发布的任何内容):

似乎对 L1dTLB 进行索引是 L1d 加载延迟的关键路径.提前开始 1 个周期(无需等待加法器的输出来计算最终地址)将使用地址的低 12 位索引 L1d 的整个过程缩短一个周期,然后将该组中的 8 个标记与高位进行比较TLB 产生的物理地址位.(Intel 的 L1d 是 VIPT 8-way 32kiB,所以它没有混叠问题,因为索引位都来自地址的低 12 位:页面内的偏移量,在虚拟地址和物理地址中都相同.即低 12 位可免费从 virt 转换为 phy.)

由于我们没有发现跨越 64 字节边界的影响,我们知道加载端口在索引缓存之前添加了位移.

正如 Hadi 所建议的那样,如果第 11 位有进位,加载端口可能会让错误的 TLB 加载完成,然后使用正常路径重新加载.(在 HSW 上,总加载延迟 = 9.在 SKL 上,总加载延迟可以是 7.5 或 10).

立即中止并在下一个周期重试(使其为 5 或 6 个周期而不是 9 个周期)理论上是可能的,但请记住,加载端口是流水线化的,每个时钟吞吐量为 1 个.调度程序期望能够在下一个周期向加载端口发送另一个 uop,而 Sandybridge 系列将延迟标准化为 5 个周期或更短的所有内容.(没有 2 周期指令).

我没有测试 2M 大页面是否有帮助,但可能没有.我认为 TLB 硬件足够简单,以至于它无法识别 1 页高的索引仍然会选择相同的条目.因此,它可能会在位移跨越 4k 边界时进行缓慢的重试,即使它在同一个大页面中.(分页加载是这样工作的:如果数据实际上跨越了 4k 边界(例如从第 4 页加载 8 字节),您将支付分页惩罚,而不仅仅是缓存行拆分惩罚,无论大页面如何)


Intel 的优化手册2.4.5.2 L1 DCache 部分(在 Sandybridge 部分)中记录了这种特殊情况,但没有提到任何不同的页面限制,或者它只是用于指针追逐,当 dep 链中有 ALU 指令时不会发生.

 (桑迪布里奇)表2-21.寻址模式对负载延迟的影响-----------------------------------------------------------------------数据类型 |基数 + 偏移 >2048 |基数 + 偏移量 <2048|基数 + 索引 [+ 偏移量] |----------------------+----------------------------------------+---------------整数 |5 |4MMX、SSE、128 位 AVX |6 |5X87 |7 |6256 位 AVX |7 |7(请记住,与 HSW/SKL 不同,SnB 上的 256 位加载在加载端口中需要 2 个周期)

这张桌子周围的文字也没有提到 Haswell/Skylake 存在的限制,也可能存在于 SnB(我不知道).

也许 Sandybridge 没有这些限制并且英特尔没有记录 Haswell 回归,或者英特尔只是没有记录这些限制.该表非常确定寻址模式始终为 4c 延迟,偏移量为 0..2047.


@Harold 将 ALU 指令作为加载/使用指针追踪依赖链的一部分的实验证实,正是这种影响导致了速度变慢:ALU insn 减少了总延迟,有效地提供了在这种特定的页面交叉情况下,当添加到 mov rdx, [rdx-8] dep 链时,像 和 rdx, rdx 这样的指令会产生负增量延迟.


此答案中的先前猜测包括以下建议:在 ALU 中使用负载结果与另一个负载是决定延迟的因素.那将是非常奇怪的,需要展望未来.这是我对将 ALU 指令添加到循环中的影响的错误解释.(我不知道页面交叉的 9 周期效应,并且认为 HW 机制是加载端口内结果的转发快速路径.这是有道理的.)

我们可以证明重要的是基本 reg 输入的来源,而不是加载结果的目的地:在页面边界之前和之后的 2 个不同位置存储相同的地址.创建 ALU => 的 dep 链负载=>加载,并检查它是第 2 个容易受到这种减速影响的加载/能够通过简单的寻址模式从加速中受益.

%define off 16lea rdi, [buf+4096 - 16]mov [rdi], rdimov [rdi+off], rdimov ebp, 100000000.环形:和 rdi, rdimov rdi, [rdi] ;基数来自 ANDmov rdi, [rdi+off] ;基地来自负载十二月循环... sys_exit_group(0).bss 节对齐 4096缓冲区:resb 4096*2

在 SKL i7-6700k 上使用 Linux perf 计时.

  • off = 8,推测是正确的,我们得到总延迟 = 10 个周期 = 1 + 5 + 4.(每次迭代 10 个周期).

  • off = 16[rdi+off] 加载很慢,我们得到 16 个周期/iter = 1 + 5 + 10.(SKL 的惩罚似乎比 HSW 更高)

加载顺序颠倒(首先执行 [rdi+off] 加载),无论 off=8 还是 off=16,它总是 10c,所以我们已经证明 movrdi, [rdi+off] 如果其输入来自 ALU 指令,则不会尝试推测性快速路径.

如果没有 andoff=8,我们每个迭代得到预期的 8c:两者都使用快速路径.(@harold 确认 HSW 在这里也获得了 8 个).

如果没有 andoff=16,我们每次迭代得到 15c:5+10.mov rdi, [rdi+16] 尝试快速路径并失败,取 10c.然后 mov rdi, [rdi] 不会尝试快速路径,因为它的输入失败.(@harold 的 HSW 在这里取 13:4 + 9.因此,即使最后一个快速路径失败,也确认 HSW 确实尝试了快速路径,并且快速路径失败惩罚实际上只有 9在 HSW 上 vs. 10 在 SKL)

遗憾的是,SKL 没有意识到没有位移的 [base] 总是可以安全地使用快速路径.


在 SKL 上,循环中只有 mov rdi, [rdi+16],平均延迟为 7.5 个周期.基于对其他混合的测试,我认为它在 5c 和 10c 之间交替:在没有尝试快速路径的 5c 负载之后,下一个确实尝试并失败了,取了 10c.这使得下一次加载使用安全的 5c 路径.

在我们知道快速路径总是会失败的情况下,添加一个归零的索引寄存器实际上可以加快速度.或者不使用基址寄存器,例如 [nosplit off + rdi*1],NASM 将其组装为 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10].请注意,这需要一个 disp32,因此对代码大小不利.

还要注意,微融合内存操作数的索引寻址模式在某些情况下是未分层的,而 base+disp 模式则不是.但是,如果您使用的是纯负载(例如 movvbroadcastss),那么索引寻址模式本身并没有什么问题.不过,使用额外的归零寄存器并不是很好.


在 Ice Lake 上,这种用于指针追逐加载的特殊 4 周期快速路径消失了:现在在 L1 中命中的 GP 寄存器加载通常需要 5 个周期,基于索引的存在或偏移量的大小没有区别.

The execution times for these three snippets:

pageboundary: dq (pageboundary + 8)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx - 8]
    sub ecx, 1
    jnz .loop

And this:

pageboundary: dq (pageboundary - 8)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx + 8]
    sub ecx, 1
    jnz .loop

And this:

pageboundary: dq (pageboundary - 4096)
...

    mov rdx, [rel pageboundary]
.loop:
    mov rdx, [rdx + 4096]
    sub ecx, 1
    jnz .loop

Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access the exact same address, which is 4K-aligned. In the second snippet, only the address calculation crosses the page boundary: rdx and rdx + 8 don't belong to the same page, the load is still aligned. With a large offset it's back to 5 cycles again.

How does this effect work in general?


Routing the result from the load through an ALU instruction like this:

.loop:
    mov rdx, [rdx + 8]
    or rdx, 0
    sub ecx, 1
    jnz .loop

Makes it take 6 cycles per iteration, which makes sense as 5+1. Reg+8 should be a special fast load and AFAIK take 4 cycles, so even in this case there seems to be some penalty, but only 1 cycle.


A test like this was used in response to some of the comments:

.loop:
    lfence
    ; or rdx, 0
    mov rdx, [rdx + 8]
    ; or rdx, 0
    ; uncomment one of the ORs
    lfence
    sub ecx, 1
    jnz .loop

Putting the or before the mov makes the loop faster than without any or, putting the or after the mov makes it a cycle slower.

解决方案

Optimization rule: in pointer-connected data structures like linked-lists / trees, put the next or left/right pointers in the first 16 bytes of the object. malloc typically returns 16-byte aligned blocks (alignof(maxalign_t)), so this will ensure the linking pointers are in the same page as the start of the object.

Any other way of ensuring that important struct members are in the same page as the start of the object will also work.


Sandybridge-family normally has 5 cycle L1d load-use latency, but there's a special case for pointer-chasing with small positive displacements with base+disp addressing modes.

Sandybridge-family has 4 cycle load-use latency for [reg + 0..2047] addressing modes, when the base reg is the result of a mov load, not an ALU instruction. Or a penalty if reg+disp is in a different page than reg.

Based on these test results on Haswell and Skylake (and probably original SnB but we don't know), it appears that all of the following conditions must be true:

  • base reg comes from another load. (A rough heuristic for pointer-chasing, and usually means that load latency is probably part of a dep chain). If objects are usually allocated not crossing a page boundary, then this is a good heuristic. (The HW can apparently detect which execution unit the input is being forwarded from.)

  • Addressing mode is [reg] or [reg+disp8/disp32]. (Or an indexed load with an xor-zeroed index register! Usually not practically useful, but might provide some insight into the issue/rename stage transforming load uops.)

  • displacement < 2048. i.e. all bits above bit 11 are zero (a condition HW can check without a full integer adder/comparator.)

  • (Skylake but not Haswell/Broadwell): the last load wasn't a retried-fastpath. (So base = result of a 4 or 5 cycle load, it will attempt the fast path. But base = result of a 10 cycle retried load, it won't. The penalty on SKL seems to be 10, vs. 9 on HSW).

    I don't know if it's the last load attempted on that load port that matters, or if it's actually what happened to the load that produced that input. Perhaps experiments chasing two dep chains in parallel could shed some light; I've only tried one pointer chasing dep chain with a mix of page-changing and non-page-changing displacements.

If all those things are true, the load port speculates that the final effective address will be in the same page as the base register. This is a useful optimization in real cases when load-use latency forms a loop-carried dep chain, like for a linked list or binary tree.

microarchitectural explanation (my best guess at explaining the result, not from anything Intel published):

It seems that indexing the L1dTLB is on the critical path for L1d load latency. Starting that 1 cycle early (without waiting for the output of an adder to calculate the final address) shaves a cycle off the full process of indexing L1d using the low 12 bits of the address, then comparing the 8 tags in that set against the high bits of the physical address produced by the TLB. (Intel's L1d is VIPT 8-way 32kiB, so it has no aliasing problems because the index bits all come from the low 12 bits of the address: the offset within a page which is the same in both the virtual and physical address. i.e. the low 12 bits translate for free from virt to phys.)

Since we don't find an effect for crossing 64-byte boundaries, we know the load port is adding the displacement before indexing the cache.

As Hadi suggests, it seems likely that if there's carry-out from bit 11, the load port lets the wrong-TLB load complete and then redoes it using the normal path. (On HSW, the total load latency = 9. On SKL the total load latency can be 7.5 or 10).

Aborting right away and retrying on the next cycle (to make it 5 or 6 cycles instead of 9) would in theory be possible, but remember that the load ports are pipelined with 1 per clock throughput. The scheduler is expecting to be able to send another uop to the load port in the next cycle, and Sandybridge-family standardizes latencies for everything of 5 cycles and shorter. (There are no 2-cycle instructions).

I didn't test if 2M hugepages help, but probably not. I think the TLB hardware is simple enough that it couldn't recognize that a 1-page-higher index would still pick the same entry. So it probably does the slow retry any time the displacement crosses a 4k boundary, even if that's in the same hugepage. (Page-split loads work this way: if the data actually crosses a 4k boundary (e.g. 8-byte load from page-4), you pay the page-split penalty not just the cache-line split penalty, regardless of hugepages)


Intel's optimization manual documents this special case in section 2.4.5.2 L1 DCache (in the Sandybridge section), but doesn't mention any different-page limitation, or the fact that it's only for pointer-chasing, and doesn't happen when there's an ALU instruction in the dep chain.

 (Sandybridge)
Table 2-21. Effect of Addressing Modes on Load Latency
-----------------------------------------------------------------------
Data Type             |  Base + Offset > 2048    | Base + Offset < 2048
                      |  Base + Index [+ Offset] |
----------------------+--------------------------+----------------------
Integer               |            5             |  4
MMX, SSE, 128-bit AVX |            6             |  5
X87                   |            7             |  6
256-bit AVX           |            7             |  7
 (remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)

The text around this table also doesn't mention the limitations that exist on Haswell/Skylake, and may also exist on SnB (I don't know).

Maybe Sandybridge doesn't have those limitations and Intel didn't document the Haswell regression, or else Intel just didn't document the limitations in the first place. The table is pretty definite about that addressing mode always being 4c latency with offset = 0..2047.


@Harold's experiment of putting an ALU instruction as part of the load/use pointer-chasing dependency chain confirms that it's this effect that's causing the slowdown: an ALU insn decreased the total latency, effectively giving an instruction like and rdx, rdx negative incremental latency when added to the mov rdx, [rdx-8] dep chain in this specific page-crossing case.


Previous guesses in this answer included the suggestion that using the load result in an ALU vs. another load was what determined the latency. That would be super weird and require looking into the future. That was a wrong interpretation on my part of the effect of adding an ALU instruction into the loop. (I hadn't known about the 9-cycle effect on page crossing, and was thinking that the HW mechanism was a forwarding fast-path for the result inside the load port. That would make sense.)

We can prove that it's the source of the base reg input that matters, not the destination of the load result: Store the same address at 2 separate locations, before and after a page boundary. Create a dep chain of ALU => load => load, and check that it's the 2nd load that's vulnerable to this slowdown / able to benefit from the speedup with a simple addressing mode.

%define off  16
    lea    rdi, [buf+4096 - 16]
    mov    [rdi], rdi
    mov    [rdi+off], rdi

    mov     ebp, 100000000
.loop:

    and    rdi, rdi
    mov    rdi, [rdi]        ; base comes from AND
    mov    rdi, [rdi+off]    ; base comes from a load

    dec   ebp
    jnz  .loop

    ... sys_exit_group(0)

section .bss
align 4096
buf:    resb 4096*2

Timed with Linux perf on SKL i7-6700k.

  • off = 8, the speculation is correct and we get total latency = 10 cycles = 1 + 5 + 4. (10 cycles per iteration).

  • off = 16, the [rdi+off] load is slow, and we get 16 cycles / iter = 1 + 5 + 10. (The penalty seems to be higher on SKL than HSW)

With the load order reversed (doing the [rdi+off] load first), it's always 10c regardless of off=8 or off=16, so we've proved that mov rdi, [rdi+off] doesn't attempt the speculative fast-path if its input is from an ALU instruction.

Without the and, and off=8, we get the expected 8c per iter: both use the fast path. (@harold confirms HSW also gets 8 here).

Without the and, and off=16, we get 15c per iter: 5+10. The mov rdi, [rdi+16] attempts the fast path and fails, taking 10c. Then mov rdi, [rdi] doesn't attempt the fast-path because its input failed. (@harold's HSW takes 13 here: 4 + 9. So that confirms HSW does attempt the fast-path even if the last fast-path failed, and that the fast-path fail penalty really is only 9 on HSW vs. 10 on SKL)

It's unfortunate that SKL doesn't realize that [base] with no displacement can always safely use the fast path.


On SKL, with just mov rdi, [rdi+16] in the loop, the average latency is 7.5 cycles. Based on tests with other mixes, I think it alternates between 5c and 10c: after a 5c load that didn't attempt the fast path, the next one does attempt it and fails, taking 10c. That makes the next load use the safe 5c path.

Adding a zeroed index register actually speeds it up in this case where we know the fast-path is always going to fail. Or using no base register, like [nosplit off + rdi*1], which NASM assembles to 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]. Notice that this requires a disp32, so it's bad for code size.

Also beware that indexed addressing modes for micro-fused memory operands are un-laminated in some cases, while base+disp modes aren't. But if you're using pure loads (like mov or vbroadcastss), there's nothing inherently wrong with an indexed addressing mode. Using an extra zeroed register isn't great, though.


On Ice Lake, this special 4 cycle fast path for pointer chasing loads is gone: GP register loads that hit in L1 now generally take 5 cycles, with no difference based on the presence of indexing or the size of the offset.

这篇关于当 base+offset 与 base 位于不同的页面时,是否有惩罚?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆