现代 x86 实现是否可以从多个先前存储中存储转发? [英] Can modern x86 implementations store-forward from more than one prior store?

查看:20
本文介绍了现代 x86 实现是否可以从多个先前存储中存储转发?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果负载与两个较早的存储重叠(并且负载未完全包含在最旧的存储中),现代 Intel 或 AMD x86 实现能否从两个存储转发以满足负载?

In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forward from both stores to satisfy the load?

例如,考虑以下序列:

mov [rdx + 0], eax
mov [rdx + 2], eax
mov ax, [rdx + 1]

最后的 2 字节加载从前一个存储中获取第二个字节,但从之前的存储中获取第一个字节.这个加载可以存储转发吗,还是需要等到之前的两个存储都提交到 L1?

The final 2-byte load takes its second byte from the immediate preceding store, but its first byte from the store before that. Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1?

请注意,通过 store-forwarding 在这里,我包含了任何可以满足仍然在存储缓冲区中的存储读取的机制,而不是等待它们提交到 L1,即使它是一个比从单个商店转发"的最佳情况更慢的路径.

Note that by store-forwarding here I'm including any mechanism that can satisfy the reads from stores still in the store buffer, rather than waiting them to commit to L1, even if it is a slower path than the best case "forwards from a single store" case.

推荐答案

No.

至少,不是在 Haswell、Broadwell 或 Skylake 处理器上.在其他 Intel 处理器上,限制类似(Sandy Bridge、Ivy Bridge)甚至更严格(Nehalem、Westmere、Pentium Pro/II/II/4).在 AMD 上,也有类似的限制.

No.

At least, not on Haswell, Broadwell or Skylake processors. On other Intel processors, the restrictions are either similar (Sandy Bridge, Ivy Bridge) or even tighter (Nehalem, Westmere, Pentium Pro/II/II/4). On AMD, similar limitations apply.

摘自 Agner Fog 的优秀优化手册:

From Agner Fog's excellent optimization manuals:

在某些条件下,处理器可以将内存写入转发到同一地址的后续读取.商店转发适用于以下情况:

The microarchitecture of Intel and AMD CPUs

§ 10.12 Store forwarding stalls

The processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding works in the following cases:

  • 当一个 64 位或更少的写入后跟一个相同大小和相同地址的读取时,无论对齐如何.
  • 当 128 位或 256 位的写入之后是相同大小和相同地址的读取时,完全对齐.
  • 当 64 位或更少的写入之后是一个更小尺寸的读取时,该读取完全包含在写入地址范围内,无论对齐如何.
  • 当任何大小的对齐写入之后是两半的两次读取或四分之四的四次读取等,它们在写入地址范围内自然对齐.
  • 当 128 位或 256 位对齐写入后跟不超过 8 字节边界的 64 位或更少读取时.

如果内存块跨越 64 字节缓存线边界,则会发生 2 个时钟的延迟.如果所有数据都具有自然对齐方式,则可以避免这种情况.

A delay of 2 clocks occur if the memory block crosses a 64-bytes cache line boundary. This can be avoided if all data have their natural alignment.

商店转发在以下情况下失败:

Store forwarding fails in the following cases:

  • 当任何大小的写入之后是更大大小的读取时
  • 当任何大小的写入之后是部分重叠的读取
  • 当一个 128 位写入之后是一个较小的读取跨越两个 64 位一半之间的边界
  • 当 256 位写入之后是 128 位读取跨越两个 128 位一半之间的边界时
  • 当写入 256 位后接着读取 64 位或更少的数据时,跨越四个 64 位四分之一之间的任何边界

失败的存储转发比成功的存储转发多花费 10 个时钟周期.在写入至少 16 位未对齐的 128 或 256 位之后,代价要高得多 - 大约 50 个时钟周期.

A failed store forwarding takes 10 clock cycles more than a successful store forwarding. The penalty is much higher - approximately 50 clock cycles - after a write of 128 or 256 bits which is not aligned by at least 16.

强调

Skylake 处理器可以在某些条件下将内存写入转发到从同一地址进行的后续读取.存储转发比以前的处理器快一个时钟周期.在 32 位或 64 位操作数的最佳情况下,内存写入后跟从同一地址读取需要 4 个时钟周期,而对于其他操作数大小则需要 5 个时钟周期.

The microarchitecture of Intel and AMD CPUs

§ 11.12 Store forwarding stalls

The Skylake processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding is one clock cycle faster than on previous processors. A memory write followed by a read from the same address takes 4 clock cycles in the best case for operands of 32 or 64 bits, and 5 clock cycles for other operand sizes.

当 128 位或 256 位操作数未对齐时,存储转发会额外增加多达 3 个时钟周期.

Store forwarding has a penalty of up to 3 clock cycles extra when an operand of 128 or 256 bits is misaligned.

当任何大小的操作数跨越缓存线边界(即可被 64 字节整除的地址)时,存储转发通常需要额外 4 - 5 个时钟周期.

A store forwarding usually takes 4 - 5 clock cycles extra when an operand of any size crosses a cache line boundary, i.e. an address divisible by 64 bytes.

写入后跟从同一地址的较小读取几乎没有或没有损失.

A write followed by a smaller read from the same address has little or no penalty.

当读取偏移但完全包含在写入覆盖的地址范围内时,64 位或更少的写入后跟较小的读取会损失 1 - 3 个时钟.

A write of 64 bits or less followed by a smaller read has a penalty of 1 - 3 clocks when the read is offset but fully contained in the address range covered by the write.

128 或 256 位对齐写入,然后读取两半或四分之四中的一个或两个等,几乎没有或没有损失.不适合一半或四分之一的部分读取可能需要额外 11 个时钟周期.

An aligned write of 128 or 256 bits followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. A partial read that does not fit into the halves or quarters can take 11 clock cycles extra.

比写入大的读取,或涵盖写入和未写入字节的读取,大​​约需要额外 11 个时钟周期.

A read that is bigger than the write, or a read that covers both written and unwritten bytes, takes approximately 11 clock cycles extra.

强调

Agner Fog 的文档指出的微架构中的一个共同点是,如果写入是对齐的并且读取是一半四分之一,则更可能发生存储转发书面价值.

A common point across microarchitectures that Agner Fog's document points out is that store forwarding is more likely to happen if the write was aligned and the reads are halves or quarters of the written value.

具有以下紧密循环的测试:

A test with the following tight loop:

mov [rsp-16], eax
mov [rsp-12], ebx
mov ecx, [rsp-15]

表明 ld_blocks.store_forward PMU 计数器确实增加了.此事件记录如下:

Shows that the ld_blocks.store_forward PMU counter does indeed increment. This event is documented as follows:

ld_blocks.store_forward [此事件计数如何多次加载操作得到了真正的 Block-on-Store 阻塞防止商店转发的代码.这包括以下情况:- 前面的存储与负载冲突(不完全重叠)

ld_blocks.store_forward [This event counts how many times the load operation got the true Block-on-Store blocking code preventing store forwarding. This includes cases when: - preceding store conflicts with the load (incomplete overlap)

  • 由于 u-arch 限制,无法进行商店转发

  • store forwarding is impossible due to u-arch limitations

前锁RMW操作不被转发

preceding lock RMW operations are not forwarded

存储设置了无转发位(不可缓存/分页/屏蔽存储)

store has the no-forward bit set (uncacheable/page-split/masked stores)

使用全阻塞存储(主要是栅栏和端口 I/O)

all-blocking stores are used (mostly, fences and port I/O)

这表明当读取仅与最近的早期存储部分重叠时,存储转发确实会失败(即使在考虑更早的存储时完全包含它).

This indicates that the store-forwarding does indeed fail when a read only partially overlaps the most recent earlier store (even if it is fully contained when even earlier stores are considered).

这篇关于现代 x86 实现是否可以从多个先前存储中存储转发?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆