现代x86实现可以从多个先前的存储中进行存储转发吗? [英] Can modern x86 implementations store-forward from more than one prior store?

查看:66
本文介绍了现代x86实现可以从多个先前的存储中进行存储转发吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果负载与两个较早的存储区重叠(并且负载没有完全包含在最旧的存储区中),那么现代的Intel或AMD x86实现是否可以从这两个存储区转发来满足负载?

In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forward from both stores to satisfy the load?

例如,考虑以下顺序:

mov [rdx + 0], eax
mov [rdx + 2], eax
mov ax, [rdx + 1]

最后的2字节加载从紧接的前一个存储中获取其第二个字节,但在此之前从存储中获取其第一个字节.可以将此负载转发到存储中,还是需要等到两个先前的存储都提交到L1之前?

The final 2-byte load takes its second byte from the immediate preceding store, but its first byte from the store before that. Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1?

请注意,通过 store-forwarding ,这里包含的任何机制都可以满足仍然在存储缓冲区中的存储的读取,而不是等待它们提交到L1,即使它是一个路径比最佳情况下从单个商店转发"情况下的路径慢.

Note that by store-forwarding here I'm including any mechanism that can satisfy the reads from stores still in the store buffer, rather than waiting them to commit to L1, even if it is a slower path than the best case "forwards from a single store" case.

推荐答案

否.

至少,不是在Haswell,Broadwell或Skylake处理器上.在其他Intel处理器上,限制要么相似(Sandy Bridge,Ivy Bridge),要么甚至更加严格(Nehalem,Westmere,Pentium Pro/II/II/4).在AMD上也有类似的限制.

No.

At least, not on Haswell, Broadwell or Skylake processors. On other Intel processors, the restrictions are either similar (Sandy Bridge, Ivy Bridge) or even tighter (Nehalem, Westmere, Pentium Pro/II/II/4). On AMD, similar limitations apply.

摘自Agner Fog出色的优化手册:

From Agner Fog's excellent optimization manuals:

Intel和AMD CPU的微体系结构

§10.12商店转发摊位

在某些情况下,处理器可以将内存写入转发到来自同一地址的后续读取.在以下情况下,商店转发有效:

The microarchitecture of Intel and AMD CPUs

§ 10.12 Store forwarding stalls

The processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding works in the following cases:

  • 在写入64位或更少的位之后,无论对齐方式如何,均以相同大小和相同地址进行读取.
  • 在写入128或256位之后,再进行相同大小和相同地址的读取,完全对齐.
  • 在写入64位或更少的字节之后,再进行一次较小的读取,该读取将完全包含在写入地址范围内,而与对齐方式无关.
  • 在任意大小的对齐写入之后,是两个半读取两次,或者四个四分之一读取四个,依此类推,它们的自然对齐在写入地址范围之内.
  • 在对齐的128位或256位写入之后,是不超过8字节边界的64位或更少的读取.
  • When a write of 64 bits or less is followed by a read of the same size and the same address, regardless of alignment.
  • When a write of 128 or 256 bits is followed by a read of the same size and the same address, fully aligned.
  • When a write of 64 bits or less is followed by a read of a smaller size which is fully contained in the write address range, regardless of alignment.
  • When an aligned write of any size is followed by two reads of the two halves, or four reads of the four quarters, etc. with their natural alignment within the write address range.
  • When an aligned write of 128 bits or 256 bits is followed by a read of 64 bits or less that does not cross an 8 bytes boundary.

如果存储块越过64字节的高速缓存行边界,则会发生2个时钟的延迟.如果所有数据都具有自然对齐方式,则可以避免这种情况.

A delay of 2 clocks occur if the memory block crosses a 64-bytes cache line boundary. This can be avoided if all data have their natural alignment.

在以下情况下,存储转发失败:

Store forwarding fails in the following cases:

  • 在任意大小的写入之后紧随其后的是较大的读取
  • 在任意大小的写入之后出现部分重叠的读取
  • 在写入128位之后,进行较小的读取时,越过两个64位半部分之间的边界
  • 在写入256位之后,又有128位读取越过了两个128位半部分之间的边界
  • 在写入256位之后,越过64个四分之一季度之间的任何边界,读取64位或更少的内容
  • When a write of any size is followed by a read of a larger size
  • When a write of any size is followed by a partially overlapping read
  • When a write of 128 bits is followed by a smaller read crossing the boundary between the two 64-bit halves
  • When a write of 256 bits is followed by a 128 bit read crossing the boundary between the two 128-bit halves
  • When a write of 256 bits is followed by a read of 64 bits or less crossing any boundary between the four 64-bit quarters

失败的商店转发比成功的商店转发花费10个时钟周期.在写入未对齐至少16位的128或256位之后,代价更高-大约50个时钟周期.

A failed store forwarding takes 10 clock cycles more than a successful store forwarding. The penalty is much higher - approximately 50 clock cycles - after a write of 128 or 256 bits which is not aligned by at least 16.

添加了重点

Intel和AMD CPU的微体系结构

§11.12商店转发摊位

在某些情况下,Skylake处理器可以将内存写入转发到同一地址的后续读取.存储转发比以前的处理器快一个时钟周期.对于32或64位的操作数,在内存写入后再从同一地址进行读取的最佳情况下需要4个时钟周期,而对于其他大小的操作数,则需要5个时钟周期.

The microarchitecture of Intel and AMD CPUs

§ 11.12 Store forwarding stalls

The Skylake processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding is one clock cycle faster than on previous processors. A memory write followed by a read from the same address takes 4 clock cycles in the best case for operands of 32 or 64 bits, and 5 clock cycles for other operand sizes.

当128或256位操作数未对齐时,存储转发会额外受到多达3个时钟周期的惩罚.

Store forwarding has a penalty of up to 3 clock cycles extra when an operand of 128 or 256 bits is misaligned.

当任何大小的操作数越过高速缓存行边界(即,一个可被64字节整除的地址)时,存储转发通常会额外花费4-5个时钟周期.

A store forwarding usually takes 4 - 5 clock cycles extra when an operand of any size crosses a cache line boundary, i.e. an address divisible by 64 bytes.

写操作后再从同一地址进行较小的读操作几乎没有代价.

A write followed by a smaller read from the same address has little or no penalty.

如果读取偏移且完全包含在写入所覆盖的地址范围内,则写入64位或更少的位并随后进行较小的读取会导致1-3个时钟的损失.

A write of 64 bits or less followed by a smaller read has a penalty of 1 - 3 clocks when the read is offset but fully contained in the address range covered by the write.

对齐写入128或256位,然后读取两个半部或四个四分之一等中的一个或两个,以此类推.半部分或四分之一部分的部分读取可能需要额外的11个时钟周期.

An aligned write of 128 or 256 bits followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. A partial read that does not fit into the halves or quarters can take 11 clock cycles extra.

大于写入的读取,或者覆盖已写入和未写入字节的读取,大​​约需要11个时钟周期.

A read that is bigger than the write, or a read that covers both written and unwritten bytes, takes approximately 11 clock cycles extra.

添加了重点

Agner Fog的文档指出,跨微体系结构的一个共同点是,如果写入对齐并且读取的是一半 quarters ,则存储转发更可能发生书面价值.

A common point across microarchitectures that Agner Fog's document points out is that store forwarding is more likely to happen if the write was aligned and the reads are halves or quarters of the written value.

具有以下紧密循环的测试:

A test with the following tight loop:

mov [rsp-16], eax
mov [rsp-12], ebx
mov ecx, [rsp-15]

表明ld_blocks.store_forward PMU计数器确实在增加.该事件记录如下:

Shows that the ld_blocks.store_forward PMU counter does indeed increment. This event is documented as follows:

ld_blocks.store_forward [此事件计算 很多次加载操作都得到了真正的存储区阻止"功能 代码阻止商店转发.这包括以下情况: -先前的商店与负载发生冲突(重叠不完全)

ld_blocks.store_forward [This event counts how many times the load operation got the true Block-on-Store blocking code preventing store forwarding. This includes cases when: - preceding store conflicts with the load (incomplete overlap)

    由于u-arch限制,
  • 商店转发是不可能的

  • store forwarding is impossible due to u-arch limitations

先前的锁定RMW操作未转发

preceding lock RMW operations are not forwarded

商店设置了无转发位(不可缓存/页面拆分/屏蔽的商店)

store has the no-forward bit set (uncacheable/page-split/masked stores)

使用了所有阻塞存储(主要是围栏和端口I/O)

all-blocking stores are used (mostly, fences and port I/O)

这表明,当只读部分仅与最近的较早存储区重叠时(即使考虑到更早的存储区,即使已完全包含该存储区),存储转发也确实会失败.

This indicates that the store-forwarding does indeed fail when a read only partially overlaps the most recent earlier store (even if it is fully contained when even earlier stores are considered).

这篇关于现代x86实现可以从多个先前的存储中进行存储转发吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆