存储转发地址与数据:英特尔优化指南中的STD和STA有什么区别? [英] Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?

查看:133
本文介绍了存储转发地址与数据:英特尔优化指南中的STD和STA有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有任何英特尔专家可以告诉我STD和STA在Intel Skylake内核方面的区别.

I'm wondering if any Intel experts out there can tell me the difference between STD and STA with respect to the Intel Skylake core.

在Intel优化指南中,有一张图片描述了Intel Cores的超标量端口".

In the Intel optimization guide, there's a picture describing the "super-scalar ports" of the Intel Cores.

.

这是第78页的另一张图片,该图片描述了商店地址"和商店数据":

Here's another picture from page 78, this picture describes "Store Address" and "Store Data":

  1. 使用要存储的数据地址准备商店转发和商店淘汰逻辑.

  1. Prepares the store forwarding and store retirement logic with the address of the data being stored.

使用要存储的数据准备商店转发和商店退役逻辑.

Prepares the store forwarding and store retirement logic with the data being stored.

考虑到Skylake可以在每个时钟周期执行一次#1 3x,但在每个时钟周期只能执行一次#2,我很好奇这两者之间的区别.

Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle, I was curious what the difference was between these two.

在我看来,将存储转发到数据地址是自然的".但是我无法理解何时进行数据存储转发(又名:STD/端口4).是否有任何组装/优化专家可以帮助我准确了解STD和STA之间的区别?

It seems "natural" to me that store-forwarding would be done to the address of the data. But I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done. Are there any assembly / optimization experts out there that can help me understand exactly the difference between STD and STA is?

推荐答案

自从第一个P6系列微体系结构Pentium Pro以来,英特尔CPU就一直将存储分为存储地址和存储数据.

Intel CPUs have been splitting stores into store-address and store-data since the first P6-family microarchitecture, Pentium Pro.

但是存储地址和存储数据的微指令可以微融合到一个融合域的微指令中.在Sandy/IvyBridge上,索引寻址模式是未分层的,如英特尔的优化手册中所述.但是Haswell和更高版本甚至可以在ROB中使它们保持微融合,因此它们并不是未分层的.参见微融合和寻址模式. (英特尔没有提及这一点,而且Agner Fog还没有时间对Haswell/Skylake进行广泛的测试,因此通常,好的microarch PDF 甚至根本没有提到分层,但是您仍然应该一定阅读它,以了解有关uops如何工作以及如何解码指令以及如何通过管道的更多信息.另请参见 标记Wiki )

But store-address and store-data uops can micro-fuse into one fused-domain uop. On Sandy/IvyBridge, indexed addressing modes are un-laminated as described in Intel's optimization manual. But Haswell and later can keep them micro-fused even in the ROB, so they aren't un-laminated. See Micro fusion and addressing modes. (Intel doesn't mention this, and Agner Fog hasn't had time to test extensively for Haswell/Skylake so his usually-good microarch PDF doesn't even mention un-lamination at all. But you should still definitely read it to learn more about how uops work and how instructions are decoded and go through the pipeline. See also other x86 performance links in the x86 tag wiki)

考虑到Skylake可以在每个时钟周期执行#1 3倍,但在每个时钟周期只能执行#2一次

Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle

端口2和3也可以在其AGU上运行负载oupp ,而该周期的端口的负载数据部分则未使用. Port7仅具有专用于存储AGU的简单寻址模式.

Ports 2 and 3 can also run load uops on their AGUs, leaving the load-data part of the port unused that cycle. Port7 only has a dedicated store-AGU for simple addressing modes.

带有索引寄存器的存储寻址模式不能使用端口7,只能使用p2/p3.但是,如果对存储使用简单"寻址模式,则峰值吞吐量为每个时钟2个负载+ 1个存储.

Store addressing modes with an index register can't use port 7, only p2/p3. But if you do use "simple" addressing modes for stores, the peak throughput is 2 loads + 1 store per clock.

在Nehalem和更早的版本(P6系列)中,p2是唯一的加载端口,p3是存储地址端口,而p4是存储数据.

On Nehalem and earlier (P6 family), p2 was the only load port, p3 was the store-address port, and p4 was store-data.

在IvyBridge/Sandybridge上,没有用于存储地址的单独端口,它们始终仅在加载端口(p23)中的AGU(地址生成单元)上运行.对于256b的加载/存储,仅每隔一个周期就需要AGU(256b的加载或存储单元占用加载或存储数据端口2个周期,但是加载端口在第二个周期内可以接受存储地址uop).因此,从理论上讲,在Sandybridge上,每个时钟2个负载/1个存储区是可持续的,但是(如果其中大多数是AVX 256位矢量负载/存储区以两个128位半运行).

On IvyBridge/Sandybridge, there weren't separate ports for store-address uops, they always just ran on the AGU (Address Generation Unit) in the load ports (p23). With 256b loads / stores, the AGU was only needed every other cycle (256b load or store uops occupy the load or store-data ports for 2 cycles, but the load ports can accept a store-address uop during that 2nd cycle). So 2 load / 1 store per clock was in theory sustainable on Sandybridge, but only if most of it was with AVX 256-bit vector loads / stores running as two 128-bit halves.

Haswell在端口7上添加了专用的store-AGU,并将加载/存储执行单元扩大到256b,因为如果负载稳定,那么当加载端口不需要其AGU时就没有空闲周期.

Haswell added the dedicated store-AGU on port7 and widened the load/store execution units to 256b, because there aren't spare cycles when the load ports don't need their AGUs if there's a steady supply of loads.

存储地址uop将地址(我猜是宽度)写到存储缓冲区中(在Intel的术语中也称为内存顺序缓冲区").分别进行此操作,并且可能在要存储的数据准备就绪之前,可以使以后的加载(按程序顺序)检测它们是否与存储区重叠.

A store-address uop writes the address (and width, I guess) into the store buffer (aka Memory Order Buffer in Intel's terminology). Having this happen separately, and possibly before the data to be stored is even ready lets later loads (in program order) detect whether they overlap the store or not.

在存在地址未知的挂起存储时,按乱序执行加载是有问题的:错误的猜测意味着必须回滚管道. (我认为machine_clears.memory_ordering perf计数器事件包括此内容.可以从单线程代码中获取非零计数,但是我忘记了我是否有确凿的证据表明Skylake有时会推测性地猜测负载不会重叠,这是未知的.地址存储).

Out-of-order execution of loads when there are pending stores with unknown address is problematic: a wrong guess means having to roll back the pipeline. (I think the machine_clears.memory_ordering perf counter event includes this. It is possible to get non-zero counts for this from single-threaded code, but I forget if I had definite evidence that Skylake sometimes speculatively guesses that loads don't overlap unknown-address stores).

正如大卫·坎特(David Kanter)在他的Haswell微架构文章中指出的那样, 还需要探测存储缓冲区以检查是否存在转发/冲突,因此仅运行存储地址微指令的执行单元的构建成本较低.

As David Kanter points out in his Haswell microarch writeup, a load uop also needs to probe the store buffer to check for forwarding / conflicts, so an execution unit that only runs store-address uops is cheaper to build.

无论如何,我不确定如果英特尔重新设计了东西,以便port7具有完整的AGU,它也可以处理索引寻址模式,并且使存储地址uops仅在p7上运行,而不对性能产生什么影响, p2/p3.

Anyway, I'm not sure what the performance implications would be if Intel redesigned things so port7 had a full AGU that could handle indexed addressing modes, too, and made store-address uops only run on p7, not p2/p3.

这将阻止窃取" p23的存储地址,这确实发生了,并且将最大持续L1D带宽从96字节/周期(2个加载+ 1个32字节YMM向量存储)降低到〜81字节/根据英特尔优化手册中的表格对Skylake进行周期调整.但是在适当的情况下, Skylake可以承受2个负载+ 1个存储区每个时钟的4字节操作数,因此也许81字节/周期数受其他一些微体系结构限制.峰值为96B/时钟,但显然不可能无限期地背靠背发生.

That would stop store-address uops from "stealing" p23, which does happen and which reduces max sustained L1D bandwidth from 96 bytes / cycle (2 load + 1 store of 32-byte YMM vectors) down to ~81 bytes / cycle for Skylake according to a table in Intel's optimization manual. But under the right circumstances, Skylake can sustain 2 loads + 1 store per clock of 4-byte operands, so maybe that 81-byte / cycle number is limited by some other microarchitectural limit. The peak is 96B/clock, but apparently that can't happen back-to-back indefinitely.

阻止存储地址微指令在p23上运行的一个缺点是,存储地址的已知将花费更长的时间,也许会延迟更多的加载.

One downside to stopping store-address uops from running on p23 is that it would take longer for store addresses to be known, maybe delaying loads more.

我不知道何时进行数据存储转发(又名:STD/端口4).

I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done.

存储/重新加载可以使加载从存储缓冲区中获取数据,而不是等待它提交到L1D并从那里读取数据.

A store/reload can have the load take the data from the store buffer, instead of waiting for it to commit to L1D and reading it from there.

  • How does store to load forwarding happens in case of unaligned memory access?
  • Store-to-Load Forwarding and Memory Disambiguation in x86 Processors

当函数在调用函数之前溢出某些寄存器时,可能会发生存储/重载,这是在堆栈上传递args的一部分(尤其是通过在堆栈上传递所有args的糟糕的stack-args调用约定).或通过引用非内联函数传递某些内容.或在直方图中,如果重复命中相同的bin,则基本上是在循环中进行内存目标增量.

Store/reload can happen when a function spills some registers before calling a function, of as part of passing args on the stack (especially with crappy stack-args calling conventions that pass all args on the stack). Or passing something by reference to a non-inline function. Or in a histogram, if the same bin is hit repeatedly, you're basically doing a memory-destination increment in a loop.

这篇关于存储转发地址与数据:英特尔优化指南中的STD和STA有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆