在内存访问未对齐的情况下,存储到负载转发如何发生? [英] How does store to load forwarding happens in case of unaligned memory access?

查看:72
本文介绍了在内存访问未对齐的情况下,存储到负载转发如何发生?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道加载/存储队列体系结构,可促进存储到负载的转发和消除乱序的推测性负载的歧义。这是通过使用匹配的加载和存储地址来完成的。



如果较早的存储要对齐的地址且负载取决于该地址,则此匹配地址技术将不起作用。我的问题是,第二次加载是否乱序发出,如何被早期商店消除歧义?还是现代体系结构用来处理这种情况的策略?

解决方案



简短的回答是它取决于体系结构,但是从理论上讲,未对齐的操作不会不必要阻止体系结构执行商店转发。然而,实际上,未对齐的加载操作所代表的转发可能性要大得多,这意味着从这些位置进行的转发可能根本不被支持,或者比对齐的情况支持得不太好。





长话短说,任何特定的体系结构都将具有可以有效处理的各种方案,而不能解决的方案。



旧的或非常简单的体系结构可能根本没有任何存储转发功能。这些体系结构可能根本不会乱序执行,或者可能具有某种乱序功能,但可能只是等到所有先前的存储都提交后再执行任何负载



下一个复杂度是至少具有某种CAM来检查以前的商店地址的体系结构。这种架构可能没有存储转发,但是一旦知道加载地址和所有先前存储地址(并且没有匹配项),就可以允许加载按顺序或无序执行。如果与先前的存储匹配,则体系结构可能会等到存储提交后再执行加载(该加载将从L1中读取存储的值,如果有的话)。



接下来,我们拥有类似于上面的架构,该架构要等到已知先前的存储地址后再进行存储转发。行为与上面相同,不同之处在于,当加载地址命中一个先前的存储时,存储数据将转发到该加载,而无需等待其提交到L1。



上面的一个大问题是,在上述设计中,直到所有先前的存储地址都已知之前,加载仍然无法执行。这样可以防止乱序执行。因此,接下来,我们增加推测-如果观察到特定IP上的负载取决于先前的存储,即使没有先前的存储地址,我们也让它执行(读取其值)知道。退役时将进行第二次检查,以确保比先前存储没有 hit 的假设正确,如果不正确,还将进行某种类型的管道清理和恢复。预计会命中的先前存储加载会一直等到存储数据(可能还有地址)可用,因为它们需要存储转发。 1



这就是我们今天的情况。还有更多先进的技术,其中许多都属于内存重命名,但据我所知,它们尚未广泛部署。



最后,我们可以回答您的原始问题:所有这些如何与未对齐的载荷相互作用。上面的大多数内容都没有改变-我们只需要更精确地定义 hit 的定义,即负载从上面的先前存储中读取数据。



您有几种情况:


  1. 以后的负载完全包含在以前的商店。这意味着加载读取的所有字节都来自较早的存储。

  2. 后来的加载部分包含在中。这意味着加载的一个或多个字节来自较早的存储,而一个或多个字节不是来自较早的存储。

  3. 任何较早的存储中根本不包含后来的加载。
  4. li>

在大多数平台上,所有三种可能的情况都存在,与对齐方式无关。但是,在值对齐的情况下,只有在较大的存储量跟随较小的负载并且平台仅支持一次负载大小情况(2)时才发生第二种情况(部分重叠)。 p>

从理论上讲,在场景(1)中可以直接进行 1 存储到负载的转发,而在场景(2)或(3)中则不可以。



要捕获(1)的许多实际情况,您只需要检查存储和加载地址是否相同,以及加载是否不大于存储即可。这仍然会遗漏大型商店中完全容纳少量货物的情况,而不论是否对齐。



其中对齐有助于上述检查更容易:您需要比较较少的地址位(例如,32位加载可以忽略地址的后两位),并且比较的可能性也较小:4字节的加载只能包含在两个地址的8字节存储中可能的方式(在存储地址或存储地址+ 4),而未对齐的操作可以用五种不同的方式完全包含(在加载地址偏离存储0、1、2、3或4个字节的情况下)。



这些差异在硬件中很重要,在这些硬件中,存储队列必须看起来像实现这些比较的完全关联的CAM。比较的范围越广,则需要的硬件就越多(或者查找时间越长)。早期的硬件可能只捕获了(1)的相同地址情况,但是趋势是捕获更多的情况,包括对齐和不对齐的情况。这是总体概述






1 如何最好地进行这种类型的内存依赖推测是 WARF 拥有专利并以此为依据积极起诉各种CPU制造商。



2 通过 direct 我的意思是从一家商店到下一家商店。原则上,您可能还具有更复杂的存储转发形式,可以占用多个先前存储的一部分并将其转发到单个负载,但是我不清楚目前的体系结构是否实现此功能。


I know the load/store queue architecture to facilitate store to load forwarding and disambiguation of out-of-order speculative loads. This is accomplished using matching load and store addresses.

This matching address technique will not work if the earlier store is to unaligned address and the load depends on it. My question is if this second load issued out-of-order how it gets disambiguated by earlier stores? or what policies modern architectures use to handle this condition?

解决方案

Short

The short answer is that it depends on the architecture, but in theory unaligned operations don't necessarily prevent the architecture from performing store forwarding. As a practical matter, however, the much larger number of forwarding possibilities that unaligned loads operations represent means that forwarding from such locations may not be supported at all, or may be less well supported than the aligned cases.

Long

The long answer is that any particular architecture will have various scenarios they can handle efficiently, and those they cannot.

Old or very simple architectures may not have any store-forwarding capabilities at all. These architectures may not execute out of order at all, or may have some out-of-order capability but may simply wait until all prior stores have committed before executing any load.

The next level of sophistication is an architecture that at least has some kind of CAM to check prior store addresses. This architecture may not have store forwarding, but may allow loads to execute in-order or out-of-order once the load address and all prior store addresses are known (and there is no match). If there is a match with a prior store, the architecture may wait until the store commits before executing the load (which will read the stored value from the L1, if any).

Next up, we have architecture likes the above that wait until prior store addresses are known and also do store forwarding. The behavior is the same as above, except that when a load address hits a prior store, the store data is forwarded to the load without waiting for it to commit to L1.

A big problem with the above is that in the above designs, loads still can't execute until all prior store addresses are known. This inhibits out-of-order execution. So next up, we add speculation - if a load at a particular IP has been observed to not depend on prior stores, we just let it execute (read its value) even if prior store addresses aren't know. At retirement there will be a second check to ensure than the assumption that there was no hit to a prior store was correct, and if not there will be some type of pipeline clean and recovery. Loads that are predicted to hit a prior store wait until the store data (and possibly address) is available since they'll need store-forwarding.1

That's kind of where we are at today. There are yet more advanced techniques, many of which fall under the banner of memory renaming, but as far as I know they are not widely deployed.

Finally, we get to answer your original question: how all of this interacts with unaligned loads. Most of the above doesn't change - we only need to be more precise about what the definition of a hit is, where a load reads data from a previous store above.

You have several scenarios:

  1. A later load is totally contained within a prior store. This means that all the bytes read by a load come from the earlier store.
  2. A later load is partially contained within a prior store. This means that one or more bytes of the load come from an earlier store, but one or more bytes do not.
  3. A later load is not contained at all within any earlier store.

On most platforms, all three possible scenarios exist regardless of alignment. However, in the case of aligned values, the second case (partial overlap) can only occur when a larger store follows a smaller load, and if the platform only supports once size of loads situation (2) is not supported at all.

Theoretically, direct1 store-to-load forwarding is possible in scenario (1), but not in scenarios (2) or (3).

To catch many practical cases of (1), you only need to check that the store and load addresses are the same, and that the load is not larger than the store. This still misses cases where a small load is fully contained in a larger store, whether aligned or not.

Where alignment helps is that the checks above are easier: you need to compare fewer bits of the addresses (e.g., a 32-bit load can ignore the bottom two bits of the address), and there are fewer possibilities to compare: a 4-byte load can only be contained in an 8-byte store in two possible ways (at the store address or the store address + 4), while misaligned operations can be fully contained in five different ways (at a load address offset any of 0,1,2,3 or 4 bytes from the store).

These differences are important in hardware, where the store queue has to look something like a fully-associative CAM implementing these comparisons. The more general the comparison, the more hardware is needed (or the longer the latency to do a lookup). Early hardware may have only caught the "same address" cases of (1), but the trend is towards catching more cases, both aligned and unaligned. Here is a great overview.


1 How best to do this type of memory-dependence speculation is something that WARF holds patents and based on which it is actively suing all sorts of CPU manufacturers.

2 By direct I mean from a single store to a following store. In principle, you might also have more complex forms of store-forwarding that can take parts of multiple prior stores and forward them to a single load, but it isn't clear to me if current architectures implement this.

这篇关于在内存访问未对齐的情况下,存储到负载转发如何发生?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆