相关负载在CPU中的重新排序 [英] Dependent loads reordering in CPU

查看:62
本文介绍了相关负载在CPU中的重新排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读内存壁垒:针对软件黑客的硬件视图,这是Paul E的一篇非常受欢迎的文章麦肯尼.

I have been reading Memory Barriers: A Hardware View For Software Hackers, a very popular article by Paul E. McKenney.

论文强调的一件事是,像Alpha这样的顺序较弱的处理器可以对依赖的负载进行重新排序,这似乎是分区缓存的副作用

One of the things the paper highlights is that, very weakly ordered processors like Alpha, can reorder dependent loads which seems to be a side effect of partitioned cache

论文摘录:

1 struct el *insert(long key, long data)
2 {
3     struct el *p;
4     p = kmalloc(sizeof(*p), GPF_ATOMIC);
5     spin_lock(&mutex);
6     p->next = head.next;
7     p->key = key;
8     p->data = data; 
9     smp_wmb();
10    head.next = p;
11    spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16     struct el *p;
17     p = head.next;
18     while (p != &head) {
19         /* BUG ON ALPHA!!! */
20         if (p->key == key) {
21             return (p);
22         }
23         p = p->next;
24     };
25     return (NULL);
26 }

  1. 有2个处理器CPU0和CPU1.
  2. 每个CPU都有2个高速缓存存储区CB0(奇数地址),CB1(偶数地址).
  3. 头在CB0中,P在CB1中.
  4. insert()具有写屏障,可确保第6-8行的无效是首先进入总线,然后是第10行的无效.
  5. 但是,执行搜索的其他处理器可以轻载CB0,重载CB1.
  6. 这意味着处理器将head的最新值领先于p的旧值(因为CB1尚未处理针对p的无效请求.)

问题: 看起来所有体系结构都期望Alpha荣誉依赖于负载. 例如:IA64可以对以下项进行重新排序,除了从属负载重新排序.

Question: Looks like all architectures expect Alpha honor dependent loads. For example: IA64 can reorder the following except Dependent loads reordering.

  1. 加载后重新排序加载
  2. 存储后重新排序加载
  3. 商店在商店之后重新排序
  4. 加载后商店重新排序
  5. 原子指令按负载重新排序.
  6. 与商店重新排序的原子指令.

这使我想知道需要什么硬件支持来防止依赖的负载重新排序.

This makes me wonder what hardware support is required to prevent dependent load reordering.

一个可能的答案是,所有其他体系结构(IA64)都没有分区缓存,因此不会遇到此问题,并且不需要显式的硬件支持.

One possible answer is that all other architecture( IA64) do not have a partitioned cache and hence would not run into this issue and no explicit hardware support is required.

有什么见解吗?

推荐答案

简短答案:

在乱序处理器中,加载存储队列用于跟踪和强制执行内存排序约束.诸如Alpha 21264之类的处理器具有必要的硬件,可以防止依赖的负载重新排序,但是强制执行这种依赖关系可能会增加处理器间通信的开销.

Short answer:

In an out-of-order processor the load-store queue is used to track and enforce memory ordering constraints. Processors such as the Alpha 21264 have the necessary hardware to prevent dependent load reordering, but enforcing this dependency could add overhead for inter-processor communication.

这可能是使用示例最好的解释.想象一下,您具有以下指令序列(为简化起见使用伪代码指令):

This is probably best explained using an example. Imagine that you had the following sequence of instructions (pseudo-code instructions used for simplicity):

ST R1, A       // store value in register R1 to memory at address A
LD B, R2       // load value from memory at address B to register R2
ADD R2, 1, R2  // add immediate value 1 to R2 and save result in R2

在此示例中,LDADD指令之间存在依赖关系. ADD读取R2的值,因此直到LD使该值可用之前,它无法执行.这种依赖性是通过寄存器实现的,处理器的发布逻辑可以跟踪这种依赖关系.

In this example there is a dependency between the LD and the ADD instruction. The ADD reads the value of R2 and so it cannot execute until the LD makes that value available. This dependency is through a register and it is something that the processor's issue logic can track.

但是,如果地址AB相同,则STLD之间也可能存在依赖关系.但是与LDADD之间的依赖关系不同,STLD之间的可能依赖关系在发出指令(开始执行)时未知.

However, there could also be a dependency between the ST and the LD, if address A and B were the same. But unlike the dependence between the LD and the ADD, the possible dependence between the ST and the LD is not known at the time the instruction is issued (begins execution).

处理器没有在发布时尝试检测内存依赖关系,而是使用称为加载存储队列的结构来跟踪它们.该结构的作用是跟踪未决加载的地址,并存储已发布但尚未退休的指令.如果存在内存排序冲突,则可以检测到这种情况,并且可以从发生冲突的地方重新开始执行.

Instead of trying to detect memory dependencies at issue time, the processor keeps track of them using a structure called the load-store queue. What this structure does is keep track of the addresses of pending loads and stores for instructions that have been issued but not yet retired. If there is a memory ordering violation this can be detected and execution can be restarted from the point where the violation occurred.

因此,回到伪代码示例,您可以想象在ST之前执行LD的情况(也许由于某种原因R1中所需的值尚未准备好).但是当ST执行时,它会看到地址AB是相同的.因此,LD实际上应该已经读取了ST生成的值,而不是已在缓存中保存的陈旧值.结果,将需要重新执行LD,以及LD之后的所有指令.可以通过各种优化来减少一些开销,但是基本思想仍然成立.

So going back to the pseudo-code example, you could imagine a situation where the LD is executed before the ST (perhaps the value needed in R1 wasn't ready for some reason). But when the ST executes it sees that address A and B are the same. So the LD should really have read the value that was produced by the ST, rather than the stale value that was already in the cache. As a result the LD will need to be re-executed, along with any instructions that came after the LD. There are various optimizations possible to reduce some of this overhead, but the basic idea holds.

正如我之前提到的,检测这种依赖关系的逻辑存在于所有允许推测性执行内存指令的乱序处理器(包括Alpha处理器)中.

As I mentioned earlier the logic to detect this dependence exists in all out-of-order processors that allow speculative execution of memory instructions (including Alpha processors).

但是,内存排序规则不仅限制了处理器从自己的内存操作中看到结果的顺序.相反,内存排序规则限制了在一个处理器上执行的内存操作对其他处理器可见的操作的相对顺序.

However, memory ordering rules don't just constrain the order that a processor sees results from its own memory operations. Instead memory ordering rules constrain the relative order of that operations memory operations performed on one processor become visible to other processors.

对于依赖的负载重新排序,处理器必须跟踪此信息以供自己使用,但是Alpha ISA不需要它来确保其他处理器也可以看到此排序.以下是发生这种情况的一个示例(我引用了此链接)

In the case of dependent load reordering, the processor has to track this information for its own use, but Alpha ISA does not require it to make sure that other processors see this ordering. One example of how this can occur is the following (I've quoted from this link)

Initially: p = & x, x = 1, y = 0

    Thread 1         Thread 2
--------------------------------
  y = 1         |    
  memoryBarrier |    i = *p
  p = & y       |
--------------------------------
Can result in: i = 0

当前仅在基于21264的情况下才可能出现异常行为 系统.显然,您必须使用我们的多处理器之一 服务器.最后,您实际看到它的机会非常低, 仍然有可能.

The anomalous behavior is currently only possible on a 21264-based system. And obviously you have to be using one of our multiprocessor servers. Finally, the chances that you actually see it are very low, yet it is possible.

这是此行为出现的必须发生的情况.假设T1 在P1上运行,在T2上运行T2. P2必须是值为0的缓存位置y. P1做y = 1,这导致无效y"发送到P2.这 使invalidate进入P2的传入探针队列";随你便 看到,问题出现了,因为这种无效理论上可以 坐在探针队列中,而无需在P2上执行MB.无效为 马上就被确认(即,您不必等待它 在发送P2的缓存中实际上使副本无效 确认).因此,P1可以通过其MB.它继续 对p做写操作.现在,P2继续读取p.读p的回复 允许绕过其进入路径上P2上的探测队列(此 允许回复/数据快速返回21264,而无需 等待先前的传入探针得到维修).现在,P2可以 取消引用P以读取位于其缓存中的y的旧值 (P2的探测队列中的inval仍然坐在那里).

Here is what has to happen for this behavior to show up. Assume T1 runs on P1 and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which causes an "invalidate y" to be sent to P2. This invalidate goes into the incoming "probe queue" of P2; as you will see, the problem arises because this invalidate could theoretically sit in the probe queue without doing an MB on P2. The invalidate is acknowledged right away at this point (i.e., you don't wait for it to actually invalidate the copy in P2's cache before sending the acknowledgment). Therefore, P1 can go through its MB. And it proceeds to do the write to p. Now P2 proceeds to read p. The reply for read p is allowed to bypass the probe queue on P2 on its incoming path (this allows replies/data to get back to the 21264 quickly without needing to wait for previous incoming probes to be serviced). Now, P2 can derefence P to read the old value of y that is sitting in its cache (the inval y in P2's probe queue is still sitting there).

P2上的MB如何解决此问题? 21264刷新其传入的探针 每个MB的队列(即为其中的所有待处理消息提供服务). 因此,在读取P之后,您将执行一个MB,将inval拉到y 当然.而且您将不再看到y的旧缓存值.

How does an MB on P2 fix this? The 21264 flushes its incoming probe queue (i.e., services any pending messages in there) at every MB. Hence, after the read of P, you do an MB which pulls in the inval to y for sure. And you can no longer see the old cached value for y.

尽管以上情况在理论上是可能的,但仍有机会 因这个问题而观察问题的时间非常短.原因是 即使您正确设置了缓存,P2可能也有足够的空间 服务在其探查队列中的消息(即inval)的机会 在收到"read p"的数据答复之前.但是,如果您 陷入一种情况,您已经在P2的探针中放置了很多东西 在inval到y之前排队,然后可能是对p的答复 回来并绕过这种间隔.你很难 设置场景并实际观察异常.

Even though the above scenario is theoretically possible, the chances of observing a problem due to it are extremely minute. The reason is that even if you setup the caching properly, P2 will likely have ample opportunity to service the messages (i.e., inval) in its probe queue before it receives the data reply for "read p". Nonetheless, if you get into a situation where you have placed many things in P2's probe queue ahead of the inval to y, then it is possible that the reply to p comes back and bypasses this inval. It would be difficult for you to set up the scenario though and actually observe the anomaly.

以上内容解决了当前的Alpha可能如何违反您的财产 如图所示.由于其他优化,未来的Alpha可能会违反它.一 有趣的优化是价值预测.

The above addresses how current Alpha's may violate what you have shown. Future Alpha's can violate it due to other optimizations. One interesting optimization is value prediction.

摘要

所有无序处理器中已经存在执行从属负载排序所需的基本硬件.但是,确保所有处理器都可以看到此内存顺序,这增加了对缓存行无效性的处理的其他约束.并且它可能还会在其他情况下增加其他约束.但是,在实践中,对于硬件设计人员来说,弱Alpha内存模型的潜在优势似乎不值得以软件复杂性为代价,也不会增加需要更多内存屏障的开销.

Summary

The basic hardware needed to enforce the ordering of dependent loads is already present in all out-of-order processors. But ensuring that this memory ordering is seen by all processors adds additional constraints to handling of cache-line invalidation. And it may add additional constraints in other scenarios as well. However, in practice it seems likely that the potential advantages of the weak Alpha memory model for hardware designers were not worth the cost in software complexity and added overhead of requiring more memory barriers.

这篇关于相关负载在CPU中的重新排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆