CPU 中的相关负载重新排序 [英] Dependent loads reordering in CPU

查看:31
本文介绍了CPU 中的相关负载重新排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读 Memory Barriers: A Hardware View For Software Hackers,Paul E 的一篇非常受欢迎的文章. 麦肯尼.

论文强调的一件事是,像 Alpha 这样排序非常弱的处理器可以重新排序依赖负载,这似乎是分区缓存的副作用

论文摘录:

1 struct el *insert(long key, long data)2 {3 结构 el *p;4 p = kmalloc(sizeof(*p), GPF_ATOMIC);5 spin_lock(&mutex);6 p->next = head.next;7 p->键=键;8 p->数据=数据;9 smp_wmb();10 头.下一个 = p;11 spin_unlock(&mutex);12 }1314 struct el *search(长键)15 {16 结构 el *p;17 p = 头.下一个;18 而 (p != &head) {19/* 阿尔法的错误!!!*/20 if (p->key == key) {21返回(p);22 }23 p = p->next;24};25 返回(空);26 }

  1. 有 2 个处理器 CPU0 和 CPU1.
  2. 每个 CPU 有 2 个缓存组 CB0(奇地址),CB1(偶地址).
  3. Head 位于 CB0 中,P 位于 CB1 中.
  4. insert() 有一个写屏障,可确保第 6-8 行的失效首先是 in 总线,然后是第 10 行的失效.
  5. 但是,执行搜索的其他处理器可能使 CB0 负载较轻,而 CB1 负载较重.
  6. 这意味着处理器领先于head的最新值但p的旧值(因为对p的无效请求还没有被CB1处理.)

问题:看起来所有架构都期望 Alpha 荣誉依赖负载.例如:IA64 可以对以下内容进行重新排序,但从属负载重新排序除外.

  1. 加载后重新排序
  2. 加载后重新排序
  3. 商店在商店之后重新排序
  4. 加载后重新排序的商店
  5. 原子指令随负载重新排序.
  6. 与商店一起重新排序的原子指令.

这让我想知道需要哪些硬件支持来防止相关负载重新排序.

一个可能的答案是所有其他架构(IA64)都没有分区缓存,因此不会遇到这个问题,也不需要明确的硬件支持.

有什么见解吗?

解决方案

简答:

在乱序处理器中,加载存储队列用于跟踪和强制执行内存排序约束.诸如 Alpha 21264 之类的处理器具有必要的硬件来防止相关负载重新排序,但强制执行此相关性可能会增加处理器间通信的开销.

长答案:

依赖跟踪的背景

这可能最好用一个例子来解释.假设您有以下指令序列(为简单起见,使用伪代码指令):

ST R1, A//将寄存器 R1 中的值存储到地址 A 的内存中LD B, R2//从地址 B 的内存中加载值到寄存器 R2ADD R2, 1, R2//将立即数 1 添加到 R2 并将结果保存到 R2

在此示例中,LDADD 指令之间存在依赖关系.ADD 读取 R2 的值,因此它无法执行,直到 LD 使该值可用.这种依赖是通过一个寄存器实现的,它是处理器的问题逻辑可以跟踪的东西.

但是,如果地址 ABSTLD 之间也可能存在依赖关系> 是一样的.但与LDADD之间的依赖不同,STLD之间可能存在的依赖并不是在发出指令时已知(开始执行).

处理器不会在发布时尝试检测内存依赖关系,而是使用称为加载存储队列的结构来跟踪它们.该结构的作用是跟踪已发布但尚未退出的指令的挂起加载和存储的地址.如果存在内存排序违规,则可以检测到这一点,并且可以从违规发生的点重新开始执行.

所以回到伪代码示例,你可以想象 LDST 之前执行的情况(也许 R1 中需要的值不是出于某种原因准备好了).但是当 ST 执行时,它发现地址 AB 是相同的.所以 LD 应该真的读取了由 ST 产生的值,而不是已经在缓存中的陈旧值.因此,需要重新执行 LD 以及 LD 之后的任何指令.有多种优化方法可以减少这种开销,但基本思想是成立的.

正如我之前提到的,检测这种依赖性的逻辑存在于所有允许推测性执行内存指令的无序处理器(包括 Alpha 处理器)中.

内存排序规则

但是,内存排序规则不仅仅限制处理器从其自己的内存操作中看到结果的顺序.相反,内存排序规则限制了在一个处理器上执行的内存操作对其他处理器可见的操作的相对顺序.

Alpha 示例

在依赖负载重新排序的情况下,处理器必须跟踪此信息以供自己使用,但 Alpha ISA 不要求它确保其他处理器看到此排序.以下是如何发生这种情况的一个示例(我引用自 此链接)

最初:p = &x, x = 1, y = 0线程 1 线程 2--------------------------------y = 1 |内存屏障 |我 = *pp = &是 |--------------------------------可能导致:i = 0

<块引用>

异常行为目前只能在基于 21264 的系统.显然你必须使用我们的多处理器之一服务器.最后,你真正看到它的机会非常低,还是有可能的.

这是出现这种行为所必须发生的事情.假设 T1在 P1 上运行,在 P2 上运行 T2.P2 必须缓存位置 y,值为 0.P1 执行 y=1,这会导致将无效 y"发送到 P2.这无效进入 P2 的传入探测队列";随你便看,问题出现了,因为理论上这种无效可以坐在探测队列中而不在 P2 上执行 MB.无效是此时立即确认(即,您无需等待在发送之前实际上使 P2 缓存中的副本无效确认).因此,P1 可以通过它的 MB.它继续写到 p.现在 P2 继续读取 p.阅读 p 的回复允许在其传入路径上绕过 P2 上的探测队列(此允许回复/数据快速返回 21264 而无需等待之前的传入探针得到服务).现在,P2 可以解除 P 以读取位于其缓存中的 y 的旧值(P2 的探测队列中的无效值 y 仍然在那里).

P2 上的 MB 如何解决此问题?21264 冲洗其传入的探头每 MB 的队列(即为其中的任何待处理消息提供服务).因此,在读取 P 之后,您执行一个 MB 将 inval 拉入 y一定.而且您再也看不到 y 的旧缓存值了.

即使上述情况在理论上是可能的,但机会由于它而观察一个问题是非常微小的.原因是即使您正确设置了缓存,P2 也可能有足够的为其探测队列中的消息(即无效)提供服务的机会在它收到read p"的数据回复之前.尽管如此,如果你进入你在 P2 的探针中放置了很多东西的情况在对 y 的无效之前排队,那么有可能对 p 的回复回来并绕过这个inval.你会很难设置场景并实际观察异常.

以上说明了当前的 Alpha 可能如何违反您所拥有的显示.由于其他优化,Future Alpha 可能会违反它.一有趣的优化是价值预测.

总结

强制执行依赖加载排序所需的基本硬件已经存在于所有乱序处理器中.但是确保所有处理器都能看到这种内存顺序会增加处理缓存行失效的额外限制.并且它也可能在其他场景中添加额外的约束.然而,在实践中,对于硬件设计人员而言,弱 Alpha 内存模型的潜在优势似乎不值得付出软件复杂性和增加需要更多内存屏障的开销.

I have been reading Memory Barriers: A Hardware View For Software Hackers, a very popular article by Paul E. McKenney.

One of the things the paper highlights is that, very weakly ordered processors like Alpha, can reorder dependent loads which seems to be a side effect of partitioned cache

Snippet from the paper:

1 struct el *insert(long key, long data)
2 {
3     struct el *p;
4     p = kmalloc(sizeof(*p), GPF_ATOMIC);
5     spin_lock(&mutex);
6     p->next = head.next;
7     p->key = key;
8     p->data = data; 
9     smp_wmb();
10    head.next = p;
11    spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16     struct el *p;
17     p = head.next;
18     while (p != &head) {
19         /* BUG ON ALPHA!!! */
20         if (p->key == key) {
21             return (p);
22         }
23         p = p->next;
24     };
25     return (NULL);
26 }

  1. There are 2 processors CPU0 and CPU1.
  2. Each CPU has 2 cache banks CB0( odd address ), CB1( even address ).
  3. Head is in CB0 and P in CB1.
  4. The insert() has a write barrier which ensure that invalidation for line 6-8 is the in bus first followed by invalidation at line 10.
  5. However, the other processor executing search can have CB0 lightly loaded and CB1 heavily loaded.
  6. This means the processor leads the latest value of head but old value of p ( because the invalidation request for p is not processed by CB1 yet. )

Question: Looks like all architectures expect Alpha honor dependent loads. For example: IA64 can reorder the following except Dependent loads reordering.

  1. Load reordered after load
  2. Load reordered after store
  3. Stores reordered after stores
  4. Stores reordered after load
  5. Atomic instruction reordered with loads.
  6. Atomic Instructions reordered with stores.

This makes me wonder what hardware support is required to prevent dependent load reordering.

One possible answer is that all other architecture( IA64) do not have a partitioned cache and hence would not run into this issue and no explicit hardware support is required.

Any insights ?

解决方案

Short answer:

In an out-of-order processor the load-store queue is used to track and enforce memory ordering constraints. Processors such as the Alpha 21264 have the necessary hardware to prevent dependent load reordering, but enforcing this dependency could add overhead for inter-processor communication.

Long answer:

Background on dependence tracking

This is probably best explained using an example. Imagine that you had the following sequence of instructions (pseudo-code instructions used for simplicity):

ST R1, A       // store value in register R1 to memory at address A
LD B, R2       // load value from memory at address B to register R2
ADD R2, 1, R2  // add immediate value 1 to R2 and save result in R2

In this example there is a dependency between the LD and the ADD instruction. The ADD reads the value of R2 and so it cannot execute until the LD makes that value available. This dependency is through a register and it is something that the processor's issue logic can track.

However, there could also be a dependency between the ST and the LD, if address A and B were the same. But unlike the dependence between the LD and the ADD, the possible dependence between the ST and the LD is not known at the time the instruction is issued (begins execution).

Instead of trying to detect memory dependencies at issue time, the processor keeps track of them using a structure called the load-store queue. What this structure does is keep track of the addresses of pending loads and stores for instructions that have been issued but not yet retired. If there is a memory ordering violation this can be detected and execution can be restarted from the point where the violation occurred.

So going back to the pseudo-code example, you could imagine a situation where the LD is executed before the ST (perhaps the value needed in R1 wasn't ready for some reason). But when the ST executes it sees that address A and B are the same. So the LD should really have read the value that was produced by the ST, rather than the stale value that was already in the cache. As a result the LD will need to be re-executed, along with any instructions that came after the LD. There are various optimizations possible to reduce some of this overhead, but the basic idea holds.

As I mentioned earlier the logic to detect this dependence exists in all out-of-order processors that allow speculative execution of memory instructions (including Alpha processors).

Memory ordering rules

However, memory ordering rules don't just constrain the order that a processor sees results from its own memory operations. Instead memory ordering rules constrain the relative order of that operations memory operations performed on one processor become visible to other processors.

Alpha example

In the case of dependent load reordering, the processor has to track this information for its own use, but Alpha ISA does not require it to make sure that other processors see this ordering. One example of how this can occur is the following (I've quoted from this link)

Initially: p = & x, x = 1, y = 0

    Thread 1         Thread 2
--------------------------------
  y = 1         |    
  memoryBarrier |    i = *p
  p = & y       |
--------------------------------
Can result in: i = 0

The anomalous behavior is currently only possible on a 21264-based system. And obviously you have to be using one of our multiprocessor servers. Finally, the chances that you actually see it are very low, yet it is possible.

Here is what has to happen for this behavior to show up. Assume T1 runs on P1 and T2 on P2. P2 has to be caching location y with value 0. P1 does y=1 which causes an "invalidate y" to be sent to P2. This invalidate goes into the incoming "probe queue" of P2; as you will see, the problem arises because this invalidate could theoretically sit in the probe queue without doing an MB on P2. The invalidate is acknowledged right away at this point (i.e., you don't wait for it to actually invalidate the copy in P2's cache before sending the acknowledgment). Therefore, P1 can go through its MB. And it proceeds to do the write to p. Now P2 proceeds to read p. The reply for read p is allowed to bypass the probe queue on P2 on its incoming path (this allows replies/data to get back to the 21264 quickly without needing to wait for previous incoming probes to be serviced). Now, P2 can derefence P to read the old value of y that is sitting in its cache (the inval y in P2's probe queue is still sitting there).

How does an MB on P2 fix this? The 21264 flushes its incoming probe queue (i.e., services any pending messages in there) at every MB. Hence, after the read of P, you do an MB which pulls in the inval to y for sure. And you can no longer see the old cached value for y.

Even though the above scenario is theoretically possible, the chances of observing a problem due to it are extremely minute. The reason is that even if you setup the caching properly, P2 will likely have ample opportunity to service the messages (i.e., inval) in its probe queue before it receives the data reply for "read p". Nonetheless, if you get into a situation where you have placed many things in P2's probe queue ahead of the inval to y, then it is possible that the reply to p comes back and bypasses this inval. It would be difficult for you to set up the scenario though and actually observe the anomaly.

The above addresses how current Alpha's may violate what you have shown. Future Alpha's can violate it due to other optimizations. One interesting optimization is value prediction.

Summary

The basic hardware needed to enforce the ordering of dependent loads is already present in all out-of-order processors. But ensuring that this memory ordering is seen by all processors adds additional constraints to handling of cache-line invalidation. And it may add additional constraints in other scenarios as well. However, in practice it seems likely that the potential advantages of the weak Alpha memory model for hardware designers were not worth the cost in software complexity and added overhead of requiring more memory barriers.

这篇关于CPU 中的相关负载重新排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆