在这种情况下,任何合理的CPU实现给出foo = 2? [英] Can any reasonable CPU implementation give foo = 2 in this case?

查看:454
本文介绍了在这种情况下,任何合理的CPU实现给出foo = 2?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

阅读Dan Luu一篇非常有趣的博客文章,介绍x86架构在过去的进步几十年,他说:


如果我们将 _foo 设置为0,线程都执行 incl(_foo)每次10000次,使用单个指令20000次递增相同的位置,保证不超过20000,但是(理论上)低至2。如果不明显为什么理论最小值为2而不是10000,则认为这是一个很好的练习。


_foo 是一些内存地址。



显然这是因为(如他说的更远) incl 实现为加载,后跟加号,后跟商店。所以,如果你desugar它:

  mov reg,_foo ;; #1 
inc reg ;; #2
mov _foo,reg ;; #3

然后,u-ops的以下排序导致 _foo = 2

 线程A执行#1,#2 
线程B执行# #2
线程A执行#3
线程B执行#3
线程A执行#1,#2 .... etc
pre>

(我可能会在这里混淆汇编器的细节一点,但据我所知,这是一个相当准确的描述<$ c $



我想知道的是他下一个练习:


[M] y奖金练习是,任何合理的CPU实现得到的结果,或者是规范允许的一些愚蠢的事情永远不会发生?在这篇文章中没有足够的信息来回答红利问题...


可以吗?我的直觉不是因为我相信当A执行#3 时,则:




  • A和B位于同一个CPU上。 B将不会运行,直到A的时间片结束,并且没有办法需要整个时间片执行单个指令,所以最终有人要写出一个值> 2,或


  • A和B位于不同的CPU上。




但是,A的写入会导致B的缓存失效,A会继续执行,我不是肯定的,如果每个存储导致其他缓存无效,或者如果A能够在那段时间内继续运行,我不确定操作系统级别的事情,如时间片应该适用于CPU的推理。 p>

解决方案

tl; dr summary :不能在单个核心上使用单指令 inc [foo] 。也许可能每个线程在自己的核心,但我认为只有超线程,通过在加载/公司和商店之间造成高速缓存逐出在存储上创造额外的延迟。



我认为,即使多插槽缓存一致性可以足够慢,B的最终存储在B的最后加载后被延迟50k个周期,但超线程可能能够排队多个缓存/ TLB未命中。






在单核案例中:您假设 B将不会运行,直到A的时间表结束必须保持。中断(例如定时器中断或NIC)可以在任何点进入,在任何指令边界处暂停用户空间线程的执行。也许在中断之后,更高优先级的进程被唤醒并被调度到CPU上一段时间,因此没有理由让调度程序更喜欢已经运行了一小段时间片的线程。

$ b然而,如果我们只是谈论单核案例,并发只能通过上下文切换发生, inc [mem] mov reg,[mem] / inc reg / ,reg 。无论CPU内部如何处理 inc [mem] 上下文切换仅保存/恢复架构状态。如果加载和inc部分已经内部完成,但不是存储,则整个指令不能退出。上下文切换不会保存/恢复该进度:当线程再次执行并且CPU看到 inc [mem]

如果测试使用单独的load / inc / store指令,即使单核机器在理论上可以得到 2 按顺序Michael Burr指出:

  A从_foo 
加载0 9999次(最后存储_foo = 9999)
A存储_foo = 1(第一次迭代结束)
B的最后迭代从_foo
加载1个A循环9999次(最终存储_foo = 10000)
B的最终迭代存储_foo = 2

这是可能的,但需要几个上下文切换由中断在极特定时间到达触发。从使得调度器抢占线程到来自新线程的第一指令实际运行的点的中断需要许多周期。可能有足够的时间另一个中断到达。



同样,使用 inc [mem],我们可能会遇到这样的情况, ,这在单个核心上是不可能的,因为上下文切换只能在整个指令之后发生。 CPU的架构状态已执行 inc 或不执行。






多核情况下,两个线程同时运行,情况完全不同。高速缓存一致性操作可以发生在单个指令被解码成的uop之间。因此 inc [mem] 在此上下文中不是单个操作。



,但我认为可能甚至对于单指令 inc [foo] 循环产生最终结果2.中断/上下文切换不能解决延迟




  • A从加载0, foo

  • B循环9999次(最后存储foo = 9999)。缓存行现在处于B的CPU核心的 E 状态

  • A stores _foo = 1(第一次迭代结束)。这可以想象地在超线程CPU上被其他逻辑线程延迟这么长时间,该线程使存储端口饱和具有在缓存和/或TLB中错过的大量积压的存储,并且存储被缓冲一段时间。这可能发生没有超线程,几个缓存未命中存储等待完成。记住,在x86的强内存模型中,商店在程序顺序中变得全局可见,因此稍后存储到热缓存线仍然需要等待。

  • B的最后一次迭代从foo加载1次。 (一个中断或某事可能延迟B的最后一次迭代的执行,这不需要在单个指令的load / inc / store uops之间发生任何事情,所以我不需要弄清楚是否接收到一个相干消息从A)使无效的高速缓存线将阻止存储转发从前面的迭代的存储转发9999值到这个迭代的负载。我不知道,但我认为它可以。)

  • A循环9999次(最终存储 _foo = 10000

  • B的最终迭代存储 _foo = 2 解释此商店如何延迟,直到A的循环完成为止,这是最大的延伸。超线程可以做到:另一个逻辑核心可以驱逐 _foo 的TLB条目,也可能驱逐包含该值的L1 D $行。这种驱逐可能在最后的 inc 指令的加载和存储uops之间发生。我不知道一致性协议可以花多长时间获得对另一个核心当前拥有的缓存线的写访问。我确信它通常远远少于50k个周期,实际上小于CPU的主存储器访问,其中包括用作一致性流量(例如Intel的Nehalem和后来的设计)的支持的最后一级缓存。具有多个套接字的非常多核系统可能很慢,但我认为他们仍然使用环形总线来实现一致性流量。



    我不确定B是否合理最终存储被延迟50k个周期而没有超线程来堆积一些存储端口争用并导致缓存逐出。负载(必须看到A的存储为1,但不是任何A的其他存储)不能在OOO调度器中的存储之前太远,因为它仍然必须来自倒数第二次迭代的存储。 (核心必须在单个执行上下文中保持有序语义。)




由于只有一个内存位置这是读取然后写在两个线程,没有任何重新排序的商店和负载。加载将总是看到来自同一个线程的以前的商店,所以它不能全局可见,直到商店到同一位置。



在x86上,只有 StoreLoad重新排序是可能的,但在这种情况下,唯一的事情事情是,乱序机制可以延迟商店很长一段时间,即使没有相对于任何负载重新排序。









你指的原始博客文章一般看起来不错,但我注意到至少有一个错误。


事实证明,在现代x86 CPU上,使用锁定实现
并发原语是常常比使用内存
障碍更便宜


该链接只是显示使用 lock add [mem],0 作为一个障碍在Nehalem是便宜的,尤其是。它与其他指令交错更好。它没有什么关于使用锁定与无障碍算法依赖于障碍。如果你想以原子方式增加一个内存位置,那么最简单的选择是一个 lock ed指令。使用 MFENCE 将需要某种单独的锁实现没有原子RMW操作,如果可能的话。



想要介绍 lock inc [mem] inc [mem] 的主题,只是不小心关于措辞。






示例代码也很奇怪,用 -O0 一如既往地制造令人讨厌的代码。我修复了内联asm来询问编译器一个内存操作数,而不是手动写 incl(reg),所以通过优化,它产生 incl计数器(%rip),而不是将指针加载到寄存器。更重要的是, -O3 也避免了在内存中保留循环计数器,即使使用原始源。 -O3 在原始源仍然出现产生正确的代码,即使它不告诉编译器它写入内存。



无论如何,实验是有缺陷的,我认为实验仍然有效,并且不太可能用 -O0 编译的巨大的循环开销人工限制到最终计数器可能结束的范围。






Dan Luu的示例asm语法是一个奇怪的混合和AT& T语法: mov [_foo],%eax 是一个负载。它应该写成 mov eax,[_foo] mov _foo,%eax ,或者可能 mov(_foo),%eax 如果你想清楚它是一个加载而不是mov-immediate。无论如何,如果我不知道他的意思,并试图证明,那么我会感到困惑。





Reading a very interesting blog post by Dan Luu about advances in x86 architecture over the past few decades, he says:

If we set _foo to 0 and have two threads that both execute incl (_foo) 10000 times each, incrementing the same location with a single instruction 20000 times, is guaranteed not to exceed 20000, but it could (theoretically) be as low as 2. If it’s not obvious why the theoretical minimum is 2 and not 10000, figuring that out is a good exercise.

where _foo is some memory address.

Obviously this is because (as he says farther down) incl is implemented as a load followed by an add followed by a store. So if you "desugar" it into:

mov reg, _foo ;; #1
inc reg       ;; #2
mov _foo, reg ;; #3

Then the following ordering of u-ops results in _foo = 2:

Thread A executes #1, #2
Thread B executes #1, #2
Thread A executes #3
Thread B executes #3
Thread A executes #1, #2.... etc

(I may be muddling the details of assembler here a little bit, but as far as I know this is a reasonably accurate description of the case where _foo = 2.)

What I wonder about is his next "exercise:"

[M]y bonus exercise for you is, can any reasonable CPU implementation get that result, or is that some silly thing the spec allows that will never happen? There isn’t enough information in this post to answer the bonus question...

Can it? My instinct is no because I believe that when A executes #3, then either:

  • A and B are on the same CPU. B won't get to run until A's timeslice is up, and there's no way that it will take a whole timeslice to execute a single instruction, so eventually someone is going to write out a value > 2, or

  • A and B are on different CPUs. A's write causes B's cache to become invalidated and A gets to continue executing, writing out a value > 2.

But I'm not positive if every store causes every other cache to get invalidated, or if A is able to continue running during that time, and I'm not sure if OS-level things like timeslices should apply to reasoning about CPUs.

解决方案

tl;dr summary: not possible on a single core with the one-instruction inc [foo]. Maybe possible with each thread on its own core, but I think only with hyperthreading to create extra delays on stores by causing cache evictions between the load/inc and the store.

I think not even multi-socket cache coherency can be slow enough for B's final store to be delayed 50k cycles after B's final load, but hyperthreading might be able to queue multiple cache/TLB misses ahead of it.


In the single-core case: your assumption that B won't get to run until A's timeslice is up doesn't necessarily hold. An interrupt (e.g. a timer interrupt or NIC) can come in at any point, suspending execution of a user-space thread at any instruction boundary. Perhaps after the interrupt, a higher-priority process wakes up and is scheduled onto the CPU for a while, so there's no reason for the scheduler to prefer the thread that had already run for a fraction of a timeslice.

However, if we're just talking about the single-core case, and concurrency can only happen via context switches, inc [mem] is very different from mov reg, [mem] / inc reg / mov [mem], reg. Regardless of how the CPU internals handle inc [mem], a context switch only saves/restores the architectural state. If the load and inc part had already internally completed, but not the store, the whole instruction couldn't have retired. A context switch wouldn't save/restore that progress: the load and inc would have to be re-run when the thread started executing again and the the CPU saw the inc [mem] instruction again.

If the test had used separate load/inc/store instructions, even a single-core machine could in theory get 2 by the sequence Michael Burr points out:

A loads 0 from _foo
B loops 9999 times (finally storing _foo = 9999)
A stores _foo = 1  (end of first iteration)
B's final iteration loads 1 from _foo
A loops 9999 times (eventually storing _foo = 10000)
B's final iteration stores _foo = 2

This is possible, but would require several context switches triggered by interrupts arriving at extremely specific times. It takes many cycles from an interrupt that causes the scheduler to preempt a thread to the point where the first instruction from a new thread actually runs. There's probably enough time for another interrupt to arrive. We're just interested in it being possible, not likely enough to be observable even after days of trials!

Again, with inc [mem], this is impossible on a single core, because context switches can only happen after whole instructions. The CPU's architectural state has either executed the inc or not.


In a multicore situation, with both threads running at the same time, things are entirely different. Cache coherency operations can happen between the uops that a single instruction is decoded into. So inc [mem] isn't a single operation in this context.

I'm not sure about this, but I think it might be possible even for a single-instruction inc [foo] loop to produce a final result of 2. Interrupts / context switches can't account for delays between load and store, though, so we need to come up with other possible reasons.

  • A loads 0 from foo
  • B loops 9999 times (finally storing foo = 9999). The cache line is now in the E state on B's CPU core
  • A stores _foo = 1 (end of first iteration). This could conceivably be delayed this long on a hyperthreaded CPU by the other logical thread saturating the store port with a big backlog of stores which miss in cache and/or TLB, and by the store being buffered for a while. Possibly this can happen without hyperthreading with several cache-miss stores waiting to complete. Remember that stores become globally visible in program order in x86's strong memory model, so a later store to a hot cache line still has to wait. Having it complete just in time for B's last iteration is just a coincidence in timing, which is fine.
  • B's final iteration loads 1 from foo. (An interrupt or something could delay B's execution of the final iteration. This doesn't require anything to happen between the load/inc/store uops of a single instruction, so I don't need to figure out if receiving a coherency message (from A) that invalidates the cache line will prevent store-forwarding from forwarding the 9999 value from the previous iteration's store to this iteration's load. I'm not sure, but I think it could.)
  • A loops 9999 times (eventually storing _foo = 10000)
  • B's final iteration stores _foo = 2. Explaining how this store can be delayed until after A's loop completes seems like the biggest stretch. Hyperthreading could do it: the other logical core could evict the TLB entry for _foo, and maybe also evict the L1 D$ line containing the value. This eviction could happen between the load and the store uops for the final inc instruction. I'm not sure how long it can take for the coherency protocol to obtain write access to a cache line that's currently owned by another core. I'm sure it's usually far less than 50k cycles, actually less than a main memory access on CPUs with large include last-level caches that act as a backstop for coherency traffic (e.g. Intel's Nehalem and later designs). Very-many-core systems with multiple sockets are potentially slow, but I think they still use a ring bus for coherency traffic.

    I'm not sure it's plausible for B's final store to be delayed 50k cycles without hyperthreading to pile up some store-port contention and cause cache evictions. The load (which has to see A's store of 1, but not any of A's other stores) can't get too far ahead of the store in the OOO scheduler, since it still has to come after the store from the penultimate iteration. (A core must maintain in-order semantics within a single execution context.)

Since there's only a single memory location that's read and then written in both threads, there isn't any reordering of stores and loads. A load will always see previous stores from the same thread, so it can't become globally visible until after a store to the same location.

On x86, only StoreLoad reordering is possible, but in this case the only thing that matters is that the out-of-order machinery can delay the store for a long time, even without reordering it relative to any loads.



The original blog post you're referring to looks good in general, but I did notice at least one mistake. There are a lot of good links in there.

it turns out that on modern x86 CPUs, using locking to implement concurrency primitives is often cheaper than using memory barriers

That link just shows that using lock add [mem], 0 as a barrier is cheaper on Nehalem, and esp. that it interleaves better with other instructions. It has nothing to say about using locking vs. lockless algorithms that depend on barriers. If you want to atomically increment a memory location, then the simplest choice by far is a locked instruction. Using just MFENCE would require some kind of separate lock implemented without atomic RMW operations, if that's possible.

Clearly he wanted to introduce the topic of lock inc [mem] vs. inc [mem], and just wasn't careful about the wording. In most cases his generalizations work better.


The example code is also weird, and compiling with -O0 makes quite nasty code as always. I fixed the inline asm to ask the compiler for a memory operand, rather than manually writing incl (reg), so with optimization on, it produces incl counter(%rip) instead of loading the pointer into a register. More importantly, -O3 also avoids keeping the loop counter in memory, even with the original source. -O3 on the original source still appears to produce correct code, even though it doesn't tell the compiler that it writes to memory.

Anyway, flawed as the experiment is, I think the experiment is still valid, and it's unlikely that the huge loop overhead of compiling with -O0 added an artificial limit to the range the final counter could end up with.


Dan Luu's example asm syntax is a weird mix of Intel and AT&T syntax: mov [_foo], %eax is a load. It should be written mov eax, [_foo], or mov _foo, %eax, or maybe mov (_foo), %eax if you're trying to make it clear that it's a load rather than a mov-immediate. Anyway, I think it would be confusing if I didn't already know what he meant and was trying to demonstrate.


这篇关于在这种情况下,任何合理的CPU实现给出foo = 2?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆