生产者-消费者在超级兄弟与非超级兄弟之间共享内存位置的延迟和吞吐量成本是多少? [英] What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

查看:21
本文介绍了生产者-消费者在超级兄弟与非超级兄弟之间共享内存位置的延迟和吞吐量成本是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

单个进程中的两个不同线程可以通过读取和/或写入共享一个公共内存位置.

通常,这种(有意的)共享是通过使用原子操作在 x86 上使用 lock 前缀来实现的,这对于 lock 前缀本身具有众所周知的成本(即,无竞争成本)以及缓存行实际共享时的额外一致性成本(true 或 false 共享).

这里我对生产消费者成本感兴趣,其中单个线程 P 写入内存位置,另一个线程 `C 从内存位置读取,两者都使用 plain 读写.

在同一套接字上的不同内核上执行此类操作时,与在同一物理内核上的同级超线程上(在最近的 x86 内核上)进行比较时,此类操作的延迟和吞吐量是多少.

在标题中,我使用术语超级兄弟"来指代在同一核心的两个逻辑线程上运行的两个线程,而核心间兄弟指的是在两个线程上运行的更常见情况不同的物理内核.

解决方案

致命的问题是内核进行推测性读取,这意味着每次写入推测性读取地址(或更准确地说是相同的缓存行)在完成"之前意味着 CPU 必须撤消读取(至少如果您是 x86),这实际上意味着它取消了该指令和以后的所有推测指令.

在读取退休之前的某个时间点,它会完成",即.之前没有任何指令会失败并且不再有任何理由重新发出,并且 CPU 可以像之前执行过所有指令一样.

其他核心示例

除了取消指令之外,这些还在玩缓存乒乓,所以这应该比HT版本更糟糕.

让我们从进程中的某个时刻开始,其中包含共享数据的缓存行刚刚被标记为共享,因为消费者要求读取它.

  1. 生产者现在想要写入共享数据并发出对缓存线独占所有权的请求.
  2. 消费者收到仍处于共享状态的缓存行并愉快地读取该值.
  3. 消费者继续读取共享值,直到独占请求到达.
  4. 此时,消费者发送对缓存行的共享请求.
  5. 此时,消费者从共享值的第一个未完成的加载指令中清除其指令.
  6. 当消费者等待数据时,它会推测性地向前运行.

所以消费者可以在它获得共享缓存线之间的时间段内前进,直到它再次失效.不清楚可以同时完成多少次读取,最有可能是 2 次,因为 CPU 有 2 个读取端口.一旦 CPU 的内部状态得到满足,它就不需要重新运行它们,它们不能在每个之间失败.

同核HT

这里两个HT共享核心,必须共享资源.

缓存行应该一直保持独占状态,因为它们共享缓存,因此不需要缓存协议.

现在为什么在 HT 内核上需要这么多周期?让我们从刚刚读取共享值的消费者开始.

  1. 下一个周期发生来自 Produces 的写操作.
  2. 消费者线程检测到写入并从第一次未完成的读取中取消其所有指令.
  3. 消费者重新发出指令需要大约 5-14 个周期才能再次运行.
  4. 最后,作为读取的第一条指令被发出并执行,因为它没有读取推测值,而是在队列前面读取了正确的值.

因此,每次读取共享值时,都会重置消费者.

结论

在每次缓存乒乓之间,不同的内核显然每次都进步太多,以至于它的性能比 HT 更好.

如果 CPU 等待查看值是否真的改变了会发生什么?

对于测试代码,HT 版本的运行速度会快得多,甚至可能与私有写入版本一样快.由于缓存未命中覆盖了重新发布延迟,因此不同的内核不会运行得更快.

但是如果数据不同也会出现同样的问题,只是对于不同的核心版本会更糟,因为它还必须等待缓存行,然后重新发布.

因此,如果 OP 可以更改某些角色,让时间戳生产者从共享中读取并降低性能,那就更好了.

阅读更多此处

Two different threads within a single process can share a common memory location by reading and/or writing to it.

Usually, such (intentional) sharing is implemented using atomic operations using the lock prefix on x86, which has fairly well-known costs both for the lock prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).

Here I'm interested in produced-consumer costs where a single thread P writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.

What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.

In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.

解决方案

The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.

At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.

Other core example

These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.

Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.

  1. The producer now wants to write to the shared data and sends out a request for exclusive ownership of the cache line.
  2. The Consumer receives his cache line still in shared state and happily reads the value.
  3. The consumer continues to read the shared value until the exclusive request arrives.
  4. At which point the Consumer sends a shared request for the cache line.
  5. At this point the Consumer clears its instructions from the first unfulfilled load instruction of the shared value.
  6. While the Consumer waits for the data it runs ahead speculatively.

So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.

Same core HT

Here the two HT shares the core and must share its resources.

The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.

Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.

  1. Next cycle a write from the Produces occures.
  2. The Consumer thread detects the write and cancels all its instructions from the first unfulfilled read.
  3. The Consumer re-issues its instructions taking ~5-14 cycles to run again.
  4. Finally the first instruction, which is a read, is issued and executed as it did not read a speculative value but a correct one as its in front of the queue.

So for every read of the shared value the Consumer is reset.

Conclusion

The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.

What would have happened if the CPU waited to see if the value had actually changed?

For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.

But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.

So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.

Read more here

这篇关于生产者-消费者在超级兄弟与非超级兄弟之间共享内存位置的延迟和吞吐量成本是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆