如何强制cpu内核刷新c中的存储缓冲区? [英] How to force cpu core to flush store buffer in c?

查看:235
本文介绍了如何强制cpu内核刷新c中的存储缓冲区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序,它有2个线程,线程A与内核1的亲缘关系和线程B与内核2的亲和力, 核心1和核心2位于同一x86插槽中.

线程A忙于整数x的旋转,线程B在某些情况下会增加x,当线程B决定增加x时,它将使x所在的缓存行无效,并根据x86 MESI协议,它将存储新的x在core2收到无效ack之前存储缓冲区,然后在core2收到无效ack之后,core2刷新存储缓冲区.

我想知道,在core2收到无效ack之后,core2刷新是否立即存储缓冲区?我是否有可能迫使cpu用C语言进行刷新存储缓冲区?因为在我的情况下,core1旋转x的线程A应该尽早获得x新值.

解决方案

内核始终尝试尽快将其存储缓冲区提交到L1d高速缓存(从而在全局范围内可见),以便为更多存储空间. >

您可以使用屏障(例如atomic_thread_fence(memory_order_seq_cst)使线程 wait 使其线程在进行更多加载或存储之前变得全局可见,但这可以通过阻塞此核心而不是阻塞内核来实现.通过加快刷新存储缓冲区的速度.

很显然,为了避免C11中出现未定义的行为,该变量必须为_Atomic.如果只有一个编写器,则可以使用tmp = atomic_load_explicit(&x, memory_order_relaxed)tmp+1的store_explicit来避免更昂贵的seq_cst存储或原子RMW. acq/rel排序也可以,只要避免使用默认的seq_cst,如果只有一个编写器,就避免使用atomic_fetch_add RMW.

如果只有一个线程修改过RMW操作,而不需要其他线程以只读方式访问它,则不需要整个RMW操作都是原子操作.


在另一个内核可以读取您写入的数据之前,它必须先从将其写入L3高速缓存的内核的L1d中的已修改"状态开始,再从那里到达读取器内核的L1d.

您也许可以加快部分的进度,这是在数据离开存储缓冲区之后发生的.但是您没有什么可以做的有用的事情.您不想clflush/clflushopt,它将完全回写+逐出高速缓存行,因此,如果另一个内核在某个时候不尝试读取它,则另一个内核将不得不从DRAM中获取它.方式(如果可能的话).

Ice Lake有 clwb (希望)保留缓存的数据以及强制回写到DRAM.但这又迫使数据实际上一直传到DRAM,而不仅仅是共享的外部缓存,因此它占用了DRAM带宽,并且可能比我们想要的慢. (Skylake-Xeon也有,但clflushopt相同地处理它.我希望&希望Ice Lake客户端/服务器具有/将有适当的实现.)


特雷蒙特(Goldmont Plus的继任者 atom/silvermont系列)具有 解决方案

A core always tries to commit its store buffer to L1d cache (and thus become globally visible) as fast as possible, to make room for more stores.

You can use a barrier (like atomic_thread_fence(memory_order_seq_cst) to make a thread wait for its stores to become globally visible before doing any more loads or stores, but that works by blocking this core, not by speeding up flushing the store buffer.

Obviously to avoid undefined behaviour in C11, the variable has to be _Atomic. If there's only one writer, you might use tmp = atomic_load_explicit(&x, memory_order_relaxed) and store_explicit of tmp+1 to avoid a more expensive seq_cst store or atomic RMW. acq / rel ordering would work too, just avoid the default seq_cst, and avoid an atomic_fetch_add RMW if there's only one writer.

You don't need the whole RMW operation to be atomic if only one thread ever modifies it, and other threads access it read-only.


Before another core can read data you wrote, it has to make its way from Modified state in the L1d of the core that wrote it out to L3 cache, and from there to the L1d of the reader core.

You might be able to speed this part along, which happens after the data leaves the store buffer. But there's not much you can usefully do. You don't want to clflush/clflushopt, which would write-back + evict the cache line entirely so the other core would have to get it from DRAM, if it didn't try to read it at some point along the way (if that's even possible).

Ice Lake has clwb which (hopefully) leaves the data cached as well as forcing write-back to DRAM. But again that forces data to actually go all the way to DRAM, not just a shared outer cache, so it costs DRAM bandwidth and is presumably slower than we'd like. (Skylake-Xeon has it, too, but handles it the same as clflushopt. I expect & hope that Ice Lake client/server has/will have a proper implementation.)


Tremont (successor to Goldmont Plus, atom/silvermont series) has _mm_cldemote (cldemote). That's like the opposite of a SW prefetch; it's an optional performance hint to write the cache line out to L3, but doesn't force it to go to DRAM or anything.


Without special instructions, maybe you can write to 8 other locations that alias the same set in L2 and L1d cache, forcing a conflict eviction. That would cost extra time in the writing thread, but could make make the data available sooner to other threads that want to read it. I haven't tried this.

这篇关于如何强制cpu内核刷新c中的存储缓冲区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆