如何强制cpu核心刷新c中的存储缓冲区? [英] How to force cpu core to flush store buffer in c?

查看:20
本文介绍了如何强制cpu核心刷新c中的存储缓冲区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序,它有 2 个线程,线程 A 与核心 1 的关联性和线程 B 与核心 2 的关联性,核心 1 和核心 2 在同一个 x86 插槽中.

I have an application which has 2 threads , thread A affinity to core 1 and thread B affinity to core 2 , core 1 and core 2 are in the same x86 socket .

线程 A 对整数 x 进行忙自旋,线程 B 在某些条件下会增加 x,当线程 B 决定增加 x 时,它会使 x 所在的缓存行无效,并根据 x86 MESI 协议存储新的 x在 core2 接收到 invalidate ack 之前存储缓冲区,然后在 core2 收到 invalidate ack 之后,core2 刷新存储缓冲区.

thread A do a busy spin of integer x , thread B will increase x under some conditions , When thread B decide to increase x , it invalidate the cache line where x located ,and according to x86 MESI protocal , it store new x to store buffer before core2 receive invalidate ack, then after core2 receive invalidate ack , core2 flush store buffer .

我想知道,在 core2 收到无效 ack 后,core2 是否立即刷新存储缓冲区?!有没有可能我可以强制 cpu 用 ​​c 语言执行刷新存储缓冲区?!因为在我的情况下,core1 旋转 x 中的线程 A 应该尽早获得 x 新值.

I am wondering , does core2 flush store buffer immediately after core2 receive invalidate ack ?! is there any chance that I can force cpu to do flush store buffer in c language ?! because thread A in core1 spining x should get x new value as early as possible in my case .

推荐答案

内核总是尝试尽快将其存储缓冲区提交到 L1d 缓存(从而变得全局可见),以便为更多存储腾出空间.

A core always tries to commit its store buffer to L1d cache (and thus become globally visible) as fast as possible, to make room for more stores.

您可以使用屏障(如 atomic_thread_fence(memory_order_seq_cst) 使线程等待使其存储在执行更多加载或存储之前变得全局可见,但是通过阻塞这个核心来工作,而不是通过加速刷新存储缓冲区来工作.

You can use a barrier (like atomic_thread_fence(memory_order_seq_cst) to make a thread wait for its stores to become globally visible before doing any more loads or stores, but that works by blocking this core, not by speeding up flushing the store buffer.

显然为了避免 C11 中的未定义行为,变量必须是 _Atomic.如果只有一个编写器,您可以使用 tmp = atomic_load_explicit(&x, memory_order_relaxed)tmp+1 的 store_explicit 来避免更昂贵的 seq_cst 存储或原子 RMW.acq/rel 排序也可以,只要避免使用默认的 seq_cst,如果只有一个 writer,则避免使用 atomic_fetch_add RMW.

Obviously to avoid undefined behaviour in C11, the variable has to be _Atomic. If there's only one writer, you might use tmp = atomic_load_explicit(&x, memory_order_relaxed) and store_explicit of tmp+1 to avoid a more expensive seq_cst store or atomic RMW. acq / rel ordering would work too, just avoid the default seq_cst, and avoid an atomic_fetch_add RMW if there's only one writer.

如果只有一个线程修改它,并且其他线程以只读方式访问它,则不需要整个 RMW 操作都是原子的.

You don't need the whole RMW operation to be atomic if only one thread ever modifies it, and other threads access it read-only.

在另一个内核可以读取您写入的数据之前,它必须从将数据写出到 L3 缓存的内核的 L1d 中的修改状态,然后从那里到达读取器内核的 L1d.

Before another core can read data you wrote, it has to make its way from Modified state in the L1d of the core that wrote it out to L3 cache, and from there to the L1d of the reader core.

您也许可以加快这个部分的速度,这会在数据离开存储缓冲区之后发生.但是你能做的事情并不多.你不想 clflush/clflushopt,这会写回 + 完全驱逐缓存线,所以另一个核心必须从 DRAM 中获取它,如果它没有不要试图在途中的某个时刻阅读它(如果可能的话).

You might be able to speed this part along, which happens after the data leaves the store buffer. But there's not much you can usefully do. You don't want to clflush/clflushopt, which would write-back + evict the cache line entirely so the other core would have to get it from DRAM, if it didn't try to read it at some point along the way (if that's even possible).

Ice Lake 有 clwb,它(希望)离开了缓存的数据以及强制写回 DRAM.但这又一次迫使数据实际上一直传输到 DRAM,而不仅仅是共享的外部缓存,因此它会消耗 DRAM 带宽,并且可能比我们想要的要慢.(Skylake-Xeon 也有,但 处理方式与 clflushopt 相同.我期待 & 希望 Ice Lake 客户端/服务器有/将会有一个正确的实现.)

Ice Lake has clwb which (hopefully) leaves the data cached as well as forcing write-back to DRAM. But again that forces data to actually go all the way to DRAM, not just a shared outer cache, so it costs DRAM bandwidth and is presumably slower than we'd like. (Skylake-Xeon has it, too, but handles it the same as clflushopt. I expect & hope that Ice Lake client/server has/will have a proper implementation.)

Tremont(Goldmont Plus 的后继产品atom/silvermont 系列)具有 _mm_cldemote (cldemote).这与 SW 预取相反;将缓存行写入 L3 是一个可选的性能提示,但不会强制将其转到 DRAM 或其他任何内容.

Tremont (successor to Goldmont Plus, atom/silvermont series) has _mm_cldemote (cldemote). That's like the opposite of a SW prefetch; it's an optional performance hint to write the cache line out to L3, but doesn't force it to go to DRAM or anything.

如果没有特殊说明,也许您可​​以写入其他 8 个在 L2 和 L1d 缓存中具有相同别名的位置,从而强制冲突驱逐.这会在写入线程中花费额外的时间,但可以使数据更快地可供其他想要读取它的线程使用.这个我没试过.

Without special instructions, maybe you can write to 8 other locations that alias the same set in L2 and L1d cache, forcing a conflict eviction. That would cost extra time in the writing thread, but could make make the data available sooner to other threads that want to read it. I haven't tried this.

这也可能会驱逐其他线路,花费更多 L3 流量 = 系统范围的共享资源,而不仅仅是在生产者线程中花费时间.您只会考虑延迟而不是吞吐量,除非其他行是您想要写入和驱逐的行.

And this would probably evict other lines, too, costing more L3 traffic = system wide shared resources, not just costing time in the producer thread. You'd only ever consider this for latency, not throughput, unless the other lines were ones you wanted to write and evict anyway.

这篇关于如何强制cpu核心刷新c中的存储缓冲区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆