强制将缓存行迁移到另一个核心 [英] Force a migration of a cache line to another core

查看:65
本文介绍了强制将缓存行迁移到另一个核心的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在x86硬件的C ++中(使用平台上可用的任何低级内在函数)(例如Intel Skylake),是否可以将高速缓存行发送到另一个内核而无需强制该内核上的线程加载行明确地?

In C++ (using any of the low level intrinsics available on the platform) for x86 hardware (say Intel Skylake for example), is it possible to send a cacheline to another core without forcing the thread on that core to load the line explicitly?

我的用例在并发数据结构中.在这种情况下,在某些情况下,某个内核会在内存中经过某些地方,而这些地方可能在探寻斑点时由其他一些内核拥有.这些内核上的线程通常在条件变量中被阻塞,因此它们具有一些空闲循环,可以在其中运行其他有用的工作".这里有用的工作"的一个示例可能是它们将数据流式传输到另一个内核,该内核将在将来加载它们,因此加载的内核不必等待行进入它的缓存中就可以对其进行处理.在x86硬件上是否有一些可能的内在/指令可用?

My usecase is in a concurrent data-structure. In this, for some cases a core goes through some places in memory that might be owned by some other core(s) while probing for spots. The threads on those cores are typically are blocked on a condition variable, so they have some spare cycles where they can run additional "useful work". One example of "useful work" here might be that they stream the data to the other core that will load them in the future so the loading core doesn't have to wait for the line to come into it's cache before processing it. Is there some intrinsic/instruction available on x86 hardware where this can be possible?

__ builtin_prefetch并不能很好地工作,因为出于某种原因,它最终将延迟添加回执行加载的代码中:(也许步幅配置不佳,但是我却无法获得很好的步幅到目前为止,这可能会得到更好的处理,并且可以确定地从其他知道最终会加载它们的行的内核中进行确定.

A __builtin_prefetch didn't work really well because for some reason, it ends up adding that latency back to the code doing the loading :( Maybe the strides were not well configured, but I haven't been able to get good strides so far. This might be better handled, and deterministically from the other cores that know their lines might be loaded eventually.

推荐答案

没有推送";高速缓存行仅在物理核心请求后才进入该核心.(由于负载,软件预取甚至硬件预取.)

There is no "push"; a cache line enters L1d on a physical core only after that core requests it. (Because of a load, SW prefetch, or even HW prefetch.)

2个逻辑核心可以共享同一物理核心:如果某些将来负载的延迟远比吞吐量重要,那么唤醒预取辅助线程来引发情况可能不会那么恐怖.我正在想象让编写器使用条件变量或发送POSIX信号,或写入管道,或将导致OS辅助唤醒另一个线程的事情,该线程的CPU亲缘性设置为一个或两个逻辑核心,您关心的其他一些线程也固定到了.

2 logical cores can share the same physical core, in case that helps: it might be less horrible to wake up a prefetch-assistant thread to prime the case if latency of some future load is far more important than throughput. I'm picturing having the writer use a condition variable or send a POSIX signal, or write to a pipe, or something that will result in OS-assisted wakeup of another thread whose CPU affinity is set to one or both of the logical cores that some other thread you care about is also pinned to.

您可能在写者方可能要做的最好的事情就是触发向共享(L3)缓存的写回操作,以便另一个内核可以命中L3,而不是找到其他内核拥有它并拥有也要等待回写.(或者取决于uarch,用于直接core-> core传输)

The best you can possibly do from the writer side is trigger write-back to shared (L3) cache so the other core can hit in L3 instead of finding it owned by some other core and having to wait for write-back too. (Or depending on the uarch, for direct core->core transfer)

例如在Ice Lake或更高版本上,使用 clwb 强制执行回写操作,从而使它干净但仍被缓存.(但请注意,这会强制将其完全插入DRAM.) clwb在SKX上确实像 clflushopt 一样.

e.g. on Ice Lake or later, use clwb to force a write-back, resulting in it being clean but still cached. (But note that forces it to go all the way to DRAM.) clwb on SKX does evict like clflushopt.

另请参见 CPU缓存禁止,在这里我建议可能使用设置为直写式缓存的内存区域,如果在主流操作系统下有可能的话.另请参见如何强制cpu核心刷新存储缓冲在c中?

See also CPU cache inhibition where I suggested possibly using a memory region set to write-through caching, if that's possible under a mainstream OS. See also How to force cpu core to flush store buffer in c?

当然,也可以将写入器和读取器都固定在同一物理核心上,以便它们通过L1d进行通信.但是随后他们争夺执行资源.

Or of course to pin both writer and reader to the same physical core so they communicate via L1d. But then they compete for execution resources.

这篇关于强制将缓存行迁移到另一个核心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆