在x86上对L1缓存行的独占访问权? [英] Exclusive access to L1 cacheline on x86?

查看:110
本文介绍了在x86上对L1缓存行的独占访问权?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果有一个64字节的缓冲区被大量读取/写入,则很可能将其保留在L1中;否则,它将保留在L1中.但是有什么方法可以强迫这种行为吗?

If one has a 64 byte buffer that is heavily read/written to then it's likely that it'll be kept in L1; but is there any way to force that behaviour?

在这种情况下,授予一个内核对这64个字节的独占访问权,并告诉它不要与其他内核或内存控制器同步数据,以便使这64个字节始终驻留在一个内核的L1中,而不管CPU是否认为它是经常使用.

As in, give one core exclusive access to those 64 bytes and tell it not to sync the data with other cores nor the memory controller so that those 64 bytes always live in one core's L1 regardless of whether or not the CPU thinks it's used often enough.

推荐答案

否,x86不允许您这样做.您可以使用clfushopt强制退出,或(在即将推出的CPU上)强制回写,而无需使用clwb强制退出,但是您不能将行固定在缓存中或禁用一致性.

No, x86 doesn't let you do this. You can force evict with clfushopt, or (on upcoming CPUs) for just write-back without evict with clwb, but you can't pin a line in cache or disable coherency.

您可以将整个CPU(或单个内核?)置于"RAM缓存"(aka no-fill)模式,以禁用与内存控制器的同步,并禁止回写数据. Cache-as-Ram(无填充模式)可执行代码 .在配置内存控制器之前,BIOS/固件通常在早期引导中使用它.它不是每行可用的,并且几乎可以肯定在这里实际上没有用.有趣的事实:离开此模式是 invd 的用例之一与wbinvd相对,缓存的数据没有回写.

You can put the whole CPU (or a single core?) into cache-as-RAM (aka no-fill) mode to disable sync with the memory controller, and disable ever writing back the data. Cache-as-Ram (no fill mode) Executable Code. It's typically used by BIOS / firmware in early boot before configuring the memory controllers. It's not available on a per-line basis, and is almost certainly not practically useful here. Fun fact: leaving this mode is one of the use-cases for invd, which drops cached data without writeback, as opposed to wbinvd.

我不确定无填充模式是否会阻止从L1d驱逐到L3或其他情况;还是只是在逐出数据时丢弃了数据.因此,您只需要避免访问超过7条其他高速缓存行,这些行将成为您在L1d中关心的高速缓存行的别名,或者等同于L2/L3.

I'm not sure if no-fill mode prevents eviction from L1d to L3 or whatever; or if data is just dropped on eviction. So you'd just have to avoid accessing more than 7 other cache lines that alias the one you care about in your L1d, or the equivalent for L2/L3.

能够强制一个内核无限期地挂接到L1d线路上,并且不响应MESI要求回写/共享它,这会使其他内核如果碰到那条线路就容易受到锁定的影响.因此,显然,如果存在这样的功能,则将需要内核模式. (并且使用硬件虚拟化,需要管理程序特权.)它也可能阻止硬件DMA(因为现代的x86具有与缓存相关的DMA).

Being able to force one core to hang on to a line of L1d indefinitely and not respond to MESI requests to write it back / share it would make the other cores vulnerable to lockups if they ever touched that line. So obviously if such a feature existed, it would require kernel mode. (And with HW virtualization, require hypervisor privilege.) It could also block hardware DMA (because modern x86 has cache-coherent DMA).

因此,要支持这种功能,需要CPU的许多部分来处理不确定的延迟,如果存在这种情况,当前可能存在一些上限,该上限可能短于PCIe超时. (我不写驱动程序也不构建真正的硬件,只是猜测一下.)

So supporting such a feature would require lots of parts of the CPU to handle indefinite delays, where currently there's probably some upper bound, which may be shorter than a PCIe timeout, if there is such a thing. (I don't write drivers or build real hardware, just guessing about this).

@fuz指出,违反一致性的指令(xdcbt)是

As @fuz points out, a coherency-violating instruction (xdcbt) was tried on PowerPC (in the Xbox 360 CPU), with disastrous results from mis-speculated execution of the instruction. So it's hard to implement.

如果该线路经常使用,则更换LRU会使它保持高温.而且,如果它以足够频繁的间隔从L1d中丢失,那么它可能会在L2中保持高温,而L2在最近的设计中也处于内核和私有状态,而且速度非常快(自Nehalem以来为Intel).英特尔在Skylake-AVX512以外的CPU上具有独占性的L3,意味着留在L1d中也意味着留在L3中.

If the line is frequently used, LRU replacement will keep it hot. And if it's lost from L1d at frequent enough intervals, then it will probably stay hot in L2 which is also on-core and private, and very fast, in recent designs (Intel since Nehalem). Intel's inclusive L3 on CPUs other than Skylake-AVX512 means that staying in L1d also means staying in L3.

所有这些都意味着,对于一个内核大量使用的线路,无论采用哪种频率,完全高速缓存都不会丢失到DRAM.因此吞吐量应该不是问题. 我想您可能希望将其用于实时延迟,在这种情况下,一次函数调用的最坏情况下的运行时间很重要.在代码的其他部分从高速缓存行中进行虚拟读取可能有助于使其保持高温.

All this means that full cache misses all the way to DRAM are very unlikely with any kind of frequency for a line that's heavily used by one core. So throughput shouldn't be a problem. I guess you could maybe want this for realtime latency, where the worst-case run time for one call of a function mattered. Dummy reads from the cache line in some other part of the code could be helpful in keeping it hot.

但是,如果L3缓存中其他内核的压力导致该行从L3驱逐出去,则包含L3的Intel CPU也必须强制从仍然很热的内部缓存中驱逐. IDK,如果有任何机制可以让L3知道核心的L1d中大量使用了线路,因为这不会产生任何L3流量.

However, if pressure from other cores in L3 cache causes eviction of this line from L3, Intel CPUs with an inclusive L3 also have to force eviction from inner caches that still have it hot. IDK if there's any mechanism to let L3 know that a line is heavily used in a core's L1d, because that doesn't generate any L3 traffic.

我不知道这在实际代码中有很多问题. L3具有高度的关联性(例如16或24方式),因此在驱逐之前需要进行很多冲突. L3还使用了更复杂的索引函数(类似于真实的哈希函数,而不仅仅是通过采用连续的位范围取模).在IvyBridge及更高版本中,它还使用自适应替换策略来缓解逐出而避免触碰很多不经常重复使用的数据的情况. http://blog.stuffedcow.net/2013/01/ivb-cache -replacement/.

I'm not aware of this being much of a problem in real code. L3 is highly associative (like 16 or 24 way), so it takes a lot of conflicts before you'd get an eviction. L3 also uses a more complex indexing function (like a real hash function, not just modulo by taking a contiguous range of bits). In IvyBridge and later, it also uses an adaptive replacement policy to mitigate eviction from touching a lot of data that won't be reused often. http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.

另请参阅哪种缓存映射技术是在Intel Core i7处理器中使用?

@AlexisWilke指出,在某些用例中,您可以使用矢量寄存器代替高速缓存行. .您可以在全球范围内专门使用一些向量regs来实现此目的.要在gcc生成的代码中获得此信息,请使用 -ffixed-ymm8 ,或将其声明为易失性全局寄存器变量. (如何通知GCC不要使用特定的寄存器)

@AlexisWilke points out that you could maybe use vector register(s) instead of a line of cache, for some use-cases. Using ymm registers as a "memory-like" storage location. You could globally dedicate some vector regs to this purpose. To get this in gcc-generated code, maybe use -ffixed-ymm8, or declare it as a volatile global register variable. (How to inform GCC to not use a particular register)

使用ALU指令或存储转发将数据从向量reg获取到/从向量reg获取数据将为您确保延迟,而不会发生数据缓存未命中的情况.但是,代码缓存未命中仍然是极低延迟的问题.

Using ALU instructions or store-forwarding to get data to/from the vector reg will give you guaranteed latency with no possibility of data-cache misses. But code-cache misses are still a problem for extremely low latency.

这篇关于在x86上对L1缓存行的独占访问权?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆