如何在不接触缓存的情况下写入或读取内存 [英] How to write or read memory without touching cache

查看:35
本文介绍了如何在不接触缓存的情况下写入或读取内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 x86 CPU 下,有没有什么方法可以在不接触 L1/L2/L3 缓存的情况下写入/读取内存?

Is there any way to write/read memory without touching L1/L2/L3 cache under x86 CPUs?

x86 CPU 中的缓存是否完全由硬件管理?

And is cache in x86 CPUs totally managed by hardware?

我想这样做是因为我想对内存速度进行采样,看看内存性能是否有任何部分下降.

I want to do this because I want to sample the speed of memory and see if any part of memory's performance degrades.

推荐答案

CPU 确实在硬件中管理自己的缓存,但 x86 提供了一些影响这种管理的方法.

The CPU indeed manages its own caches in hardware, but x86 provides you some ways to affect this management.

要在不缓存的情况下访问内存,您可以:

To access memory without caching, you could:

  1. 使用 x86 非临时指令,它们旨在告诉 CPU 您不会再次重用这些数据,因此将其保留在缓存中毫无意义.x86 中的这些指令通常称为 movnt*(根据数据类型使用后缀,例如 movnti 用于将普通整数加载到通用寄存器).还有一些用于流式加载/存储的说明,它们也使用类似的技术,但更适合高 BW 流(当您连续加载整行时).要使用这些,要么在内联汇编中对它们进行编码,要么使用编译器提供的内在函数,它们中的大多数都称该系列为 _mm_stream_*

  1. Use the x86 non-temporal instructions, they're meant to tell the CPU that you won't be reusing this data again, so there's no point in retaining it in the cache. These instructions in x86 are usually called movnt* (with the suffix according to data type, for e.g. movnti for loading normal integers to general purpose registers). There are also instructions for streaming loads/stores that also use a similar technique but are more appropriate for high BW streams (when you load full lines consecutively). To use these, either code them in inline assembly, or use the intrinsics provided by your compiler, most of them call that family _mm_stream_*

将特定区域的内存类型更改为不可缓存.由于您声明不想禁用所有缓存(这是理所当然的,因为这还包括代码、堆栈、页面映射等),您可以将基准测试数据集所在的特定区域定义为不可缓存,使用 MTRR(内存类型范围寄存器).有几种方法可以做到这一点,您需要阅读一些相关文档.

Change the memory type of the specific region to uncacheable. Since you stated you don't want to disable all caching (and rightfully so, since that would also include code, stack, page map, etc..), you could define the specific region your benchmark's data-set resides in as uncacheable, using MTRRs (memory type range registers). There are several ways of doing that, you'll need to read some documentation for that.

最后一个选项是正常获取该行,这意味着它最初确实被缓存,但随后使用专用的 clflush 指令(或完整的 wbinvd,如果您想刷新)强制它清除所有缓存级别整个缓存).确保正确隔离这些操作,以便您可以保证它们已完成(当然不要将它们作为延迟的一部分来衡量).

The last option is to fetch the line normally, which means it does get cached initially, but then force it to clear out of all cache levels using the dedicated clflush instruction (or the full wbinvd if you want to flush the entire cache). Make sure to properly fence these operations so that you can guarantee they're done (and of course don't measure them as part of the latency).

话虽如此,如果您只想对内存读取进行计时,那么您可能会得到糟糕的结果,因为大多数 CPU 处理非临时或不可缓存的访问效率低下".如果您只是在强制从内存中读取数据,那么最好通过依次访问大到无法放入任何缓存的数据集来操作缓存 LRU 来实现.这将使大多数 LRU 方案(不是全部!)首先丢弃最旧的行,因此下次您回绕时,它们必须来自内存.

Having said that, if you want to do all this just to time your memory reads, you may get bad results, since most of the CPUs handle non-temporal or uncacheable accesses "inefficiently". If you're just after forcing reads to come from memory, this is best achieved through manipulating the caches LRUs by sequentially accessing a data set that's large enough to not fit in any cache. This would make most LRU schemes (not all!) drop the oldest lines first, so the next time you wrap around, they'll have to come from memory.

请注意,要使其正常工作,您需要确保您的硬件预取器不起作用(并意外覆盖了您要测量的延迟)-要么禁用它,要么使访问步幅足够大以使其无效.

Note that for that to work, you need to make sure your HW prefetcher does not help (and accidentally covers the latency you want to measure) - either disable it, or make the accesses stride far enough for it to be ineffective.

这篇关于如何在不接触缓存的情况下写入或读取内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆