如何写或不接触缓存读取内存 [英] How to write or read memory without touching cache

查看:120
本文介绍了如何写或不接触缓存读取内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法读/写内存,而不在x86 CPU的触摸L1 / L2 / L3缓存?

和被缓存的x86 CPU完全由硬件管理?


  

编辑:我想这样做,因为我想样的内存速度,看看内存的性能降低任何部分



解决方案

的CPU确实是管理其自己的高速缓存硬件,但86为您提供了一些方法来影响这种管理。

要没有缓存访问内存,您可以:


  1. 使用x86的非暂时的指示,他们想告诉你不会再重用此数据的CPU,所以有在其保留在缓存中没有任何意义。在x86的这些指令通常被称为movnt *(后缀为根据数据类型,例如movnti加载正常整数通用寄存器)。此外,还有流媒体加载/存储也使用了类似的技术,但更适合于高BW流(当你满负荷连续线)指令。
    要内联汇编使用这些,无论是code,或使用你的编译器提供的内部函数,他们大多称之为家庭_mm_stream _ *


  2. 改变特定区域的内存类型为不可缓存。既然你说你不想禁用所有缓存(这是理所当然的,因为这也将包括code,堆栈,页地图等),你可以定义基准的数据集驻留在特定区域作为不可缓存,采用MTRRs(内存类型范围寄存器)。有这样做的,你需要阅读一些文档的几种方法。


  3. 最后一个选项是获取正常的路线,这意味着它得到初步缓存,但随后迫使其清除使用专用CLFLUSH指令(或全WBINVD所有缓存级别的,如果你要刷新整个缓存)。确保正确栅栏这些操作,这样就可以保证他们完成(当然不计算为延迟的一部分)。


说了这么多,如果你想要做这一切只是为了计时您的内存读取,你可能会得到不好的结果,因为大部分的CPU处理非时间或不可缓存的访问低效。如果你只是迫使后读取来自内存,这是最好的通过顺序访问的数据集是足够大,不适合在任何高速缓存操作的缓存的LRU实现。这将使最LRU方案(不是全部!)先删除最古老的线条,使下一次你环绕,他们将不得不来自于内存中。

要注意的是工作,你需要确保你的硬件prefetcher没有帮助(和意外覆盖您要测量的延迟) - 禁用它,或者使访问箭步远远不够了它是无效的。

Is there any way to write/read memory without touching L1/L2/L3 cache under x86 CPUs?

And is cache in x86 CPUs totally managed by hardware?

EDIT: I want to do this because I want to sample the speed of memory and see if any part of memory's performance degrades.

解决方案

The CPU indeed manages its own caches in hardware, but x86 provides you some ways to affect this management.

To access memory without caching, you could:

  1. Use the x86 non-temporal instructions, they're meant to tell the CPU that you won't be reusing this data again, so there's no point in retaining it in the cache. These instructions in x86 are usually called movnt* (with the suffix according to data type, for e.g. movnti for loading normal integers to general purpose registers). There are also instructions for streaming loads/stores that also use a similar technique but are more appropriate for high BW streams (when you load full lines consecutively). To use these, either code them in inline assembly, or use the intrinsics provided by your compiler, most of them call that family _mm_stream_*

  2. Change the memory type of the specific region to uncacheable. Since you stated you don't want to disable all caching (and rightfully so, since that would also include code, stack, page map, etc..), you could define the specific region your benchmark's data-set resides in as uncacheable, using MTRRs (memory type range registers). There are several ways of doing that, you'll need to read some documentation for that.

  3. The last option is to fetch the line normally, which means it does get cached initially, but then force it to clear out of all cache levels using the dedicated clflush instruction (or the full wbinvd if you want to flush the entire cache). Make sure to properly fence these operations so that you can guarantee they're done (and of course don't measure them as part of the latency).

Having said that, if you want to do all this just to time your memory reads, you may get bad results, since most of the CPUs handle non-temporal or uncacheable accesses "inefficiently". If you're just after forcing reads to come from memory, this is best achieved through manipulating the caches LRUs by sequentially accessing a data set that's large enough to not fit in any cache. This would make most LRU schemes (not all!) drop the oldest lines first, so the next time you wrap around, they'll have to come from memory.

Note that for that to work, you need to make sure your HW prefetcher does not help (and accidentally covers the latency you want to measure) - either disable it, or make the accesses stride far enough for it to be ineffective.

这篇关于如何写或不接触缓存读取内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆