有没有办法刷新与程序相关的整个 CPU 缓存? [英] Is there a way to flush the entire CPU cache related to a program?

查看：141 发布时间：2021/6/2 19:32:05 c++ assembly memory optimization cpu-cache

本文介绍了有没有办法刷新与程序相关的整个 CPU 缓存?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在x86-64 平台上，CLFLUSH 汇编指令允许刷新与给定地址对应的缓存行.不是刷新与特定地址相关的缓存，是否有办法刷新整个缓存(与正在执行的程序相关的缓存或整个缓存)，例如通过使其充满虚拟内容(或任何我不知道的其他方法):

仅使用标准 C++17?
如有必要，使用标准 C++17 和编译器内部函数?

以下函数的内容是什么:(无论编译器优化如何，该函数都应该工作)?

void flush_cache(){//内容}

解决方案

有关清除缓存(尤其是在 x86 上)相关问题的链接，请参阅 WBINVD 指令用法.

<小时>

不，您无法使用纯 ISO C++17 可靠或高效地做到这一点.它不知道也不关心 CPU 缓存.您能做的最好的事情就是访问大量内存，这样其他所有内容最终都会被驱逐¹，但这并不是您真正想要的.(当然，根据定义，刷新所有缓存是低效的...)

CPU 缓存管理函数/内在函数/asm 指令是 C++ 语言的特定于实现的扩展.但是除了内联汇编之外，我所知道的 C 或 C++ 实现没有提供一种方法来刷新所有缓存，而不是一系列地址.那是因为它不是一件正常的事情.

<小时>

例如，在 x86 上，您要查找的 asm 指令是 wbinvd. 与 invd<不同，它在驱逐之前回写任何脏行/code>(在没有回写的情况下删除缓存，有用当离开缓存作为 RAM 模式时).所以理论上 wbinvd 没有架构效果，只有微架构，但它太慢了，这是一条特权指令.正如英特尔的 insn ref 手册条目 wbinvd 指出的那样，它会增加中断延迟，因为它本身是不可中断的，可能需要等待 8 MiB 或更多的脏 L3 缓存被刷新.即，与大多数时序效应不同，将中断延迟那么长时间可以被视为一种架构效应.它在多核系统上也很复杂，因为它必须为所有内核刷新缓存.

我认为在 x86 上的用户空间(环 3)中没有任何方法可以使用它.与 cli/sti 和 in/out 不同，它不是通过 IO 权限级别启用的(您可以在 Linux 上使用 iopl() 系统调用进行设置).所以 wbinvd 仅在实际运行在 ring 0 中(即在内核代码中)时才有效.请参阅特权指令和 CPU 环级别.

但是如果您正在 GNU C 或 C++ 中编写内核(或在 ring0 中运行的独立程序)，您可以使用 asm("wbinvd" ::: "memory");.在运行实际 DOS 的计算机上，普通程序以实模式运行(没有任何低权限级别；一切都是有效的内核).这将是另一种运行微基准测试的方法，它需要运行特权指令以避免wbinvd 的内核<->用户空间转换开销，并且还具有在操作系统下运行的便利性，因此您可以使用文件系统.不过，将您的微基准测试放入 Linux 内核模块可能比从 USB 记忆棒或其他东西启动 FreeDOS 更容易.特别是如果你想控制涡轮频率的东西.

<小时>我能想到的唯一原因是你可能想要这个是为了某种实验来弄清楚特定 CPU 的内部结构是如何设计的.因此，具体如何完成的细节至关重要.我什至想要一种可移植/通用的方式来做到这一点是没有意义的.
或者可能在重新配置物理内存布局之前在内核中，例如所以现在有一个以太网卡的 MMIO 区域，那里曾经有普通的 DRAM.但在这种情况下，您的代码已经完全特定于架构.
<小时>通常，当您出于正确性原因想要/需要刷新缓存时，您知道哪个地址范围需要刷新.例如在具有非缓存一致性的 DMA 架构上编写驱动程序时，回写发生在 DMA 读取之前，并且不会执行 DMA 写入.(驱逐部分对于 DMA 读取也很重要:您不想要旧的缓存值).但是现在 x86 具有缓存一致性 DMA，因为现代设计将内存控制器构建到 CPU 芯片中，因此系统流量可以在从 PCIe 到内存的过程中监听 L3.
驱动程序之外需要担心缓存的主要情况是在具有非一致性指令缓存的非 x86 架构上使用 JIT 代码生成.如果您(或 JIT 库)将一些机器代码写入 char[] 缓冲区并将其转换为函数指针，则像 ARM 这样的体系结构不保证代码提取会看到"新的-写入的数据.
这就是 gcc 提供 __builtin__clear_cache.它不一定会刷新任何内容，只是确保将该内存作为代码执行是安全的.x86 具有与数据缓存一致的指令缓存，并支持 自修改代码，无需任何特殊同步指令.见 godbolt for x86 和 AArch64，并注意 __builtin__clear_cache 对于 x86 编译为零指令，但有对周围代码的影响:没有它，gcc 可以在转换为函数指针和调用之前优化到缓冲区的存储.(它没有意识到数据被用作代码，所以它认为它们是死存储并消除它们.)

尽管名称如此，__builtin__clear_cache 与 wbinvd 完全无关.它需要一个地址范围作为参数，因此它不会刷新整个缓存并使整个缓存无效.它也不使用 clflush、clflushopt 或 clwb 来实际从缓存中写回(和可选地逐出)数据.>

当您需要刷新某些缓存以确保正确性时，您只想刷新一系列地址，不要通过刷新所有缓存来减慢系统速度.

<小时>

出于性能原因故意刷新缓存很少有意义，至少在 x86 上是这样.有时您可以使用污染最小化预取来读取数据而不会造成太多缓存污染，或者使用 NT 存储来写入缓存.但是在最后一次接触一些内存之后做正常"的事情然后 clflushopt 在正常情况下通常是不值得的.就像存储一样，它必须一直遍历内存层次结构，以确保在任何地方找到并刷新该行的任何副本.

没有像_mm_prefetch相反的轻量级指令设计为性能提示.

<小时>

您可以在 x86 上的用户空间中进行的唯一缓存刷新是使用 clflush/clflushopt.(或者使用 NT 存储，如果它之前很热，它也会驱逐缓存线).或者当然为已知的 L1d 大小和关联性创建冲突驱逐，例如以 4kiB 的倍数写入多行，这些行都映射到 32k/8 路 L1d 中的同一组.

clflush(还有另一个用于 clflushopt)，但这些只能通过(虚拟)地址刷新缓存行.您可以遍历进程映射的所有页面中的所有缓存行......(但这只能刷新您自己的内存，而不是缓存内核数据的缓存行，例如进程的内核堆栈或其 task_struct，因此第一个系统调用仍然比刷新所有内容要快).

有一个 Linux 系统调用包装器可移植地驱逐一系列地址:cacheflush(char *addr, int nbytes, int flags).假设 x86 上的实现在循环中使用 clflush 或 clflushopt，如果它在 x86 上完全受支持的话.手册页说它首先出现在 MIPS Linux 中但是现在，Linux 提供了一个 cacheflush() 系统调用在其他一些架构，但有不同的论点."

我认为没有公开 wbinvd 的 Linux 系统调用， 但您可以编写一个内核模块来增加一个.

<小时>

最近的 x86 扩展引入了更多缓存控制指令，但仍然只能通过地址来控制特定的缓存线.用例适用于直接连接到 CPU 的非易失性内存，例如英特尔傲腾 DC 持久内存.如果你想提交到持久存储而不使下一次读取变慢，你可以使用 clwb.但请注意，clwb 并不能保证避免被驱逐，而只是允许这样做.它可能与 clflushopt 运行相同，例如 SKX 可能就是这种情况.

请参阅 https://danluu.com/clwb-pcommit/，但请注意 pcommit 不是必需的:英特尔决定在发布任何需要它的芯片之前简化 ISA，因此 clwb 或 clflushopt + sfence就足够了.请参阅 https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction.

无论如何，这是一种与现代 CPU 相关的缓存控制.无论您在做什么实验，都需要在 x86 上进行 ring0 和汇编.

<小时>脚注1:触动大量内存:纯ISO C++17
您可以可能分配一个非常大的缓冲区，然后memset它(因此这些写入会用该数据污染所有(数据)缓存)，然后取消映射.如果 delete 或 free 实际上立即将内存返回给操作系统，那么它将不再是进程地址空间的一部分，因此只有少数其他数据的缓存行仍然会很热:可能是一两行堆栈(假设您使用的是使用堆栈的 C++ 实现，以及在操作系统下运行程序......).当然，这只会污染数据缓存，而不是指令缓存，而且正如 Basile 指出的那样，某些级别的缓存是每个内核私有的，操作系统可以在 CPU 之间迁移进程.
另外，请注意使用实际的 memset 或 std::fill 函数调用，或优化的循环，可以优化为使用缓存绕过或减少污染的商店.而且我还隐含地假设您的代码在具有写入分配缓存的 CPU 上运行，而不是在存储未命中时进行直写(因为所有现代 CPU 都是这样设计的).
做一些无法优化并触及大量内存的事情(例如，使用long 数组而不是位图的素筛)会更可靠，但当然仍然依赖于缓存污染驱逐其他数据.仅仅读取大量数据也不可靠；一些 CPU 实现了自适应替换策略，以减少顺序访问带来的污染，因此循环访问大数组不会驱逐大量有用的数据.例如.英特尔 IvyBridge 及更高版本中的 L3 缓存就是这样做的.
On x86-64 platforms, the CLFLUSH assembly instruction allows to flush the cache line corresponding to a given address. Instead of flushing the cache related to a specific address, would there be a way to flush the entire cache (either the cache related to the program being executed, or the entire cache), for example by making it full of dummy contents (or any other approach I would not be aware of):


using only standard C++17?
using standard C++17 and compiler intrinsics if necessary?


What would be the contents of the following function: (the function should work regardless of compiler optimizations)?
void flush_cache() 
{
    // Contents
}

 解决方案 
For links to related questions about clearing caches (especially on x86), see the first answer on WBINVD instruction usage.



No, you cannot do this reliably or efficiently with pure ISO C++17.  It doesn't know or care about CPU caches.  The best you could do is touch a lot of memory so everything else ends up getting evicted¹, but this is not what you're really asking for.  (Of course, flushing all cache is by definition inefficient...)

CPU cache management functions / intrinsics / asm instructions are implementation-specific extensions to the C++ language.  But other than inline asm, no C or C++ implementations that I'm aware of provide a way to flush all cache, rather than a range of addresses.  That's because it's not a normal thing to do.  



On x86, for example, the asm instruction you're looking for is wbinvd.  It writes-back any dirty lines before evicting, unlike invd (which drops cache without write-back, useful when leaving cache-as-RAM mode).  So in theory wbinvd has no architectural effect, only microarchitectural, but it's so slow that's it's a privileged instruction.  As Intel's insn ref manual entry for wbinvd points out, it will increase interrupt latency, because it is not itself interruptible and may have to wait for 8 MiB or more of dirty L3 cache to be flushed.  i.e. delaying interrupts for that long can be considered an architectural effect, unlike most timing effects.  It's also complicated on a multi-core system because it has to flush caches for all cores.

I don't think there's any way to use it in user-space (ring 3) on x86.  Unlike cli / sti and in/out, it's not enabled by the IO-privilege level (which you can set on Linux with an iopl() system call).  So wbinvd only works when actually running in ring 0 (i.e. in kernel code).  See Privileged Instructions and CPU Ring Levels.

But if you're writing a kernel (or freestanding program that runs in ring0) in GNU C or C++, you could use asm("wbinvd" ::: "memory");.  On a computer running actual DOS, normal programs run in real mode (which doesn't have any lower-privilege levels; everything is effectively kernel).  That would be another way to run a microbenchmark that needs to run privileged instructions to avoid kernel<->userspace transition overhead for wbinvd, and also has the convenience of running under an OS so you can use a filesystem.  Putting your microbenchmark into a Linux kernel module might be easier than booting FreeDOS from a USB stick or something, though.  Especially if you want control of turbo frequency stuff.



The only reason I can think of that you might want this is for some kind of experiment to figure out how the internals of a specific CPU are designed.  So the details of exactly how it's done are critical.  It doesn't make sense to me to even want a portable / generic way to do this.

Or maybe in a kernel before reconfiguring physical memory layout, e.g. so there's now an MMIO region for an ethernet card where there used to be normal DRAM.  But in that case your code is already totally arch-specific.



Normally when you want / need to flush caches for correctness reasons, you know which address range needs flushing.  e.g. when writing drivers on architectures with DMA that isn't cache coherent, so write-back happens before a DMA read, and doesn't step on a DMA write.  (And the eviction part is important for DMA reads, too: you don't want the old cached value).  But x86 has cache-coherent DMA these days, because modern designs build the memory controller into the CPU die so system traffic can snoop L3 on the way from PCIe to memory.

The major case outside of drivers where you need to worry about caches is with JIT code-generation on non-x86 architectures with non-coherent instruction caches.  If you (or a JIT library) write some machine code into a char[] buffer and cast it to a function pointer, architectures like ARM don't guarantee that code-fetch will "see" that newly-written data.

This is why gcc provides __builtin__clear_cache.  It doesn't necessarily flush anything, only makes sure it's safe to execute that memory as code.  x86 has instruction caches that are coherent with data caches and supports self-modifying code without any special syncing instructions.  See godbolt for x86 and AArch64, and note that __builtin__clear_cache compiles to zero instructions for x86, but has an effect on surrounding code: without it, gcc can optimize away stores to a buffer before casting to a function pointer and calling.  (It doesn't realize that data is being used as code, so it thinks they're dead stores and eliminates them.)

Despite the name,  __builtin__clear_cache is totally unrelated to wbinvd.  It needs an address-range as args so it's not going to flush and invalidate the entire cache.   It also doesn't use use clflush, clflushopt, or clwb to actually write-back (and optionally evict) data from cache.

When you need to flush some cache for correctness, you only want to flush a range of addresses, not slow the system down by flushing all the caches.



It rarely if ever makes sense to intentionally flush caches for performance reasons, at least on x86.  Sometimes you can use pollution-minimizing prefetch to read data without as much cache pollution, or use NT stores to write around cache.  But doing "normal" stuff and then clflushopt after touching some memory for the last time is generally not worth it in normal cases.  Like a store, it has to go all the way through the memory hierarchy to make sure it finds and flushes any copy of that line anywhere.

There isn't a light-weight instruction designed as a performance hint, like the opposite of _mm_prefetch.



The only cache-flushing you can do in user-space on x86 is with clflush / clflushopt.  (Or with NT stores, which also evict the cache line if it was hot before hand).   Or of course creating conflict evictions for known L1d size and associativity, like writing to multiple lines at multiples of 4kiB which all map to the same set in a 32k / 8-way L1d.

There's an Intel intrinsic [_mm_clflush(void const *p)][6] wrapper for clflush (and another for clflushopt), but these can only flush cache lines by (virtual) address.  You could loop over all the cache lines in all the pages your process has mapped...  (But that can only flush your own memory, not cache lines that are caching kernel data, like the kernel stack for your process or its task_struct, so the first system-call will still be faster than if you had flushed everything).

There's a Linux system call wrapper to portably evict a range of addresses: cacheflush(char *addr, int nbytes, int flags).  Presumably the implementation on x86 uses clflush or clflushopt in a loop, if it's supported on x86 at all.  The man page says it first appeared in MIPS Linux "but
       nowadays, Linux provides a cacheflush() system call on some other
       architectures, but with different arguments."

I don't think there's a Linux system call that exposes wbinvd, but you could write a kernel module that adds one.



Recent x86 extensions introduced more cache-control instructions, but still only by address to control specific cache lines.  The use-case is for non-volatile memory attached directly to the CPU, such as Intel Optane DC Persistent Memory.  If you want to commit to persistent storage without making the next read slow, you can use clwb.  But note that clwb is not guaranteed to avoid eviction, it's merely allowed to.  It might run the same as clflushopt, like may be the case on SKX.

See https://danluu.com/clwb-pcommit/, but note that pcommit isn't required: Intel decided to simplify the ISA before releasing any chips that need it, so clwb or clflushopt + sfence are sufficient.  See https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction.

Anyway, this is the kind of cache-control that's relevant for modern CPUs.  Whatever experiment you're doing requires ring0 and assembly on x86.



Footnote 1: Touching a lot of memory: pure ISO C++17

You could maybe allocate a very large buffer and then memset it (so those writes will pollute all the (data) caches with that data), then unmap it.  If delete or free actually returns the memory to the OS right away, then it will no longer be part of your process's address space, so only a few cache lines of other data will still be hot: probably a line or two of stack (assuming you're on a C++ implementation that uses a stack, as well as running programs under an OS...).  And of course this only pollutes data caches, not instruction caches, and as Basile points out, some levels of cache are private per-core, and OSes can migrate processes between CPUs.

Also, beware that using an actual memset or std::fill function call, or a loop that optimizes to that, could be optimized to use cache-bypassing or pollution-reducing stores.  And I also implicitly assumed that your code is running on a CPU with write-allocate caches, instead of write-through on store misses (because all modern CPUs are designed this way).

Doing something that can't optimize away and touches a lot of memory (e.g. a prime sieve with a long array instead of a bitmap) would be more reliable, but of course still dependent on cache pollution to evict other data.  Just reading large amounts of data isn't reliable either; some CPUs  implement adaptive replacement policies that reduce pollution from sequential accesses, so looping over a big array hopefully doesn't evict lots of useful data.  E.g. the L3 cache in Intel IvyBridge and later does this.

                        这篇关于有没有办法刷新与程序相关的整个 CPU 缓存?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

有没有办法刷新与程序相关的整个 CPU 缓存? [英] Is there a way to flush the entire CPU cache related to a program?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

有没有办法刷新与程序相关的整个 CPU 缓存? [英] Is there a way to flush the entire CPU cache related to a program?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭