将代码带入L1指令高速缓存而不执行 [英] Bring code into the L1 instruction cache without executing it

查看：125 发布时间：2020/9/20 18:38:55 performance x86 benchmarking prefetch microbenchmark

本文介绍了将代码带入L1指令高速缓存而不执行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

比方说，我有一个计划作为基准测试的一部分执行的函数.我想在执行之前将此代码带入L1指令高速缓存中，因为我不想将I $丢失的代价作为基准测试的一部分.

Let's say I have a function that I plan to execute as part of a benchmark. I want to bring this code into the L1 instruction cache prior to executing since I don't want to measure the cost of I$ misses as part of the benchmark.

执行此操作的明显方法是在基准测试之前至少执行一次代码，从而预热"代码并将其放入L1指令高速缓存以及可能的uop高速缓存等.

The obvious way to do this is to simply execute the code at least once before the benchmark, hence "warming it up" and bringing it into the L1 instruction cache and possibly the uop cache, etc.

在我不想执行代码的情况下有哪些选择(例如，因为我希望从指令地址中删除键的各种预测变量是冷的)?

What are my alternatives in the case I don't want to execute the code (e.g., because I want the various predictors which key off of instruction addresses to be cold)?

推荐答案

将同一物理页面映射到两个不同的虚拟地址.

Map the same physical page to two different virtual addresses.

L1I $物理寻址. (VIPT，但所有索引位都位于页面偏移量以下，因此有效地实现了PIPT).

L1I$ is physically addressed. (VIPT but with all the index bits from below the page offset, so effectively PIPT).

分支预测和uop缓存得到了虚拟寻址，因此，通过正确选择虚拟地址，在备用虚拟地址处对功能进行预热将启动L1I，但不会启动分支预测或uop缓存. (这仅在分支别名以大于4096字节为模的情况下才有效，因为这两个映射在页面中的位置是相同的.)

Branch-prediction and uop caches are virtually addressed, so with the right choice of virtual addresses, a warm-up run of the function at the alternate virtual address will prime L1I, but not branch prediction or uop caches. (This only works if branch aliasing happens modulo something larger than 4096 bytes, because the position within the page is the same for both mappings.)

通过在测试功能所在的页面中的call到ret的位置，在iTLB上占优.

Prime the iTLB by calling to a ret in the same page as the test function, but outside it.

设置后，在预热运行和计时运行之间不需要修改页表.这就是为什么您使用同一页面的两个映射而不是重新映射单个映射的原因.

After setting this up, no modification of the page tables are required between the warm-up run and the timing run. This is why you use two mappings of the same page instead of remapping a single mapping.

玛格丽特·布鲁姆(Margaret Bloom)建议如果您跳到容易崩溃的CPU可能会推测性地从no-exec页中获取指令(在错误预测的阴影下，因此它实际上不会出错)，但是那将需要更改页表，因此昂贵的系统调用，可能会驱逐该行L1I.但是，如果它不污染iTLB，则可以在函数所在的同一页中的任何地方使用错误预测的分支重新填充iTLB条目.或者只是在同一页面中的函数外部调用虚拟ret.

Margaret Bloom suggests that CPUs vulnerable to Meltdown might speculatively fetch instructions from a no-exec page if you jump there (in the shadow of a mispredict so it doesn't actually fault), but that would then require changing the page table, and thus a system call which is expensive and might evict that line of L1I. But if it doesn't pollute the iTLB, you could then re-populate the iTLB entry with a mispredicted branch anywhere into the same page as the function. Or just a call to a dummy ret outside the function in the same page.

但是，由于虚拟寻址已被实际解决，因此这些操作都不会让您预热uop缓存. OTOH，在现实生活中，如果分支预测变量很冷，那么uop缓存也可能很冷.

None of this will let you get the uop cache warmed up, though, because it's virtually addressed. OTOH, in real life, if branch predictors are cold then probably the uop cache will also be cold.

这篇关于将代码带入L1指令高速缓存而不执行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将代码带入L1指令高速缓存而不执行 [英] Bring code into the L1 instruction cache without executing it

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将代码带入L1指令高速缓存而不执行 [英] Bring code into the L1 instruction cache without executing it

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭