在页面边界访问数据时速度变慢? [英] Slowdown when accessing data at page boundaries?

查看：84 发布时间：2020/5/8 20:01:18 performance memory cpu-architecture cpu-cache

本文介绍了在页面边界访问数据时速度变慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

(我的问题与计算机体系结构和性能理解有关.未找到相关的论坛，因此将其作为一般问题发布在这里.)

我有一个C程序，该程序访问在虚拟地址空间中相隔X个字节的内存字.例如，for (int i=0;<some stop condition>;i+=X){array[i]=4;}.

我用变化的值X测量执行时间.有趣的是，当X是2的幂并且大约是页面大小(例如X=1024,2048,4096,8192...)时，我会发现性能大幅下降.但是，在X的所有其他值(例如1023和1025)上，没有减速.性能结果如下图所示.

我在几台个人计算机上测试了我的程序，所有这些计算机都在Intel CPU上运行带有x86_64的Linux.

这种减速的原因可能是什么?我们已经尝试在DRAM，L3缓存等中使用行缓冲区，但这似乎没有任何意义.

更新(7月11日)

我们在此处进行了一些测试，在原始代码中添加了NOP指令.而且减速仍然存在.否决了4k别名.冲突缓存未命中的原因更可能是这种情况.

解决方案

这里有两件事:

集关联高速缓存别名会造成冲突丢失.内部快速缓存(L1和L2)通常由物理地址中的一小部分位索引.因此，跨度为1024字节意味着所有访问的地址位都是相同的，因此您只使用高速缓存中的一些地址集.

但是，如果使用非2幂次方的跨度，您的访问权限将分布在缓存中的更多集合上. 2幂的数据的性能优势?(答案描述了此 dis 优点)

在英特尔®酷睿™i7处理器?-共享的L3缓存可以抵抗较大的2乘幂偏移，因为它使用了更复杂的索引功能.
4k别名(例如在某些Intel CPU中).尽管仅使用存储，但这可能无关紧要.当CPU必须快速确定负载是否正在重新加载最近存储的数据时，这主要是造成内存歧义的一个因素，而在第一遍中，它仅通过查看页面偏移量位就可以做到.

这可能不是您要执行的操作，但有关更多详细信息，请参见:
L1内存带宽:50％使用相差4096 + 64字节的地址和
会降低效率为什么是元素添加在单独的循环中比在组合的循环中快得多?

这两种影响中的一种或两种都可能是为什么memcpy()的速度每4KB急剧下降但是将步幅从1024更改为1023并没有很大的帮助. IvyBridge和更高版本中的下一页"预取仅是TLB预取，而不是下一页中的数据.

对于这种答案，我大部分都假设使用x86，但是通常会使用缓存别名/冲突遗漏的东西.具有简单索引的集关联高速缓存普遍用于L1d高速缓存. (或者在较旧的CPU上，直接映射，其中每个集合"只有1个成员). 4k别名可能大部分是特定于Intel的.

跨虚拟页面边界进行预取可能也是一个普遍问题.

(My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.)

I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;<some stop condition>;i+=X){array[i]=4;}.

I measure the execution time with a varying value of X. Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192..., I get to huge performance slowdown. But on all other values of X, like 1023 and 1025, there is no slowdown. The performance results are attached in the figure below.

I test my program on several personal machines, all are running Linux with x86_64 on Intel CPU.

What could be the cause of this slowdown? We have tried row buffer in DRAM, L3 cache, etc. which do not seem to make sense...

Update (July 11)

We did a little test here by adding NOP instructions to the original code. And the slowdown is still there. This sorta veto the 4k alias. The cause by conflict cache misses is more likely the case here.

解决方案

There's 2 things here:

Set-associative cache aliasing creating conflict misses if you only touch the multiple-of-1024 addresses. Inner fast caches (L1 and L2) are normally indexed by a small range of bits from the physical address. So striding by 1024 bytes means those address bits are the same for all accesses so you're only using a few of the sets in a cache.

But with a non-power-of-2 stride, your accesses will be distributed over more sets in the cache. Performance advantages of powers-of-2 sized data? (answer describes this disadvantage)

Which cache mapping technique is used in intel core i7 processor? - shared L3 cache is resistant to aliasing from big power-of-2 offsets because it uses a more complex indexing function.
4k aliasing (e.g. in some Intel CPUs). Although with only stores this probably doesn't matter. It's mainly a factor for memory disambiguation, when the CPU has to quickly figure out if a load might be reloading recently-stored data, and it does so in the first pass by just looking just at page-offset bits.

This is probably not what's going on for you, but for more details see:
L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes and
Why are elementwise additions much faster in separate loops than in a combined loop?

Either or both of these effects could be a factor in Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?

Another possible factor is that HW prefetching stops at physical page boundaries. Why does the speed of memcpy() drop dramatically every 4KB? But changing a stride from 1024 to 1023 wouldn't help that by a big factor. "Next-page" prefetching in IvyBridge and later is only TLB prefetching, not data from the next page.

I kind of assumed x86 for most of this answer, but the cache aliasing / conflict-miss stuff applies generally. Set-associative caches with simple indexing are universally used for L1d caches. (Or on older CPUs, direct-mapped where each "set" only has 1 member). The 4k aliasing stuff might be mostly Intel-specific.

Prefetching across virtual page boundaries is likely also a general problem.

这篇关于在页面边界访问数据时速度变慢?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在页面边界访问数据时速度变慢? [英] Slowdown when accessing data at page boundaries?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在页面边界访问数据时速度变慢? [英] Slowdown when accessing data at page boundaries?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭