为什么 memcpy() 的速度每 4KB 就会急剧下降? [英] Why does the speed of memcpy() drop dramatically every 4KB?

查看:21
本文介绍了为什么 memcpy() 的速度每 4KB 就会急剧下降?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我测试了 memcpy() 的速度,注意到速度在 i*4KB 时急剧下降.结果如下:Y轴是速度(MB/秒),X轴是memcpy()的缓冲区大小,从1KB增加到2MB.子图 2 和子图 3 详细说明了 1KB-150KB 和 1KB-32KB 的部分.

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.

环境:

CPU : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

CPU : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

操作系统:2.6.35-22-generic #33-Ubuntu

OS : 2.6.35-22-generic #33-Ubuntu

GCC 编译器标志:-O3 -msse4 -DINTEL_SSE4 -Wall -std=c99

GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99

我猜肯定和缓存有关,但从以下缓存不友好的情况中我找不到原因:

I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:

为什么转置 512x512 的矩阵比转置 513x513 的矩阵慢得多?

由于这两种情况的性能下降都是由不友好的循环引起的,这些循环将分散的字节读入缓存,浪费了缓存线的剩余空间.

Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.

这是我的代码:

void memcpy_speed(unsigned long buf_size, unsigned long iters){
    struct timeval start,  end;
    unsigned char * pbuff_1;
    unsigned char * pbuff_2;

    pbuff_1 = malloc(buf_size);
    pbuff_2 = malloc(buf_size);

    gettimeofday(&start, NULL);
    for(int i = 0; i < iters; ++i){
        memcpy(pbuff_2, pbuff_1, buf_size);
    }   
    gettimeofday(&end, NULL);
    printf("%5.3f
", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - 
    start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
    free(pbuff_1);
    free(pbuff_2);
}

更新

考虑到@usr、@ChrisW 和@Leeor 的建议,我更精确地重新进行了测试,下图显示了结果.缓冲区大小从26KB到38KB,我每隔64B测试一次(26KB、26KB+64B、26KB+128B、......、38KB).每个测试在大约 0.15 秒内循环 100,000 次.有趣的是,下降不仅发生在 4KB 边界,而且出现在 4*i+2 KB,下降幅度要小得多.

UPDATE

Considering suggestions from @usr, @ChrisW and @Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.

@Leeor 提供了一种填充 drop 的方法,在 pbuff_1pbuff_2 之间添加了一个 2KB 的虚拟缓冲区.它有效,但我不确定 Leeor 的解释.

@Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.

推荐答案

内存通常以 4k 页面组织(尽管也支持更大尺寸).您的程序看到的虚拟地址空间可能是连续的,但在物理内存中不一定如此.维护虚拟地址到物理地址的映射(在页面映射中)的操作系统通常也会尝试将物理页面保持在一起,但这并不总是可能的,并且它们可能会断裂(特别是在长时间使用时它们可能会偶尔交换)).

Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).

当你的内存流跨越 4k 页面边界时,CPU 需要停下来去获取一个新的翻译——如果它已经看到了页面,它可能会被缓存在 TLB 中,并且访问被优化为最快,但如果这是第一次访问(或者如果您有太多页面供 TLB 保留),CPU 将不得不停止内存访问并开始对页面映射条目进行页面遍历 - 这与每个级别相对较长实际上是自己读取的内存(在虚拟机上它甚至更长,因为每个级别可能需要在主机上进行完整的页面遍历).

When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).

您的 memcpy 函数可能有另一个问题——当第一次分配内存时,操作系统只会将页面构建到页面映射中,但由于内部优化,将它们标记为未访问和未修改.第一次访问可能不仅会调用页面遍历,还可能会通知操作系统该页面将被使用(并存储到目标缓冲区页面中),这将需要向某些操作系统处理程序进行昂贵的转换.

Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.

为了消除这种噪音,分配缓冲区一次,执行多次复制,并计算分摊时间.另一方面,这会给你温暖"的性能(即在缓存预热之后),所以你会看到缓存大小反映在你的图表上.如果您想在不遭受分页延迟的情况下获得冷"效果,您可能需要在迭代之间刷新缓存(只要确保您没有时间)

In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)

重读这个问题,你似乎在做一个正确的测量.我的解释的问题是它应该在 4k*i 之后逐渐增加,因为每次这样的下降你都会再次支付罚款,但应该享受免费乘车直到下一个 4k.它没有解释为什么会有这样的尖峰",在它们之后速度恢复正常.

Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.

我认为您正在面临与问题中链接的关键步幅问题类似的问题 - 当您的缓冲区大小为 4k 时,两个缓冲区将与缓存中的相同集合对齐,并且互相殴打.你的 L1 是 32k,所以一开始看起来不是问题,但假设数据 L1 有 8 种方式,它实际上是对相同集合的 4k 环绕,并且你有 2*4k 块具有完全相同的对齐方式(假设分配是连续完成的)所以它们在相同的集合上重叠.LRU 不能完全按照您的预期工作就足够了,并且您会不断遇到冲突.

I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.

为了检查这一点,我会尝试在 pbuff_1 和 pbuff_2 之间分配一个虚拟缓冲区,使其大小为 2k 并希望它打破对齐.

To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.

好的,既然这有效,是时候详细说明一下了.假设您在 0x1000-0x1fff0x2000-0x2fff 范围内分配了两个 4k 数组.L1 中的 set 0 将包含 0x1000 和 0x2000 处的行,set 1 将包含 0x1040 和 0x2040,依此类推.在这些大小下,您还没有遇到任何抖动问题,它们都可以共存而不会溢出缓存的关联性.但是,每次执行迭代时,您都会有一个负载和一个访问同一个集合的存储 - 我猜这可能会导致硬件冲突.更糟糕的是 - 你需要多次迭代来复制一行,这意味着你有 8 个加载 + 8 个存储的拥塞(如果你向量化,则更少,但仍然很多),所有这些都针对同一个糟糕的集合,我很漂亮肯定有一堆碰撞隐藏在那里.

Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.

我还看到 Intel 优化指南 对此有专门说明(参见 3.6.8.2):

I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):

当代码访问两个不同的内存位置之间有 4 KB 的偏移量.4 KB混叠情况可以在内存复制例程中表现出来,其中源缓冲区和目标缓冲区的地址保持一个常量偏移量,常量偏移量恰好是从一次迭代到下一次迭代的字节增量.

4-KByte memory aliasing occurs when the code accesses two different memory locations with a 4-KByte offset between them. The 4-KByte aliasing situation can manifest in a memory copy routine where the addresses of the source buffer and destination buffer maintain a constant offset and the constant offset happens to be a multiple of the byte increment from one iteration to the next.

...

加载必须等到商店退役后才能进行继续.例如在偏移量为 16 处,下一次迭代的负载为4 KB 别名当前迭代存储,因此循环必须等待直到存储操作完成,使整个循环连载.等待时间越长,等待时间越短偏移量直到偏移量 96 解决了问题(因为没有待处理的以相同地址的加载时间存储).

loads have to wait until stores have been retired before they can continue. For example at offset 16, the load of the next iteration is 4-KByte aliased current iteration store, therefore the loop must wait until the store operation completes, making the entire loop serialized. The amount of time needed to wait decreases with larger offset until offset of 96 resolves the issue (as there is no pending stores by the time of the load with same address).

这篇关于为什么 memcpy() 的速度每 4KB 就会急剧下降?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆