为什么memcpy()的速度每4KB急剧下降? [英] Why does the speed of memcpy() drop dramatically every 4KB?
问题描述
我测试了memcpy()
的速度,发现速度在i * 4KB时急剧下降.结果如下:Y轴是速度(MB/秒),X轴是memcpy()
的缓冲区大小,从1KB增加到2MB.图2和图3详细说明了1KB-150KB和1KB-32KB的部分.
环境:
CPU:英特尔(R)至强(R)CPU E5620 @ 2.40GHz
OS:2.6.35-22(通用)#33-Ubuntu
GCC编译器标志:-O3 -msse4 -DINTEL_SSE4 -Wall -std = c99
我想它一定与缓存有关,但是在以下对缓存不友好的情况下我找不到原因:
-
更新
考虑来自@ usr,@ ChrisW和@Leeor的建议,我更精确地重新进行了测试,下图显示了结果.缓冲区大小从26KB到38KB,我每隔64B(26KB,26KB + 64B,26KB + 128B,......,38KB)对其进行测试.每个测试在约0.15秒内循环100,000次.有趣的是,下降不仅发生在4KB边界内,而且出现在4 * i + 2 KB处,下降幅度小得多.
PS
@Leeor提供了一种填补空缺的方法,在
pbuff_1
和pbuff_2
之间添加了2KB的虚拟缓冲区.可以,但是我不确定Leeor的解释.解决方案内存通常以4k页组织(尽管也支持更大的内存).您的程序看到的虚拟地址空间可能是连续的,但物理内存中不一定是这种情况.维护虚拟地址到物理地址(在页面映射中)的映射的OS通常会尝试将物理页面也保持在一起,但这并不总是可能的,并且它们可能会破裂(尤其是长时间使用时,有时可能会交换它们) ).
当您的内存流越过4k页面边界时,CPU需要停止并获取新的翻译-如果已经看到该页面,则该页面可能会缓存在TLB中,并且访问被优化为最快,但是,如果这是第一次访问(或者如果要保留的TLB的页面太多),则CPU将不得不停止内存访问并开始遍历页面映射条目-与每个级别相对较长实际上,它本身就是一个读取的内存(在虚拟机上,它甚至更长,因为每个级别在主机上可能需要完整的页面遍历).
您的memcpy函数可能还有另一个问题-首次分配内存时,操作系统只会将页面构建到页面映射中,但由于内部优化而将它们标记为未访问和未修改.首次访问不仅可以调用页面漫游,而且还可以帮助告知OS该页面将被使用(并存储到目标缓冲区页面中),这将需要昂贵的过渡到某些OS处理程序.
为了消除这种噪音,请一次分配缓冲区,重复执行几次副本,然后计算摊销时间.另一方面,这会给您温暖"的性能(即在对缓存进行预热之后),以便您看到缓存大小反映在图形上.如果您想在不遭受分页延迟的情况下获得冷"效果,则可能要在迭代之间刷新缓存(只需确保您不计时)
编辑
重新阅读该问题,您似乎正在进行正确的测量.我的解释的问题是,在
4k*i
之后它应该显示出逐渐增加的趋势,因为在每次下降时,您都需要支付罚款,但随后应该享受免费乘车服务,直到下一个4k.它没有解释为什么会有这样的尖峰",并且在它们之后速度恢复正常.我认为您 面临的问题与问题中链接的关键步幅问题类似-当您的缓冲区大小为4k左右时,两个缓冲区将对齐缓存中的相同集合,并且互相殴打.您的L1是32k,所以一开始似乎没有什么问题,但是假设数据L1有8种方式,实际上是对相同集合的4k环绕,并且您有2 * 4k块具有完全相同的对齐方式(假设分配是连续进行的),因此它们在相同的集合上重叠. LRU不能按您预期的那样正常工作就足够了,您将不断遇到冲突.
要对此进行检查,我将尝试在pbuff_1和pbuff_2之间分配一个虚拟缓冲区,将其设置为2k,并希望它能打破对齐方式.
好吧,既然这行得通,现在该进行详细说明了.假设您在
0x1000-0x1fff
和0x2000-0x2fff
范围内分配了两个4k阵列. L1中的set 0将包含0x1000和0x2000的行,set 1将包含0x1040和0x2040,依此类推.在这些大小下,您不会遇到任何问题,它们可以共存而不会溢出缓存的关联性.但是,每次执行迭代时,您都会有一个负载和一个存储访问相同的集合-我猜测这可能会导致硬件冲突.更糟糕的是,您需要多次迭代才能复制一行,这意味着您拥挤了8个负载+ 8个存储(如果向量化的话会更少,但仍然很多),所有这些都指向同一组可怜的文件,我很漂亮确保那里藏有很多碰撞.我还看到了Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from @usr, @ChrisW and @Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
@Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1
and pbuff_2
. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i
, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff
and 0x2000-0x2fff
. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different memory locations with a 4-KByte offset between them. The 4-KByte aliasing situation can manifest in a memory copy routine where the addresses of the source buffer and destination buffer maintain a constant offset and the constant offset happens to be a multiple of the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can continue. For example at offset 16, the load of the next iteration is 4-KByte aliased current iteration store, therefore the loop must wait until the store operation completes, making the entire loop serialized. The amount of time needed to wait decreases with larger offset until offset of 96 resolves the issue (as there is no pending stores by the time of the load with same address).
这篇关于为什么memcpy()的速度每4KB急剧下降?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!