为什么memmove与快于memcpy的? [英] Why is memmove faster than memcpy?

查看:441
本文介绍了为什么memmove与快于memcpy的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在其中花费50%的应用性能研究热点
其中的memmove时间(3)。该应用程序插入数百万4字节的整数
进入整理阵列,并使用memmove与转移的数据向右,在
为了腾出空间为插入的值。

I am investigating performance hotspots in an application which spends 50% of its time in memmove(3). The application inserts millions of 4-byte integers into sorted arrays, and uses memmove to shift the data "to the right" in order to make space for the inserted value.

我的期望是,内存复制速度非​​常快,我很惊讶
说了这么多时间都花在memmove与。但后来我有这样的的memmove理念
是缓慢的,因为它的移动重叠区域,它必须被实现
在一个紧密的循环,而不是复制大页面的内存。我写了一个小
微基准,以找出是否有之间的性能差异
memcpy和memmove与,期待的memcpy赢得手了。

My expectation was that copying memory is extremely fast, and I was surprised that so much time is spent in memmove. But then I had the idea that memmove is slow because it's moving overlapping regions, which must be implemented in a tight loop, instead of copying large pages of memory. I wrote a small microbenchmark to find out whether there was a performance difference between memcpy and memmove, expecting memcpy to win hands down.

我跑了两台机器(酷睿i5,酷睿i7),我的基准,看到memmove与
其实比的memcpy快,上了年纪的酷​​睿i7甚至接近两倍的速度!
现在我正在寻找解释。

I ran my benchmark on two machines (core i5, core i7) and saw that memmove is actually faster than memcpy, on the older core i7 even nearly twice as fast! Now I am looking for explanations.

下面是我的标杆。它复制100 MB用的memcpy,然后移动约100 MB的memmove带;源和目标是重叠的。各种距离
源和目标都试过了。每个测试运行10次,平均
时间被打印出来。

Here is my benchmark. It copies 100 mb with memcpy, and then moves about 100 mb with memmove; source and destination are overlapping. Various "distances" for source and destination are tried. Each test is run 10 times, the average time is printed.

<一个href=\"https://gist.github.com/cruppstahl/78a57cdf937bca3d062c\">https://gist.github.com/cruppstahl/78a57cdf937bca3d062c

下面是对酷睿i5的结果(Linux的3.5.0-54泛型#81〜precise1 Ubuntu的
SMP x86_64的GNU / Linux的,GCC 4.6.3是(Ubuntu的/ Linaro的4.6.3-1ubuntu5)。号码
括号中是源和目的地之间的距离(间隙尺寸):

Here are the results on the Core i5 (Linux 3.5.0-54-generic #81~precise1-Ubuntu SMP x86_64 GNU/Linux, gcc is 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The number in brackets is the distance (gap size) between source and destination:

memcpy        0.0140074
memmove (002) 0.0106168
memmove (004) 0.01065
memmove (008) 0.0107917
memmove (016) 0.0107319
memmove (032) 0.0106724
memmove (064) 0.0106821
memmove (128) 0.0110633

的memmove实施为SSE优化汇编code,从背面复制
前面。它使用硬件prefetch将数据加载到缓存中,并
拷贝128字节到XMM寄存器,然后在目的地存储它们。

Memmove is implemented as a SSE optimized assembler code, copying from back to front. It uses hardware prefetch to load the data into the cache, and copies 128 bytes to XMM registers, then stores them at the destination.

(<一个href=\"http://downloads.yoctoproject.org/mirror/sources/svn/www.eglibc.org/svn/branches/eglibc-2_13/libc/sysdeps/x86_64/multiarch/memcpy-ssse3-back.S\">memcpy-ssse3-back.S,线1650 FF)

(memcpy-ssse3-back.S, lines 1650 ff)

L(gobble_ll_loop):
    prefetchnta -0x1c0(%rsi)
    prefetchnta -0x280(%rsi)
    prefetchnta -0x1c0(%rdi)
    prefetchnta -0x280(%rdi)
    sub $0x80, %rdx
    movdqu  -0x10(%rsi), %xmm1
    movdqu  -0x20(%rsi), %xmm2
    movdqu  -0x30(%rsi), %xmm3
    movdqu  -0x40(%rsi), %xmm4
    movdqu  -0x50(%rsi), %xmm5
    movdqu  -0x60(%rsi), %xmm6
    movdqu  -0x70(%rsi), %xmm7
    movdqu  -0x80(%rsi), %xmm8
    movdqa  %xmm1, -0x10(%rdi)
    movdqa  %xmm2, -0x20(%rdi)
    movdqa  %xmm3, -0x30(%rdi)
    movdqa  %xmm4, -0x40(%rdi)
    movdqa  %xmm5, -0x50(%rdi)
    movdqa  %xmm6, -0x60(%rdi)
    movdqa  %xmm7, -0x70(%rdi)
    movdqa  %xmm8, -0x80(%rdi)
    lea -0x80(%rsi), %rsi
    lea -0x80(%rdi), %rdi
    jae L(gobble_ll_loop)

为什么快的memmove随后的memcpy?我希望的memcpy复制内存页,
这应该是比循环快得多。在最坏的情况下我期望的memcpy
是的memmove为快。

Why is memmove faster then memcpy? I would expect memcpy to copy memory pages, which should be much faster than looping. In worst case I would expect memcpy to be as fast as memmove.

PS:我知道,我不能取代的memcpy的memmove在我的code。我知道
在code样品混合C和C ++。这个问题实际上只是学术
用途。

PS: I know that I cannot replace memmove with memcpy in my code. I know that the code sample mixes C and C++. This question is really just for academic purposes.

我跑的测试一些变型中,基于各种答案。

I ran some variations of the tests, based on the various answers.


  1. 运行时MEMCPY两次,然后第二次运行是比第一个快。

  2. 当感人的memcpy的目标缓冲区( memset的(B2,0,BUFFERSIZE ...)),那么的memcpy的第一次运行也比较快。

  3. 的memcpy仍然比慢的memmove一点点。

  1. When running memcpy twice, then the second run is faster than the first one.
  2. When "touching" the destination buffer of memcpy (memset(b2, 0, BUFFERSIZE...)) then the first run of memcpy is also faster.
  3. memcpy is still a little bit slower than memmove.

下面是结果:

memcpy        0.0118526
memcpy        0.0119105
memmove (002) 0.0108151
memmove (004) 0.0107122
memmove (008) 0.0107262
memmove (016) 0.0108555
memmove (032) 0.0107171
memmove (064) 0.0106437
memmove (128) 0.0106648

我的结论:基于来自@Oliver Charlesworth的评论,操作系统必须尽快提交物理内存的memcpy的目标缓冲区是针对第一次访问(如果有人知道如何证明这话请加一个答案!)。此外,@Mats皮特森说,memmove与缓存比memcpy的友善。

My conclusion: based on a comment from @Oliver Charlesworth, the operating system has to commit physical memory as soon as the memcpy destination buffer is accessed for the very first time (if someone knows how to "proof" this then please add an answer!). In addition, as @Mats Petersson said, memmove is cache friendlier than memcpy.

感谢所有伟大的答案和评论!

Thanks for all the great answers and comments!

推荐答案

memmove与通话沿着2至128个字节洗牌内存,而你的的memcpy 源和目的是完全不同的。不知怎的,那是占的性能差异:如果您复制到同一个地方,你会看到的memcpy 最终可能微幅下挫速度更快,例如在 ideone.com

Your memmove calls are shuffling memory along by 2 to 128 bytes, while your memcpy source and destination are completely different. Somehow that's accounting for the performance difference: if you copy to the same place, you'll see memcpy ends up possibly a smidge faster, e.g. on ideone.com:

memmove (002) 0.0610362
memmove (004) 0.0554264
memmove (008) 0.0575859
memmove (016) 0.057326
memmove (032) 0.0583542
memmove (064) 0.0561934
memmove (128) 0.0549391
memcpy 0.0537919

在几乎没有任何东西虽然 - 没有证据表明写回内存页面已经有故障的的影响,我们也肯定没有看到时间减半......但它确实表明相比苹果换苹果时没有什么错制定的memcpy 不必要的慢。

Hardly anything in it though - no evidence that writing back to an already faulted in memory page has much impact, and we're certainly not seeing a halving of time... but it does show that there's nothing wrong making memcpy unnecessarily slower when compared apples-for-apples.

这篇关于为什么memmove与快于memcpy的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆