Memcpy与memset花费相同的时间 [英] Memcpy takes the same time as memset
问题描述
我想使用memcpy
测量内存带宽.我从以下答案中修改了代码:为什么对循环进行矢量化没有提高性能,该循环使用memset
来测量带宽.问题是memcpy
仅比memset
慢一点,因为我预计它会在两倍的内存上运行,因此它会比memset
慢大约两倍.
I want to measure memory bandwidth using memcpy
. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset
to measure the bandwidth. The problem is that memcpy
is only slighly slower than memset
when I expect it to be about two times slower since it operations on twice the memory.
更具体地说,我通过以下操作在1 GB的数组a
和b
(分配为calloc
)上运行了100次.
More specifically, I run over 1 GB arrays a
and b
(allocated will calloc
) 100 times with the following operations.
operation time(s)
-----------------------------
memset(a,0xff,LEN) 3.7
memcpy(a,b,LEN) 3.9
a[j] += b[j] 9.4
memcpy(a,b,LEN) 3.8
请注意,memcpy
仅比memset
稍慢.操作a[j] += b[j]
(其中j
移过[0,LEN)
)所花费的时间应该是memcpy
的三倍,因为它处理的数据量是其三倍.但是,它的速度仅为memset
的2.5左右.
Notice that memcpy
is only slightly slower then memset
. The operations a[j] += b[j]
(where j
goes over [0,LEN)
) should take three times longer than memcpy
because it operates on three times as much data. However it's only about 2.5 as slow as memset
.
然后我用memset(b,0,LEN)
将b
初始化为零,然后再次测试:
Then I initialized b
to zero with memset(b,0,LEN)
and test again:
operation time(s)
-----------------------------
memcpy(a,b,LEN) 8.2
a[j] += b[j] 11.5
现在,我们看到memcpy
的速度是memset
的两倍,而a[j] += b[j]
的速度是memset
的三倍,就像我期望的那样.
Now we see that memcpy
is about twice as slow as memset
and a[j] += b[j]
is about thrice as slow as memset
like I expect.
至少我希望在memset(b,0,LEN)
之前,memcpy
将会触摸).
At the very least I would have expected that before memset(b,0,LEN)
that memcpy
would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.
为什么我只能得到memset(b,0,LEN)
之后的期望时间?
Why do I only get the time I expect after memset(b,0,LEN)
?
test.c
#include <time.h>
#include <string.h>
#include <stdio.h>
void tests(char *a, char *b, const int LEN){
clock_t time0, time1;
time0 = clock();
for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
memset(b,0,LEN);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}
main.c
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
tests(a, b, LEN);
}
使用(gcc 6.2)gcc -O3 test.c main.c
进行编译. Clang 3.8给出了基本相同的结果.
Compile with (gcc 6.2) gcc -O3 test.c main.c
. Clang 3.8 gives essentially the same result.
测试系统:i7-6700HQ@2.60GHz(Skylake),32 GB DDR4,Ubuntu 16.10.在我的Haswell系统上,带宽在memset(b,0,LEN)
之前是有意义的,即,我只在Skylake系统上看到问题.
Test system: i7-6700HQ@2.60GHz (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN)
i.e. I only see a problem on my Skylake system.
I first discovered this issue from the a[j] += b[k]
operations in this answer which was overestimating the bandwidth.
我想出了一个更简单的测试
I came up with a simpler test
#include <time.h>
#include <string.h>
#include <stdio.h>
void __attribute__ ((noinline)) foo(char *a, char *b, const int LEN) {
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}
void tests(char *a, char *b, const int LEN) {
foo(a, b, LEN);
memset(b,0,LEN);
foo(a, b, LEN);
}
此输出.
9.472976
12.728426
但是,如果我在calloc
之后(见下文)在main中执行memset(b,1,LEN)
,那么它将输出
However, if I do memset(b,1,LEN)
in main after calloc
(see below) then it outputs
12.5
12.5
这使我认为这是操作系统分配问题,而不是编译器问题.
This leads me to to think this is a OS allocation issue and not a compiler issue.
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
memset(b,1,LEN);
tests(a, b, LEN);
}
推荐答案
问题在于,大多数平台上的malloc
和calloc
都不分配内存.他们分配地址空间.
The point is that malloc
and calloc
on most platforms don't allocate memory; they allocate address space.
malloc
等通过以下方式工作:
malloc
etc work by:
- 如果自由列表可以满足请求,请从中剔除一部分
- 在
calloc
情况下:发出与memset(ptr, 0, size)
等价的内容
- if the request can be fulfilled by the freelist, carve a chunk out of it
- in case of
calloc
: the equivalent ofmemset(ptr, 0, size)
is issued
对于具有按需分页的系统( COW )(MMU可以在这里提供帮助),第二个选项变为:
For systems with demand paging (COW) (an MMU could help here), the second options winds downto:
- 为请求创建足够的页表条目,并使用对
/dev/zero
的(COW)引用填充它们 - 将这些 PTE 添加到进程的地址空间
- create enough page table entries for the request, and fill them with a (COW) reference to
/dev/zero
- add these PTEs to the address space of the process
除了页表之外,这不会消耗任何物理内存.
This will consume no physical memory, except only for the Page Tables.
- 一旦引用了新内存以供读取 ,读取的内容将来自
/dev/zero
./dev/zero
设备是非常特殊的设备,在这种情况下,它映射到新内存的每一页. - 但是,如果写入了新页面,则COW逻辑会启动(通过页面错误):
- 已分配物理内存
- /dev/zero页面已已复制到新页面
- 新页面与母页面分离
- 并且调用过程最终可以进行更新,从而开始了这一切
- Once the new memory is referenced for read, the read will come from
/dev/zero
. The/dev/zero
device is a very special device, in this case mapped to every page of the new memory. - but, if the new page is written, the COW logic kicks in (via a page fault):
- physical memory is allocated
- the /dev/zero page is copied to the new page
- the new page is detached from the mother page
- and the calling process can finally do the update which started all this
这篇关于Memcpy与memset花费相同的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- in case of
- 在