Memcpy与memset花费相同的时间 [英] Memcpy takes the same time as memset

查看:156
本文介绍了Memcpy与memset花费相同的时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用memcpy测量内存带宽.我从以下答案中修改了代码:为什么对循环进行矢量化没有提高性能,该循环使用memset来测量带宽.问题是memcpy仅比memset慢一点,因为我预计它会在两倍的内存上运行,因此它会比memset慢大约两倍.

I want to measure memory bandwidth using memcpy. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset to measure the bandwidth. The problem is that memcpy is only slighly slower than memset when I expect it to be about two times slower since it operations on twice the memory.

更具体地说,我通过以下操作在1 GB的数组ab(分配为calloc)上运行了100次.

More specifically, I run over 1 GB arrays a and b (allocated will calloc) 100 times with the following operations.

operation             time(s)
-----------------------------
memset(a,0xff,LEN)    3.7
memcpy(a,b,LEN)       3.9
a[j] += b[j]          9.4
memcpy(a,b,LEN)       3.8

请注意,memcpy仅比memset稍慢.操作a[j] += b[j](其中j移过[0,LEN))所花费的时间应该是memcpy的三倍,因为它处理的数据量是其三倍.但是,它的速度仅为memset的2.5左右.

Notice that memcpy is only slightly slower then memset. The operations a[j] += b[j] (where j goes over [0,LEN)) should take three times longer than memcpy because it operates on three times as much data. However it's only about 2.5 as slow as memset.

然后我用memset(b,0,LEN)b初始化为零,然后再次测试:

Then I initialized b to zero with memset(b,0,LEN) and test again:

operation             time(s)
-----------------------------
memcpy(a,b,LEN)       8.2
a[j] += b[j]          11.5

现在,我们看到memcpy的速度是memset的两倍,而a[j] += b[j]的速度是memset的三倍,就像我期望的那样.

Now we see that memcpy is about twice as slow as memset and a[j] += b[j] is about thrice as slow as memset like I expect.

至少我希望在memset(b,0,LEN)之前,memcpy将会触摸).

At the very least I would have expected that before memset(b,0,LEN) that memcpy would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.

为什么我只能得到memset(b,0,LEN)之后的期望时间?

Why do I only get the time I expect after memset(b,0,LEN)?

test.c

#include <time.h>
#include <string.h>
#include <stdio.h>

void tests(char *a, char *b, const int LEN){
    clock_t time0, time1;
    time0 = clock();
    for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    memset(b,0,LEN);
    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}

main.c

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    tests(a, b, LEN);
}

使用(gcc 6.2)gcc -O3 test.c main.c进行编译. Clang 3.8给出了基本相同的结果.

Compile with (gcc 6.2) gcc -O3 test.c main.c. Clang 3.8 gives essentially the same result.

测试系统:i7-6700HQ@2.60GHz(Skylake),32 GB DDR4,Ubuntu 16.10.在我的Haswell系统上,带宽在memset(b,0,LEN)之前是有意义的,即,我只在Skylake系统上看到问题.

Test system: i7-6700HQ@2.60GHz (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN) i.e. I only see a problem on my Skylake system.

我首先从a[j] += b[k]操作

I first discovered this issue from the a[j] += b[k] operations in this answer which was overestimating the bandwidth.

我想出了一个更简单的测试

I came up with a simpler test

#include <time.h>
#include <string.h>
#include <stdio.h>

void __attribute__ ((noinline))  foo(char *a, char *b, const int LEN) {
  for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}

void tests(char *a, char *b, const int LEN) {
    foo(a, b, LEN);
    memset(b,0,LEN);
    foo(a, b, LEN);
}

此输出.

9.472976
12.728426

但是,如果我在calloc之后(见下文)在main中执行memset(b,1,LEN),那么它将输出

However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs

12.5
12.5

这使我认为这是操作系统分配问题,而不是编译器问题.

This leads me to to think this is a OS allocation issue and not a compiler issue.

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    //GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
    memset(b,1,LEN);
    tests(a, b, LEN);
}

推荐答案

问题在于,大多数平台上的malloccalloc 都不分配内存.他们分配地址空间.

The point is that malloc and calloc on most platforms don't allocate memory; they allocate address space.

malloc等通过以下方式工作:

malloc etc work by:

  • 如果自由列表可以满足请求,请从中剔除一部分
    • calloc情况下:发出与memset(ptr, 0, size)等价的内容
    • if the request can be fulfilled by the freelist, carve a chunk out of it
      • in case of calloc: the equivalent ofmemset(ptr, 0, size) is issued

      对于具有按需分页的系统( COW )(MMU可以在这里提供帮助),第二个选项变为:

      For systems with demand paging (COW) (an MMU could help here), the second options winds downto:

      • 为请求创建足够的页表条目,并使用对/dev/zero的(COW)引用填充它们
      • 将这些 PTE 添加到进程的地址空间
      • create enough page table entries for the request, and fill them with a (COW) reference to /dev/zero
      • add these PTEs to the address space of the process

      除了页表之外,这不会消耗任何物理内存.

      This will consume no physical memory, except only for the Page Tables.

      • 一旦引用了新内存以供读取 ,读取的内容将来自/dev/zero. /dev/zero设备是非常特殊的设备,在这种情况下,它映射到新内存的每一页.
      • 但是,如果写入了新页面,则COW逻辑会启动(通过页面错误):
        • 已分配物理内存
        • /dev/zero页面已已复制到新页面
        • 新页面与母页面分离
        • 并且调用过程最终可以进行更新,从而开始了这一切
        • Once the new memory is referenced for read, the read will come from /dev/zero. The /dev/zero device is a very special device, in this case mapped to every page of the new memory.
        • but, if the new page is written, the COW logic kicks in (via a page fault):
          • physical memory is allocated
          • the /dev/zero page is copied to the new page
          • the new page is detached from the mother page
          • and the calling process can finally do the update which started all this

          这篇关于Memcpy与memset花费相同的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆