Memcpy与memset花费相同的时间 [英] Memcpy takes the same time as memset

查看：156 发布时间：2020/5/1 9:55:25 c linux memory x86 malloc

本文介绍了Memcpy与memset花费相同的时间的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用memcpy测量内存带宽.我从以下答案中修改了代码:为什么对循环进行矢量化没有提高性能，该循环使用memset来测量带宽.问题是memcpy仅比memset慢一点，因为我预计它会在两倍的内存上运行，因此它会比memset慢大约两倍.

I want to measure memory bandwidth using memcpy. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset to measure the bandwidth. The problem is that memcpy is only slighly slower than memset when I expect it to be about two times slower since it operations on twice the memory.

更具体地说，我通过以下操作在1 GB的数组a和b(分配为calloc)上运行了100次.

More specifically, I run over 1 GB arrays a and b (allocated will calloc) 100 times with the following operations.

operation             time(s)
-----------------------------
memset(a,0xff,LEN)    3.7
memcpy(a,b,LEN)       3.9
a[j] += b[j]          9.4
memcpy(a,b,LEN)       3.8

请注意，memcpy仅比memset稍慢.操作a[j] += b[j](其中j移过[0,LEN))所花费的时间应该是memcpy的三倍，因为它处理的数据量是其三倍.但是，它的速度仅为memset的2.5左右.

Notice that memcpy is only slightly slower then memset. The operations a[j] += b[j] (where j goes over [0,LEN)) should take three times longer than memcpy because it operates on three times as much data. However it's only about 2.5 as slow as memset.

然后我用memset(b,0,LEN)将b初始化为零，然后再次测试:

Then I initialized b to zero with memset(b,0,LEN) and test again:

operation             time(s)
-----------------------------
memcpy(a,b,LEN)       8.2
a[j] += b[j]          11.5

现在，我们看到memcpy的速度是memset的两倍，而a[j] += b[j]的速度是memset的三倍，就像我期望的那样.

Now we see that memcpy is about twice as slow as memset and a[j] += b[j] is about thrice as slow as memset like I expect.

至少我希望在memset(b,0,LEN)之前，memcpy将会触摸).

At the very least I would have expected that before memset(b,0,LEN) that memcpy would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.

为什么我只能得到memset(b,0,LEN)之后的期望时间?

Why do I only get the time I expect after memset(b,0,LEN)?

test.c

#include <time.h>
#include <string.h>
#include <stdio.h>

void tests(char *a, char *b, const int LEN){
    clock_t time0, time1;
    time0 = clock();
    for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    memset(b,0,LEN);
    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}

main.c

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    tests(a, b, LEN);
}

使用(gcc 6.2)gcc -O3 test.c main.c进行编译. Clang 3.8给出了基本相同的结果.

Compile with (gcc 6.2) gcc -O3 test.c main.c. Clang 3.8 gives essentially the same result.

测试系统:i7-6700HQ@2.60GHz(Skylake)，32 GB DDR4，Ubuntu 16.10.在我的Haswell系统上，带宽在memset(b,0,LEN)之前是有意义的，即，我只在Skylake系统上看到问题.

Test system: i7-6700HQ@2.60GHz (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN) i.e. I only see a problem on my Skylake system.

我首先从a[j] += b[k]操作

I first discovered this issue from the a[j] += b[k] operations in this answer which was overestimating the bandwidth.

我想出了一个更简单的测试

I came up with a simpler test

#include <time.h>
#include <string.h>
#include <stdio.h>

void __attribute__ ((noinline))  foo(char *a, char *b, const int LEN) {
  for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}

void tests(char *a, char *b, const int LEN) {
    foo(a, b, LEN);
    memset(b,0,LEN);
    foo(a, b, LEN);
}

此输出.

9.472976
12.728426

但是，如果我在calloc之后(见下文)在main中执行memset(b,1,LEN)，那么它将输出

However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs

12.5
12.5

这使我认为这是操作系统分配问题，而不是编译器问题.

This leads me to to think this is a OS allocation issue and not a compiler issue.

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    //GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
    memset(b,1,LEN);
    tests(a, b, LEN);
}

Memcpy与memset花费相同的时间 [英] Memcpy takes the same time as memset

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

Memcpy与memset花费相同的时间 [英] Memcpy takes the same time as memset

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭