C:memcpy的速度上的动态分配数组 [英] C: memcpy speed on dynamically allocated arrays

查看:198
本文介绍了C:memcpy的速度上的动态分配数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要用下面的code的性能有所帮助。它在任意大小的两个动态分配数组一个memcpy的:

I need help with the performance of the following code. It does a memcpy on two dynamically allocated arrays of arbitrary size:

int main()
{
  double *a, *b;
  unsigned n = 10000000, i;
  a = malloc(n*sizeof(double));
  b = malloc(n*sizeof(double));
  for(i=0; i<n; i++) {
    a[i] = 1.0;
    /* b[i] = 0.0; */
  }

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero1");

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero2");

  tic();
  memcpy(b, a, n*sizeof(double));
  toc("memcpy");
}

TIC / TOC测量的执行时间。

tic/toc measure the execution time.

在我的电脑上花费0.035s memcpy的(Linux操作系统,gcc版本4.4.6)。
如果我现在取消注释它初始化目标数组b线,code是快三倍(!) - 0.011s。

On my computer it takes 0.035s to memcpy (Linux, gcc version 4.4.6). If I now uncomment the line which initializes the destination array b, the code is three times faster (!) - 0.011s.

我已经使用循环代替了memcpy时观察到类似的行为。通常我不关心这个,因为它是足够使用它之前初始化的记忆。不过,我现在需要执行一个简单的内存拷贝,并做到这一点尽可能快。初始化数据需要写入例如0到存储器,这是没有必要的,需要时间。我想与所有可用内存带宽进行内存复制。

I have observed similar behavior when using a loop instead of memcpy. Usually I do not care about this since it is enough to 'initialize' the memory before using it. However, I now need to perform a simple memory copy, and do it as fast as possible. Initializing the data requires writing e.g. 0 to the memory, which is not necessary and takes time. And I would like to perform a memory copy with all available memory bandwidth.

有没有一个解决这个问题?抑或是连接到Linux处理动态内存(某种懒页分配?),不能被周围的工作方式?它是如何在其他系统上?

Is there a solution to this problem? Or is it connected to the way Linux handles dynamic memory (some sort of lazy page allocation?) and can not be worked around? How is it on other systems?

编辑:同样的结果与GCC 4.6获得。我用-O3进行编译。

The same results are obtained with gcc 4.6. I used -O3 to compile.

编辑:
谢谢大家的意见。我也明白,内存映射需要时间。我想我只是有一个很难接受,它需要这么长时间,比实际内存访问更长的时间。的code已被修改为包括阵列的初始化b按两个后续bzero呼叫的基准。时序现在证明

Thank you all for your comments. I do understand that memory mapping takes time. I guess I just have a hard time accepting that it takes so long, much longer than the actual memory access. The code has been modified to include a benchmark of the initialization of array b using two subsequent bzero calls. The timings now show

bzero1 0.273981结果
bzero2 0.056803结果
memcpy的0.117934结果

bzero1 0.273981
bzero2 0.056803
memcpy 0.117934

显然,第一bzero呼叫确实的的多只流零到存储器 - 即内存映射和存储器归零。在另一方面,第二bzero调用需要的做的memcpy所需的时间,而这正是预期的一半 - 只写时间与读写时间。据我所知,第二bzero调用的开销必须有由于操作系统安全方面的原因。怎么样的休息吗?我不能莫名其妙地减少它,例如使用更大的内存页?不同的内核设置?

Clearly, the first bzero call does much more than just stream zeros to memory - that is memory mapping and memory zeroing. The second bzero call on the other hand takes half of the time required to do a memcpy, which is exactly as expected - write only time vs. read and write time. I understand that the overhead of the second bzero call must be there due to OS security reasons. What about the rest? Can I not decrease it somehow, e.g. use larger memory pages? Different kernel settings?

我要指出,我在Ubuntu的喘息运行此。

I should mention that I run this on Ubuntu wheeze.

推荐答案

第一bzero运行较长时间,因为(1)懒惰页面分配和(2)懒惰页零初始化内核所。而第二个原因是,由于安全原因无法避免的,懒惰的页面分配可以通过使用更大(庞大)的网页进行优化。

The first bzero runs longer because of (1) lazy page allocation and (2) lazy page zero-initialization by kernel. While second reason is unavoidable because of security reasons, lazy page allocation may be optimized by using larger ("huge") pages.

有使用Linux上巨大的页面的至少两种方式。硬的方法是 hugetlbfs的。简单的方法是透明大内存页的。

There are at least two ways to use huge pages on Linux. Hard way is hugetlbfs. Easy way is Transparent huge pages.

搜索 khugepaged 在你的系统中的进程列表中。如果不存在这样的进程,支持透明大内存页,你可以,如果你改变的malloc 来此使用它们在你的应用程序:

Search khugepaged in the list of processes on your system. If such process exists, transparent huge pages are supported, you can use them in your application if you change malloc to this:

posix_memalign((void **)&b, 2*1024*1024, n*sizeof(double));
madvise((void *)b, n*sizeof(double), MADV_HUGEPAGE);

这篇关于C:memcpy的速度上的动态分配数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆