可一些解释以下的内存分配的C程序的性能行为? [英] Can some explain the performance behavior of the following memory allocating C program?

查看:97
本文介绍了可一些解释以下的内存分配的C程序的性能行为?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我这取决于是否 A 是机时间A和时刻B互换
定义与否(更改顺序的两个释放calloc s的称呼)。

我最初将此归因于寻呼系统。古怪,当
MMAP 则使用释放calloc ,情况就更加的bizzare - 两个循环利用时间相同,符合市场预期。如
可以用 strace的中可以看出,释放calloc 取值最终导致两个
MMAP S,所以没有回报,已经分配的内存中的神奇事情。

我的Intel酷睿i7运行Debian测试。

 的#include<&stdlib.h中GT;
#包括LT&;&stdio.h中GT;
#包括LT&; SYS / mman.h>#包括LT&;&time.h中GT;的#define SIZE 500002816的#ifndef USE_MMAP
ALLOC的#define释放calloc
#其他
的#define ALLOC(A,B)(MMAP(NULL,A * B,PROT_READ | PROT_WRITE,\\
                          MAP_PRIVATE | MAP_ANONYMOUS,-1,0))
#万一诠释主(){
  clock_t表示启动,完成;
#IFDEF一个
  为int * ARR1 = ALLOC(的sizeof(INT),SIZE);
  为int * ARR2 = ALLOC(的sizeof(INT),SIZE);
#其他
  为int * ARR2 = ALLOC(的sizeof(INT),SIZE);
  为int * ARR1 = ALLOC(的sizeof(INT),SIZE);
#万一
  INT I;  开始=时钟();
  {
    对于(i = 0; I<大小;我++)
      ARR1 [I] =(1 + 13)* 5;
  }
  完成=时钟();  的printf(时间A:%.2f \\ n,((双)(结束 - 开始))/ CLOCKS_PER_SEC);  开始=时钟();
  {
    对于(i = 0; I<大小;我++)
      ARR2 [I] =(1 + 13)* 5;
  }
  完成=时钟();  的printf(时间A:%.2f \\ n,((双)(结束 - 开始))/ CLOCKS_PER_SEC);  返回0;
}

输出我得到:

 〜/目录$ CC -Wall -O3板凳loop.c中-o板凳环
 〜/目录$ ./bench-loop
时间A:0.94
的时间B:0.34
 〜/目录$ CC-DA -Wall -O3板凳loop.c中-o板凳环
 〜/目录$ ./bench-loop
时间A:0.34
的时间B:0.90
 〜/目录$ CC -DUSE_MMAP-DA -Wall -O3板凳loop.c中-o板凳环
 〜/目录$ ./bench-loop
时间A:0.89
的时间B:0.90
 〜/目录$ CC -DUSE_MMAP -Wall -O3板凳loop.c中-o板凳环
 〜/目录$ ./bench-loop
时间A:0.91
的时间B:0.92


解决方案

简答

第一次发现释放calloc 被称为是明确清零的内存。而下一次它被称为它假定内存从 MMAP 已经归零回来了。

详细信息

下面是一些事情,我检查,以得出这样的结论,你可以尝试自己,如果你想要的东西:


  1. 插入释放calloc 调用你的第一个 ALLOC 呼叫。你会看到,在此之后的时间A和时间B是相同的。

  2. 时间
  3. 使用时钟()函数来检查多久每个 ALLOC的通话需要。当他们使用释放calloc 两者的情况下,你会看到,在第一次调用花费更长的时间比第二。


  4. 使用时间来时的释放calloc 版本的执行时间和 USE_MMAP 版本。当我这样做,我看到的执行时间 USE_MMAP 是一贯略显不足。


  5. strace的-T -tt 跑这显示了系统调用时的的时间和花费时间。下面是输出的一部分:


strace的输出:

  21:29:06.127536 MMAP(NULL,2000015360,PROT_READ | PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS,-1,0)= 0x7fff806fd000< 0.000014>
21:29:07.778442 MMAP(NULL,2000015360,PROT_READ | PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS,-1,0)= 0x7fff093a0000&所述; 0.000021>
21:29:07.778563倍({tms_utime = 63,tms_stime = 102,tms_cutime = 0,tms_cstime = 0})= 4324241005<&0.000011 GT;

您可以看到第一个 MMAP 调用了 0.000014 秒,但约 1.5 下一个系统调用之前经历的秒数。然后第二个 MMAP 调用了 0.000021 秒,其次是调用几百微秒后。

我还通过执行应用程序的一部分,加强与 GDB ,看到为释放calloc 第一次调用导致到 memset的无数个电话,而到第二个呼叫释放calloc 没有对 memset的<任何电话/ code>。你可以看到源$ C ​​$ C为释放calloc 相对=nofollow>(寻找 __ libc_calloc ),如果你有兴趣。至于为什么释放calloc 正在做 memset的上的第一个呼叫,但没有后续的我不知道。但我觉得相当有信心,这说明你已经问的行为。

至于为什么被清零数组 memset的性能有所提升我的猜测是,这是因为价值被加载到TLB,而不是高速缓存,因为它是一个非常大阵。无论在性能上的差异,你问一下具体的原因是,他们执行时的两个释放calloc 调用不同的表现。

On my machine Time A and Time B swap depending on whether A is defined or not (which changes the order in which the two callocs are called).

I initially attributed this to the paging system. Weirdly, when mmap is used instead of calloc, the situation is even more bizzare -- both the loops take the same amount of time, as expected. As can be seen with strace, the callocs ultimately result in two mmaps, so there is no return-already-allocated-memory magic going on.

I'm running Debian testing on an Intel i7.

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>

#include <time.h>

#define SIZE 500002816

#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE,  \
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif

int main() {
  clock_t start, finish;
#ifdef A
  int *arr1 = ALLOC(sizeof(int), SIZE);
  int *arr2 = ALLOC(sizeof(int), SIZE);
#else
  int *arr2 = ALLOC(sizeof(int), SIZE);
  int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
  int i;

  start = clock();
  {
    for (i = 0; i < SIZE; i++)
      arr1[i] = (i + 13) * 5;
  }
  finish = clock();

  printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

  start = clock();
  {
    for (i = 0; i < SIZE; i++)
      arr2[i] = (i + 13) * 5;
  }
  finish = clock();

  printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

  return 0;
}

The output I get:

 ~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop 
Time A: 0.94
Time B: 0.34
 ~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop                               
Time A: 0.34
Time B: 0.90
 ~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop                                          
Time A: 0.89
Time B: 0.90
 ~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop 
 ~/directory $ ./bench-loop                                      
Time A: 0.91
Time B: 0.92

解决方案

Short Answer

The first time that calloc is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from mmap is already zeroed out.

Details

Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:

  1. Insert a calloc call before your first ALLOC call. You will see that after this the Time for Time A and Time B are the same.

  2. Use the clock() function to check how long each of the ALLOC calls take. In the case where they are both using calloc you will see that the first call takes much longer than the second one.

  3. Use time to time the execution time of the calloc version and the USE_MMAP version. When I did this I saw that the execution time for USE_MMAP was consistently slightly less.

  4. I ran with strace -tt -T which shows both the time of when the system call was made and how long it took. Here is part of the output:

Strace output:

21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>

You can see that the first mmap call took 0.000014 seconds, but that about 1.5 seconds elapsed before the next system call. Then the second mmap call took 0.000021 seconds, and was followed by the times call a few hundred microsecond later.

I also stepped through part of the application execution with gdb and saw that the first call to calloc resulted in numerous calls to memset while the second call to calloc did not make any calls to memset. You can see the source code for calloc here (look for __libc_calloc) if you are interested. As for why calloc is doing the memset on the first call but not subsequent ones I don't know. But I feel fairly confident that this explains the behavior you have asked about.

As for why the array that was zeroed memset has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the two calloc calls behave differently when they are executed.

这篇关于可一些解释以下的内存分配的C程序的性能行为?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆