为什么BSS中静态数组上的第二个循环比第一个循环快? [英] Why is the second loop over a static array in the BSS faster than the first?

查看:67
本文介绍了为什么BSS中静态数组上的第二个循环比第一个循环快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码,两次用零写入一个全局数组,一次正向一次,一次向后一次.

I have the following code that writes a global array with zeros twice, once forward and once backward.

#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000

char c[SIZE];
char c2[SIZE];

int main()
{
   int i;
   clock_t t = clock();
   for(i = 0; i < SIZE; i++)
       c[i] = 0;

   t = clock() - t;
   printf("%d\n\n", t);

   t = clock(); 
   for(i = SIZE - 1; i >= 0; i--)
      c[i] = 0;

   t = clock() - t;
   printf("%d\n\n", t);
}

我已经运行了几次,第二张打印总是显示较小的值...

I've run it a couple and the second print is always showing a smaller value...

但是,如果我在一个循环中将更改c更改为c2,则两次打印之间的时间差可以忽略不计...造成这种差异的原因是什么?

However, if I change change c to c2 in one of the loops, the time difference between both prints becomes negligible... what is the reason for that difference?

我尝试使用-O3进行编译,并查看了程序集:有2个对memset的调用,但第二个仍在打印较小的值.

I've tried compiling with -O3 and looked into the assembly: there were 2 calls to memset but the second was still printing a smaller value.

推荐答案

在C中定义某些全局数据后,该数据将被初始化为零:

When you defined some global data in C, it is zero-initialized:

char c[SIZE];
char c2[SIZE];

在Linux(unix)世界中,这意味着cc2都将在特殊的ELF文件部分

In linux (unix) world this means, than both c and c2 will be allocated in special ELF file section, the .bss:

...包含静态分配变量的数据段,最初仅由零值位表示

... data segment containing statically-allocated variables represented solely by zero-valued bits initially

创建.bss段的目的是不将所有的零存储在二进制文件中,而只是说该程序希望具有200MB的归零内存".

The .bss segment is created to not store all zeroes in the binary, it just says something like "this program wants to have 200MB of zeroed memory".

在加载程序时,ELF加载器(对于经典静态二进制文件为内核,或者为ld.so动态加载器,也称为interp)将为.bss分配内存,通常类似于MAP_ANONYMOUS标志和READ + WRITE权限/保护请求的http://man7.org/linux/man-pages/man2/mmap.2.html> mmap .

When you program is loaded, ELF loader (kernel in case of classic static binaries, or ld.so dynamic loader also known as interp) will allocate the memory for .bss, usually like something like mmap with MAP_ANONYMOUS flag and READ+WRITE permissions/protection request.

但是OS内核中的内存管理器不会为您提供全部200 MB的零内存.而是将进程的虚拟内存的一部分标记为零初始化,并且此内存的每一页都将指向物理内存中的特殊零页.此页面有4096个字节的零字节,因此,如果您从cc2进行读取,则将得到零字节.并且这种机制允许内核减少内存需求.

But memory manager in the OS kernel will not give you all 200 MB of zero memory. Instead it will mark part of virtual memory of your process as zero-initialized, and every page of this memory will point to the special zero page in physical memory. This page has 4096 bytes of zero byte, so if you are reading from c or c2, you will get zero bytes; and this mechanism allow kernel cut down memory requirements.

到零页面的映射是特殊的;它们被标记为(在页表中)为只读.当您写入任何此类虚拟页面时,常规保护错误或<硬件将生成href ="http://en.wikipedia.org/wiki/Page_fault"> pagefault 异常(我会说,是MMU和TLB).此错误将由内核处理,对于您而言,将由次要pagefault 处理程序处理.它将分配一个物理页面,将其填充为零字节,然后将刚刚访问的虚拟页面的映射重置为该物理页面.然后它将重新运行错误的指令.

The mappings to zero page are special; they are marked (in page table) as read-only. When you do first write to the any of such virtual pages, the General protection fault or pagefault exception will be generated by hardware (I'll say, by MMU and TLB). This fault will be handled by kernel, and in your case, by minor pagefault handler. It will allocate one physical page, fill it by zero bytes, and reset mapping of just-accesed virtual page to this physical page. Then it will rerun faulted instruction.

我对您的代码进行了一些转换(两个循环都移到了单独的函数中):

I converted your code a bit (both loops are moved to separate function):

$ cat b.c
#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000

char c[SIZE];
char c2[SIZE];

void FIRST()
{
   int i;
   for(i = 0; i < SIZE; i++)
       c[i] = 0;
}

void SECOND()
{
   int i;
   for(i = 0; i < SIZE; i++)
       c[i] = 0;
}


int main()
{
   int i;
   clock_t t = clock();
   FIRST();
   t = clock() - t;
   printf("%d\n\n", t);

   t = clock(); 
   SECOND();

   t = clock() - t;
   printf("%d\n\n", t);
}

使用gcc b.c -fno-inline -O2 -o b进行编译,然后在linux的perf stat或更通用的/usr/bin/time下运行以获取pagefault计数:

Compile with gcc b.c -fno-inline -O2 -o b, then run under linux's perf stat or more generic /usr/bin/time to get pagefault count:

$ perf stat ./b
139599

93283


 Performance counter stats for './b':
 ....
            24 550 page-faults               #    0,100 M/sec           


$ /usr/bin/time ./b
234246

92754

Command exited with non-zero status 7
0.18user 0.15system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 98136maxresident)k
0inputs+8outputs (0major+24576minor)pagefaults 0swaps

因此,我们有24.5万个次要的页面错误.如果x86/x86_64上的标准页面大小为4096,则接近100兆字节.

So, we have 24,5 thousands of minor pagefaults. With standard page size on x86/x86_64 of 4096 this is near 100 megabytes.

使用perf record/perf report linux profiler,我们可以找到发生页面错误的位置(生成):

With perf record/perf report linux profiler we can find, where pagefaults occur (are generated):

$ perf record -e page-faults ./b
...skip some spam from non-root run of perf...
213322

97841

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (~801 samples) ]

$ perf report -n |cat
...
# Samples: 467  of event 'page-faults'
# Event count (approx.): 24583
#
# Overhead       Samples  Command      Shared Object                   Symbol
# ........  ............  .......  .................  .......................
#
    98.73%           459        b  b                  [.] FIRST              
     0.81%             1        b  libc-2.19.so       [.] __new_exitfn       
     0.35%             1        b  ld-2.19.so         [.] _dl_map_object_deps
     0.07%             1        b  ld-2.19.so         [.] brk                
     ....

因此,现在我们可以看到,只有FIRST函数会生成页面错误(第一次写入bss页面时),而SECOND则不会生成任何页面错误.每个pagefault都对应于OS内核完成的某些工作,并且每页bss只能完成一次此工作(因为bss不会被取消映射并重新映射回去).

So, now we can see, that only FIRST function generates pagefaults (on first write to bss pages), and SECOND does not generate any. Every pagefault corresponds to some work, done by OS kernel, and this work is done only one time per page of bss (because bss is not unmapped and remapped back).

这篇关于为什么BSS中静态数组上的第二个循环比第一个循环快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆