子进程退出后内核复制CoW页面 [英] Kernel copying CoW pages after child process exit

查看:100
本文介绍了子进程退出后内核复制CoW页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Linux中,只要分支了一个进程,就会将父进程的内存映射克隆到子进程中.实际上,出于性能原因,这些页面被设置为写时复制-最初它们是共享的,并且如果两个进程之一在其中之一上进行写操作,则它们随后将被共享被克隆(MAP_PRIVATE).

In Linux, whenever a process is forked, the memory mappings of the parent process are cloned into the child process. In reality, for performance reasons, the pages are set to be copy-on-write -- initially they are shared and, in the event one of the two processes writing on one of them, they will then be cloned (MAP_PRIVATE).

这是获取正在运行的程序状态的快照的一种非常常见的机制-您进行了分叉,这使您可以在该时间点(一致)查看进程的内存.

This is a very common mechanism of getting a snapshot of the state of a running program -- you do a fork, and this gives you a (consistent) view of the memory of the process at that point in time.

我做了一个简单的基准测试,其中有两个组成部分:

I did a simple benchmark where I have two components:

  • 一个父进程,它具有写入数组
  • 的线程池
  • 一个子进程,它具有一个线程池,这些线程对数组进行快照并取消映射
  • A parent process that has a pool of threads writing into an array
  • A child process that has a pool of threads making a snapshot of the array and unmapping it

在某些情况下(机器/体系结构/内存位置/线程数/...),我能够比线程写入数组早得多地完成复制.

Under some circumstances (machine/architecture/memory placement/number of threads/...) I am able to make the copy finish much earlier than the threads write into the array.

但是,当子进程退出时,在htop中我仍然看到大部分CPU时间都花在了内核上,这与用来处理写时复制只要父进程将其写入页面.

However, when the child process exits, in htop I still see most of the CPU time being spent in the kernel, which is consistent to it being used to handle the copy-on-write whenever the parent process writes to a page.

据我了解,如果标记为写时复制的匿名页面是由单个进程映射的,则不应复制该页面,而应直接使用.

In my understanding, if an anonymous page marked as copy-on-write is mapped by a single process, it should not be copied and instead should be used directly.

如何确定这确实是在复制内存上花费的时间?

如果我说对了,该如何避免这种开销?

该基准测试的核心位于 modern C ++中.

The core of the benchmark is below, in modern C++.

定义WITH_FORK以启用快照;保持未定义状态以禁用子进程.

Define WITH_FORK to enable the snapshot; leave undefined to disable the child process.

#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <numaif.h>
#include <numa.h>

#include <algorithm>
#include <cassert>
#include <condition_variable>
#include <mutex>
#include <iomanip>
#include <iostream>
#include <cmath>
#include <numeric>
#include <thread>
#include <vector>

#define ARRAY_SIZE 1073741824 // 1GB
#define NUM_WORKERS 28
#define NUM_CHECKPOINTERS 4
#define BATCH_SIZE 2097152 // 2MB

using inttype = uint64_t;
using timepoint = std::chrono::time_point<std::chrono::high_resolution_clock>;

constexpr uint64_t NUM_ELEMS() {
  return ARRAY_SIZE / sizeof(inttype);
}

int main() {

  // allocate array
  std::array<inttype, NUM_ELEMS()> *arrayptr = new std::array<inttype, NUM_ELEMS()>();
  std::array<inttype, NUM_ELEMS()> & array = *arrayptr;

  // allocate checkpoint space
  std::array<inttype, NUM_ELEMS()> *cpptr = new std::array<inttype, NUM_ELEMS()>();
  std::array<inttype, NUM_ELEMS()> & cp = *cpptr;

  // initialize array
  std::fill(array.begin(), array.end(), 123);

#ifdef WITH_FORK
  // spawn checkpointer threads
  int pid = fork();
  if (pid == -1) {
    perror("fork");
    exit(-1);
  }

  // child process -- do checkpoint
  if (pid == 0) {
    std::array<std::thread, NUM_CHECKPOINTERS> cpthreads;
    for (size_t tid = 0; tid < NUM_CHECKPOINTERS; tid++) {
      cpthreads[tid] = std::thread([&, tid] {
        // copy array
        const size_t numBatches = ARRAY_SIZE / BATCH_SIZE;
        for (size_t i = tid; i < numBatches; i += NUM_CHECKPOINTERS) {
          void *src = reinterpret_cast<void*>(
            reinterpret_cast<intptr_t>(array.data()) + i * BATCH_SIZE);
          void *dst = reinterpret_cast<void*>(
            reinterpret_cast<intptr_t>(cp.data()) + i * BATCH_SIZE);
          memcpy(dst, src, BATCH_SIZE);
          munmap(src, BATCH_SIZE);
        }
      });
    }
    for (std::thread& thread : cpthreads) {
      thread.join();
    }
    printf("CP finished successfully! Child exiting.\n");
    exit(0);
  }
#endif  // #ifdef WITH_FORK

  // spawn worker threads
  std::array<std::thread, NUM_WORKERS> threads;
  for (size_t tid = 0; tid < NUM_WORKERS; tid++) {
    threads[tid] = std::thread([&, tid] {
      // write to array
      std::array<inttype, NUM_ELEMS()>::iterator it;
      for (it = array.begin() + tid; it < array.end(); it += NUM_WORKERS) {
        *it = tid;
      }
    });
  }

  timepoint tStart = std::chrono::high_resolution_clock::now();

#ifdef WITH_FORK
  // allow reaping child process while workers work
  std::thread childWaitThread = std::thread([&] {
    if (waitpid(pid, nullptr, 0)) {
      perror("waitpid");
    }
    timepoint tChild = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> durationChild = tChild - tStart;
    printf("reunited with child after (s): %lf\n", durationChild.count());
  });
#endif

  // wait for workers to finish
  for (std::thread& thread : threads) {
    thread.join();
  }
  timepoint tEnd = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> duration = tEnd - tStart;
  printf("duration (s): %lf\n", duration.count());

#ifdef WITH_FORK
  childWaitThread.join();
#endif
}

推荐答案

数组的大小为1GB,约250K页,其中每页大小为4KB.对于此程序,可以很容易地估计由于写入CoW页面而导致的页面错误数.也可以使用Linux perf工具进行测量. new运算符将数组初始化为零.所以下面的代码行:

The size of the array is 1GB, which is about 250K pages, where each page is 4KB in size. For this program, the number of page faults that occur due writing to CoW pages can be easily estimated. It can also measured using the Linux perf tool. The new operator initializes the array to zero. So the following line of code:

std::array<inttype, NUM_ELEMS()> *arrayptr = new std::array<inttype, NUM_ELEMS()>();

将导致大约250K页面错误.同样,下面的代码行:

will cause about 250K page faults. Similarly, the following line of the code:

std::array<inttype, NUM_ELEMS()> *cpptr = new std::array<inttype, NUM_ELEMS()>();

将导致另外25万页错误.所有这些页面错误都是次要,即,可以在不访问磁盘驱动器的情况下进行处理.分配两个1GB阵列不会对具有更多物理内存的系统造成任何重大故障.

will cause another 250K page faults. All of these page faults are minor, i.e., they can be handled without accessing the disk drive. Allocating two 1GB arrays will not cause any major faults for a system with much more physical memory.

在这一点上,已经发生了大约500K页错误(当然,还有其他页错误是由程序中的其他内存访问引起的,但是可以忽略不计). std::fill的执行不会造成任何次要错误,但是阵列的虚拟页面已经被映射到专用的物理页面.

At this point, about 500K page faults have already occurred (there will be other pages faults caused by other memory accesses from the program, of course, but they can be neglected). The execution of std::fill will not cause any minor faults but the virtual pages of the arrays have already been mapped to dedicated physical pages.

然后,程序的执行将继续进行分支子进程并创建父进程的工作线程.子进程本身的创建足以制作阵列快照,因此实际上不需要在子进程中执行任何操作.实际上,在分叉子进程时,两个阵列的虚拟页都标记为写时复制.子进程从arrayptr读取并写入cpptr,这将导致额外的250K小错误.父进程还会写入arrayptr,这还会导致额外的250K小错误.因此,在子进程中进行复制并取消映射页面不会提高性能.相反,页面错误的数量增加了一倍,并且性能显着下降.

The execution of the program then proceeds to forking the child process and creating the worker threads of the parent process. The creation of the child process by itself is sufficient to make a snapshot of the array, so there is really no need to do anything in the child process. In fact, when the child process is forked, the virtual pages of both arrays are marked as copy-on-write. The child process reads from arrayptr and writes to cpptr, which results in additional 250K minor faults. The parent process also writes to arrayptr, which also results in additional 250K minor faults. So making a copy in the child process and unmapping the pages does not improve performance. On the contrary, the number of page faults is doubled and performance is significantly degraded.

您可以使用以下命令测量次要和主要故障的数量:

You can measure the number of minor and major faults using the following command:

perf stat -r 3 -e minor-faults,major-faults ./binary

默认情况下,这将计算整个过程树的次要和主要故障. -r 3选项告诉perf重复三次实验并报告平均值和标准偏差.

This will, by default, count minor and major faults for the whole process tree. The -r 3 option tells perf to repeat the experiment three times and report the average and standard deviation.

我还注意到,线程总数为28 +4.最佳线程数大约等于系统上的在线逻辑核心总数.如果线程数量大得多或小得多,由于创建太多线程并在它们之间进行切换的开销会降低性能.

I noticed also that the total number of threads is 28 + 4. The optimal number of threads is approximately equal to the total number of online logical cores on your system. If the number of threads is much larger or much smaller than that, performance will be degraded due to the overhead of creating too many threads and switching between them.

以下循环中可能还存在另一个潜在问题:

Another potential issue may exist in the following loop:

for (it = array.begin() + tid; it < array.end(); it += NUM_WORKERS) {
  *it = tid;
}

不同的线程可能会尝试同时多次向同一缓存行写入内容,从而导致错误共享.根据处理器高速缓存行的大小,线程数以及所有内核是否都以相同的频率运行,这可能不是一个严重的问题,因此很难一概而论.更好的循环形状是使每个线程的元素在数组中是连续的.

Different threads may try to write more than once to the same cache line at the same time, resulting in false sharing. This may not be a significant issue depending on the size of the cache line of your processor, the number of threads, and whether all cores are running at the same frequency, so it's hard to say without measuring. A better loop shape would be having the elements of each thread to be contiguous in the array.

这篇关于子进程退出后内核复制CoW页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆