是什么导致基于 OpenMP 的模拟中内存消耗增加? [英] What causes increasing memory consumption in OpenMP-based simulation?

查看:80
本文介绍了是什么导致基于 OpenMP 的模拟中内存消耗增加?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Monte Carlo 粒子模拟中遇到了内存消耗问题,我使用 OpenMP 进行并行化.模拟方法的细节不赘述,一个平行的部分是粒子运动".使用一定数量的线程,另一个是缩放移动".使用一些可能不同数量的线程.这 2 个并行代码由一些串行内核交替运行,每个代码都需要几毫秒才能运行.

I am having a big struggle with memory consumption in my Monte Carlo particle simulation, where I am using OpenMP for parallelization. Not going into the details of the simulation method, one parallel part are "particle moves" using some number of threads and the other are "scaling moves" using some, possibly different number of threads. This 2 parallel codes are run interchangeably separated by some serial core and each takes milliseconds to run.

我有一台运行 Linux Ubuntu 18.04 LTS 的 8 核 16 线程机器,我正在使用 gcc 和 GNU OpenMP 实现.现在:

I have an 8-core, 16-thread machine running Linux Ubuntu 18.04 LTS and I'am using gcc and GNU OpenMP implementation. Now:

  • 使用8 个线程进行粒子移动";和 8 个线程用于缩放移动";产生稳定的 8-9 MB 内存使用
  • 使用8 个线程进行粒子移动";和 16 个线程用于缩放移动";导致长时间模拟的内存消耗从 8 MB 增加到数十 GB,最终导致 OOM 终止
  • 使用16 个线程16 个线程可以
  • 使用16 个线程8 个线程导致消耗增加
  • using 8 threads for "particle moves" and 8 threads for "scaling moves" yields stable 8-9 MB memory usage
  • using 8 threads for "particle moves" and 16 threads for "scaling moves" causes increasing memory consumption from those 8 MB to tens of GB for long simulation resulting in the end in an OOM kill
  • using 16 threads and 16 threads is ok
  • using 16 threads and 8 threads causes increasing consumption

所以如果这两种类型的移动的线程数不匹配,就会出现问题.

So something is wrong if numbers of threads for those 2 types of moves don't match.

不幸的是,我无法在一个最小的示例中重现该问题,我只能给出 OpenMP 代码的摘要.最小示例的链接位于底部.

Unfortunately, I was not able to reproduce the issue in a minimal example and I can only give a summary of the OpenMP code. A link to aminimal example is at the bottom.

在模拟中,我有 N 个具有某些位置的粒子.粒子运动"在网格中组织,我使用 collapse(3) 来分配线程.代码看起来或多或少是这样的:

In the simulation I have N particles with some positions. "Particle moves" are organized in a grid, I am using collapse(3) to distribute threads. The code looks more or less like this:

// Each threads has its own cell in a 2 x 2 x 2 grid
#pragma omp parallel for collapse(3) num_threads(8 or 16)
for (std::size_t i = 0; i < 2; i++) {
    for (std::size_t j = 0; j < 2; j++) {
        for (std::size_t k = 0; k < 2; k++) {
            std::array<std::size_t, 3> gridCoords = {i, j, k};
            
            // This does something for all particles in {i, j, k} grid cell
            doIndependentParticleMovesInAGridCellGivenByCoords(gridCoords);
        }
    }
}

(请注意,在两种情况下都只能分配 8 个线程 - 8 个和 16 个,但是使用这些额外的、无工作的 8 个线程可以神奇地解决使用 16 个缩放线程时的问题.)

(Notice, that only 8 threads are to be distributed in both cases - 8 and 16, but using those additional, jobless 8 threads magically fixes the problem when 16 scaling threads are used.)

在音量移动"中我正在独立地对每个粒子进行重叠检查,并在发现第一个重叠时退出.它看起来像这样:

In "volume moves" I am doing an overlap check on each particle independently and exit when a first overlap is found. It looks like this:

// We independently check for each particle
std::atomic<bool> overlapFound = false;
#pragma omp parallel for num_threads(8 or 16)
for (std::size_t i = 0; i < N; i++) {
    if (overlapFound)
        continue;
    if (isParticleOverlappingAnything(i))
        overlapFound = true;
}

现在,在并行区域中,我不分配任何新内存,也不需要任何临界区 - 应该没有竞争条件.

Now, in parallel regions I don't allocate any new memory and don't need any critical sections - there should be no race conditions.

此外,整个程序中的所有内存管理都是通过 std::vector、std::unique_ptr 等以 RAII 方式完成的. - 我不使用 new 或 <代码>删除任何地方.

Moreover, all memory management in the whole program is done in a RAII fashion by std::vector, std::unique_ptr, etc. - I don't use new or delete anywhere.

我尝试使用一些 Valgrind 工具.我运行了一段时间的模拟,对于非匹配线程数的情况,它会产生大约 16 MB(仍在增加)的内存消耗,而对于 匹配 仍然保持在 8 MB> 案例.

I tried to use some Valgrind tools. I ran a simulation for a time, which produces about 16 MB of (still increasing) memory consumption for non-matching thread numbers case, while is stays still on 8 MB for matching case.

  • Valgrind Memcheck 没有显示任何内存泄漏(OpenMP 控制结构中只有几个 kB仍然可访问"或可能丢失",请参阅 此处)在任何一种情况下.
  • Valgrind Massif 仅报告那些正确"的两种情况下都分配了 8 MB 的内存.

我也尝试将 { } 中 main 的内容括起来,并添加 while(true):

I also tried to surround the contents of main in { } and add while(true):

int main() {
    {
        // Do the simulation and let RAII do all the cleanup when destructors are called
    }

    // Hang
    while(true) { }
}

在模拟内存消耗增加到 100 MB.当 { ... } 结束执行时,内存消耗降低了大约 6 MB 并在 while(true) 中保持在 94 - 6 MB 是最大的实际大小数据结构(我估计),但其余部分是未知类型.

During the simulation memory consumption increases let say up to 100 MB. When { ... } ends its execution, memory consumption gets lower by around 6 MB and stays at 94 in while(true) - 6 MB is the actual size of biggest data structures (I estimated it), but the remaining part is of an unknown kind.

所以我认为它必须与 OpenMP 内存管理有关.也许交替使用 8 和 16 线程会导致 OpenMP 不断创建新线程池并放弃旧线程池而不释放资源?我发现了这样的东西 此处,但它似乎是另一个 OpenMP 实现.

So I assume it must be something with OpenMP memory management. Maybe using 8 and 16 threads interchangeably causes OpenMP to constantly create new thread pools abandoning old ones without releasing resources? I found something like this here, but it seems to be another OpenMP implementation.

如果您提供一些建议,我将不胜感激,我还可以检查哪些方面以及问题可能出在哪里.

I would be very grateful for some ideas what else can I check and where might be the issue.

  • re @1201ProgramAlarm:我已将 volatile 更改为 std::atomic
  • re @Gilles:我已经检查了 16 个线程案例中的粒子移动";并相应更新

我终于能够在一个最小的例子中重现这个问题,它最终变得非常简单,这里的所有细节都是不必要的.我创建了一个没有任何混乱的新问题 here.

I was finally able to reproduce the issue in a minimal example, it ended up being extremely simple and all the details here are unnecessary. I created a new question without all the mess here.

推荐答案

问题出在哪里?

问题似乎与此特定代码的作用或 OpenMP 子句的结构方式无关,而仅与具有不同线程数的两个交替的 OpenMP 并行区域有关.在进行了数百万次更改之后,进程使用了​​大量内存,而不管这些部分中的内容如何.它们甚至可能像睡眠几毫秒一样简单.

Where lies the problem?

It seem that the problem is not connected with what this particular code does or how the OpenMP clauses are structured, but solely with two alternating OpenMP parallel regions with different numbers of threads. After millions of those alterations there is a substantial amount of memory used by the process irregardless of what is in the sections. They may be even as simple as sleeping for a couple of milliseconds.

由于这个问题包含太多不必要的细节,我将讨论移至更直接的问题 此处.我推荐有兴趣的读者.

As this question contains too much unnecessary details I have moved the discussion to a more direct question here. I refer there the interested reader.

在这里,我简要总结了 StackOverflow 成员和我能够确定的内容.假设我们有 2 个具有不同线程数的 OpenMP 部分,例如:

Here I give a brief summary of what StackOverflow members and I were able to determine. Let's say we have 2 OpenMP sections with different number of threads, such as here:

#include <unistd.h>

int main() {
    while (true) {
        #pragma omp parallel num_threads(16)
        usleep(30);

        #pragma omp parallel num_threads(8)
        usleep(30);
    }
    return 0;
}

如描述的更多细节这里,OpenMP 复用了普通的 8 个线程,但是 16 个线程部分需要的其他 8 个线程不断地创建和销毁.这种不断创建的线程会导致内存消耗增加,这可能是由于实际内存泄漏或内存碎片造成的,我不知道.此外,该问题似乎特定于 GCC 中的 GOMP OpenMP 实现(至少版本 10).Clang 和 Intel 编译器似乎没有复制这个问题.

As described with more details here, OpenMP reuses common 8 threads, but other 8 needed for 16-thread section are constantly created and destroyed. This constant thread creation causes increasing memory consumption, either because of an actual memory leak, or memory fragmentation, I don't know. Moreover, the problem seems to be specific to GOMP OpenMP implementation in GCC (up to at least version 10). Clang and Intel compilers seem not to replicate the issue.

尽管 OpenMP 标准没有明确说明,但大多数实现倾向于重用已经产生的线程,但 GOMP 似乎并非如此,这可能是一个错误.我将提交错误问题并更新答案.目前,唯一的解决方法是在每个并行区域中使用相同数量的线程(然后 GOMP 正确重用旧线程).在问题中的 collapse 循环之类的情况下,当要分发的线程比其他部分少时,一个人总是可以请求 16 个线程而不是 8 个线程,而让其他 8 个线程什么都不做.它适用于我的生产"代码很好.

Although not stated explicitly by the OpenMP standard, most implementations tend to reuse the already spawned threads, but is seems not to be the case for GOMP and it is probably a bug. I will file the bug issue and update the answer. For now, the only workaround is to use the same number of threads in every parallel region (then GOMP properly reuses old threads). In cases like collapse loop from the question, when there are less threads to distribute than in the other section, one can always request 16 threads instead of 8 and let the other 8 just do nothing. It worked in my "production" code quite well.

这篇关于是什么导致基于 OpenMP 的模拟中内存消耗增加?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆