产生python多处理池时出现意外的内存占用差异 [英] unexpected memory footprint differences when spawning python multiprocessing pool

查看:95
本文介绍了产生python多处理池时出现意外的内存占用差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图为pystruct模块中的并行化做出一些优化,并在讨论中试图解释我的想法,为什么我要在执行过程中尽早实例化池并尽可能长地保留它们,重新使用它们,我意识到我知道这样做最有效,但是我不完全知道为什么.

Trying to contribute some optimization for the parallelization in the pystruct module and in discussions trying to explain my thinking for why I wanted to instantiate pools as early in the execution as possible and keep them around as long as possible, reusing them, I realized I know that it works best to do this, but I don't completely know why.

我知道在* nix系统上的主张是,池工作程序子进程在写入时从父进程中的所有全局变量复制.总体上肯定是这样,但我认为需要注意的是,当这些全局变量之一是特别密集的数据结构(例如numpy或scipy矩阵)时,看起来任何复制到工作器中的引用实际上都是相当漂亮的.即使没有复制整个对象,它的大小也是可调整的,因此在执行的后期生成新池可能会导致内存问题.我发现最佳实践是尽早生成一个池,以使任何数据结构都较小.

I know that the claim, on *nix systems, is that a pool worker subprocess copies on write from all the globals in the parent process. This is definitely the case on the whole, but I think a caveat should be added that when one of those globals is a particularly dense data structure like a numpy or scipy matrix, it appears that whatever references get copied down into the worker are actually pretty sizeable even if the whole object isn't being copied, and so spawning new pools late in the execution can cause memory issues. I have found the best practice is to spawn a pool as early as possible, so that any data structures are small.

我已经知道了一段时间,并在工作中的应用程序中对此进行了设计,但是我得到的最好的解释是我在这里的线程中发布的内容:

I have known this for a while and engineered around it in applications at work but the best explanation I've gotten is what I posted in the thread here:

https://github.com/pystruct/pystruct/pull/129# issuecomment-68898032

从本质上看下面的python脚本,您希望在第一次运行时在创建池的步骤中创建的空闲内存与在第二次运行中在创建矩阵的步骤中的可用内存基本相等,就像在两个最终的池终止调用中一样.但是它们从来没有,当您首先创建池时,总会有(当然除非机器上正在进行其他事情)更多的可用内存.这种影响随着创建池时全局命名空间中数据结构的复杂性(和大小)而增加(我认为).有人对此有很好的解释吗?

Looking at the python script below, essentially, you would expect free memory in the pool created step in the first run and the matrix created step in the second to be basically equal, as in both final pool terminated calls. But they never are, there is always (unless something else is going on on the machine of course) more free memory when you create the pool first. That effect increases with the complexity (and size) of the data structures in the global namespace at the time the pool is created (I think). Does anyone have a good explanation for this?

我用下面的bash循环和R脚本制作了这张小图,以说明创建池和矩阵之后的空闲内存总量,具体取决于顺序:

I made this little picture with the bash loop and the R script also below to illustrate, showing the free memory overall after both pool and matrix are created, depending on the order:

pool_memory_test.py:

pool_memory_test.py:

import numpy as np
import multiprocessing as mp
import logging

def memory():
    """
    Get node total memory and memory usage
    """
    with open('/proc/meminfo', 'r') as mem:
        ret = {}
        tmp = 0
        for i in mem:
            sline = i.split()
            if str(sline[0]) == 'MemTotal:':
                ret['total'] = int(sline[1])
            elif str(sline[0]) in ('MemFree:', 'Buffers:', 'Cached:'):
                tmp += int(sline[1])
        ret['free'] = tmp
        ret['used'] = int(ret['total']) - int(ret['free'])
    return ret

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--pool_first', action='store_true')
    parser.add_argument('--call_map', action='store_true')
    args = parser.parse_args()

    if args.pool_first:
        logging.debug('start:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p = mp.Pool()
        logging.debug('pool created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        biggish_matrix = np.ones((50000,5000))
        logging.debug('matrix created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        print memory()['free']
    else:
        logging.debug('start:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        biggish_matrix = np.ones((50000,5000))
        logging.debug('matrix created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p = mp.Pool()
        logging.debug('pool created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        print memory()['free']
    if args.call_map:
        row_sums = p.map(sum, biggish_matrix)
        logging.debug('sum mapped:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p.terminate()
        p.join()
        logging.debug('pool terminated:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))

pool_memory_test.sh

pool_memory_test.sh

#! /bin/bash
rm pool_first_obs.txt > /dev/null 2>&1;
rm matrix_first_obs.txt > /dev/null 2>&1;
for ((n=0;n<100;n++)); do
    python pool_memory_test.py --pool_first >> pool_first_obs.txt;
    python pool_memory_test.py >> matrix_first_obs.txt;
done

pool_memory_test_plot.R:

pool_memory_test_plot.R:

library(ggplot2)
library(reshape2)
pool_first = as.numeric(readLines('pool_first_obs.txt'))
matrix_first = as.numeric(readLines('matrix_first_obs.txt'))
df = data.frame(i=seq(1,100), pool_first, matrix_first)
ggplot(data=melt(df, id.vars='i'), aes(x=i, y=value, color=variable)) +
    geom_point() + geom_smooth() + xlab('iteration') + 
    ylab('free memory') + ggsave('multiprocessing_pool_memory.png')

修复了因过度的查找/替换和重新运行而导致的脚本中的小错误

fixing small bug in script caused by overzealous find/replace and rerunning fixed

-0"切片?你能做到吗? :)

"-0" slicing? You can do that? :)

更好的python脚本,bash循环和可视化,现在可以通过此兔子洞完成:)

better python script, bash looping and visualization, ok done with this rabbit hole for now :)

推荐答案

您的问题涉及几种松散耦合的机制.这似乎也是增加业力积分的简单目标,但是您会感到有些不对劲,而3小时后,这是一个完全不同的问题.因此,作为对我的所有乐趣的回报,您可能会在下面找到一些有用的信息.

Your question touches several loosely coupled mechanics. And it's also one that seems an easy target for additional karma points, but you can feel something's wrong and 3 hours later it's a completely different question. So in return for all the fun I had, you may find useful some of the information below.

TL; DR :测量已用内存,而不是可用内存.对于我来说,池/矩阵顺序的结果(几乎)是一致的,对象的尺寸也很大.

TL;DR: Measure used memory, not free. That gives consistent results of (almost) the same result for pool/matrix order and large object size for me.

def memory():
    import resource
    # RUSAGE_BOTH is not always available
    self = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    children = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss
    return self + children

在回答您没有问过但与您关系密切的问题之前,请先了解一下背景.

Before answering questions you didn't ask, but those closely related, here's some background.

最广泛的实现,CPython(2和3版本)都使用引用计数内存管理[1].每当您将Python对象用作值时,它的引用计数器都会增加1,而当引用丢失时,引用计数器会减少.计数器是在C结构中定义的整数,其中包含每个Python对象的数据[2].要点:参考计数器一直在变化,它与其余对象数据一起存储.

The most widespread implementation, CPython (both 2 and 3 versions) use reference counting memory management [1]. Whenever you use Python object as value, it's reference counter is increased by one, and decreased back when reference is lost. The counter is an integer defined in C struct holding data of each Python object [2]. Takeaway: reference counter is changing all the time, it is stored along with the rest of object data.

大多数受Unix启发的OS"(BSD家族,Linux,OSX等)都采用写时复制[3]的内存访问语义.在fork()之后,两个进程具有指向相同物理页的不同内存页表.但是OS已将页面标记为写保护的,因此,当您执行任何内存写操作时,CPU都会引发内存访问异常,这由OS处理,以将原始页面复制到新位置.它像进程具有隔离的内存一样走动并发出嘎嘎叫声,但是,嘿,让我们节省一些时间(在复制时)和RAM,而部分内存是等效的.要点:fork(或mp.Pool)创建新进程,但是它们(几乎)还没有使用任何额外的内存.

Most "Unix inspired OS" (BSD family, Linux, OSX, etc) sport copy-on-write [3] memory access semantic. After fork(), two processes have distinct memory page tables pointing to the same physical pages. But OS has marked the pages as write-protected, so when you do any memory write, CPU raises memory access exception, which is handled by OS to copy original page into new place. It walks and quacks like process has isolated memory, but hey, let's save some time (on copying) and RAM while parts of memory are equivalent. Takeaway: fork (or mp.Pool) create new processes, but they (almost) don't use any extra memory just yet.

CPython将小"对象存储在大型池(竞技场)中[4].在创建和销毁大量小对象(例如,函数内部的临时变量)的常见方案中,您不想经常调用OS内存管理.其他编程语言(至少是大多数已编译的语言)为此目的使用了堆栈.

CPython stores "small" objects in large pools (arenas) [4]. In common scenario where you create and destroy large number of small objects, for example, temporary variables inside a function, you don't want to call OS memory management too often. Other programming languages (most compiled ones, at least), use stack for this purpose.

  • mp.Pool()之后立即使用不同的内存,而池未做任何工作:multiprocessing.Pool.__init__创建N个(用于检测到的CPU数量)工作进程.写时复制语义从这一点开始.
  • 在* nix系统上的主张是,池工作程序子进程在写入时从父进程中的所有全局变量进行复制":多进程从其上下文"中复制全局变量,而不是从模块中复制全局变量,并且它无条件地这样做,在任何操作系统上. [5]
  • numpy.ones和Python list的不同内存使用情况:matrix = [[1,1,...],[1,2,...],...]是Python整数列表的Python列表.很多Python对象=很多PyObject_HEAD =很多引用计数器.在派生环境中访问所有它们将触摸所有引用计数器,因此将复制它们的内存页. matrix = numpy.ones((50000, 5000))是类型为numpy.array的Python对象.就是这样,一个Python对象,一个引用计数器.其余的是纯低级数字,它们彼此相邻地存储在内存中,不涉及引用计数器.为了简单起见,您可以使用data = '.'*size [5]-在内存中也创建一个对象.
  • Different memory usage right after mp.Pool() without any work done by pool: multiprocessing.Pool.__init__ creates N (for number of CPU detected) worker processes. Copy-on-write semantics begin at this point.
  • "the claim, on *nix systems, is that a pool worker subprocess copies on write from all the globals in the parent process": multiprocessing copies globals of it's "context", not globals from your module and it does so unconditionally, on any OS. [5]
  • Different memory usage of numpy.ones and Python list: matrix = [[1,1,...],[1,2,...],...] is a Python list of Python lists of Python integers. Lots of Python objects = lots of PyObject_HEAD = lots of ref-counters. Accessing all of them in forked environment would touch all ref-counters, therefore would copy their memory pages. matrix = numpy.ones((50000, 5000)) is a Python object of type numpy.array. That's it, one Python object, one ref-counter. The rest is pure low level numbers stored in memory next to each other, no ref-counters involved. For the sake of simplicity, you could use data = '.'*size [5] - that also creates a single object in memory.
  1. https://docs.python.org/2/c-api/refcounting.html
  2. https://docs.python.org/2/c-api/structures.html#c.PyObject_HEAD
  3. http://minnie.tuhs.org/CompArch/Lectures/week09.html#tth_sEc2.8
  4. http://www.evanjones.ca/memoryallocator/
  5. https://gist.github.com/temoto/af663106a3da414359fa
  1. https://docs.python.org/2/c-api/refcounting.html
  2. https://docs.python.org/2/c-api/structures.html#c.PyObject_HEAD
  3. http://minnie.tuhs.org/CompArch/Lectures/week09.html#tth_sEc2.8
  4. http://www.evanjones.ca/memoryallocator/
  5. https://github.com/python/cpython/search?utf8=%E2%9C%93&q=globals+path%3ALib%2Fmultiprocessing%2F&type=Code
  6. Getting all said together https://gist.github.com/temoto/af663106a3da414359fa

这篇关于产生python多处理池时出现意外的内存占用差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆