Python 多处理内存使用情况 [英] Python multiprocessing memory usage

查看:17
本文介绍了Python 多处理内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个程序,可以总结如下:

I have writen a program that can be summarized as follows:

def loadHugeData():
    #load it
    return data

def processHugeData(data, res_queue):
    for item in data:
        #process it
        res_queue.put(result)
    res_queue.put("END")

def writeOutput(outFile, res_queue):
    with open(outFile, 'w') as f
        res=res_queue.get()
        while res!='END':
            f.write(res)
            res=res_queue.get()

res_queue = multiprocessing.Queue()

if __name__ == '__main__':
    data=loadHugeData()
    p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue))
    p.start()
    processHugeData(data, res_queue)
    p.join()

真正的代码(尤其是writeOutput())要复杂得多.writeOutput() 只使用它作为参数的这些值(意味着它不引用 data)

The real code (especially writeOutput()) is a lot more complicated. writeOutput() only uses these values that it takes as its arguments (meaning it does not reference data)

基本上它将一个巨大的数据集加载到内存中并对其进行处理.输出的写入委托给一个子进程(它实际上写入多个文件,这需要很多时间).因此,每次处理一个数据项时,它都会通过 res_queue 发送到子进程,然后根据需要将结果写入文件.

Basically it loads a huge dataset into memory and processes it. Writing of the output is delegated to a sub-process (it writes into multiple files actually and this takes a lot of time). So each time one data item gets processed it is sent to the sub-process trough res_queue which in turn writes the result into files as needed.

子进程不需要以任何方式访问、读取或修改loadHugeData()加载的数据.子进程只需要使用主进程通过res_queue发送给它的东西.这将我引向我的问题和疑问.

The sub-process does not need to access, read or modify the data loaded by loadHugeData() in any way. The sub-process only needs to use what the main process sends it trough res_queue. And this leads me to my problem and question.

在我看来,子进程获得了自己的庞大数据集的副本(使用 top 检查内存使用情况时).这是真的?如果是这样,那么我如何避免 id(本质上使用双内存)?

It seems to me that the sub-process gets its own copy of the huge dataset (when checking memory usage with top). Is this true? And if so then how can i avoid id (using double memory essentially)?

我使用的是 Python 2.6,程序在 linux 上运行.

I am using Python 2.6 and program is running on linux.

推荐答案

multiprocessing 模块有效地基于 fork 系统调用,它创建了当前进程的副本.由于您在fork(或创建multiprocessing.Process)之前加载了大量数据,因此子进程继承了数据的副本.

The multiprocessing module is effectively based on the fork system call which creates a copy of the current process. Since you are loading the huge data before you fork (or create the multiprocessing.Process), the child process inherits a copy of the data.

但是,如果您运行的操作系统实现了 COW(写时复制),则物理内存中实际上只会有一份数据副本,除非您在父进程或子进程中修改数据(两者都父子进程将共享相同的物理内存页面,尽管在不同的虚拟地址空间);即便如此,额外的内存也只会分配给更改(以 pagesize 为增量).

However, if the operating system you are running on implements COW (copy-on-write), there will only actually be one copy of the data in physical memory unless you modify the data in either the parent or child process (both parent and child will share the same physical memory pages, albeit in different virtual address spaces); and even then, additional memory will only be allocated for the changes (in pagesize increments).

您可以通过在加载大量数据之前调用 multiprocessing.Process 来避免这种情况.那么额外的内存分配不会在父进程加载数据时反映在子进程中.

You can avoid this situation by calling multiprocessing.Process before you load your huge data. Then the additional memory allocations will not be reflected in the child process when you load the data in the parent.

在答案中反映@Janne Karila 的评论,因为它非常相关:还要注意,每个 Python 对象都包含一个引用计数,只要访问该对象就会修改该引用计数.所以,仅仅读取一个数据结构就会导致 COW 复制."

reflecting @Janne Karila's comment in the answer, as it is so relevant: "Note also that every Python object contains a reference count that is modified whenever the object is accessed. So, just reading a data structure can cause COW to copy."

这篇关于Python 多处理内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆