Python多处理内存使用 [英] Python multiprocessing memory usage

查看:22
本文介绍了Python多处理内存使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个程序,可以总结如下:

I have writen a program that can be summarized as follows:

def loadHugeData():
    #load it
    return data

def processHugeData(data, res_queue):
    for item in data:
        #process it
        res_queue.put(result)
    res_queue.put("END")

def writeOutput(outFile, res_queue):
    with open(outFile, 'w') as f
        res=res_queue.get()
        while res!='END':
            f.write(res)
            res=res_queue.get()

res_queue = multiprocessing.Queue()

if __name__ == '__main__':
    data=loadHugeData()
    p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue))
    p.start()
    processHugeData(data, res_queue)
    p.join()

真正的代码(尤其是writeOutput())要复杂得多.writeOutput() 仅使用它作为参数的这些值(意味着它不引用 data)

The real code (especially writeOutput()) is a lot more complicated. writeOutput() only uses these values that it takes as its arguments (meaning it does not reference data)

基本上,它将一个巨大的数据集加载到内存中并对其进行处理.输出的写入被委托给一个子进程(它实际上写入多个文件,这需要很多时间).因此,每次处理一个数据项时,它都会通过 res_queue 发送到子进程,然后再根据需要将结果写入文件.

Basically it loads a huge dataset into memory and processes it. Writing of the output is delegated to a sub-process (it writes into multiple files actually and this takes a lot of time). So each time one data item gets processed it is sent to the sub-process trough res_queue which in turn writes the result into files as needed.

子进程不需要以任何方式访问、读取或修改loadHugeData()加载的数据.子进程只需要使用主进程通过res_queue发送的内容.这引出了我的问题和疑问.

The sub-process does not need to access, read or modify the data loaded by loadHugeData() in any way. The sub-process only needs to use what the main process sends it trough res_queue. And this leads me to my problem and question.

在我看来,子进程拥有自己的庞大数据集副本(使用 top 检查内存使用情况时).这是真的?如果是这样,那么我怎样才能避免 id (本质上使用双内存)?

It seems to me that the sub-process gets its own copy of the huge dataset (when checking memory usage with top). Is this true? And if so then how can i avoid id (using double memory essentially)?

我使用的是 Python 2.6,程序在 linux 上运行.

I am using Python 2.6 and program is running on linux.

推荐答案

multiprocessing 模块有效地基于创建当前进程副本的 fork 系统调用.由于您在 fork(或创建 multiprocessing.Process)之前加载了大量数据,因此子进程会继承数据的副本.

The multiprocessing module is effectively based on the fork system call which creates a copy of the current process. Since you are loading the huge data before you fork (or create the multiprocessing.Process), the child process inherits a copy of the data.

但是,如果您运行的操作系统实现了 COW(写时复制),那么物理内存中实际上只有一份数据副本,除非您在父进程或子进程中修改数据(两者父母和孩子将共享相同的物理内存页面,尽管在不同的虚拟地址空间中);即便如此,也只会为更改分配额外的内存(以 pagesize 为增量).

However, if the operating system you are running on implements COW (copy-on-write), there will only actually be one copy of the data in physical memory unless you modify the data in either the parent or child process (both parent and child will share the same physical memory pages, albeit in different virtual address spaces); and even then, additional memory will only be allocated for the changes (in pagesize increments).

您可以通过在加载大量数据之前调用 multiprocessing.Process 来避免这种情况.那么在父进程中加载​​数据时,额外的内存分配不会反映在子进程中.

You can avoid this situation by calling multiprocessing.Process before you load your huge data. Then the additional memory allocations will not be reflected in the child process when you load the data in the parent.

在答案中反映@Janne Karila 的评论,因为它是如此相关:还请注意,每个 Python 对象都包含一个引用计数,每当访问该对象时都会修改该引用计数.所以,仅仅读取一个数据结构就会导致 COW 复制."

reflecting @Janne Karila's comment in the answer, as it is so relevant: "Note also that every Python object contains a reference count that is modified whenever the object is accessed. So, just reading a data structure can cause COW to copy."

这篇关于Python多处理内存使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆