多重处理-共享复杂的对象 [英] multiprocessing - sharing a complex object

查看:47
本文介绍了多重处理-共享复杂的对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型的dict类对象,需要在多个工作进程之间共享.每个工作人员都会读取对象中信息的随机子集,并对其进行一些计算.我想避免复制大对象,因为我的机器很快用完了内存.

I've got a large dict-like object that needs to be shared between a number of worker processes. Each worker reads a random subset of the information in the object and does some computation with it. I'd like to avoid copying the large object as my machine quickly runs out of memory.

我正在使用此SO问题的代码并对其进行了一些修改,以使用固定大小的进程池,该池更适合于我的用例.但是,这似乎破坏了它.

I was playing with the code for this SO question and I modified it a bit to use a fixed-size process pool, which is better suited to my use case. This however seems to break it.

from multiprocessing import Process, Pool
from multiprocessing.managers import BaseManager

class numeri(object):
    def __init__(self):
        self.nl = []

    def getLen(self):
        return len(self.nl)

    def stampa(self):
        print self.nl

    def appendi(self, x):
        self.nl.append(x)

    def svuota(self):
        for i in range(len(self.nl)):
            del self.nl[0]

class numManager(BaseManager):
    pass

def produce(listaNumeri):
    print 'producing', id(listaNumeri)
    return id(listaNumeri)

def main():
    numManager.register('numeri', numeri, exposed=['getLen', 'appendi',
                        'svuota', 'stampa'])
    mymanager = numManager()
    mymanager.start()
    listaNumeri = mymanager.numeri()
    print id(listaNumeri)

    print '------------ Process'
    for i in range(5):
        producer = Process(target=produce, args=(listaNumeri,))
        producer.start()
        producer.join()

    print '--------------- Pool'
    pool = Pool(processes=1)
    for i in range(5):
        pool.apply_async(produce, args=(listaNumeri,)).get()

if __name__ == '__main__':
    main()

输出为

4315705168
------------ Process
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
--------------- Pool
producing 4299771152
producing 4315861712
producing 4299771152
producing 4315861712
producing 4299771152

如您所见,在第一种情况下,所有工作进程都获得相同的对象(按ID).在第二种情况下,id不相同.这是否意味着该对象正在被复制?

As you can see, in the first case all worker processes get the same object (by id). In the second case, the id is not the same. Does that mean the object is being copied?

P.S.我认为这并不重要,但我使用的是joblib,它在内部使用的是Pool:

P.S. I don't think that matters, but I am using joblib, which internally used a Pool:

from joblib import delayed, Parallel

print '------------- Joblib'
        Parallel(n_jobs=4)(delayed(produce)(listaNumeri) for i in range(5))

输出:

------------- Joblib
producing 4315862096
producing 4315862288
producing 4315862480
producing 4315862672
producing 4315862352

推荐答案

恐怕这里实际上没有任何方法可以像您希望 那样起作用:-(

I'm afraid virtually nothing here works the way you hope it works :-(

首先请注意,由不同过程产生的相同id() 不会告诉您对象是否真的是同一对象.每个进程都有自己的虚拟地址空间,由操作系统分配.在两个进程中,相同的虚拟地址可以引用完全不同的物理内存位置.您的代码是否产生相同的id()输出几乎完全是偶然的.在多次运行中,有时我会在Process部分看到不同的id()输出,并在Pool部分看到重复的id()输出,反之亦然,或者同时出现这两种情况.

First note that identical id() values produced by different processes tell you nothing about whether the objects are really the same object. Each process has its own virtual address space, assigned by the operating system. The same virtual address in two processes can refer to entirely different physical memory locations. Whether your code produces the same id() output or not is pretty much purely accidental. Across multiple runs, sometimes I see different id() output in your Process section and repeated id() output in your Pool section, or vice versa, or both.

第二,Manager提供语义共享,但不提供物理共享.您的numeri实例的数据在管理器进程中仅存在 .您的所有辅助进程均查看(副本)代理对象.这些是精简包装,用于转发要由管理器进程执行的所有操作.这涉及许多进程间通信以及管理器进程内部的序列化.这是编写速度很慢的代码的好方法;-)是的,只有numeri数据的一个副本,但是对它的所有工作都是由单个进程(管理器进程)完成的.

Second, a Manager supplies semantic sharing but not physical sharing. The data for your numeri instance lives only in the manager process. All your worker processes see (copies of) proxy objects. Those are thin wrappers that forward all operations to be performed by the manager process. This involves lots of inter-process communication, and serialization inside the manager process. This is a great way to write really slow code ;-) Yes, there is only one copy of the numeri data, but all work on it is done by a single process (the manager process).

要更清楚地了解这一点,请进行@martineau建议的更改,并将get_list_id()更改为以下内容:

To see this more clearly, make the changes @martineau suggested, and also change get_list_id() to this:

def get_list_id(self):  # added method
    import os
    print("get_list_id() running in process", os.getpid())
    return id(self.nl)

以下是示例输出:

41543664
------------ Process
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 46268496
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 44153904
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
--------------- Pool
producing 41639248
get_list_id() running in process 5856
with list_id 44544608
producing 41777200
get_list_id() running in process 5856
with list_id 44544608
producing 41776816
get_list_id() running in process 5856
with list_id 44544608
producing 41777168
get_list_id() running in process 5856
with list_id 44544608
producing 41777136
get_list_id() running in process 5856
with list_id 44544608

清除?每次获得相同列表ID的原因是,因为每个工作进程具有相同的self.nl成员,是因为所有numeri方法在其中运行是一个流程(管理者流程).这就是列表ID始终相同的原因.

Clear? The reason you get the same list id each time is not because each worker process has the same self.nl member, it's because all numeri methods run in a single process (the manager process). That's why the list id is always the same.

如果您在Linux-y系统(支持fork()的操作系统)上运行,一个更好的主意是在启动任何工作进程之前,忘记所有这些Manager内容,并在模块级别创建复杂对象. .然后,工作人员将继承您的复杂对象(地址空间副本).通常的写时复制fork()语义将使内存效率尽可能高.如果不需要将突变折叠回复杂对象的主程序副本中,就足够了.如果确实需要重新添加突变,那么您又需要进行大量的进程间通信,并且multiprocessing的吸引力也随之降低.

If you're running on a Linux-y system (an OS that supports fork()), a much better idea is to forget all this Manager stuff and create your complex object at module level before starting any worker processes. Then the workers will inherit (address-space copies of) your complex object. The usual copy-on-write fork() semantics will make that about as memory-efficient as possible. That's sufficient if mutations don't need to be folded back into the main program's copy of the complex object. If mutations do need to be folded back in, then you're back to needing lots of inter-process communication, and multiprocessing becomes correspondingly less attractive.

这里没有简单的答案.不要开枪;-)

There are no easy answers here. Don't shoot the messenger ;-)

这篇关于多重处理-共享复杂的对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆