多处理 - 共享一个复杂的对象 [英] multiprocessing - sharing a complex object

查看:21
本文介绍了多处理 - 共享一个复杂的对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型的 dict 类对象,需要在多个工作进程之间共享.每个工作人员读取对象中信息的随机子集并对其进行一些计算.我想避免复制大对象,因为我的机器很快就会耗尽内存.

I've got a large dict-like object that needs to be shared between a number of worker processes. Each worker reads a random subset of the information in the object and does some computation with it. I'd like to avoid copying the large object as my machine quickly runs out of memory.

我正在使用这个SO问题的代码 并且我对其进行了一些修改以使用固定大小的进程池,这更适合我的用例.然而,这似乎打破了它.

I was playing with the code for this SO question and I modified it a bit to use a fixed-size process pool, which is better suited to my use case. This however seems to break it.

from multiprocessing import Process, Pool
from multiprocessing.managers import BaseManager

class numeri(object):
    def __init__(self):
        self.nl = []

    def getLen(self):
        return len(self.nl)

    def stampa(self):
        print self.nl

    def appendi(self, x):
        self.nl.append(x)

    def svuota(self):
        for i in range(len(self.nl)):
            del self.nl[0]

class numManager(BaseManager):
    pass

def produce(listaNumeri):
    print 'producing', id(listaNumeri)
    return id(listaNumeri)

def main():
    numManager.register('numeri', numeri, exposed=['getLen', 'appendi',
                        'svuota', 'stampa'])
    mymanager = numManager()
    mymanager.start()
    listaNumeri = mymanager.numeri()
    print id(listaNumeri)

    print '------------ Process'
    for i in range(5):
        producer = Process(target=produce, args=(listaNumeri,))
        producer.start()
        producer.join()

    print '--------------- Pool'
    pool = Pool(processes=1)
    for i in range(5):
        pool.apply_async(produce, args=(listaNumeri,)).get()

if __name__ == '__main__':
    main()

输出是

4315705168
------------ Process
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
--------------- Pool
producing 4299771152
producing 4315861712
producing 4299771152
producing 4315861712
producing 4299771152

如您所见,在第一种情况下,所有工作进程都获得相同的对象(通过 id).在第二种情况下,id 不相同.这是否意味着正在复制对象?

As you can see, in the first case all worker processes get the same object (by id). In the second case, the id is not the same. Does that mean the object is being copied?

附言我认为这并不重要,但我使用的是 joblib,它在内部使用了一个 Pool:

P.S. I don't think that matters, but I am using joblib, which internally used a Pool:

from joblib import delayed, Parallel

print '------------- Joblib'
        Parallel(n_jobs=4)(delayed(produce)(listaNumeri) for i in range(5))

输出:

------------- Joblib
producing 4315862096
producing 4315862288
producing 4315862480
producing 4315862672
producing 4315862352

推荐答案

恐怕这里几乎没有任何事情像您希望那样有效:-(

I'm afraid virtually nothing here works the way you hope it works :-(

首先请注意,不同进程产生的相同id()告诉您没有关于对象是否真的是同一个对象.每个进程都有自己的虚拟地址空间,由操作系统分配.两个进程中的相同虚拟地址可以引用完全不同的物理内存位置.您的代码是否产生相同的 id() 输出几乎纯属偶然.在多次运行中,有时我会在您的 Process 部分看到不同的 id() 输出,并在您的 Pool<中看到重复的 id() 输出/code> 部分,反之亦然,或两者兼而有之.

First note that identical id() values produced by different processes tell you nothing about whether the objects are really the same object. Each process has its own virtual address space, assigned by the operating system. The same virtual address in two processes can refer to entirely different physical memory locations. Whether your code produces the same id() output or not is pretty much purely accidental. Across multiple runs, sometimes I see different id() output in your Process section and repeated id() output in your Pool section, or vice versa, or both.

其次,Manager 提供语义共享,但不提供物理共享.numeri 实例的数据存在于管理器进程中.您所有的工作进程都会看到(副本)代理对象.这些是转发由管理器进程执行的所有操作的瘦包装器.这涉及大量的进程间通信和管理器进程内部的序列化.这是编写非常慢的代码的好方法;-) 是的,numeri 数据只有一个副本,但所有工作都由单个进程(管理进程)完成.

Second, a Manager supplies semantic sharing but not physical sharing. The data for your numeri instance lives only in the manager process. All your worker processes see (copies of) proxy objects. Those are thin wrappers that forward all operations to be performed by the manager process. This involves lots of inter-process communication, and serialization inside the manager process. This is a great way to write really slow code ;-) Yes, there is only one copy of the numeri data, but all work on it is done by a single process (the manager process).

要更清楚地看到这一点,请进行@martineau 建议的更改,并将 get_list_id() 更改为:

To see this more clearly, make the changes @martineau suggested, and also change get_list_id() to this:

def get_list_id(self):  # added method
    import os
    print("get_list_id() running in process", os.getpid())
    return id(self.nl)

这是示例输出:

41543664
------------ Process
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 46268496
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 44153904
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
--------------- Pool
producing 41639248
get_list_id() running in process 5856
with list_id 44544608
producing 41777200
get_list_id() running in process 5856
with list_id 44544608
producing 41776816
get_list_id() running in process 5856
with list_id 44544608
producing 41777168
get_list_id() running in process 5856
with list_id 44544608
producing 41777136
get_list_id() running in process 5856
with list_id 44544608

清楚了吗?每次获得相同列表 ID 的原因是不是,因为每个工作进程具有相同的 self.nl 成员,这是因为所有 numeri 方法单个进程(管理器进程)中运行.这就是列表 ID 始终相同的原因.

Clear? The reason you get the same list id each time is not because each worker process has the same self.nl member, it's because all numeri methods run in a single process (the manager process). That's why the list id is always the same.

如果您在 Linux-y 系统(支持 fork() 的操作系统)上运行,一个更好的主意是忘记所有这些 Manager 的东西并且在启动任何工作进程之前,在模块级别创建复杂对象.然后工作人员将继承(地址空间副本)您的复杂对象.通常的 copy-on-write fork() 语义将使其尽可能节省内存.如果不需要将突变折叠回复杂对象的主程序副本中,这就足够了.如果确实需要折叠回变,那么您又需要大量的进程间通信,而 multiprocessing 相应地变得不那么有吸引力了.

If you're running on a Linux-y system (an OS that supports fork()), a much better idea is to forget all this Manager stuff and create your complex object at module level before starting any worker processes. Then the workers will inherit (address-space copies of) your complex object. The usual copy-on-write fork() semantics will make that about as memory-efficient as possible. That's sufficient if mutations don't need to be folded back into the main program's copy of the complex object. If mutations do need to be folded back in, then you're back to needing lots of inter-process communication, and multiprocessing becomes correspondingly less attractive.

这里没有简单的答案.不要射击信使;-)

There are no easy answers here. Don't shoot the messenger ;-)

这篇关于多处理 - 共享一个复杂的对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆