Python多处理 - 共享一个复杂的对象 [英] Python multiprocessing- sharing a complex object

查看:128
本文介绍了Python多处理 - 共享一个复杂的对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的类似dict的对象,需要在多个工作进程之间共享。每个工作者读取对象中的信息的随机子集并用它做一些计算。我想避免复制大对象,因为我的机器很快耗尽内存。



我在玩这个问题,我修改了一点,使用固定大小的进程池,这更适合我的用例。然而,这似乎打破了它。

 来自multiprocessing import Process,Pool 
来自multiprocessing.managers import BaseManager

class numeri(object):
def __init __(self):
self.nl = []

def getLen(self):
return len(self.nl)

def stampa(self):
print self.nl

def appendi(self,x):
self。 nl.append(x)

def svuota(self):
for i in range(len(self.nl)):
del self.nl [0]

class numManager(BaseManager):
pass


def produce(listaNumeri):
print'producing',id(listaNumeri)
返回id(listaNumeri)


def main():
numManager.register('numeri',numeri,exposed = ['getLen','appendi' svuota','stampa'])
mymanager = numManager()
mymanager.start()
listaNumeri = mymanager.numeri()
print id(listaNumeri)

print'------------处理'
for i in range(5):
producer = Process(target = produce,args =(listaNumeri,) )
producer.start()
producer.join()

print'--------------- Pool'
pool = pool(processes = 1)
for i in range(5):
pool.apply_async(produce,args =(listaNumeri,))。get()

b $ b if __name__ =='__main__':
main()

  4315705168 
------------处理
产生4315705168
生产4315705168
生产4315705168
生产4315705168
生产4315705168
---------------池
生产4299771152
生产4315861712
生产4299771152
生产4315861712
生产4299771152

正如你可以看到的,在第一种情况下,所有工作进程获得相同的对象(按id)。在第二种情况下,id不一样。这是否意味着该对象被复制?



PS我不认为重要,但我使用joblib,它内部使用一个池。

 来自joblib导入延迟,平行
print'------------- Jobbib'
Parallel(n_jobs = 4)(延迟(产生)(listaNumeri)for i在范围(5))

其输出

  ------------- Joblib 
生产4315862096
生产4315862288
生产4315862480
生产4315862672
生产4315862352


希望的工作原理:



首先请注意,不同进程产生的相同 id()值会告诉您无任何内容对象是真正相同的对象每个进程都有自己的虚拟地址空间,由操作系统分配在两个进程中相同的虚拟地址可以指完全不同的物理内存位置。相同的 id()输出或者是非常纯粹意外。在多个运行中,有时我在过程部分中看到不同的 id()输出并重复



>其次,经理提供语义共享,但不提供物理共享。您的 numeri 实例的数据仅在管理器进程中存在 。所有工作进程都会看到(副本)代理对象。那些是转发要由管理器进程执行的所有操作的薄包装器。这涉及大量的进程间通信,以及在管理器进程内的序列化。这是写一个真正慢的代码的好方法;-)是的,只有一个副本的 numeri 数据,但所有的工作是由一个单一的过程



要更清楚地看到这一点,请进行@martineau建议的更改,并更改 get_list_id() to this:

  def get_list_id(self):#added method 
import os
print get_list_id()running in process,os.getpid())
return id(self.nl)

这是示例输出:

  41543664 
------------过程
生成42262032
在过程5856中运行的get_list_id()
与list_id 44544608
生成46268496
在进程5856中运行的get_list_id()
with list_id 44544608
产生42262032
在进程5856中运行的get_list_id()
与list_id 44544608
产生44153904
在进程5856中运行的get_list_id()
与list_id 44544608
生成42262032
在进程5856中运行的get_list_id()
with list_id 44544608
---------------池
生成41639248
get_list_id()在进程5856中运行
与list_id 44544608
生成41777200
在进程5856中运行的get_list_id()
与list_id 44544608
生成41776816
get_list_id ()运行在进程5856
和list_id 44544608
生成41777168
get_list_id()运行在进程5856
与list_id 44544608
生成41777136
get_list_id正在运行5856
with list_id 44544608

清除?每次获得相同列表ID的原因是,因为每个工作进程 都具有 self.nl ,因为所有 numeri 方法中运行单个进程(管理器进程)。这就是为什么列表id总是相同。



如果你运行在Linux-y系统(支持 fork ),一个更好的想法是忘记所有这 Manager 的东西,并在启动任何工作进程之前在模块级别创建复杂的对象。然后,工人将继承(地址空间副本)您的复杂对象。通常的copy-on-write fork()语义将使得它尽可能具有内存效率。这是足够的,如果突变不需要折叠到主程序的复杂对象的副本。如果突变确实需要折回,那么你需要进行大量的进程间通信, multiprocessing 会相应地减少吸引力。



这里没有简单的答案。不要拍摄信使; - )


I've got a large dict-like object that needs to be shared between a number of worker processes. Each worker reads a random subset of the information in the object and does some computation with it. I'd like to avoid copying the large object as my machine quickly runs out of memory.

I was playing with the code for this SO question and I modified it a bit to use a fixed-size process pool, which is better suited to my use case. This however seems to break it.

from multiprocessing import Process, Pool
from multiprocessing.managers import BaseManager

class numeri(object):
    def __init__(self):
        self.nl = []

    def getLen(self):
        return len(self.nl)

    def stampa(self):
        print self.nl

    def appendi(self, x):
        self.nl.append(x)

    def svuota(self):
        for i in range(len(self.nl)):
            del self.nl[0]    

class numManager(BaseManager):
    pass


def produce(listaNumeri):
    print 'producing', id(listaNumeri)
    return id(listaNumeri)


def main():
    numManager.register('numeri', numeri, exposed=['getLen', 'appendi', 'svuota', 'stampa'])
    mymanager = numManager()
    mymanager.start()
    listaNumeri = mymanager.numeri()
    print id(listaNumeri)

    print '------------ Process'
    for i in range(5):
        producer = Process(target=produce, args=(listaNumeri,))
        producer.start()
        producer.join()

    print '--------------- Pool'
    pool = Pool(processes=1)
    for i in range(5):
        pool.apply_async(produce, args=(listaNumeri,)).get()


if __name__ == '__main__':
    main()

The output is

4315705168
------------ Process
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
producing 4315705168
--------------- Pool
producing 4299771152
producing 4315861712
producing 4299771152
producing 4315861712
producing 4299771152

As you can see, in the first case all worker processes get the same object (by id). In the second case, the id is not the same. Does that mean the object is being copied?

PS I don't think that matters, but I am using joblib, which internally used a Pool. Below

from joblib import delayed, Parallel
print '------------- Joblib'
        Parallel(n_jobs=4)(delayed(produce)(listaNumeri) for i in range(5))

which outputs

------------- Joblib
producing 4315862096
producing 4315862288
producing 4315862480
producing 4315862672
producing 4315862352

解决方案

I'm afraid virtually nothing here works the way you hope it works :-(

First note that identical id() values produced by different processes tell you nothing about whether the objects are really the same object. Each process has its own virtual address space, assigned by the operating system. The same virtual address in two processes can refer to entirely different physical memory locations. Whether your code produces the same id() output or not is pretty much purely accidental. Across multiple runs, sometimes I see different id() output in your Process section and repeated id() output in your Pool section, or vice versa, or both.

Second, a Manager supplies semantic sharing but not physical sharing. The data for your numeri instance lives only in the manager process. All your worker processes see (copies of) proxy objects. Those are thin wrappers that forward all operations to be performed by the manager process. This involves lots of inter-process communication, and serialization inside the manager process. This is a great way to write really slow code ;-) Yes, there is only one copy of the numeri data, but all work on it is done by a single process (the manager process).

To see this more clearly, make the changes @martineau suggested, and also change get_list_id() to this:

def get_list_id(self):  # added method
    import os
    print("get_list_id() running in process", os.getpid())
    return id(self.nl)

Here's sample output:

41543664
------------ Process
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 46268496
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
producing 44153904
get_list_id() running in process 5856
with list_id 44544608
producing 42262032
get_list_id() running in process 5856
with list_id 44544608
--------------- Pool
producing 41639248
get_list_id() running in process 5856
with list_id 44544608
producing 41777200
get_list_id() running in process 5856
with list_id 44544608
producing 41776816
get_list_id() running in process 5856
with list_id 44544608
producing 41777168
get_list_id() running in process 5856
with list_id 44544608
producing 41777136
get_list_id() running in process 5856
with list_id 44544608

Clear? The reason you get the same list id each time is not because each worker process has the same self.nl member, it's because all numeri methods run in a single process (the manager process). That's why the list id is always the same.

If you're running on a Linux-y system (an OS that supports fork()), a much better idea is to forget all this Manager stuff and create your complex object at module level before starting any worker processes. Then the workers will inherit (address-space copies of) your complex object. The usual copy-on-write fork() semantics will make that about as memory-efficient as possible. That's sufficient if mutations don't need to be folded back into the main program's copy of the complex object. If mutations do need to be folded back in, then you're back to needing lots of inter-process communication, and multiprocessing becomes correspondingly less attractive.

There are no easy answers here. Don't shoot the messenger ;-)

这篇关于Python多处理 - 共享一个复杂的对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆