多处理:在进程之间共享一个大型只读对象? [英] multiprocessing: sharing a large read-only object between processes?

查看:36
本文介绍了多处理:在进程之间共享一个大型只读对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过多重处理产生的子进程是否共享程序中较早创建的对象?

Do child processes spawned via multiprocessing share objects created earlier in the program?

我有以下设置:

do_some_processing(filename):
    for line in file(filename):
        if line.split(',')[0] in big_lookup_object:
            # something here

if __name__ == '__main__':
    big_lookup_object = marshal.load('file.bin')
    pool = Pool(processes=4)
    print pool.map(do_some_processing, glob.glob('*.data'))

我正在将一些大对象加载到内存中,然后创建一个需要使用该大对象的工作池.大对象是只读访问的,我不需要在进程之间传递对其的修改.

I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.

我的问题是:大对象是否已加载到共享内存中,就像我在unix/c中生成进程一样,还是每个进程都加载了自己的大对象副本?

My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?

更新:进一步说明-big_lookup_object是共享的查找对象.我不需要将其拆分并单独处理.我需要保留一个副本.我需要拆分的工作是读取许多其他大文件,并根据查找对象查找那些大文件中的项目.

Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.

进一步更新:数据库是一个很好的解决方案,memcached可能是一个更好的解决方案,并且磁盘上的文件(搁架或dbm)可能更好.在这个问题上,我对内存解决方案特别感兴趣.对于最终的解决方案,我将使用hadoop,但我想看看是否也可以具有本地内存版本.

Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.

推荐答案

子进程是否通过在程序中较早创建的多处理共享对象而产生?"

"Do child processes spawned via multiprocessing share objects created earlier in the program?"

否(在3.8之前的python)和在3.8中的是(

No (python before 3.8), and Yes in 3.8 (https://docs.python.org/3/library/multiprocessing.shared_memory.html#module-multiprocessing.shared_memory)

进程具有独立的内存空间.

Processes have independent memory space.

解决方案1 ​​

要充分利用有很多工人的大型结构,请执行此操作.

To make best use of a large structure with lots of workers, do this.

  1. 将每个工作人员写为过滤器"–从stdin读取中间结果,可以工作,在stdout上写入中间结果.

  1. Write each worker as a "filter" – reads intermediate results from stdin, does work, writes intermediate results on stdout.

将所有工作程序连接为管道:

Connect all the workers as a pipeline:

process1 <source | process2 | process3 | ... | processn >result

每个过程都可以读取,执行和写入.

Each process reads, does work and writes.

这是非常有效的,因为所有进程都同时运行.读写直接通过进程之间的共享缓冲区.

This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.

解决方案2

在某些情况下,您的结构更复杂-通常是扇形"结构.在这种情况下,您的父母有多个孩子.

In some cases, you have a more complex structure – often a "fan-out" structure. In this case you have a parent with multiple children.

  1. 父级将打开源数据.父母分叉了许多孩子.

  1. Parent opens source data. Parent forks a number of children.

父级读取源代码,将源代码的一部分分发给每个同时运行的子级.

Parent reads source, farms parts of the source out to each concurrently running child.

当父级到达末尾时,关闭管道.子文件结束并且正常完成.

When parent reaches the end, close the pipe. Child gets end of file and finishes normally.

孩子们的部分写起来很愉快,因为每个孩子都简单地阅读sys.stdin.

The child parts are pleasant to write because each child simply reads sys.stdin.

父母在产卵所有孩子和正确地固定管道方面有一些花哨的步法,但这还不错.

The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.

Fan-in是相反的结构.许多独立运行的流程需要将其输入交织到一个通用流程中.收集器不那么容易编写,因为它必须从许多来源读取.

Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.

通常使用select模块从许多命名管道中进行读取,以查看哪些管道具有待处理的输入.

Reading from many named pipes is often done using the select module to see which pipes have pending input.

解决方案3

共享查找是数据库的定义.

Shared lookup is the definition of a database.

解决方案3A–加载数据库.让工作人员处理数据库中的数据.

Solution 3A – load a database. Let the workers process the data in the database.

解决方案3B–使用 werkzeug (或类似方法)创建一个非常简单的服务器,以提供可响应HTTP GET的WSGI应用程序,从而使工作人员可以查询服务器.

Solution 3B – create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.

解决方案4

共享文件系统对象. Unix OS提供共享的内存对象.这些只是映射到内存的文件,因此可以完成交换I/O的工作,而不是执行更多常规的缓冲读取.

Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.

您可以通过多种方式在Python上下文中完成此操作

You can do this from a Python context in several ways

  1. 编写一个启动程序,该程序(1)将原始的巨大对象分解为较小的对象,(2)启动工作程序,每个工作程序都具有较小的对象.较小的对象可以腌制为Python对象,以节省一点点文件读取时间.

  1. Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.

编写一个启动程序,该程序(1)使用seek操作读取原始的巨大对象并写入页面结构的字节编码文件,以确保通过简单的查找即可轻松找到各个部分.这就是数据库引擎的工作–将数据分解成页面,使每个页面都可以通过seek轻松找到.

Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does – break the data into pages, make each page easy to locate via a seek.

具有此大型页面结构文件的Spawn工作者可以访问.每个工人都可以寻找相关的零件并在那里工作.

Spawn workers with access this this large page-structured file. Each worker can seek to the relevant parts and do their work there.

这篇关于多处理:在进程之间共享一个大型只读对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆