在python进程中创建和重用对象 [英] Creating and reusing objects in python processes

查看:74
本文介绍了在python进程中创建和重用对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个令人尴尬的可并行化问题,其中涉及一堆彼此独立解决的任务.解决每个任务的时间很长,因此这是进行多处理的主要选择.

I have an embarrassingly parallelizable problem consisting on a bunch of tasks that get solved independently of each other. Solving each of the tasks is quite lengthy, so this is a prime candidate for multi-processing.

问题是要解决我的任务,需要创建一个特定的对象,该对象很耗时,但可以重复用于所有任务(例如需要启动一个外部二进制程序),因此在串行版本中我做这样的事情:

The problem is that solving my tasks requires creating a specific object that is very time consuming on its own but can be reused for all the tasks (think of an external binary program that needs to be launched), so in the serial version I do something like this:

def costly_function(task, my_object):
    solution = solve_task_using_my_object
    return solution

def solve_problem():
    my_object = create_costly_object()
    tasks = get_list_of_tasks()
    all_solutions = [costly_function(task, my_object) for task in tasks]
    return all_solutions

当我尝试使用多处理程序并行化该程序时,由于多种原因,my_object不能作为参数传递(无法腌制,并且不应同时运行多个任务),所以我必须为每个任务创建对象的单独实例:

When I try to parallelize this program using multiprocessing, my_object cannot be passed as a parameter for a number of reasons (it cannot be pickled, and it should not run more than one task at the same time), so I have to resort to create a separate instance of the object for each task:

def costly_function(task):
    my_object = create_costly_object()
    solution = solve_task_using_my_object
    return solution

def psolve_problem():
    pool = multiprocessing.Pool()
    tasks = get_list_of_tasks()
    all_solutions = pool.map_async(costly_function, tasks)
    return all_solutions.get()

但是创建多个my_object实例的额外成本使此代码仅比序列化的代码快一点.

but the added costs of creating multiple instances of my_object makes this code only marginally faster than the serialized one.

如果我可以在每个进程中创建一个单独的my_object实例,然后将其重新用于该进程中运行的所有任务,则我的时间安排将大大改善.关于如何做到这一点的任何指示?

If I could create a separate instance of my_object in each process and then reuse them for all the tasks that get run in that process, my timings would significantly improve. Any pointers on how to do that?

推荐答案

我找到了一种解决自身问题的简单方法,无需引入标准库以外的任何工具,我想我会在这里写下来,以防其他人有类似的问题.

I found a simple way of solving my own problem without bringing in any tools besides the standard library, I thought I'd write it down here in case somebody else has a similar problem.

multiprocessing.Pool接受initializer函数(带有参数),该函数在启动每个进程时运行.该函数的返回值不会存储在任何地方,但是可以利用该函数来设置全局变量:

multiprocessing.Pool accepts an initializer function (with arguments) that gets run when each process is launched. The return value of this function is not stored anywhere, but one can take advantage of the function to set up a global variable:

def init_process():
    global my_object
    my_object = create_costly_object()

def costly_function(task):
    global my_object
    solution = solve_task_using_my_object
    return solution

def psolve_problem():
    pool = multiprocessing.Pool(initializer=init_process)
    tasks = get_list_of_tasks()
    all_solutions = pool.map_async(costly_function, tasks)
    return all_solutions.get()

由于每个进程都有一个单独的全局名称空间,因此实例化的对象不会发生冲突,并且每个进程只能创建一次.

Since each process has a separate global namespace, the instantiated objects do not clash, and they are created only once per process.

可能不是最优雅的解决方案,但它很简单,可以使我获得近乎线性的加速效果.

Probably not the most elegant solution, but it's simple enough and gives me a near-linear speedup.

这篇关于在python进程中创建和重用对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆