multiprocessing.Pool与全局变量 [英] multiprocessing.Pool with a global variable

查看:425
本文介绍了multiprocessing.Pool与全局变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python的多处理库中的Pool类编写将在HPC群集上运行的程序.

I am using the Pool class from python's multiprocessing library write a program that will run on an HPC cluster.

这是我要做的事情的抽象:

Here is an abstraction of what I am trying to do:

def myFunction(x):
    # myObject is a global variable in this case
    return myFunction2(x, myObject)

def myFunction2(x,myObject):
    myObject.modify() # here I am calling some method that changes myObject
    return myObject.f(x)

poolVar = Pool()
argsArray = [ARGS ARRAY GOES HERE]
output = poolVar.map(myFunction, argsArray)

函数f(x)包含在* .so文件中,即它正在调用C函数.

The function f(x) is contained in a *.so file, i.e., it is calling a C function.

我遇到的问题是每次运行程序时输出变量的值都不同(即使函数myObject.f()是确定性函数). (如果我只有一个进程,则每次运行该程序时输出变量都是相同的.)

The problem I am having is that the value of the output variable is different each time I run my program (even though the function myObject.f() is a deterministic function). (If I only have one process then the output variable is the same each time I run the program.)

我尝试创建对象,而不是将其存储为全局变量:

I have tried creating the object rather than storing it as a global variable:

def myFunction(x):
    myObject = createObject()
    return myFunction2(x, myObject)

但是,在我的程序中,对象的创建是昂贵的,因此,一次创建myObject然后每次调用myFunction2()对其进行修改都容易得多.因此,我不想每次都创建对象.

However, in my program the object creation is expensive, and thus, it is a lot easier to create myObject once and then modify it each time I call myFunction2(). Thus, I would like to not have to create the object each time.

您有什么建议吗?我对并行编程非常陌生,因此我可能会出错.我决定使用Pool类,因为我想从简单的事情开始.但我愿意尝试一种更好的方法.

Do you have any tips? I am very new to parallel programming so I could be going about this all wrong. I decided to use the Pool class since I wanted to start with something simple. But I am willing to try a better way of doing it.

推荐答案

我正在使用python的多处理库中的Pool类来做 HPC群集上的一些共享内存处理.

I am using the Pool class from python's multiprocessing library to do some shared memory processing on an HPC cluster.

进程不是线程!不能简单地将Thread替换为Process,并期望所有工作都一样. Process es 共享内存,这意味着全局变量已被复制,因此它们在原始过程中的值不会改变.

Processes are not threads! You cannot simply replace Thread with Process and expect all to work the same. Processes do not share memory, which means that the global variables are copied, hence their value in the original process doesn't change.

如果要在进程之间使用共享内存,则必须使用multiprocessing的数据类型,例如ValueArray,或使用Manager创建共享列表等.

If you want to use shared memory between processes then you must use the multiprocessing's data types, such as Value, Array, or use the Manager to create shared lists etc.

尤其是您可能对Manager.register方法感兴趣,该方法允许Manager创建共享的自定义对象(尽管它们必须是可腌制的).

In particular you might be interested in the Manager.register method, which allows the Manager to create shared custom objects(although they must be picklable).

但是我不确定这是否会提高性能.由于进程之间的任何通信都需要进行酸洗,因此酸洗通常需要花费更多的时间[em> ,然后才简单地实例化对象.

However I'm not sure whether this will improve the performance. Since any communication between processes requires pickling, and pickling takes usually more time then simply instantiating the object.

请注意,在创建

Note that you can do some initialization of the worker processes passing the initializer and initargs argument when creating the Pool.

例如,以最简单的形式在工作进程中创建全局变量:

For example, in its simplest form, to create a global variable in the worker process:

def initializer():
    global data
    data = createObject()

用作:

pool = Pool(4, initializer, ())

然后,工作者函数可以使用data全局变量而无需担心.

Then the worker functions can use the data global variable without worries.

样式注释:从不请勿将内置名称用于变量/模块.在您的情况下,object是内置的.否则,您将遇到意想不到的错误,这些错误可能难以理解且难以追踪.

Style note: Never use the name of a built-in for your variables/modules. In your case object is a built-in. Otherwise you'll end up with unexpected errors which may be obscure and hard to track down.

这篇关于multiprocessing.Pool与全局变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆