如何在python joblib中写入共享变量 [英] How to write to a shared variable in python joblib

查看:131
本文介绍了如何在python joblib中写入共享变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码并行化了一个for循环.

The following code parallelizes a for-loop.

import networkx as nx;
import numpy as np;
from joblib import Parallel, delayed;
import multiprocessing;

def core_func(repeat_index, G, numpy_arrary_2D):
  for u in G.nodes():
    numpy_arrary_2D[repeat_index][u] = 2;
  return;

if __name__ == "__main__":
  G = nx.erdos_renyi_graph(100000,0.99);
  nRepeat = 5000;
  numpy_array = np.zeros([nRepeat,G.number_of_nodes()]);
  Parallel(n_jobs=4)(delayed(core_func)(repeat_index, G, numpy_array) for repeat_index in range(nRepeat));
  print(np.mean(numpy_array));

可以看出,要打印的期望值为2.但是,当我在群集(多核,共享内存)上运行代码时,它返回0.0.

As can be seen, the expected value to be printed is 2. However, when I run my code on a cluster (multi-core, shared memory), it returns 0.0.

我认为问题在于每个工作人员都创建自己的numpy_array对象副本,而在main函数中创建的副本未更新.如何修改代码,以便可以更新numpy数组numpy_array?

I think the problem is that each worker creates its own copy of the numpy_array object, and the one created in the main function is not updated. How can I modify the code such that the numpy array numpy_array can be updated?

推荐答案

joblib默认使用进程的多处理池,作为

joblib uses the multiprocessing pool of processes by default, as its manual says:

在后台,Parallel对象创建一个多处理池,该池 在多个进程中分叉Python解释器以执行每个 列表中的项目.延迟功能是一个简单的技巧 能够通过函数调用创建元组(function,args,kwargs) 语法.

Under the hood, the Parallel object create a multiprocessing pool that forks the Python interpreter in multiple processes to execute each of the items of the list. The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.

这意味着,每个进程都将继承数组的原始状态,但是当进程退出时,写入其中的任何内容都会丢失.仅函数结果被传递回调用(主)过程.但是您什么也没返回,所以返回了None.

Which means, that every process inherits the original state of the array, but whatever it writes inside into it, is lost when the process exits. Only the function result is delivered back to the calling (main) process. But you do not return anything, so None is returned.

要使共享数组具有可修改性,您有两种方法:使用线程和使用共享内存.

To make the shared array modiyable, you have two ways: using threads and using the shared memory.

与进程不同,线程共享内存.因此,您可以写入阵列,每个作业都会看到此更改.根据joblib手册,此操作是通过以下方式完成的:

The threads, unlike the processes, share the memory. So you can write to the array and every job will see this change. According to the joblib manual, it is done this way:

  Parallel(n_jobs=4, backend="threading")(delayed(core_func)(repeat_index, G, numpy_array) for repeat_index in range(nRepeat));

运行时:

$ python r1.py 
2.0

但是,当您要将复杂的东西写入数组时,请确保您正确处理了数据或数据块周围的锁,否则您将遇到竞争条件(用谷歌搜索).

However, when you will be writing complex things into the array, make sure you properly handle the locks around the data or data pieces, or you will hit the race conditions (google it).

还请仔细阅读有关GIL的内容,因为Python中的计算多线程是受限制的(与I/O多线程不同).

Also read carefully about GIL, as the computational multithreading in Python is limited (unlike the I/O multithreading).

如果仍然需要进程(例如由于GIL),则可以将该数组放入共享内存中.

If you still need the processes (e.g. because of GIL), you can put that array into the shared memory.

这是一个比较复杂的主题,但是joblib手册中也显示了="nofollow noreferrer"> joblib + numpy共享内存示例.

This is a bit more complicated topic, but joblib + numpy shared memory example is shown in the joblib manual also.

这篇关于如何在python joblib中写入共享变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆