python3多进程共享numpy数组(只读) [英] python3 multiprocess shared numpy array(read-only)

查看:1053
本文介绍了python3多进程共享numpy数组(只读)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定这个标题是否适合我的情况:我想共享numpy数组的原因是它可能是我的案例的潜在解决方案之一,但是如果您还有其他解决方案,也可以很好.

I'm not sure if this title is appropriate for my situation: the reason why I want to share numpy array is that it might be one of the potential solutions to my case, but if you have other solutions that would also be nice.

我的任务:我需要通过 multiprocessing 实现迭代算法,而每个流程都需要有一个数据副本(此数据很大,而只读,并且在迭代算法中不会更改).

My task: I need to implement an iterative algorithm with multiprocessing, while each of these processes need to have a copy of data(this data is large, and read-only, and won't change during the iterative algorithm).

我已经写了一些伪代码来证明我的想法:

I've written some pseudo code to demonstrate my idea:

import multiprocessing


def worker_func(data, args):
    # do sth...
    return res

def compute(data, process_num, niter):
    data
    result = []
    args = init()

    for iter in range(niter):
        args_chunk = split_args(args, process_num)
        pool = multiprocessing.Pool()
        for i in range(process_num):
            result.append(pool.apply_async(worker_func,(data, args_chunk[i])))
        pool.close()
        pool.join()
        # aggregate result and update args
        for res in result:
            args = update_args(res.get())

if __name__ == "__main__":
    compute(data, 4, 100)

问题在于每次迭代中,我必须将数据传递给子流程,这非常耗时.

The problem is in each iteration, I have to pass the data to subprocess, which is very time-consuming.

我提出了两个潜在的解决方案:

I've come up with two potential solutions:

  1. 在进程之间共享数据(它是ndarray),这就是这个问题的标题.
  2. 使子进程保持活动状态,例如守护进程或其他……等待调用.这样,我只需要在最开始就传递数据.

那么,有什么方法可以在进程之间共享一个只读的numpy数组?或者,如果您有解决方案2的良好实现,那么它也可以工作.

So, is there any way to share a read-only numpy array among process? Or if you have a good implementation of solution 2, it also works.

谢谢.

推荐答案

如果您绝对必须使用Python多重处理,则可以将Python多重处理与此示例,使用Pandas数据框而不是numpy数组也是如此.

If you absolutely must use Python multiprocessing, then you can use Python multiprocessing along with Arrow's Plasma object store to store the object in shared memory and access it from each of the workers. See this example, which does the same thing using a Pandas dataframe instead of a numpy array.

如果您绝对不需要使用Python多重处理,则可以使用 Ray更轻松地做到这一点. . Ray的优点之一是,它不仅可以与数组一起使用,而且还可以与包含数组的Python对象一起使用.

If you don't absolutely need to use Python multiprocessing, you can do this much more easily with Ray. One advantage of Ray is that it will work out of the box not just with arrays but also with Python objects that contain arrays.

在后台,Ray使用 Apache Arrow 序列化Python对象,这是一种零拷贝数据布局,并将结果存储在 Arrow的Plasma对象存储库中.这使工作任务可以对对象具有只读访问权限,而无需创建自己的副本.您可以阅读有关这是如何工作的.

Under the hood, Ray serializes Python objects using Apache Arrow, which is a zero-copy data layout, and stores the result in Arrow's Plasma object store. This allows worker tasks to have read-only access to the objects without creating their own copies. You can read more about how this works.

这是您示例运行的修改版.

Here is a modified version of your example that runs.

import numpy as np
import ray

ray.init()

@ray.remote
def worker_func(data, i):
    # Do work. This function will have read-only access to
    # the data array.
    return 0

data = np.zeros(10**7)
# Store the large array in shared memory once so that it can be accessed
# by the worker tasks without creating copies.
data_id = ray.put(data)

# Run worker_func 10 times in parallel. This will not create any copies
# of the array. The tasks will run in separate processes.
result_ids = []
for i in range(10):
    result_ids.append(worker_func.remote(data_id, i))

# Get the results.
results = ray.get(result_ids)

请注意,如果我们省略了data_id = ray.put(data)行,而是将其命名为worker_func.remote(data, i),则data数组将在每个函数调用中存储在共享内存中一次,这将导致效率低下.通过首先调用ray.put,我们可以一次将对象存储在对象存储中.

Note that if we omitted the line data_id = ray.put(data) and instead called worker_func.remote(data, i), then the data array would be stored in shared memory once per function call, which would be inefficient. By first calling ray.put, we can store the object in the object store a single time.

这篇关于python3多进程共享numpy数组(只读)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆