在多处理进程之间共享大型只读的Numpy数组 [英] Share Large, Read-Only Numpy Array Between Multiprocessing Processes

查看:84
本文介绍了在多处理进程之间共享大型只读的Numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个60GB的SciPy阵列(矩阵),我必须在5个以上的multiprocessing Process对象之间共享.我看过numpy-sharedmem并阅读了此讨论在SciPy列表中.似乎有两种方法-numpy-sharedmem和使用multiprocessing.RawArray()并将NumPy dtype s映射到ctype s.现在,numpy-sharedmem似乎是可行的方法,但是我还没有看到一个很好的参考示例.我不需要任何类型的锁,因为数组(实际上是矩阵)将是只读的.现在,由于它的大小,我想避免复制.听起来很像,正确的方法是将数组的 only 副本创建为sharedmem数组,然后将其传递给Process对象?几个具体问题:

I have a 60GB SciPy Array (Matrix) I must share between 5+ multiprocessing Process objects. I've seen numpy-sharedmem and read this discussion on the SciPy list. There seem to be two approaches--numpy-sharedmem and using a multiprocessing.RawArray() and mapping NumPy dtypes to ctypes. Now, numpy-sharedmem seems to be the way to go, but I've yet to see a good reference example. I don't need any kind of locks, since the array (actually a matrix) will be read-only. Now, due to its size, I'd like to avoid a copy. It sounds like the correct method is to create the only copy of the array as a sharedmem array, and then pass it to the Process objects? A couple of specific questions:

  1. 将sharedmem句柄实际传递给sub- Process() es的最佳方法是什么?我是否需要一个队列来传递一个数组?管道会更好吗?我可以仅将它作为参数传递给Process()子类的init(我假设它已被腌制)吗?

  1. What's the best way to actually pass the sharedmem handles to sub-Process()es? Do I need a queue just to pass one array around? Would a pipe be better? Can I just pass it as an argument to the Process() subclass's init (where I'm assuming it's pickled)?

在我上面链接的讨论中,提到了numpy-sharedmem不是64位安全的吗?我肯定使用了一些不是32位可寻址的结构.

In the discussion I linked above, there's mention of numpy-sharedmem not being 64bit-safe? I'm definitely using some structures that aren't 32-bit addressable.

RawArray()方法是否存在折衷?慢一点,孩子吗?

Are there tradeoff's to the RawArray() approach? Slower, buggier?

numpy-sharedmem方法是否需要从ctype到dtype的映射?

Do I need any ctype-to-dtype mapping for the numpy-sharedmem method?

有人这样做的一些开放源代码示例吗?我是一个非常动手的知识,如果没有任何好的榜样,很难使它成功.

Does anyone have an example of some OpenSource code doing this? I'm a very hands-on learned and it's hard to get this working without any kind of good example to look at.

如果我可以提供其他任何信息以帮助其他人对此进行澄清,请发表评论,然后我将进行补充.谢谢!

If there's any additional info I can provide to help clarify this for others, please comment and I'll add. Thanks!

这需要在Ubuntu Linux和也许 Mac OS上运行,但是可移植性并不是一个大问题.

This needs to run on Ubuntu Linux and Maybe Mac OS, but portability isn't a huge concern.

推荐答案

@Velimir Mlaker给出了一个很好的答案.我以为我可以添加一些评论和一个小例子.

@Velimir Mlaker gave a great answer. I thought I could add some bits of comments and a tiny example.

(我在sharedmem上找不到太多文档-这些是我自己实验的结果.)

(I couldn't find much documentation on sharedmem - these are the results of my own experiments.)

  1. 您是否需要在子进程启动时或启动子进程后传递句柄?如果只是前者,则可以对Process使用targetargs参数.这可能比使用全局变量更好.
  2. 在您链接的讨论页面上,似乎不久前已在sharedmem中添加了对64位Linux的支持,所以这可能不是问题.
  3. 我不知道这个.
  4. 不.请参阅下面的示例.
  1. Do you need to pass the handles when the subprocess is starting, or after it has started? If it's just the former, you can just use the target and args arguments for Process. This is potentially better than using a global variable.
  2. From the discussion page you linked, it appears that support for 64-bit Linux was added to sharedmem a while back, so it could be a non-issue.
  3. I don't know about this one.
  4. No. Refer to example below.

示例

#!/usr/bin/env python
from multiprocessing import Process
import sharedmem
import numpy

def do_work(data, start):
    data[start] = 0;

def split_work(num):
    n = 20
    width  = n/num
    shared = sharedmem.empty(n)
    shared[:] = numpy.random.rand(1, n)[0]
    print "values are %s" % shared

    processes = [Process(target=do_work, args=(shared, i*width)) for i in xrange(num)]
    for p in processes:
        p.start()
    for p in processes:
        p.join()

    print "values are %s" % shared
    print "type is %s" % type(shared[0])

if __name__ == '__main__':
    split_work(4)

输出

values are [ 0.81397784  0.59667692  0.10761908  0.6736734   0.46349645  0.98340718
  0.44056863  0.10701816  0.67167752  0.29158274  0.22242552  0.14273156
  0.34912309  0.43812636  0.58484507  0.81697513  0.57758441  0.4284959
  0.7292129   0.06063283]
values are [ 0.          0.59667692  0.10761908  0.6736734   0.46349645  0.
  0.44056863  0.10701816  0.67167752  0.29158274  0.          0.14273156
  0.34912309  0.43812636  0.58484507  0.          0.57758441  0.4284959
  0.7292129   0.06063283]
type is <type 'numpy.float64'>

这个相关问题可能有用.

这篇关于在多处理进程之间共享大型只读的Numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆