大数据多处理 [英] multiprocessing with large data

查看:50
本文介绍了大数据多处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用multiprocessing.Pool()并行化一些繁重的计算.

I am using multiprocessing.Pool() to parallelize some heavy computations.

目标函数返回大量数据(庞大的列表).我的RAM用完了.

The target function returns a lot of data (a huge list). I'm running out of RAM.

在没有multiprocessing的情况下,我只是将目标函数转换为生成器,方法是yield在计算所得的元素时将其依次接一个.

Without multiprocessing, I'd just change the target function into a generator, by yielding the resulting elements one after another, as they are computed.

我知道多处理不支持生成器-它等待整个输出并立即返回,对吗?没有屈服.是否有一种方法可以使Pool工作程序在数据可用时立即产生数据,而无需在RAM中构造整个结果数组?

I understand multiprocessing does not support generators -- it waits for the entire output and returns it at once, right? No yielding. Is there a way to make the Pool workers yield data as soon as they become available, without constructing the entire result array in RAM?

简单的例子:

def target_fnc(arg):
   result = []
   for i in xrange(1000000):
       result.append('dvsdbdfbngd') # <== would like to just use yield!
   return result

def process_args(some_args):
    pool = Pool(16)
    for result in pool.imap_unordered(target_fnc, some_args):
        for element in result:
            yield element

这是Python 2.7.

This is Python 2.7.

推荐答案

这听起来像是队列的理想用例:

This sounds like an ideal use case for a Queue: http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes

只需将结果从汇总的工作人员输入队列中,然后将其吸收到主服务器中即可.

Simply feed your results into the queue from the pooled workers and ingest them in the master.

请注意,除非排空队列的速度几乎与工作人员填充队列的速度一样快,否则您仍然可能会遇到内存压力问题.您可以限制队列大小(队列中可以容纳的最大对象数),在这种情况下,池工作程序将阻塞queue.put语句,直到队列中有可用空间为止.这将限制内存使用量. 但是,如果这样做,可能是时候重新考虑是否需要全部池化和/或使用更少的工作程序是否有意义.

Note that you still may run into memory pressure issues unless you drain the queue nearly as fast as the workers are populating it. You could limit the queue size (the maximum number of objects that will fit in the queue) in which case the pooled workers would block on the queue.put statements until space is available in the queue. This would put a ceiling on memory usage. But if you're doing this, it may be time to reconsider whether you require pooling at all and/or if it might make sense to use fewer workers.

这篇关于大数据多处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆