Python Multiprocessing.Pool延迟迭代 [英] Python Multiprocessing.Pool lazy iteration

查看:173
本文介绍了Python Multiprocessing.Pool延迟迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道python的Multiprocessing.Pool类与map,imap和map_async一起工作的方式.我的特殊问题是,我想映射一个创建大量内存对象的迭代器,而不希望将所有这些对象同时生成到内存中.我想看看各种map()函数是否会使我的迭代器变干,或者仅在子进程缓慢前进时才智能地调用next()函数,因此我修改了一些测试,例如:

I'm wondering about the way that python's Multiprocessing.Pool class works with map, imap, and map_async. My particular problem is that I want to map on an iterator that creates memory-heavy objects, and don't want all these objects to be generated into memory at the same time. I wanted to see if the various map() functions would wring my iterator dry, or intelligently call the next() function only as child processes slowly advanced, so I hacked up some tests as such:

def g():
  for el in xrange(100):
    print el
    yield el

def f(x):
  time.sleep(1)
  return x*x

if __name__ == '__main__':
  pool = Pool(processes=4)              # start 4 worker processes
  go = g()
  g2 = pool.imap(f, go)
  g2.next()

以此类推,使用map,imap和map_async.但是,这是最公开的示例,因为只需在g2上一次调用next()即可从生成器g()中打印出我的所有元素,而如果imap这样做是懒惰"的话,我希望它仅调用go.next ()一次,因此仅打印出"1".

And so on with map, imap, and map_async. This is the most flagrant example however, as simply calling next() a single time on g2 prints out all my elements from my generator g(), whereas if imap were doing this 'lazily' I would expect it to only call go.next() once, and therefore print out only '1'.

有人可以清除正在发生的事情吗?是否有某种方法可以让进程池懒惰"地根据需要评估迭代器?

Can someone clear up what is happening, and if there is some way to have the process pool 'lazily' evaluate the iterator as needed?

谢谢

Gabe

推荐答案

首先让我们看一下程序的结尾.

Let's look at the end of the program first.

程序结束时,多处理模块使用atexit调用multiprocessing.util._exit_function.

The multiprocessing module uses atexit to call multiprocessing.util._exit_function when your program ends.

如果删除g2.next(),则程序将快速结束.

If you remove g2.next(), your program ends quickly.

_exit_function最终调用Pool._terminate_pool.主线程将pool._task_handler._state的状态从RUN更改为TERMINATE.同时,pool._task_handler线程正在循环进入Pool._handle_tasks,并在达到条件时自动退出

The _exit_function eventually calls Pool._terminate_pool. The main thread changes the state of pool._task_handler._state from RUN to TERMINATE. Meanwhile the pool._task_handler thread is looping in Pool._handle_tasks and bails out when it reaches the condition

            if thread._state:
                debug('task handler found thread._state != RUN')
                break

(请参见/usr/lib/python2.6/multiprocessing/pool.py)

(See /usr/lib/python2.6/multiprocessing/pool.py)

这是阻止任务处理程序完全消耗生成器g()的原因.如果您查看Pool._handle_tasks,您会看到

This is what stops the task handler from fully consuming your generator, g(). If you look in Pool._handle_tasks you'll see

        for i, task in enumerate(taskseq):
            ...
            try:
                put(task)
            except IOError:
                debug('could not put task on queue')
                break

这是消耗您的生成器的代码. (taskseq并不完全是您的生成器,但是随着taskseq被消耗,您的生成器也会被消耗.)

This is the code which consumes your generator. (taskseq is not exactly your generator, but as taskseq is consumed, so is your generator.)

相反,当您调用g2.next()时,主线程将调用IMapIterator.next,并在达到self._cond.wait(timeout)时等待.

In contrast, when you call g2.next() the main thread calls IMapIterator.next, and waits when it reaches self._cond.wait(timeout).

主线程正在等待而不是 调用_exit_function可以使任务处理程序线程正常运行,这意味着在Pool._handle_tasks函数的worker s'inqueueput的任务中,它会完全消耗生成器.

That the main thread is waiting instead of calling _exit_function is what allows the task handler thread to run normally, which means fully consuming the generator as it puts tasks in the workers' inqueue in the Pool._handle_tasks function.

最重要的是,所有Pool映射函数都会消耗给定的整个可迭代项.如果要分块使用生成器,则可以执行以下操作:

The bottom line is that all Pool map functions consume the entire iterable that it is given. If you'd like to consume the generator in chunks, you could do this instead:

import multiprocessing as mp
import itertools
import time


def g():
    for el in xrange(50):
        print el
        yield el


def f(x):
    time.sleep(1)
    return x * x

if __name__ == '__main__':
    pool = mp.Pool(processes=4)              # start 4 worker processes
    go = g()
    result = []
    N = 11
    while True:
        g2 = pool.map(f, itertools.islice(go, N))
        if g2:
            result.extend(g2)
            time.sleep(1)
        else:
            break
    print(result)

这篇关于Python Multiprocessing.Pool延迟迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆