Python Multiprocessing.Pool延迟迭代 [英] Python Multiprocessing.Pool lazy iteration
问题描述
我想知道python的Multiprocessing.Pool类与map,imap和map_async一起工作的方式.我的特殊问题是,我想映射一个创建大量内存对象的迭代器,而不希望将所有这些对象同时生成到内存中.我想看看各种map()函数是否会使我的迭代器变干,或者仅在子进程缓慢前进时才智能地调用next()函数,因此我修改了一些测试,例如:
I'm wondering about the way that python's Multiprocessing.Pool class works with map, imap, and map_async. My particular problem is that I want to map on an iterator that creates memory-heavy objects, and don't want all these objects to be generated into memory at the same time. I wanted to see if the various map() functions would wring my iterator dry, or intelligently call the next() function only as child processes slowly advanced, so I hacked up some tests as such:
def g():
for el in xrange(100):
print el
yield el
def f(x):
time.sleep(1)
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
go = g()
g2 = pool.imap(f, go)
g2.next()
以此类推,使用map,imap和map_async.但是,这是最公开的示例,因为只需在g2上一次调用next()即可从生成器g()中打印出我的所有元素,而如果imap这样做是懒惰"的话,我希望它仅调用go.next ()一次,因此仅打印出"1".
And so on with map, imap, and map_async. This is the most flagrant example however, as simply calling next() a single time on g2 prints out all my elements from my generator g(), whereas if imap were doing this 'lazily' I would expect it to only call go.next() once, and therefore print out only '1'.
有人可以清除正在发生的事情吗?是否有某种方法可以让进程池懒惰"地根据需要评估迭代器?
Can someone clear up what is happening, and if there is some way to have the process pool 'lazily' evaluate the iterator as needed?
谢谢
Gabe
推荐答案
首先让我们看一下程序的结尾.
Let's look at the end of the program first.
程序结束时,多处理模块使用atexit
调用multiprocessing.util._exit_function
.
The multiprocessing module uses atexit
to call multiprocessing.util._exit_function
when your program ends.
如果删除g2.next()
,则程序将快速结束.
If you remove g2.next()
, your program ends quickly.
_exit_function
最终调用Pool._terminate_pool
.主线程将pool._task_handler._state
的状态从RUN
更改为TERMINATE
.同时,pool._task_handler
线程正在循环进入Pool._handle_tasks
,并在达到条件时自动退出
The _exit_function
eventually calls Pool._terminate_pool
. The main thread changes the state of pool._task_handler._state
from RUN
to TERMINATE
. Meanwhile the pool._task_handler
thread is looping in Pool._handle_tasks
and bails out when it reaches the condition
if thread._state:
debug('task handler found thread._state != RUN')
break
(请参见/usr/lib/python2.6/multiprocessing/pool.py)
(See /usr/lib/python2.6/multiprocessing/pool.py)
这是阻止任务处理程序完全消耗生成器g()
的原因.如果您查看Pool._handle_tasks
,您会看到
This is what stops the task handler from fully consuming your generator, g()
. If you look in Pool._handle_tasks
you'll see
for i, task in enumerate(taskseq):
...
try:
put(task)
except IOError:
debug('could not put task on queue')
break
这是消耗您的生成器的代码. (taskseq
并不完全是您的生成器,但是随着taskseq
被消耗,您的生成器也会被消耗.)
This is the code which consumes your generator. (taskseq
is not exactly your generator, but as taskseq
is consumed, so is your generator.)
相反,当您调用g2.next()
时,主线程将调用IMapIterator.next
,并在达到self._cond.wait(timeout)
时等待.
In contrast, when you call g2.next()
the main thread calls IMapIterator.next
, and waits when it reaches self._cond.wait(timeout)
.
主线程正在等待而不是
调用_exit_function
可以使任务处理程序线程正常运行,这意味着在Pool._handle_tasks
函数的worker
s'inqueue
中put
的任务中,它会完全消耗生成器.
That the main thread is waiting instead of
calling _exit_function
is what allows the task handler thread to run normally, which means fully consuming the generator as it put
s tasks in the worker
s' inqueue
in the Pool._handle_tasks
function.
最重要的是,所有Pool
映射函数都会消耗给定的整个可迭代项.如果要分块使用生成器,则可以执行以下操作:
The bottom line is that all Pool
map functions consume the entire iterable that it is given. If you'd like to consume the generator in chunks, you could do this instead:
import multiprocessing as mp
import itertools
import time
def g():
for el in xrange(50):
print el
yield el
def f(x):
time.sleep(1)
return x * x
if __name__ == '__main__':
pool = mp.Pool(processes=4) # start 4 worker processes
go = g()
result = []
N = 11
while True:
g2 = pool.map(f, itertools.islice(go, N))
if g2:
result.extend(g2)
time.sleep(1)
else:
break
print(result)
这篇关于Python Multiprocessing.Pool延迟迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!