特定顺序的multiprocessing pool.map调用函数 [英] multiprocessing pool.map call functions in certain order
问题描述
如何使multiprocessing.pool.map按数字顺序分配进程?
How can I make multiprocessing.pool.map distribute processes in numerical order?
更多信息:
我有一个程序可以处理数千个数据文件,每个文件都作图.我正在使用multiprocessing.pool.map
将每个文件分发到处理器,并且效果很好.有时这会花费很长时间,并且在程序运行时查看输出图像会很不错.如果映射过程按顺序分发快照,则将容易得多.相反,对于我刚刚执行的特定运行,分析的前8个快照为:0, 78, 156, 234, 312, 390, 468, 546
.有没有办法使它按数字顺序更紧密地分布?
More Info:
I have a program which processes a few thousand data files, making a plot of each one. I'm using a multiprocessing.pool.map
to distribute each file to a processor and it works great. Sometimes this takes a long time, and it would be nice to look at the output images as the program is running. This would be a lot easier if the map process distributed the snapshots in order; instead, for the particular run I just executed, the first 8 snapshots analyzed were: 0, 78, 156, 234, 312, 390, 468, 546
. Is there a way to make it distribute them more closely to in numerical order?
示例:
这是一个示例代码,其中包含相同的关键元素,并显示出相同的基本结果:
Example:
Here's a sample code which contains the same key elements, and show's the same basic result:
import sys
from multiprocessing import Pool
import time
num_proc = 4; num_calls = 20; sleeper = 0.1
def SomeFunc(arg):
time.sleep(sleeper)
print "%5d" % (arg),
sys.stdout.flush() # otherwise doesn't print properly on single line
proc_pool = Pool(num_proc)
proc_pool.map( SomeFunc, range(num_calls) )
收益:
0 4 2 6 1 5 3 7 8 10 12 14 13 11 9 15 16 18 17 19
答案:
来自@Hayden:使用'chunksize'参数def map(self, func, iterable, chunksize=None)
.
Answer:
From @Hayden: Use the 'chunksize' parameter, def map(self, func, iterable, chunksize=None)
.
更多信息:
chunksize
确定一次为每个处理器分配多少次迭代.例如,我上面的示例使用的块大小为2,这意味着每个处理器都会关闭并在函数的2次迭代中执行其操作,然后再返回以进行更多操作(签入").块大小背后的权衡是,当处理器必须与其他处理器同步时,检入"会产生开销-建议您要大块大小.另一方面,如果您有很大的块,那么一个处理器可能会完成其块,而另一个处理器可能需要很长的时间才能完成工作-因此,您应该使用小块大小.我想其他有用的信息是有多少范围,每个函数调用可以花费多长时间.如果它们真的都应该花费相同的时间-使用大块数据会更有效.另一方面,如果某些函数调用的时间可能是其他函数调用的两倍,则您需要较小的块大小,这样就不会使处理器陷入等待状态.
More Info:
The chunksize
determines how many iterations are allocated to each processor at a time. My example above, for instance, uses a chunksize of 2---which means that each processor goes off and does its thing for 2 iterations of the function, then comes back for more ('check-in'). The trade-off behind chunksize is that there is overhead for the 'check-in' when the processor has to sync up with the others---suggesting you want a large chunksize. On the other hand, if you have large chunks, then one processor might finish its chunk while another-one has a long time left to go---so you should use a small chunksize. I guess the additional useful information is how much range there is, in how long each function call can take. If they really should all take the same amount of time - it's way more efficient to use a large chunk size. On the other hand, if some function calls could take twice as long as others, you want a small chunksize so that processors aren't caught waiting.
对于我的问题,每个函数调用应该花费几乎相同的时间(我认为),因此,如果我希望按顺序调用进程,则由于签入,我将牺牲效率开销.
For my problem, every function call should take very close to the same amount of time (I think), so if I want the processes to be called in order, I'm going to sacrifice efficiency because of the check-in overhead.
推荐答案
发生这种情况的原因是因为在调用map的过程中,每个进程都被赋予了预定义的工作量,而这取决于chunksize
.
The reason that this occurs is because each process is given a predefined amount of work to do at the start of the call to map which is dependant on the chunksize
. We can work out the default chunksize
by looking at the source for pool.map
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
if extra:
chunksize += 1
因此,对于20范围和4个过程,我们将得到chunksize
为2.
So for a range of 20, and with 4 processes, we will get a chunksize
of 2.
如果我们修改您的代码以反映这一点,我们将获得与您现在获得的结果类似的结果:
If we modify your code to reflect this we should get similar results to the results you are getting now:
proc_pool.map(SomeFunc, range(num_calls), chunksize=2)
这将产生输出:
0 2 6 4 1 7 5 3 8 10 12 14 9 13 15 11 16 18 17 19
现在,设置chunksize=1
将确保每次仅向池中的每个进程分配一个任务.
Now, setting the chunksize=1
will ensure that each process within the pool will only be given one task at a time.
proc_pool.map(SomeFunc, range(num_calls), chunksize=1)
与不指定块大小时相比,这应确保合理的数字排序.例如,chunksize为1会产生输出:
This should ensure a reasonably good numerical ordering compared to that when not specifying a chunksize. For example a chunksize of 1 yields the output:
0 1 2 3 4 5 6 7 9 10 8 11 13 12 15 14 16 17 19 18
这篇关于特定顺序的multiprocessing pool.map调用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!