特定顺序的multiprocessing pool.map调用函数 [英] multiprocessing pool.map call functions in certain order

查看:187
本文介绍了特定顺序的multiprocessing pool.map调用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使multiprocessing.pool.map按数字顺序分配进程?

How can I make multiprocessing.pool.map distribute processes in numerical order?

更多信息:
我有一个程序可以处理数千个数据文件,每个文件都作图.我正在使用multiprocessing.pool.map将每个文件分发到处理器,并且效果很好.有时这会花费很长时间,并且在程序运行时查看输出图像会很不错.如果映射过程按顺序分发快照,则将容易得多.相反,对于我刚刚执行的特定运行,分析的前8个快照为:0, 78, 156, 234, 312, 390, 468, 546.有没有办法使它按数字顺序更紧密地分布?

More Info:
I have a program which processes a few thousand data files, making a plot of each one. I'm using a multiprocessing.pool.map to distribute each file to a processor and it works great. Sometimes this takes a long time, and it would be nice to look at the output images as the program is running. This would be a lot easier if the map process distributed the snapshots in order; instead, for the particular run I just executed, the first 8 snapshots analyzed were: 0, 78, 156, 234, 312, 390, 468, 546. Is there a way to make it distribute them more closely to in numerical order?

示例:
这是一个示例代码,其中包含相同的关键元素,并显示出相同的基本结果:

Example:
Here's a sample code which contains the same key elements, and show's the same basic result:

import sys
from multiprocessing import Pool
import time

num_proc  = 4; num_calls = 20; sleeper   = 0.1

def SomeFunc(arg):
    time.sleep(sleeper)
    print "%5d" % (arg),
    sys.stdout.flush()     # otherwise doesn't print properly on single line

proc_pool = Pool(num_proc)
proc_pool.map( SomeFunc, range(num_calls) )

收益:

   0  4  2  6   1   5   3   7   8  10  12  14  13  11   9  15  16  18  17  19


答案:

来自@Hayden:使用'chunksize'参数def map(self, func, iterable, chunksize=None).


Answer:

From @Hayden: Use the 'chunksize' parameter, def map(self, func, iterable, chunksize=None).

更多信息:
chunksize确定一次为每个处理器分配多少次迭代.例如,我上面的示例使用的块大小为2,这意味着每个处理器都会关闭并在函数的2次迭代中执行其操作,然后再返回以进行更多操作(签入").块大小背后的权衡是,当处理器必须与其他处理器同步时,检入"会产生开销-建议您要大块大小.另一方面,如果您有很大的块,那么一个处理器可能会完成其块,而另一个处理器可能需要很长的时间才能完成工作-因此,您应该使用小块大小.我想其他有用的信息是有多少范围,每个函数调用可以花费多长时间.如果它们真的都应该花费相同的时间-使用大块数据会更有效.另一方面,如果某些函数调用的时间可能是其他函数调用的两倍,则您需要较小的块大小,这样就不会使处理器陷入等待状态.

More Info:
The chunksize determines how many iterations are allocated to each processor at a time. My example above, for instance, uses a chunksize of 2---which means that each processor goes off and does its thing for 2 iterations of the function, then comes back for more ('check-in'). The trade-off behind chunksize is that there is overhead for the 'check-in' when the processor has to sync up with the others---suggesting you want a large chunksize. On the other hand, if you have large chunks, then one processor might finish its chunk while another-one has a long time left to go---so you should use a small chunksize. I guess the additional useful information is how much range there is, in how long each function call can take. If they really should all take the same amount of time - it's way more efficient to use a large chunk size. On the other hand, if some function calls could take twice as long as others, you want a small chunksize so that processors aren't caught waiting.

对于我的问题,每个函数调用应该花费几乎相同的时间(我认为),因此,如果我希望按顺序调用进程,则由于签入,我将牺牲效率开销.

For my problem, every function call should take very close to the same amount of time (I think), so if I want the processes to be called in order, I'm going to sacrifice efficiency because of the check-in overhead.

推荐答案

发生这种情况的原因是因为在调用map的过程中,每个进程都被赋予了预定义的工作量,而这取决于.我们可以通过查看 pool.map的源代码来得出默认的chunksize.

The reason that this occurs is because each process is given a predefined amount of work to do at the start of the call to map which is dependant on the chunksize. We can work out the default chunksize by looking at the source for pool.map

chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
if extra:
  chunksize += 1

因此,对于20范围和4个过程,我们将得到chunksize为2.

So for a range of 20, and with 4 processes, we will get a chunksize of 2.

如果我们修改您的代码以反映这一点,我们将获得与您现在获得的结果类似的结果:

If we modify your code to reflect this we should get similar results to the results you are getting now:

proc_pool.map(SomeFunc, range(num_calls), chunksize=2)

这将产生输出:

0 2 6 4 1 7 5 3 8 10 12 14 9 13 15 11 16 18 17 19

现在,设置chunksize=1将确保每次仅向池中的每个进程分配一个任务.

Now, setting the chunksize=1 will ensure that each process within the pool will only be given one task at a time.

proc_pool.map(SomeFunc, range(num_calls), chunksize=1)

与不指定块大小时相比,这应确保合理的数字排序.例如,chunksize为1会产生输出:

This should ensure a reasonably good numerical ordering compared to that when not specifying a chunksize. For example a chunksize of 1 yields the output:

0 1 2 3 4 5 6 7 9 10 8 11 13 12 15 14 16 17 19 18

这篇关于特定顺序的multiprocessing pool.map调用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆