Pool/starmap的Python多处理行为 [英] Python multiprocessing behavior of Pool / starmap

查看:206
本文介绍了Pool/starmap的Python多处理行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用多处理库来计算一些东西的程序.大约有1万个输入要计算,每个输入要花费0.2秒至10秒.

I've got a program using the multiprocessing library to compute some stuff. There are about 10K inputs to compute, each of them taking between 0.2 second and 10 seconds.

我当前的方法使用池:

# Inputs
signals = [list(s) for s in itertools.combinations_with_replacement(possible_inputs, 3)]

# Compute
with mp.Pool(processes = N) as p:
    p.starmap(compute_solutions, [(s, t0, tf, folder) for s in signals])
    print ("    | Computation done.")

我已经注意到,在检查300/400个最后输入时,该程序变慢了很多.我的问题是:Poolstarmap()的行为如何?

I've noticed that on the 300 / 400 last inputs to check, the program became a lot slower. My question is: how does the Pool and the starmap() behave?

根据我的观察,我相信如果我有10K输入和N = 4(4个进程),则将2500个第一个输入分配给第一个进程,将2500个分配给第二个进程,...以及每个进程以串行方式处理其输入. 这意味着,如果某些进程先于其他进程清除了队列,则它们不会获得要执行的新任务.

Fro my observation, I believe that if I got 10K inputs and N = 4 (4 processes), then the 2 500 first inputs are assigned to the first process, the 2 500 next to the second, ... and each process treats its inputs in a serial fashion. Which means that if some processes have cleared the Queue before others, they do not get new tasks to perform.

这是正确的吗?

Is this correct?

如果这是正确的,我如何拥有一个可以用以下伪代码表示的更智能的系统:

If this is correct, how can I have a smarter system which could be represented with this pseudo-code:

workers = Initialize N workers
tasks = A list of the tasks to perform

for task in tasks:
    if a worker is free:
        submit task to this worker
    else:
        wait

感谢您的帮助:)

N.B:不同的地图函数之间有什么区别.我相信map()imap_unordered()imapstarmap存在.

N.B: What is the difference between the different map function. I believe map(), imap_unordered(), imap, starmap exists.

它们之间有什么区别,什么时候应该使用其中一个?

What are the differences between them and when should we use one or the other?

推荐答案

这意味着,如果某些进程先清除了队列,则它们将无法执行新任务.

Which means that if some processes have cleared the Queue before others, they do not get new tasks to perform.

这正确吗?

不. multiprocess.Pool()的主要目的是将传递的工作负载分散到其工作人员池中-这就是为什么它带有所有这些映射选项的原因-各种方法之间的唯一区别在于工作负载的实际分配方式以及所产生的收益如何被收集.

No. The main purpose of multiprocess.Pool() is to spread the passed workload to the pool of its workers - that's why it comes with all those mapping options - the only difference between its various methods is on how the workload is actually distributed and how the resulting returns are collected.

在您的情况下,使用[(s, t0, tf, folder) for s in signals]生成的可迭代对象会将其每个元素(最终取决于signals大小)发送给池中的下一个空闲工作程序(称为compute_solutions(s, t0, tf, folder)) ,一次一次(如果传递了chunksize参数,则一次或更多次),直到整个Iterable耗尽为止.这样您就无法控制哪个工作人员执行哪个部分.

In your case, the iterable you're generating with [(s, t0, tf, folder) for s in signals] will have each of its elements (which ultimately depends on the signals size) sent to the next free worker (invoked as compute_solutions(s, t0, tf, folder)) in the pool, one at a time (or more if chunksize parameter is passed), until the whole iterable is exhausted. You do not get to control which worker executes which part, tho.

工作量也可能不会平均分配-一名工人可能会根据资源使用情况,执行速度,各种内部事件而处理比另一个工人更多的条目...

The workload may also not be evenly spread - one worker may process more entries than another in dependence of resource usage, execution speed, various internal events...

但是,使用multiprocessing.Poolmapimapstarmap方法,您会得到均匀分布的错觉,因为它们内部同步每个工人的收益以匹配来源可迭代的(即结果的第一个元素将包含来自被调用函数的结果返回以及可迭代的第一个元素).如果要查看底下的实际情况,可以尝试使用这些方法的 async/unordered 版本.

However, using map, imap and starmap methods of multiprocessing.Pool you get the illusion of even and orderly spread as they internally synchronize the returns from each of the workers to match the source iterable (i.e. the first element of the result will contain the resulting return from the called function with the first element of the iterable). You can try the async/unordered versions of these methods if you want to see what actually happens underneath.

因此,默认情况下,您将获得更智能的系统,但是您始终可以使用

Therefore, you get the smarter system by default, but you can always use multiprocessing.Pool.apply_async() if you want a full control over your pool of workers.

作为旁注,如果您想优化对可迭代对象本身的访问(因为池映射选项将消耗很大一部分),则可以选中

As a side note, if you're looking on optimizing the access to your iterable itself (as the pool map options will consume a large part of it) you can check this answer.

最后,

它们之间有什么区别,什么时候应使用另一种?

代替我在这里引用,而是转到官方文档,因为它们之间的区别得到了很好的解释.

Instead of me quoting here, head over to the official docs as there is quite a good explanation of a difference between those.

这篇关于Pool/starmap的Python多处理行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆