python的多处理和concurrent.futures有什么区别? [英] What's the difference between python's multiprocessing and concurrent.futures?

查看:19
本文介绍了python的多处理和concurrent.futures有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python中实现多处理的一种简单方法是

A simple way of implementing multiprocessing in python is

from multiprocessing import Pool

def calculate(number):
    return number

if __name__ == '__main__':
    pool = Pool()
    result = pool.map(calculate, range(4))

基于期货的替代实现是

from concurrent.futures import ProcessPoolExecutor

def calculate(number):
    return number

with ProcessPoolExecutor() as executor:
    result = executor.map(calculate, range(4))

这两种选择基本上做同样的事情,但一个显着的区别是我们不必使用通常的 if __name__ == '__main__' 子句来保护代码.这是因为 futures 的实现解决了这个问题还是我们有不同的原因?

Both alternatives do essentially the same thing, but one striking difference is that we don't have to guard the code with the usual if __name__ == '__main__' clause. Is this because the implementation of futures takes care of this or us there a different reason?

更广泛地说,multiprocessingconcurrent.futures 之间有什么区别?什么时候优先于另一个?

More broadly, what are the differences between multiprocessing and concurrent.futures? When is one preferred over the other?

我最初的假设是 if __name__ == '__main__' 仅对多处理是必需的,这是错误的.显然,Windows 上的两种实现都需要这种保护,而在 unix 系统上则不需要.

My initial assumption that the guard if __name__ == '__main__' is only necessary for multiprocessing was wrong. Apparently, one needs this guard for both implementations on windows, while it is not necessary on unix systems.

推荐答案

实际上你也应该使用 if __name__ == "__main__" 守卫和 ProcessPoolExecutor:它是使用 multiprocessing.Process 来填充它的 Pool 在幕后,就像 multiprocessing.Pool 所做的那样,所以关于可挑选性的所有相同的警告(尤其是在Windows)等适用.

You actually should use the if __name__ == "__main__" guard with ProcessPoolExecutor, too: It's using multiprocessing.Process to populate its Pool under the covers, just like multiprocessing.Pool does, so all the same caveats regarding picklability (especially on Windows), etc. apply.

根据 当被问到为什么 Python 有这两种 API 时,Jesse Noller(Python 核心贡献者)发表的这句话:

I believe that ProcessPoolExecutor is meant to eventually replace multiprocessing.Pool, according to this statement made by Jesse Noller (a Python core contributor), when asked why Python has both APIs:

Brian 和我需要努力进行我们打算(ed)进行的整合随着人们对 API 感到满意.我的最终目标是删除除了 MP 中的基本 multiprocessing.Process/Queue 之外的任何东西并进入并发.*并支持它的线程后端.

Brian and I need to work on the consolidation we intend(ed) to occur as people got comfortable with the APIs. My eventual goal is to remove anything but the basic multiprocessing.Process/Queue stuff out of MP and into concurrent.* and support threading backends for it.

目前,ProcessPoolExecutormultiprocessing.Pool 做的事情大多完全相同,但 API 更简单(也更有限).如果您可以不使用 ProcessPoolExecutor,请使用它,因为我认为从长远来看它更有可能获得增强.请注意,您可以将 multiprocessing 中的所有帮助程序与 ProcessPoolExecutor 一起使用,例如 LockQueueManager 等,因此需要这些不是使用 multiprocessing.Pool 的理由.

For now, ProcessPoolExecutor is mostly doing the exact same thing as multiprocessing.Pool with a simpler (and more limited) API. If you can get away with using ProcessPoolExecutor, use that, because I think it's more likely to get enhancements in the long-term. Note that you can use all the helpers from multiprocessing with ProcessPoolExecutor, like Lock, Queue, Manager, etc., so needing those isn't a reason to use multiprocessing.Pool.

不过,它们的 API 和行为存在一些显着差异:

There are some notable differences in their APIs and behavior though:

  1. 如果 ProcessPoolExecutor 中的进程突然终止,引发了一个BrokenProcessPool异常,中止任何等待池工作的调用,并阻止提交新工作.如果 multiprocessing.Pool 发生同样的事情,它将默默地替换终止的进程,但在该进程中完成的工作将永远不会完成,这可能会导致调用代码挂起永远等待工作完成.

  1. If a Process in a ProcessPoolExecutor terminates abruptly, a BrokenProcessPool exception is raised, aborting any calls waiting for the pool to do work, and preventing new work from being submitted. If the same thing happens to a multiprocessing.Pool it will silently replace the process that terminated, but the work that was being done in that process will never be completed, which will likely cause the calling code to hang forever waiting for the work to finish.

如果您运行的是 Python 3.6 或更低版本,则 ProcessPoolExecutor 缺少对 initializer/initargs 的支持.仅在 3.7 中添加了对此的支持).

If you are running Python 3.6 or lower, support for initializer/initargs is missing from ProcessPoolExecutor. Support for this was only added in 3.7).

ProcessPoolExecutor 中不支持 maxtasksperchild.

concurrent.futures 在 Python 2.7 中不存在,除非您手动安装 backport.

concurrent.futures doesn't exist in Python 2.7, unless you manually install the backport.

如果您在 Python 3.5 以下运行,根据this questionmultiprocessing.Pool.map 优于 ProcessPoolExecutor.map.请注意,每个工作项的性能差异非常小,因此如果您在非常大的可迭代对象上使用 map,您可能只会注意到较大的性能差异.造成性能差异的原因是 multiprocessing.Pool 会将传递给 map 的 iterable 分批成 chunk,然后将 chunk 传递给 worker 进程,这样可以减少父子之间的 IPC 开销.ProcessPoolExecutor 总是(或默认情况下,从 3.5 开始)一次将一个项目从可迭代对象传递给子对象,由于 IPC 开销增加,这可能导致大型可迭代对象的性能更慢.好消息是这个问题在 Python 3.5 中得到修复,因为 chunksize 关键字参数已添加到 ProcessPoolExecutor.map 中,可用于在以下情况下指定更大的块大小你知道你正在处理大型迭代.有关详细信息,请参阅此 错误.

If you're running below Python 3.5, according to this question, multiprocessing.Pool.map outperforms ProcessPoolExecutor.map. Note that the performance difference is very small per work item, so you'll probably only notice a large performance difference if you're using map on a very large iterable. The reason for the performance difference is that multiprocessing.Pool will batch the iterable passed to map into chunks, and then pass the chunks to the worker processes, which reduces the overhead of IPC between the parent and children. ProcessPoolExecutor always (or by default, starting in 3.5) passes one item from the iterable at a time to the children, which can lead to much slower performance with large iterables, due to the increased IPC overhead. The good news is this issue is fixed in Python 3.5, as the chunksize keyword argument has been added to ProcessPoolExecutor.map, which can be used to specify a larger chunk size when you know you're dealing with large iterables. See this bug for more info.

这篇关于python的多处理和concurrent.futures有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆