Python 3多处理:最佳块大小 [英] Python 3 multiprocessing: optimal chunk size

查看:104
本文介绍了Python 3多处理:最佳块大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何找到multiprocessing.Pool实例的最佳块大小?

我之前曾用它来创建n数独对象的生成器:

processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)

为了测量时间,我在上面的代码片段之前使用了time.time(),然后按照说明对池进行了初始化,然后将生成器转换为列表(list(sudokus))以触发生成项目(仅用于时间测量,我知道这在最终程序中是无稽之谈),然后我再次花时间使用time.time()并输出差异.

我观察到n // processes + 1的块大小导致每个对象大约 0.425 ms 的时间.但是我还观察到CPU仅在进程的前半部分完全加载,最终使用率下降到25%(在具有2个内核和超线程的i3上).

如果我使用较小的块大小int(l // (processes**2) + 1)代替,我得到的时间大约为 0.355 ms ,并且CPU负载分配得更好.它只有一些小的尖峰,下降到大约. 75%,但在降低到25%的过程中会保持较高的时间.

是否存在更好的公式来计算块大小,或者是否存在更好的方法来最有效地使用CPU?请帮助我提高此多处理池的效率.

解决方案

此答案提供了较高层次的概述. /p>

进入细节阶段,每个工作人员每次都发送一大堆chunksize任务进行处理.工人每次完成该工作块时,都需要通过某种类型的进程间通信(IPC)(例如queue.Queue)请求更多输入.每个IPC请求都需要一个系统调用.由于上下文切换,它花费在1-10μs范围内的任何地方,例如10μs.由于共享缓存,上下文切换可能会(在一定程度上)损害所有内核.因此,我们非常悲观地估计在100μs时IPC请求的最大可能成本.

您希望IPC开销无关紧要,假设< 1%.如果我的数字正确,则可以通过使块处理时间> 10 ms来确保.因此,如果每个任务需要1μs的时间来处理,则您希望chunksize至少为10000.

不使chunksize任意大的主要原因是,在执行结束时,其中一个工作程序可能仍在其他所有人完成工作的同时在运行-显然不必要地增加了完成工作的时间.我想在大多数情况下,延迟10毫秒并不是什么大问题,因此我建议以10毫秒的块处理时间为目标的建议似乎很安全.

大型chunksize可能导致问题的另一个原因是,准备输入内容可能会花费一些时间,同时又浪费了工人的能力.大概输入准备比处理要快(否则,也应使用 RxPY 之类的方法将其并行化).因此,再次将处理时间定为约10毫秒似乎是安全的(假设您不介意启动延迟不超过10毫秒).

注意:对于现代 Windows -当然,除非该过程提早进行了系统调用.因此,在没有系统调用的情况下,上下文切换的开销不超过〜1%.除此之外,由于IPC您要创建的任何开销都是如此.

How do I find the optimal chunk size for multiprocessing.Pool instances?

I used this before to create a generator of n sudoku objects:

processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)

To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.

I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).

If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.

Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.

解决方案

This answer provides a high level overview.

Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.

You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.

The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.

Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).

Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.

这篇关于Python 3多处理:最佳块大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆