为什么multiprocessing.Pool.map_async中的get()操作需要这么长时间? [英] Why does the get() operation in multiprocessing.Pool.map_async take so long?

查看：130 发布时间：2020/5/13 19:41:04 python parallel-processing multiprocessing parallelism-amdahl

本文介绍了为什么multiprocessing.Pool.map_async中的get()操作需要这么长时间?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

import multiprocessing as mp
import numpy as np

pool   = mp.Pool( processes = 4 )
inp    = np.linspace( 0.01, 1.99, 100 )
result = pool.map_async( func, inp ) #Line1 ( func is some Python function which acts on input )
output = result.get()                #Line2

因此，我试图在multiprocessing.Pool()实例上使用 .map_async() 方法来并行化Python中的某些代码.

So, I was trying to parallelize some code in Python, using a .map_async() method on a multiprocessing.Pool() instance.

我注意到，而
Line1大约需要千分之一秒，
Line2大约需要0.3秒.

I noticed that while
Line1 takes around a thousandth of a second,
Line2 takes about .3 seconds.

是否有更好的方法来解决此问题或解决由Line2，
引起的瓶颈或
我在这里做错什么了吗?

Is there a better way to do this or a way to get around the bottleneck caused by Line2,
or
am I doing something wrong here?

_{(我对此很陌生.)}

不要惊慌，许多用户做的都非常相同-支付的费用超过了收到的费用.

这是一个普通的讲座，不是讲一些"有前途的"语法构造函数，而是讲解使用它的实际成本.

Do not panic, many users do the very same - Paid more than received.

This is a common lecture not on using some "promising" syntax-constructor, but on paying the actual costs for using it.

这个故事很长，效果很直接-您原本期望的结果很低，但是必须付出巨大的过程实例化，工作包重新分发以及结果收集的费用，所有这些只是为了做这些事情但是几轮func()通话.

The story is long, the effect was straightforward - you expected a low hanging fruit, but had to pay an immense cost of process-instantiation, work-package re-distribution and for collection of results, all that circus just for doing but a few rounds of func()-calls.

嗯，谁告诉你任何这样的(潜力)提速是免费的吗?

Well, who told you that any such ( potential ) speedup is for free?

让我们定量些，而不是测量实际的代码执行时间，而不是情绪吧?

Let's be quantitative and rather measure the actual code-execution time, instead of emotions, right?

基准化始终是一个公平的举动.
它可以帮助我们(凡人)摆脱公正的期望
并进入定量证据支持的知识:

Benchmarking is always a fair move.
It helps us, mortals, to escape from just expectations
and get ourselves into quantitative records-of-evidence supported knowledge:

from zmq import Stopwatch; aClk = Stopwatch() # this is a handy tool to do so

AS-IS测试:

在继续前进之前，应该先记录下这对:

AS-IS test:

Before moving forwards, one ought record this pair:

>>> aClk.start(); _ = [   func( SEQi ) for SEQi in inp ]; aClk.stop() # [SEQ] 
>>> HowMuchWillWePAY2RUN( func, 4, 100 )                              # [RUN]
>>> HowMuchWillWePAY2MAP( func, 4, 100 )                              # [MAP]

这将在纯 [SERIAL] multiprocessing.Pool()或其他工具)扩展实验，则将[SEQ] -ofs调用转移至未优化的joblib.Parallel()或其他任何工具.

This will set the span among the performance envelopes from a pure-[SERIAL] [SEQ]-of-calls, to an un-optimised joblib.Parallel() or any other, if one wishes to extend the experiment with any other tools, like a said multiprocessing.Pool() or other.

意图:
以衡量{流程| }实例，我们需要一个NOP-work-package负载，该负载几乎什么都不会有"，而要返回后退"，并且不需要支付任何额外的附加费用(无论是用于任何输入参数的传输还是返回任何值)

Intent:
so as to measure the cost of a { process | job }-instantiation, we need a NOP-work-package payload, that will spend almost nothing "there" but return "back" and will not require to pay any additional add-on costs ( be it for any input parameters' transmissions or returning any value )

def a_NOP_FUN( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is indeed to do nothing at all,
                             so as to be able to benchmark
                             all the process-instantiation
                             add-on overhead costs.
    """
    pass

因此，这里的设置费用与附加费用比较:

So, the setup-overhead add-on costs comparison is here:

#-------------------------------------------------------<function a_NOP_FUN
[SEQ]-pure-[SERIAL] worked within ~   37 ..     44 [us] on this localhost
[MAP]-just-[CONCURENT] tool         2536 ..   7343 [us]
[RUN]-just-[CONCURENT] tool       111162 .. 112609 [us]

在`joblib.Parallel()`任务处理上使用
`joblib.delayed()`的策略:

Using a strategy of
`joblib.delayed()` on `joblib.Parallel()` task-processing:

def HowMuchWillWePAY2RUN( aFun2TEST = a_NOP_FUN, JOBS_TO_SPAWN = 4, RUNS_TO_RUN = 10 ):
    from zmq import Stopwatch; aClk = Stopwatch()
    try:
         aClk.start()
         joblib.Parallel(  n_jobs = JOBS_TO_SPAWN
                          )( joblib.delayed( aFun2TEST )
                                           ( aFunPARAM )
                                       for ( aFunPARAM )
                                       in  range( RUNS_TO_RUN )
                             )
    except:
         pass
    finally:
         try:
             _ = aClk.stop()
         except:
             _ = -1
             pass
    pass;  pMASK = "CLK:: {0:_>24d} [us] @{1: >4d}-JOBs ran{2: >6d} RUNS {3:}"
    print( pMASK.format( _,
                         JOBS_TO_SPAWN,
                         RUNS_TO_RUN,
                         " ".join( repr( aFun2TEST ).split( " ")[:2] )
                         )
            )

在`multiprocessing.Pool()`实例上使用轻量级
`.map_async()`方法的策略:

Using a strategy of a lightweight
`.map_async()` method on a `multiprocessing.Pool()` instance:

def HowMuchWillWePAY2MAP( aFun2TEST = a_NOP_FUN, PROCESSES_TO_SPAWN = 4, RUNS_TO_RUN = 1 ):
    from zmq import Stopwatch; aClk = Stopwatch()
    try:
         import numpy           as np
         import multiprocessing as mp

         pool = mp.Pool( processes = PROCESSES_TO_SPAWN )
         inp  = np.linspace( 0.01, 1.99, 100 )

         aClk.start()
         for i in xrange( RUNS_TO_RUN ):
             pass;    result = pool.map_async( aFun2TEST, inp )
             output = result.get()
         pass
    except:
         pass
    finally:
         try:
             _ = aClk.stop()
         except:
             _ = -1
             pass
    pass;  pMASK = "CLK:: {0:_>24d} [us] @{1: >4d}-PROCs ran{2: >6d} RUNS {3:}"
    print( pMASK.format( _,
                         PROCESSES_TO_SPAWN,
                         RUNS_TO_RUN,
                         " ".join( repr( aFun2TEST ).split( " ")[:2] )
                         )
            )

所以，
第一组痛苦和惊喜
在并发 joblib.Parallel() 的同时，直接按实际的无成本进行操作:

So,
the first set of pain and surprises
comes straight at the actual cost-of-doing-NOTHING in a concurrent pool of joblib.Parallel():

 CLK:: __________________117463 [us] @   4-JOBs ran    10 RUNS <function a_NOP_FUN
 CLK:: __________________111182 [us] @   3-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________110229 [us] @   3-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________110095 [us] @   3-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________111794 [us] @   3-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________110030 [us] @   3-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________110697 [us] @   3-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: _________________4605843 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________336208 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________298816 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________355492 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________320837 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________308365 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________372762 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________304228 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________337537 [us] @ 123-JOBs ran   100 RUNS <function a_NOP_FUN
 CLK:: __________________941775 [us] @ 123-JOBs ran 10000 RUNS <function a_NOP_FUN
 CLK:: __________________987440 [us] @ 123-JOBs ran 10000 RUNS <function a_NOP_FUN
 CLK:: _________________1080024 [us] @ 123-JOBs ran 10000 RUNS <function a_NOP_FUN
 CLK:: _________________1108432 [us] @ 123-JOBs ran 10000 RUNS <function a_NOP_FUN
 CLK:: _________________7525874 [us] @ 123-JOBs ran100000 RUNS <function a_NOP_FUN

因此，这项科学公正，严格的测试从这种最简单的案例开始，已经显示了所有相关的代码执行处理设置开销的基准成本，有史以来最小的 > joblib.Parallel() 惩罚正弦准非.

So, this scientifically fair and rigorous test started from this simplest ever case, already showing the benchmarked costs of all the associated code-execution processing setup-overheads a smallest ever joblib.Parallel() penalty sine-qua-non.

这将我们带入了一个现实算法运行的方向-最好在接下来的测试循环中添加一些越来越大的有效载荷"大小.

This forwards us into a direction, where real-world algorithms do live - best with next adding some larger and larger "payload"-sizes into the testing loop.

使用这种系统的，轻量级的方法，我们可能会继续前进，因为我们还需要对附加成本和{ remote-job-PAR-XFER(s) | remote-job-MEM.alloc(s) | remote-job-CPU-bound-processing | remote-job-fileIO(s) }的其他阿姆达尔定律间接影响进行基准测试.

Using this systematic and lightweight approach, we may go forwards in the story, as we will need to also benchmark the add-on costs and other Amdahl's Law indirect effects of { remote-job-PAR-XFER(s) | remote-job-MEM.alloc(s) | remote-job-CPU-bound-processing | remote-job-fileIO(s) }

这样的功能模板可能有助于重新测试(如您所见，将有很多需要重新运行，而操作系统噪声和一些其他工件将进入实际的使用成本模式) :

A function template like this may help in re-testing ( as you see there will be a lot to re-run, while the O/S noise and some additional artifacts will step into the actual cost-of-use patterns ):

一旦我们支付了前期费用，下一个最常见的错误就是忘记了内存分配的费用.因此，让我们对其进行测试:

Once we have paid the up-front cost, the next most common mistake is to forget the costs of memory allocations. So, lets test it:

def a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR( aNeverConsumedPAR, SIZE1D = 1000 ):
    """                                                 __doc__
    The intent of this FUN() is to do nothing but
                             a MEM-allocation
                             so as to be able to benchmark
                             all the process-instantiation
                             add-on overhead costs.
    """
    import numpy as np              # yes, deferred import, libs do defer imports
    aMemALLOC = np.zeros( ( SIZE1D, #       so as to set
                            SIZE1D, #       realistic ceilings
                            SIZE1D, #       as how big the "Big Data"
                            SIZE1D  #       may indeed grow into
                            ),
                          dtype = np.float64,
                          order = 'F'
                          )         # .ALLOC + .SET
    aMemALLOC[2,3,4,5] = 8.7654321  # .SET
    aMemALLOC[3,3,4,5] = 1.2345678  # .SET

    return aMemALLOC[2:3,3,4,5]

万一您的平台停止能够分配所请求的内存块，我们就会遇到另一种问题(如果尝试在物理资源中进行并行处理，则会遇到一类隐藏的玻璃天花板不可知论的方式).可以编辑SIZE1D缩放比例，以便至少适合平台RAM的寻址/大小调整功能，但是，实际问题计算的性能范围仍然值得我们关注:

In case your platform will stop to be able to allocate the requested memory-blocks, there we head-bang into another kind of problems ( with a class of hidden glass-ceilings if trying to go-parallel in a physical-resources agnostic manner ). One may edit the SIZE1D scaling, so as to at least fit into the platform RAM addressing / sizing capabilites, yet, the performance envelopes of the real-world problem computing are still of our great interest here:

>>> HowMuchWillWePAY2RUN( a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR, 200, 1000 )

可能会产生
支付费用，介于 0.1 [s]和+9 [s] (!!)
仅用于保持静止状态，但现在也不会忘记一些实际的MEM分配附加费用"有"

查看全文

为什么multiprocessing.Pool.map_async中的get()操作需要这么长时间? [英] Why does the get() operation in multiprocessing.Pool.map_async take so long?

问题描述

推荐答案

不要惊慌，许多用户做的都非常相同-支付的费用超过了收到的费用.

Do not panic, many users do the very same - Paid more than received.

AS-IS测试:

AS-IS test:

在`joblib.Parallel()`任务处理上使用
`joblib.delayed()`的策略:

Using a strategy of
`joblib.delayed()` on `joblib.Parallel()` task-processing:

在`multiprocessing.Pool()`实例上使用轻量级
`.map_async()`方法的策略:

Using a strategy of a lightweight
`.map_async()` method on a `multiprocessing.Pool()` instance:

测试用例C:

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么multiprocessing.Pool.map_async中的get()操作需要这么长时间? [英] Why does the get() operation in multiprocessing.Pool.map_async take so long?

问题描述

推荐答案

不要惊慌，许多用户做的都非常相同-支付的费用超过了收到的费用.

Do not panic, many users do the very same - Paid more than received.

AS-IS测试:

AS-IS test:

在joblib.Parallel()任务处理上使用 joblib.delayed()的策略:

Using a strategy ofjoblib.delayed() on joblib.Parallel() task-processing:

在multiprocessing.Pool()实例上使用轻量级 .map_async()方法的策略:

Using a strategy of a lightweight.map_async() method on a multiprocessing.Pool() instance:

测试用例C:

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

在`joblib.Parallel()`任务处理上使用
`joblib.delayed()`的策略:

Using a strategy of
`joblib.delayed()` on `joblib.Parallel()` task-processing:

在`multiprocessing.Pool()`实例上使用轻量级
`.map_async()`方法的策略:

Using a strategy of a lightweight
`.map_async()` method on a `multiprocessing.Pool()` instance:

登录关闭