在大型对象列表上,多处理Pool.map()的缩放比例较差:如何在python中实现更好的并行缩放比例? [英] Poor scaling of multiprocessing Pool.map() on a list of large objects: How to achieve better parallel scaling in python?

查看:103
本文介绍了在大型对象列表上,多处理Pool.map()的缩放比例较差:如何在python中实现更好的并行缩放比例?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们定义一下:

from multiprocessing import Pool
import numpy as np
def func(x):
    for i in range(1000):
        i**2
    return 1

请注意,func()会执行某些操作,并且总是返回一个很小的数字1.

Notice that func() does something and it always returns a small number 1.

然后,我比较一个8核并行Pool.map() v/s一个内置python的串行map()

Then, I compare an 8-core parallel Pool.map() v/s a serial, python built in, map()

n=10**3
a=np.random.random(n).tolist()

with Pool(8) as p:
    %timeit -r1 -n2  p.map(func,a)
%timeit -r1 -n2  list(map(func,a))

这给出了:

38.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
200 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)

表示相当不错的并行缩放.因为我使用8​​核,所以38.3 [ms]大约是200[s]

which shows quite good parallel scaling. Because I use 8 cores, and 38.3 [ms] is roughly 1/8 of 200[s]

然后让我们在一些更大的列表上尝试Pool.map(),为简单起见,我以这种方式使用列表列表:

Then let us try Pool.map() on lists of some bigger things, for simplicity, I use a list-of-lists this way :

n=10**3
m=10**4
a=np.random.random((n,m)).tolist()

with Pool(8) as p:
    %timeit -r1 -n2  p.map(func,a)
%timeit -r1 -n2  list(map(func,a))

给出:

292 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
209 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)

您看到,并行缩放已消失! 1s〜1.76s

You see, parallel scaling is gone! 1s ~ 1.76s

我们可以使情况变得更糟,尝试使每个子列表通过更大的范围:

We can make it much worse, try to make each sub list to pass even bigger :

n=10**3
m=10**5
a=np.random.random((n,m)).tolist()

with Pool(8) as p:
    %timeit -r1 -n2  p.map(func,a)
%timeit -r1 -n2  list(map(func,a))

这给出了:

3.29 s ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
179 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)

哇,在更大的子列表中,计时结果完全相反.我们使用8个内核来将时间减慢20倍!

您还可以注意到串行map()的时间与子列表大小无关.因此,一个合理的解释是Pool.map()确实正在将那些大的子列表的内容传递给导致额外复制的进程吗?

You can also notice the serial map()'s timing has nothing to do with a sub list size. So a reasonable explanation would be that Pool.map() are really passing the content of those big sub list around processes which cause additional copy?

我不确定.但是如果是这样,为什么不通过子列表的地址呢?毕竟,子列表已经在内存中,实际上,我所使用的func()保证不会更改/修改子列表.

I am not sure. But if so, why doesn't it passing the address of sub-list? After all, the sub-list is already in the memory, and in practice the func() I used is guaranteed not to change/modify the sub-list.

因此,在python中,将一些操作映射到大对象列表上时,保持并行缩放的正确方法是什么?

推荐答案

在我们开始之前
深入研究纳秒级的搜索(正确的是,它将很快开始,当每个[ns]问题随着缩放打开问题的整个潘多拉魔盒而变得很重要时),让我们就这些缩放达成共识-最简单且经常便宜" 的过早技巧可能会并且问题规模逐渐扩大到现实规模时,经常会使您的梦想变轨-上千个(在两个迭代器中都可见)对于 缓存内计算 而言,行为方式会有所不同< 0.5 [ns]数据获取,一次超过了[GB]1E+5, 1E+6, 1E+9, 且超过[GB]其中个以上的规模,超出了L1/L2/L3缓存大小与某些100 [ns]

Before we start
and dive deeper into any hunt for nanoseconds ( and right, it will soon start, as each [ns] matters as the scaling opens the whole Pandora Box of the problems ), lets agree on the scales - most easy and often "cheap" premature tricks may and often will derail your dreams once the scales of the problem size have grown into realistic scales - the thousands (seen above in both iterators) behave way different for in-cache computing with < 0.5 [ns] data-fetches, than once having grown beyond the L1/L2/L3-cache-sizes for scales above 1E+5, 1E+6, 1E+9, above [GB]s, where each mis-aligned fetch is WAY more EXPENSIVE, than a few 100 [ns]

Q :"...,因为我有8个内核,我想用它们使速度提高8倍"

Q : "... because I have 8 cores, I want to use them to get 8 times faster"

希望可以,确实.但是,很抱歉直言不讳,世界不是这样运作的.

I wish you could, indeed. Yet, sorry for telling the truth straight, the World does not work this way.

请参见此互动工具,它将向您显示加速限制及其对实际产品实际生产成本的主要依赖最初问题的世界范围缩放,因为它是从琐碎的大小以及这些合并的影响按比例缩放 只是点击-它并播放扩展而来的 strong>并带有滑块以实时查看效果 :

See this interactive tool, it will show you both the speedup limits and their principal dependence on the actual production costs of the real-world scaling of the initial problem, as it grows from trivial sizes and these combined effects at scale just click-it and play with the sliders to see it live, in action :

Q : (是) Pool.map()确实将那些大子列表的内容传递给了引起额外复制的进程吗?

Q : (is)Pool.map() really passing the content of those big sub list around processes which cause additional copy?

是的,
必须这样做,另外,设计使它通过将所有数据通过" 传递到另一个昂贵的" SER/DES处理
以使其在那里" 交付.
只要您尝试了,反之亦然返回"back" 一些您需要的乳齿象大小的结果,在上面.

Yes,
it must do so, by design
plus it does that by passing all that data "through" another "expensive" SER/DES processing,
so as to make it happen delivered "there".
The very same would apply vice-versa whenever you would have tried to return "back" some mastodon-sized result(s), which you did not, here above.

Q :但是,如果是这样,为什么不通过子列表的地址呢?

因为远程(参数接收)过程是另一个完全自治的过程,它具有自己的,独立的且受保护的地址空间,所以我们不能仅传递 address-reference em>"into",我们希望它是一个完全独立的,可自主运行的python进程(由于愿意使用此技巧,以便逃避 this (有关CPU限制的处理,请参见第15页).

Because the remote ( parameter-receiving ) process is another, fully autonomous process, with its own, separate and protected, address-space we cannot just pass an address-reference "into", and we wanted that to be a fully independent, autonomously working python process ( due to a will to use this trick so as to escape from GIL-lock dancing ), didn't we? Sure we did - this is a central step of our escape from the GIL-Wars ( for better understanding of the GIL-lock pros and cons, may like this and this ( Pg.15+ on CPU-bound processing ).

             0.1 ns - NOP
             0.3 ns - XOR, ADD, SUB
             0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
             0.9 ns - JMP SHORT
             1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
?~~~~~~~~~~~ 1   ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
           3~4   ns - CPU L2  CACHE reference           (2020/Q1)
             5   ns - CPU L1 iCACHE Branch mispredict
             7   ns - CPU L2  CACHE reference
            10   ns - DIV
            19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
            71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
           100   ns - MUTEX lock/unlock
           100   ns - own DDR MEMORY reference
           135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
           202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
           325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
        10,000   ns - Compress 1K bytes with a Zippy PROCESS
        20,000   ns - Send     2K bytes over 1 Gbps  NETWORK
       250,000   ns - Read   1 MB sequentially from  MEMORY
       500,000   ns - Round trip within a same DataCenter
?~~~ 2,500,000   ns - Read  10 MB sequentially from  MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
    10,000,000   ns - DISK seek
    10,000,000   ns - Read   1 MB sequentially from  NETWORK
?~~ 25,000,000   ns - Read 100 MB sequentially from  MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
    30,000,000   ns - Read 1 MB sequentially from a  DISK
?~~ 36,000,000   ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
   150,000,000   ns - Send a NETWORK packet CA -> Netherlands
  |   |   |   |
  |   |   | ns|
  |   | us|
  | ms|

Q :"当在大型对象列表上并行映射某些操作时,保持并行缩放的正确方法是什么?" /p>

Q : " what is the correct way to keep parallel scaling when parallel mapping some operations on a list of large things? "

A )
了解避免或降低开支的方法:

了解

A )
UNDERSTAND THE WAYS TO AVOID OR AT LEAST REDUCE EXPENSES :

Understand all the types of the costs you have to pay and will pay :

  • spend as low process instantiation costs as possible (rather expensive ) best as a one-time cost only

在macOS上, spawn 启动方法现在是默认设置. fork 启动方法应被认为是不安全的,因为它可能导致子进程崩溃.参见 bpo-33725 .

  • 尽可能少地花费参数传递的费用(是的,最好避免重复传递那些"大物件"作为参数)

  • spend as small amount of costs of parameter-passing as you must ( yes, best avoid repetitive passing those "large things" as parameters )

    B )
    了解提高效率的方式:

    即使增加了代码复杂性,也要理解所有提高效率的技巧(一些SLOC-易于在教科书中显示,但同时牺牲了效率和性能-尽管这两者都是您的主要敌人,为在整个缩放(问题大小或迭代深度,或者同时扩大它们的深度).

    B )
    UNDERSTAND THE WAYS TO INCREASE THE EFFICIENCY :

    Understand all efficiency increasing tricks, even at a cost of complexity of code ( a few SLOC-s are easy to show in school-books, yet sacrificing both the efficiency and the performance - in spite of these both being your main enemy in a fight for a sustainable performance throughout the scaling ( of either of problem size or iteration depths, or when growing both of them at the same time ).

    A 中的某些类别的实际费用已大大更改了限额 通过进入某种形式的[PARALLEL]流程编排(在这里,使代码执行的某些部分在生成的子流程中执行)可以预期达到的加速其中最早由Gene Amdahl博士于60年前提出(为此,最近又添加了与 setup 相关的过程实例化的两个主要扩展) > + 终止增加费用(对于py2 always和py3.5 +对于MacOS和Windows而言极为重要)和atomicity-of-work,下面将对此进行讨论. /p>

    阿姆达尔定律加速比S的开销上限重新形成:

    Some categories of the real-world costs from A ) have dramatically changed the limits of the theoretically achievable speedups to be expected from going into some form of [PARALLEL] process orchestrations ( here, making some parts of the code-execution got executed in the spawned sub-processes ), the initial view of which was first formulated by Dr. Gene Amdahl as early as 60+ years ago ( for which there were recently added two principal extensions of both the process instantiation(s) related setup + termination add on costs ( extremely important in py2 always & py3.5+ for MacOS and Windows ) and an atomicity-of-work, which will be discussed below.

    S   = speedup which can be achieved with N processors
    s   = a proportion of a calculation, which is [SERIAL]
    1-s = a parallelizable portion, that may run  [PAR]
    N   = a number of processors ( CPU-cores ) actively participating on [PAR] processing
    
                   1
    S =  __________________________; where s, ( 1 - s ), N were defined above
                    ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on cost/latency
         s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on cost/latency
                        N               
    

    开销上限和资源感知的重新配制:

                               1                         where s, ( 1 - s ), N
    S =  ______________________________________________ ;      pSO, pTO
                       | ( 1 - s )             |               were defined above
         s  + pSO + max|  _________ , atomicP  |  + pTO        atomicP:= a unit of work,
                       |     N                 |                         further indivisible,
                                                                         a duration of an
                                                                         atomic-process-block
    


    使用python在目标CPU/RAM设备上的原型,缩放为>> 1E+6

    任何简化的模型化示例都会以某种方式使您对实际工作负荷在体内的执行方式的期望产生偏差.低估的RAM分配(在小规模范围内看不到)可能会在以后大范围出乎人们的意料,有时甚至使操作系统陷入呆滞状态,进行交换和颠簸.一些更聪明的工具(numba.jit())甚至可以分析代码,并捷径某些代码段,这些代码段将永远不会被访问或不会产生任何结果,因此请注意,简化的示例可能会导致令人惊讶的观察.


    Prototype on target CPU/RAM device with your python, scaled >>1E+6

    Any simplified mock-up example will somehow skew your expectations about how the actual workloads will perform in-vivo. Underestimated RAM-allocations, not seen at small-scales may later surprise at scale, sometimes even throwing the operating system into sluggish states, swapping and thrashing. Some smarter tools ( numba.jit() ) may even analyze the code and shortcut some passages of code, that will never be visited or that does not produce any result, so be warned that simplified examples may lead to surprising observations.

    from multiprocessing import Pool
    import numpy as np
    import os
    
    SCALE = int( 1E9 )
    STEP  = int( 1E1 )
    aLIST = np.random.random( ( 10**3, 10**4 ) ).tolist()
    
    #######################################################################################
    #   func() does some SCALE'd amount of work, yet
    #                                                passes almost zero bytes as parameters
    #                                                allocates nothing, but iterator
    #                                                returns one byte,
    #                                                invariant to any expensive inputs
    def func( x ):  
        for i in range( SCALE ):
            i**2
        return 1
    

    一些使扩展策略的开销成本降低的提示:

    A few hints on making the strategy of scaling less overhead-costs expensive :

    #####################################################################################
    #   more_work_en_block() wraps some SCALE'd amount of work, sub-list specified
    def more_work_en_block( en_block = [ None, ] ):
        return [ func( nth_item ) for nth_item in en_block ]
    

    如果确实必须通过一个较大的列表,则最好通过较大的块,并通过远程迭代其部分来进行((与使用sub_blocks相比,不必为每次通过的每个项目支付更多的转让成本,(使用get SER参数/DES处理(〜pickle.dumps() + pickle.loads()的成本)[每次调用],再次,以附加成本进行,这降低了结果效率并恶化了扩展的,开销严格的开销中的开销部分阿姆达尔定律)

    If indeed must pass a big list, better pass larger block, with remote-iterating its parts ( instead of paying transfer-costs for each and every item passed many many more times, than if using sub_blocks ( parameters get SER/DES processed ( ~ the costs of pickle.dumps() + pickle.loads() ) [per-each-call], again, at an add-on costs, that decrease the resulting efficiency and worsen the overheads part of the extended, overhead-strict Amdahl's Law )

    #####################################################################################
    #   some_work_en_block() wraps some SCALE'd amount of work, tuple-specified
    def some_work_en_block( sub_block = ( [ None, ], 0, 1 ) ):
        return more_work_en_block( en_block = sub_block[0][sub_block[1]:sub_block[2]] )
    


    适当调整流程实例的数量:

    aMaxNumOfProcessesThatMakesSenseToSPAWN = len( os.sched_getaffinity( 0 ) ) # never more
    
    with Pool( aMaxNumOfProcessesThatMakesSenseToSPAWN ) as p:
         p.imap_unordered( more_work_en_block, [ ( aLIST,
                                                   start,
                                                   start + STEP
                                                   )
                                               for start in range( 0, len( aLIST ), STEP ) ] )
    

    最后但并非最不重要的一点是,期望通过巧妙地使用numpy智能矢量化代码来显着提高性能,最好在没有重复传递静态,预复制的情况下(在过程实例化过程中,因此以合理的比例进行支付)不可避免的成本)BLOB,以矢量化(CPU效率非常高)的方式,作为只读数据在代码中使用,而无需通过参数传递来传递相同的数据.关于如何使~ +500 x加速的一些示例,可以在此处测试方案.

    Last but not least, expect immense performance boosts from smart use of numpy smart vectorised code, best without repetitive passing of static, pre-copied (during the process instantiation(s), thus paid as the reasonably scaled, here un-avoidable, cost of thereof ) BLOBs, used in the code without passing the same data via parameter-passing, in a vectorised ( CPU-very-efficient ) fashion as read-only data. Some examples on how one can make ~ +500 x speedup one may read here or here, about but ~ +400 x speedup or about a case of just about a ~ +100 x speedup, with some examples of some problem-isolation testing scenarios.

    无论如何,模拟代码与您的实际工作负载越接近,基准就越具有(规模化和生产化)的意义.

    Anyway, the closer will the mock-up code be to your actual workloads, the more sense the benchmarks will get to have ( at scale & in production ).


    祝您探索世界一切顺利,
    如果世界不同,那不是梦,
    不是希望它与众不同,或者我们希望它成为世界

    :o)
    事实与科学问题-两者加在一起

    证据记录是为实现尽可能高的性能而迈出的核心步骤,
    没有任何产品营销,
    没有任何福音派氏族战争,
    没有任何博客文章的闲聊

    至少不要说您没有受到警告

    :o)



    Good luck on exploring the World, as it is,
    not as a dream if it were different,
    not as a wish it were different or that we would like it to be

    :o)

    Facts and Science matter - both + together

    Records of Evidence are the core steps forwards to achieve as high performance as possible,
    not any Product Marketing,
    not any Evangelisation Clans wars,
    not any Blog-posts' chatter

    At least don't say you was not warned

    :o)


    这篇关于在大型对象列表上,多处理Pool.map()的缩放比例较差:如何在python中实现更好的并行缩放比例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆