与 joblib 并行训练 sklearn 模型会阻止该过程 [英] Training sklearn models in parallel with joblib blocks the process

查看:29
本文介绍了与 joblib 并行训练 sklearn 模型会阻止该过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如

检查 CPU% 以及该节点上其他资源的状态数据.

<小时>

我的错误是什么?

阅读您的个人资料是一个令人惊叹的时刻,先生:

<块引用>

事实
还是
情绪?... 这就是问题所在! (丹麦王子哈姆雷特的悲剧)

<小时>

我可以说实话吗?让我们始终选择 FACTS:

第 1 步:
永远不要雇佣或直接解雇每一位不尊重事实的顾问(上面提到的答案没有暗示任何东西,更不给予任何承诺).忽略事实可能是公关/MARCOM/广告/媒体业务中的成功犯罪"(以防客户容忍这种不诚实和/或操纵习惯),但不是在一个科学上公平的定量领域.这是不可原谅的.

第 2 步:
永远不要雇用或直接解雇每一个声称拥有软件架构经验的顾问,特别是在...大数据的解决方案,但对累积的一次性费用零关注一旦处理开始分布在一些硬件和软件资源池中,系统架构的每个相应元素将引入的所有附加开销成本.这是不可原谅的.

第 3 步:
永远不要雇用或直接解雇每一位顾问,一旦事实不符合她/他的意愿,他们就会变得被动攻击,并开始指责其他知识渊博的人,他们已经伸出援助之手,而是改进(他们的) 沟通技巧",而不是从错误中学习.当然,技巧可能有助于以其他方式表达明显的错误,然而,巨大的错误仍然是巨大的错误,每一位科学家,公平地对待她/他的科学头衔,应该永远诉诸于攻击一位乐于助人的同事,而是开始一个接一个地寻找错误的根本原因.这---

<块引用>

以上所有测试用例都没有要求 CPU 付出太多努力

那么让我们开始烧油吧!

如果以上所有内容都可以让您开始闻到引擎盖下的东西实际上是如何工作的,那么这将变得越来越丑陋和肮脏.

测试用例 D:

def a_CPU_1_CORE_BURNER_FUN( aNeverConsumedPAR ):""" __doc__这个 FUN() 的目的是什么都不做添加一些 CPU 负载到 MEM 分配加上一些数据 MOV以便能够进行基准测试所有流程实例化 + MEM OP附加管理费用."""import numpy as np # 是的,延迟导入,libs 延迟导入SIZE1D = 1000 # 在这里,可以根据需要随意aMemALLOC = np.ones( ( SIZE1D, # 以便设置SIZE1D, # 逼真的天花板SIZE1D, # 作为大数据"有多大SIZE1D # 可能确实长成),dtype = np.float64,订单 = 'F') # .ALLOC + .SETaMemALLOC[2,3,4,5] = 8.7654321 #.SETaMemALLOC[3,3,4,5] = 1.2345678 #.SETaMemALLOC[:,:,:,:]*= 0.1234567aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]aMemALLOC[:,:,:,:]+= int( [ np.math.factorial( x + aMemALLOC[-1,-1,-1] )对于范围内的 x(1005)][-1]/[ np.math.factorial( y + aMemALLOC[ 1, 1, 1] )对于范围内的 y ( 1000 )][-1])返回 aMemALLOC[2:3,3,4,5]

与机器学习多维空间领域中常见等级的有效载荷相比,仍然没有什么特别之处,其中 { aMlModelSPACE, aSetOfHyperParameterSPACE, aDataSET }-state-space 影响所需处理的范围(一些具有 O(N),一些其他 O(N.logN) 复杂性),其中 几乎立即,经过精心设计,即使在运行单个作业"时,也会很快利用不止一个 CPU_core.

一种确实令人讨厌的气味开始了,一旦幼稚(读取资源使用未协调)CPU 负载混合开始,并且当与任务相关的 CPU 负载混合开始与幼稚(读取资源使用)混合时未协调的)O/S-scheduler 进程碰巧争夺公共(仅诉诸天真的共享使用策略)资源 - 即 MEM(将 SWAP 引入为 HELL)、CPU(引入缓存未命中和 MEM 重新获取(是的),加上 SWAPs 处罚),不是说支付任何超过 ~ 15+ [ms] 延迟费用,如果一个人忘记并让一个进程接触fileIO-( 5 (!)-orders-of-magnitude慢 + 共享 + 纯[SERIAL],本质上是)-设备.这里没有任何帮助(包括 SSD,只是少了几个数量级,但仍然很难分享和运行设备以令人难以置信的速度进入其磨损和撕裂的坟墓).

<小时>

如果所有生成的进程都不适合物理 RAM,会发生什么情况?

虚拟内存分页和交换开始从字面上恶化其余部分到目前为止以某种方式只是"巧合-(阅读:弱协调)-[CONCURRENTLY] - 预定处理(阅读:进一步降低了个人的 PROCESS-under-TEST 性能).

<小时>

如果没有得到应有的控制,事情可能很快就会毁于一旦监督.

同样 - 事实很重要:轻量级资源监控类可能会有所帮助:

aResRECORDER.show_usage_since0() 方法返回:ResCONSUMED[T0+ 166036.311 ( 0.000000)]用户= 2475.15好 = 0.36iowait = 0.29irq = 0.00软中断= 8.32被盗_from_VM = 26.95来宾_VM_served = 0.00

同样,构建得更丰富的资源监视器可能会报告更广泛的 O/S 上下文,以查看额外的资源窃取/争用/竞争条件在哪里恶化了实际实现的流程:

<预><代码>>>>psutil.Process( os.getpid()).memory_full_info()( rss = 9428992,虚拟机 = 158584832,共享 = 3297280,文字 = 2322432,库 = 0,数据 = 5877760,脏 = 0).虚拟内存()(总计 = 25111490560,可用 = 24661327872,百分比 = 1.8,使用 = 1569603584,免费 = 23541886976,活跃 = 579739648,不活动 = 588615680,缓冲区 = 0,缓存 = 1119440896).swap_memory()(总计 = 8455712768,使用 = 967577600,免费 = 7488135168,百分比 = 11.4,罪 = 500625227776,南 = 370585448448)2017 年 10 月 19 日星期三 03:26:06166.445 ___VMS__________虚拟内存大小 MB10.406 ___RES____常驻集大小非交换 MB2.215 ___TRS____文本驻留集 MB 中的代码14.738 ___DRS________________数据驻留集 MB3.305 ___SHR___________潜在共享MB0.000 ___LIB___________共享内存大小 MB__________________脏页数 0x

<小时>

最后但并非最不重要的一点是,为什么人们可以轻松地付出多于赚取的回报?

除了逐渐建立的证据记录,现实世界的系统部署附加开销如何累积成本,最近重新制定了阿姆达尔定律,扩展以涵盖附加的间接成本加上进一步不可分割的零件尺寸的过程原子性",定义了最大附加成本阈值,如果某些分布式处理要提供任何高于 >= 1.00 计算过程加速.

不遵守重新制定的阿姆达尔定律的明确逻辑会导致一个过程比在纯[SERIAL]过程中处理得更糟-调度(有时糟糕的设计和/或操作实践的结果可能看起来好像是一个案例,当 joblib.Parallel()( joblib.delayed(...) ) 方法阻止进程").

As suggested in this answer, I tried to use joblib to train multiple scikit-learn models in parallel.

import joblib
import numpy
from sklearn import tree, linear_model

classifierParams = {
                "Decision Tree": (tree.DecisionTreeClassifier, {}),''
                "Logistic Regression" : (linear_model.LogisticRegression, {})
}


XTrain = numpy.array([[1,2,3],[4,5,6]])
yTrain = numpy.array([0, 1])

def trainModel(name, clazz, params, XTrain, yTrain):
    print("training ", name)
    model = clazz(**params)
    model.fit(XTrain, yTrain)
    return model


joblib.Parallel(n_jobs=4)(joblib.delayed(trainModel)(name, clazz, params, XTrain, yTrain) for (name, (clazz, params)) in classifierParams.items())

However, the call to the last line takes ages without utilizing the CPU, in fact it just seems to block and never return anything. What is my mistake?

A test with a very small amount of data in XTrain suggests that copying of the numpy array across multiple processes is not the reason for the delay.

解决方案

Production-grade Machine Learning pipelines have CPU utilisations more like this, almost 24 / 7 / 365:

Check both the CPU% and also other resources' state figures across this node.


What is my mistake?

Having read your profile was a stunning moment, Sir:

I am a computer scientist specializing on algorithms and data analysis by training, and a generalist by nature. My skill set combines a strong scientific background with experience in software architecture and development, especially on solutions for the analysis of big data. I offer consulting and development services and I am looking for challenging projects in the area of data science.

The problem IS deeply determined by a respect to elementary Computer Science + algorithm rules.

The problem IS NOT demanding a strong scientific background, but a common sense.

The problem IS NOT any especially Big Data but requires to smell how the things actually work.


Facts
or
Emotions? ... that's The Question! ( The tragedy of Hamlet, Prince of Denmark )


May I be honest? Let's prefer FACTS, always:

Step #1:
Never hire or fire straight each and every Consultant, who does not respect facts ( the answer referred above did not suggest anything, the less granted any promises ). Ignoring facts might be a "successful sin" in PR / MARCOM / Advertisement / media businesses ( in case The Customer tolerates such dishonesty and/or manipulative habit ) , but not in a scientifically fair quantitative domains. This is unforgivable.

Step #2:
Never hire or fire straight each and every Consultant, who claimed having experience in software architecture, especially on solutions for ... big data but pays zero attention to the accumulated lumpsum of all the add-on overhead costs that are going to be introduced by each of the respective elements of the system architecture, once the processing started to go distributed across some pool of hardware and software resources. This is unforgivable.

Step #3:
Never hire or fire straight each and every Consultant, who turns passive aggressive once facts do not fit her/his wishes and starts to accuse other knowledgeable person who have already delivered a helping hand to rather "improve ( their ) communication skills" instead of learning from mistake(s). Sure, skill may help to express the obvious mistakes in some other way, yet, the gigantic mistakes will remain gigantic mistakes and each and every scientist, being fair to her/his scientific title, should NEVER resort to attack on a helping colleague, but rather start searching for the root cause of the mistakes, one after the other. This ---

@sascha ... May I suggest you take little a break from stackoverflow to cool off, work a little on your interpersonal communication skills

--- was nothing but a straight and intellectually unacceptable nasty foul to @sascha.


Next, the toys
The architecture, Resources and Process-scheduling facts that matter:

The imperative form of a syntax-constructor ignites an immense amount of activities to start:

joblib.Parallel( n_jobs = <N> )( joblib.delayed( <aFunction> )
                                               ( <anOrderedSetOfFunParameters> )
                                           for ( <anOrderedSetOfIteratorParams> )
                                           in    <anIterator>
                                 )

To at least guess what happens, a scientifically fair approach would be to test several representative cases, benchmarking their actual execution, collect quantitatively supported facts and draw a hypothesis on a model of behaviour and its principal dependencies on CPU_core-count, on RAM-size, on <aFunction>-complexity and resources-allocation envelopes etc.

Test case A:

def a_NOP_FUN( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is indeed to do nothing at all,
                             so as to be able to benchmark
                             all the process-instantiation
                             add-on overhead costs.
    """
    pass

##############################################################
###  A NAIVE TEST BENCH
##############################################################
from zmq import Stopwatch; aClk = Stopwatch()
JOBS_TO_SPAWN =  4         # TUNE:  1,  2,  4,   5,  10, ..
RUNS_TO_RUN   = 10         # TUNE: 10, 20, 50, 100, 200, 500, 1000, ..
try:
     aClk.start()
     joblib.Parallel(  n_jobs = JOBS_TO_SPAWN
                      )( joblib.delayed( a_NOP_FUN )
                                       ( aSoFunPAR )
                                   for ( aSoFunPAR )
                                   in  range( RUNS_TO_RUN )
                         )
except:
     pass
finally:
     try:
         _ = aClk.stop()
     except:
         _ = -1
         pass
print( "CLK:: {0:_>24d} [us] @{1: >3d} run{2: >5d} RUNS".format( _,
                                                                 JOBS_TO_SPAWN,
                                                                 RUNS_TO_RUN
                                                                 )
        )

Having collected representatively enough data on this NOP-case over a reasonably scaled 2D-landscape of [ RUNS_TO_RUN, JOBS_TO_SPAWN]-cartesian-space DataPoints, so as to generate at least some first-hand experience of the actual system costs of launching an actually intrinsically empty-processes' overhead workloads, related to the imperatively instructed joblib.Parallel(...)( joblib.delayed(...) )-syntax constructor, spawning into the system-scheduler just a few joblib-managed a_NOP_FUN() instances.

Let's also agree that all the real-world problems, Machine Learning models included, are way more complex tools, that the just tested a_NOP_FUN(), while in both cases, you have to pay the already benchmarked overhead costs ( even if it was paid for getting literally zero product ).

Thus a scientifically fair, rigorous work will follow from this simplest ever case, already showing the benchmarked costs of all the associated setup-overheads a smallest ever joblib.Parallel() penalty sine-qua-non forwards into a direction, where real world algorithms live - best with next adding some larger and larger "payload"-sizes into the testing loop:

Test-case B:

def a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is to do nothing but
                             a MEM-allocation
                             so as to be able to benchmark
                             all the process-instantiation
                             add-on overhead costs.
    """
    import numpy as np              # yes, deferred import, libs do defer imports
    SIZE1D    = 1000                # here, feel free to be as keen as needed
    aMemALLOC = np.zeros( ( SIZE1D, #       so as to set
                            SIZE1D, #       realistic ceilings
                            SIZE1D, #       as how big the "Big Data"
                            SIZE1D  #       may indeed grow into
                            ),
                          dtype = np.float64,
                          order = 'F'
                          )         # .ALLOC + .SET
    aMemALLOC[2,3,4,5] = 8.7654321  # .SET
    aMemALLOC[3,3,4,5] = 1.2345678  # .SET

    return aMemALLOC[2:3,3,4,5]

Again,
collect a representatively enough quantitative data about the costs of actual remote-process MEM-allocations, by running a a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR() over some reasonable wide landscape of SIZE1D-scaling,
again
over a reasonably scaled 2D-landscape of [ RUNS_TO_RUN, JOBS_TO_SPAWN]-cartesian-space DataPoints, so as to touch a new dimension in the performance scaling, under an extended black-box PROCESS_under_TEST experimentation inside the joblib.Parallel() tool, leaving its magics yet left un-opened.

Test-case C:

def a_NOP_FUN_WITH_SOME_MEM_DATAFLOW( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is to do nothing but
                             a MEM-allocation plus some Data MOVs
                             so as to be able to benchmark
                             all the process-instantiation + MEM OPs
                             add-on overhead costs.
    """
    import numpy as np              # yes, deferred import, libs do defer imports
    SIZE1D    = 1000                # here, feel free to be as keen as needed
    aMemALLOC = np.ones(  ( SIZE1D, #       so as to set
                            SIZE1D, #       realistic ceilings
                            SIZE1D, #       as how big the "Big Data"
                            SIZE1D  #       may indeed grow into
                            ),
                          dtype = np.float64,
                          order = 'F'
                          )         # .ALLOC + .SET
    aMemALLOC[2,3,4,5] = 8.7654321  # .SET
    aMemALLOC[3,3,4,5] = 1.2345678  # .SET

    aMemALLOC[:,:,:,:]*= 0.1234567
    aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]
    aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]
    aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]
    aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]

    return aMemALLOC[2:3,3,4,5]


Bang, The Architecture related issues start to slowly show up:

One may soon notice, that not only the static-sizing matters, but also the MEM-transport BANDWIDTH ( hardware-hardwired ) will start cause problems, as moving data from/to CPU into/from MEM costs well ~ 100 .. 300 [ns], a way more, than any smart-shuffling of the few bytes "inside" the CPU_core, { CPU_core_private | CPU_core_shared | CPU_die_shared }-cache hierarchy-architecture alone ( and any non-local NUMA-transfer exhibits the same order of magnitude add-on pain ).

All the above Test-Cases have not asked much efforts from CPU yet

So let's start to burn the oil!

If all above was fine for starting to smell how the things under the hood actually work, this will grow to become ugly and dirty.

Test-case D:

def a_CPU_1_CORE_BURNER_FUN( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is to do nothing but
                             add some CPU-load
                             to a MEM-allocation plus some Data MOVs
                             so as to be able to benchmark
                             all the process-instantiation + MEM OPs
                             add-on overhead costs.
    """
    import numpy as np              # yes, deferred import, libs do defer imports
    SIZE1D    = 1000                # here, feel free to be as keen as needed
    aMemALLOC = np.ones(  ( SIZE1D, #       so as to set
                            SIZE1D, #       realistic ceilings
                            SIZE1D, #       as how big the "Big Data"
                            SIZE1D  #       may indeed grow into
                            ),
                          dtype = np.float64,
                          order = 'F'
                          )         # .ALLOC + .SET
    aMemALLOC[2,3,4,5] = 8.7654321  # .SET
    aMemALLOC[3,3,4,5] = 1.2345678  # .SET

    aMemALLOC[:,:,:,:]*= 0.1234567
    aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]
    aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]
    aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]
    aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]

    aMemALLOC[:,:,:,:]+= int( [ np.math.factorial( x + aMemALLOC[-1,-1,-1] )
                                               for x in range( 1005 )
                                ][-1]
                            / [ np.math.factorial( y + aMemALLOC[ 1, 1, 1] )
                                               for y in range( 1000 )
                                ][-1]
                              )

    return aMemALLOC[2:3,3,4,5]

Still nothing extraordinary, compared to the common grade of payloads in the domain of a Machine Learning many-D-space, where all dimensions of the { aMlModelSPACE, aSetOfHyperParameterSPACE, aDataSET }-state-space impact the scope of the processing required ( some having O( N ), some other O( N.logN ) complexity ), where almost immediately, where well engineered-in more than just one CPU_core soon gets harnessed even on a single "job"-being run.

An indeed nasty smell starts, once a naive ( read resources-usage un-coordinated ) CPU-load mixtures get down the road and when mixes of task-related CPU-loads start to get mixed with naive ( read resources-usage un-coordinated ) O/S-scheduler processes happen to fight for common ( resorted to just a naive shared-use policy ) resources - i.e. MEM ( introducing SWAPs as HELL ), CPU ( introducing cache-misses and MEM re-fetches ( yes, with SWAPs penalties added ), not speaking about paying any kind of more than ~ 15+ [ms] latency-fees, if one forgets and lets a process to touch a fileIO-( 5 (!)-orders-of-magnitude slower + shared + being a pure-[SERIAL], by nature )-device. No prayers help here ( SSD included, just a few orders of magnitude less, but still a hell to share & running a device incredibly fast into its wear + tear grave ).


What happens, if all the spawned processes do not fit into the physical RAM?

Virtual memory paging and swaps start to literally deteriorate the rest of the so far somehow "just"-by-coincidence-( read: weakly-co-ordinated )-[CONCURRENTLY]-scheduled processing ( read: further-decreased individual PROCESS-under-TEST performance ).


Things may so soon go wreck havoc, if not under due control & supervision.

Again - fact matters: a light-weight resources-monitor class may help:

aResRECORDER.show_usage_since0() method returns:

ResCONSUMED[T0+   166036.311 (           0.000000)]
user=               2475.15
nice=                  0.36
iowait=                0.29
irq=                   0.00
softirq=               8.32
stolen_from_VM=       26.95
guest_VM_served=       0.00

Similarly a bit richer constructed resources-monitor may report a wider O/S context, to see where additional resource stealing / contention / race-conditions deteriorate the actually achieved process-flow:

>>> psutil.Process( os.getpid()
                    ).memory_full_info()
                                      ( rss          =       9428992,
                                        vms          =     158584832,
                                        shared       =       3297280,
                                        text         =       2322432,
                                        lib          =             0,
                                        data         =       5877760,
                                        dirty        =             0
                                        )
           .virtual_memory()
                          (             total        =   25111490560,
                                        available    =   24661327872,
                                        percent      =             1.8,
                                        used         =    1569603584,
                                        free         =   23541886976,
                                        active       =     579739648,
                                        inactive     =     588615680,
                                        buffers      =             0,
                                        cached       =    1119440896
                                        )
           .swap_memory()
                       (                total        =    8455712768,
                                        used         =     967577600,
                                        free         =    7488135168,
                                        percent      =            11.4,
                                        sin          =  500625227776,
                                        sout         =  370585448448
                                        )

Wed Oct 19 03:26:06 2017
        166.445 ___VMS______________Virtual Memory Size  MB
         10.406 ___RES____Resident Set Size non-swapped  MB
          2.215 ___TRS________Code in Text Resident Set  MB
         14.738 ___DRS________________Data Resident Set  MB
          3.305 ___SHR_______________Potentially Shared  MB
          0.000 ___LIB_______________Shared Memory Size  MB
                __________________Number of dirty pages           0x


Last but not least, why one can easily pay more than earn in return?

Besides the gradually built records of evidence, how the real-world system-deployment add-on overheads accumulate the costs, the recently re-formulated Amdahl's Law, extended so as to cover both the add-on overhead-costs plus the "process-atomicity" of the further indivisible parts' sizing, defines a maximum add-on costs threshold, that might be reasonable paid, if some distributed processing is to provide any above >= 1.00 computing process speedup.

Dis-obeying the explicit logic of the re-formulated Amdahl's Law causes a process to proceed worse than if having been processed in a pure-[SERIAL] process-scheduling ( and sometimes the results of poor design and/or operations practices may look as if it were a case, when a joblib.Parallel()( joblib.delayed(...) ) method "blocks the process" ).

这篇关于与 joblib 并行训练 sklearn 模型会阻止该过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆