Joblib并行多个CPU比单个慢 [英] Joblib Parallel multiple cpu's slower than single

查看:141
本文介绍了Joblib并行多个CPU比单个慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用Joblib模块,并且试图了解Parallel函数的工作方式.下面是并行化导致更长的运行时间的示例,但我不明白为什么.我在1 cpu上的运行时间为51秒,而2 cpu上的运行时间为217秒.

I've just started using the Joblib module and I'm trying to understand how the Parallel function works. Below is an example of where parallelizing leads to longer runtimes but I don't understand why. My runtime on 1 cpu was 51 sec vs. 217 secs on 2 cpu.

我的假设是并行运行循环会将列表a和b复制到每个处理器.然后将item_n分配给一个cpu,将item_n + 1分配给另一个cpu,执行该函数,然后将结果写回到一个列表中(按顺序).然后抓住接下来的2个项目,依此类推.我显然缺少了一些东西.

My assumption was that running the loop in parallel would copy lists a and b to each processor. Then dispatch item_n to one cpu and item_n+1 to the other cpu, execute the function and then write the results back to a list (in order). Then grab the next 2 items and so on. I'm obviously missing something.

这是一个不好的例子还是对joblib的使用?我只是简单地将代码的结构错误了吗?

Is this a poor example or use of joblib? Did I simply structure the code wrong?

这里是示例:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed

## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))

b = zip(np.random.rand(300,2),np.random.rand(300,2))

## Check if one line segment contains another. 
def check_paths(path, paths):
    for other_path in paths:
        res='no cross'
        chck = Path(other_path)
        if chck.contains_path(path)==1:
            res= 'cross'
            break
    return res

res = Parallel(n_jobs=2) (delayed(check_paths) (Path(points), a) for points in b)

推荐答案

简而言之:我无法重现您的问题.如果您使用的是Windows,则应在主循环中使用保护器: joblib.Parallel 的文档.我看到的唯一问题是大量的数据复制开销,但是您的数字似乎是不现实的.

In short: I cannot reproduce your problem. If you are on Windows you should use a protector for your main loop: documentation of joblib.Parallel. The only problem I see is much data copying overhead, but your numbers seem unrealistic to be caused by that.

很长一段时间,这是我对您的代码的计时:

In long, here are my timings with your code:

在我的i7 3770k(4核,8个线程)上,对于不同的n_jobs我得到以下结果:

On my i7 3770k (4 cores, 8 threads) I get the following results for different n_jobs:

For-loop: Finished in 33.8521318436 sec
n_jobs=1: Finished in 33.5527760983 sec
n_jobs=2: Finished in 18.9543449879 sec
n_jobs=3: Finished in 13.4856410027 sec
n_jobs=4: Finished in 15.0832719803 sec
n_jobs=5: Finished in 14.7227740288 sec
n_jobs=6: Finished in 15.6106669903 sec

因此使用多个过程会有所收获.但是,尽管我有四个核心,但增益在三个过程中已经达到饱和.因此,我认为执行时间实际上是受内存访问限制的,而不是处理器时间.

So there is a gain in using multiple processes. However although I have four cores the gain already saturates at three processes. So I guess the execution time is actually limited by memory access rather than processor time.

您应该注意到,每个单个循环条目的参数都被复制到执行它的进程中.这意味着您为b中的每个元素复制a.那是无效的.因此,请访问全局a. (Parallel将派生该进程,将所有全局变量复制到新生成的进程中,因此可以访问a).这给了我以下代码(如joblib的文档所建议的那样,带有定时和主循环保护:

You should notice that the arguments for each single loop entry are copied to the process executing it. This means you copy a for each element in b. That is ineffective. So instead access the global a. (Parallel will fork the process, copying all global variables to the newly spawned processes, so a is accessible). This gives me the following code (with timing and main loop guard as the documentation of joblib recommends:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys

## Check if one line segment contains another. 

def check_paths(path):
    for other_path in a:
        res='no cross'
        chck = Path(other_path)
        if chck.contains_path(path)==1:
            res= 'cross'
            break
    return res

if __name__ == '__main__':
    ## Create pairs of points for line segments
    a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
    b = zip(np.random.rand(300,2),np.random.rand(300,2))

    now = time.time()
    if len(sys.argv) >= 2:
        res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
    else:
        res = [check_paths(Path(points)) for points in b]
    print "Finished in", time.time()-now , "sec"

计时结果:

 n_jobs=1: Finished in 34.2845709324 sec
 n_jobs=2: Finished in 16.6254048347 sec
 n_jobs=3: Finished in 11.219119072 sec
 n_jobs=4: Finished in 8.61683392525 sec
 n_jobs=5: Finished in 8.51907801628 sec
 n_jobs=6: Finished in 8.21842098236 sec
 n_jobs=7: Finished in 8.21816396713 sec
 n_jobs=8: Finished in 7.81841087341 sec

饱和度现在略微移至n_jobs=4,这是预期值.

The saturation now slightly moved to n_jobs=4 which is the value to be expected.

check_paths进行了多个冗余计算,可以轻松消除这些计算.首先,对于other_paths=a中的所有元素,在每个调用中都执行Path(...)行.预先计算.其次,字符串res='no cross'是在每个循环回合中写入的,尽管它只能更改一次(紧接着是断点并返回).将线移到循环的前面.然后代码如下所示:

check_paths does several redundant calculations that can easily be eliminated. Firstly for all elements in other_paths=a the line Path(...) is executed in every call. Precalculate that. Secondly the string res='no cross' is written is each loop turn, although it may only change once (followed by a break and return). Move the line in front of the loop. Then the code looks like this:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys

## Check if one line segment contains another. 

def check_paths(path):
    #global a
    #print(path, a[:10])
    res='no cross'
    for other_path in a:
        if other_path.contains_path(path)==1:
            res= 'cross'
            break
    return res

if __name__ == '__main__':
    ## Create pairs of points for line segments
    a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
    a = [Path(x) for x in a]

    b = zip(np.random.rand(300,2),np.random.rand(300,2))

    now = time.time()
    if len(sys.argv) >= 2:
        res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
    else:
        res = [check_paths(Path(points)) for points in b]
    print "Finished in", time.time()-now , "sec"

有时间:

n_jobs=1: Finished in 5.33742594719 sec
n_jobs=2: Finished in 2.70858597755 sec
n_jobs=3: Finished in 1.80810618401 sec
n_jobs=4: Finished in 1.40814709663 sec
n_jobs=5: Finished in 1.50854086876 sec
n_jobs=6: Finished in 1.50901818275 sec
n_jobs=7: Finished in 1.51030707359 sec
n_jobs=8: Finished in 1.51062297821 sec

您代码上的一个侧节点,尽管我并没有真正遵循它的目的,因为这与您的问题无关,但contains_path只会返回True if this path completely contains the given path.(请参阅

A side node on your code, although I haven't really followed its purpose as this was unrelated to your question, contains_path will only return True if this path completely contains the given path. (see documentation). Therefore your function will basically always return no cross given the random input.

这篇关于Joblib并行多个CPU比单个慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆