Joblib Parallel 多 CPU 比单 CPU 慢 [英] Joblib Parallel multiple cpu's slower than single

查看:51
本文介绍了Joblib Parallel 多 CPU 比单 CPU 慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用 Joblib 模块,我正在尝试了解 Parallel 函数的工作原理.下面是一个示例,说明并行化会导致更长的运行时间,但我不明白为什么.我在 1 个 CPU 上的运行时间为 51 秒,而在 2 个 CPU 上为 217 秒.

I've just started using the Joblib module and I'm trying to understand how the Parallel function works. Below is an example of where parallelizing leads to longer runtimes but I don't understand why. My runtime on 1 cpu was 51 sec vs. 217 secs on 2 cpu.

我的假设是并行运行循环会将列表 a 和 b 复制到每个处理器.然后将 item_n 分派到一个 cpu,将 item_n+1 分派到另一个 cpu,执行该函数,然后将结果写回列表(按顺序).然后抓住接下来的 2 个项目,依此类推.我显然错过了一些东西.

My assumption was that running the loop in parallel would copy lists a and b to each processor. Then dispatch item_n to one cpu and item_n+1 to the other cpu, execute the function and then write the results back to a list (in order). Then grab the next 2 items and so on. I'm obviously missing something.

这是一个糟糕的例子还是joblib的使用?我只是简单地构造了错误的代码吗?

Is this a poor example or use of joblib? Did I simply structure the code wrong?

示例如下:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed

## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))

b = zip(np.random.rand(300,2),np.random.rand(300,2))

## Check if one line segment contains another. 
def check_paths(path, paths):
    for other_path in paths:
        res='no cross'
        chck = Path(other_path)
        if chck.contains_path(path)==1:
            res= 'cross'
            break
    return res

res = Parallel(n_jobs=2) (delayed(check_paths) (Path(points), a) for points in b)

推荐答案

简而言之:我无法重现您的问题.如果您使用的是 Windows,您应该为您的主循环使用保护器:joblib.Parallel.我看到的唯一问题是大量的数据复制开销,但您的数字似乎不切实际.

In short: I cannot reproduce your problem. If you are on Windows you should use a protector for your main loop: documentation of joblib.Parallel. The only problem I see is much data copying overhead, but your numbers seem unrealistic to be caused by that.

总而言之,这是我使用您的代码的时间:

In long, here are my timings with your code:

在我的 i7 3770k(4 核,8 线程)上,对于不同的 n_jobs,我得到以下结果:

On my i7 3770k (4 cores, 8 threads) I get the following results for different n_jobs:

For-loop: Finished in 33.8521318436 sec
n_jobs=1: Finished in 33.5527760983 sec
n_jobs=2: Finished in 18.9543449879 sec
n_jobs=3: Finished in 13.4856410027 sec
n_jobs=4: Finished in 15.0832719803 sec
n_jobs=5: Finished in 14.7227740288 sec
n_jobs=6: Finished in 15.6106669903 sec

因此,使用多个进程是有好处的.然而,虽然我有四个内核,但增益已经在三个过程中饱和了.所以我猜执行时间实际上受内存访问而不是处理器时间的限制.

So there is a gain in using multiple processes. However although I have four cores the gain already saturates at three processes. So I guess the execution time is actually limited by memory access rather than processor time.

您应该注意到每个循环条目的参数都被复制到执行它的进程中.这意味着您为 b 中的每个元素复制 a.那是无效的.所以改为访问全局 a.(Parallel 将 fork 进程,将所有全局变量复制到新生成的进程,因此 a 是可访问的).这给了我以下代码(与 joblib 的文档推荐的时序和主循环保护一样:

You should notice that the arguments for each single loop entry are copied to the process executing it. This means you copy a for each element in b. That is ineffective. So instead access the global a. (Parallel will fork the process, copying all global variables to the newly spawned processes, so a is accessible). This gives me the following code (with timing and main loop guard as the documentation of joblib recommends:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys

## Check if one line segment contains another. 

def check_paths(path):
    for other_path in a:
        res='no cross'
        chck = Path(other_path)
        if chck.contains_path(path)==1:
            res= 'cross'
            break
    return res

if __name__ == '__main__':
    ## Create pairs of points for line segments
    a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
    b = zip(np.random.rand(300,2),np.random.rand(300,2))

    now = time.time()
    if len(sys.argv) >= 2:
        res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
    else:
        res = [check_paths(Path(points)) for points in b]
    print "Finished in", time.time()-now , "sec"

计时结果:

 n_jobs=1: Finished in 34.2845709324 sec
 n_jobs=2: Finished in 16.6254048347 sec
 n_jobs=3: Finished in 11.219119072 sec
 n_jobs=4: Finished in 8.61683392525 sec
 n_jobs=5: Finished in 8.51907801628 sec
 n_jobs=6: Finished in 8.21842098236 sec
 n_jobs=7: Finished in 8.21816396713 sec
 n_jobs=8: Finished in 7.81841087341 sec

饱和度现在略微移动到 n_jobs=4 这是预期的值.

The saturation now slightly moved to n_jobs=4 which is the value to be expected.

check_paths 做了几个很容易消除的冗余计算.首先,对于 other_paths=a 中的所有元素,在每次调用中都会执行 Path(...) 行.预先计算一下.其次,字符串 res='no cross' 被写入每个循环,尽管它可能只改变一次(随后是中断和返回).将线移动到循环前面.然后代码是这样的:

check_paths does several redundant calculations that can easily be eliminated. Firstly for all elements in other_paths=a the line Path(...) is executed in every call. Precalculate that. Secondly the string res='no cross' is written is each loop turn, although it may only change once (followed by a break and return). Move the line in front of the loop. Then the code looks like this:

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys

## Check if one line segment contains another. 

def check_paths(path):
    #global a
    #print(path, a[:10])
    res='no cross'
    for other_path in a:
        if other_path.contains_path(path)==1:
            res= 'cross'
            break
    return res

if __name__ == '__main__':
    ## Create pairs of points for line segments
    a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
    a = [Path(x) for x in a]

    b = zip(np.random.rand(300,2),np.random.rand(300,2))

    now = time.time()
    if len(sys.argv) >= 2:
        res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
    else:
        res = [check_paths(Path(points)) for points in b]
    print "Finished in", time.time()-now , "sec"

时间安排:

n_jobs=1: Finished in 5.33742594719 sec
n_jobs=2: Finished in 2.70858597755 sec
n_jobs=3: Finished in 1.80810618401 sec
n_jobs=4: Finished in 1.40814709663 sec
n_jobs=5: Finished in 1.50854086876 sec
n_jobs=6: Finished in 1.50901818275 sec
n_jobs=7: Finished in 1.51030707359 sec
n_jobs=8: Finished in 1.51062297821 sec

您代码上的一个侧节点,虽然我没有真正遵循它的目的,因为这与您的问题无关,contains_path 只会返回 True if此路径完全包含给定路径.(请参阅文档).因此,给定随机输入,您的函数基本上总是返回 no cross.

A side node on your code, although I haven't really followed its purpose as this was unrelated to your question, contains_path will only return True if this path completely contains the given path. (see documentation). Therefore your function will basically always return no cross given the random input.

这篇关于Joblib Parallel 多 CPU 比单 CPU 慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆