Joblib并行多个CPU比单个慢 [英] Joblib Parallel multiple cpu's slower than single
问题描述
我刚刚开始使用Joblib模块,并且试图了解Parallel函数的工作方式.下面是并行化导致更长的运行时间的示例,但我不明白为什么.我在1 cpu上的运行时间为51秒,而2 cpu上的运行时间为217秒.
I've just started using the Joblib module and I'm trying to understand how the Parallel function works. Below is an example of where parallelizing leads to longer runtimes but I don't understand why. My runtime on 1 cpu was 51 sec vs. 217 secs on 2 cpu.
我的假设是并行运行循环会将列表a和b复制到每个处理器.然后将item_n分配给一个cpu,将item_n + 1分配给另一个cpu,执行该函数,然后将结果写回到一个列表中(按顺序).然后抓住接下来的2个项目,依此类推.我显然缺少了一些东西.
My assumption was that running the loop in parallel would copy lists a and b to each processor. Then dispatch item_n to one cpu and item_n+1 to the other cpu, execute the function and then write the results back to a list (in order). Then grab the next 2 items and so on. I'm obviously missing something.
这是一个不好的例子还是对joblib的使用?我只是简单地将代码的结构错误了吗?
Is this a poor example or use of joblib? Did I simply structure the code wrong?
这里是示例:
import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
b = zip(np.random.rand(300,2),np.random.rand(300,2))
## Check if one line segment contains another.
def check_paths(path, paths):
for other_path in paths:
res='no cross'
chck = Path(other_path)
if chck.contains_path(path)==1:
res= 'cross'
break
return res
res = Parallel(n_jobs=2) (delayed(check_paths) (Path(points), a) for points in b)
推荐答案
简而言之:我无法重现您的问题.如果您使用的是Windows,则应在主循环中使用保护器: joblib.Parallel
的文档.我看到的唯一问题是大量的数据复制开销,但是您的数字似乎是不现实的.
In short: I cannot reproduce your problem. If you are on Windows you should use a protector for your main loop: documentation of joblib.Parallel
. The only problem I see is much data copying overhead, but your numbers seem unrealistic to be caused by that.
很长一段时间,这是我对您的代码的计时:
In long, here are my timings with your code:
在我的i7 3770k(4核,8个线程)上,对于不同的n_jobs
我得到以下结果:
On my i7 3770k (4 cores, 8 threads) I get the following results for different n_jobs
:
For-loop: Finished in 33.8521318436 sec
n_jobs=1: Finished in 33.5527760983 sec
n_jobs=2: Finished in 18.9543449879 sec
n_jobs=3: Finished in 13.4856410027 sec
n_jobs=4: Finished in 15.0832719803 sec
n_jobs=5: Finished in 14.7227740288 sec
n_jobs=6: Finished in 15.6106669903 sec
因此使用多个过程会有所收获.但是,尽管我有四个核心,但增益在三个过程中已经达到饱和.因此,我认为执行时间实际上是受内存访问限制的,而不是处理器时间.
So there is a gain in using multiple processes. However although I have four cores the gain already saturates at three processes. So I guess the execution time is actually limited by memory access rather than processor time.
您应该注意到,每个单个循环条目的参数都被复制到执行它的进程中.这意味着您为b
中的每个元素复制a
.那是无效的.因此,请访问全局a
. (Parallel
将派生该进程,将所有全局变量复制到新生成的进程中,因此可以访问a
).这给了我以下代码(如joblib
的文档所建议的那样,带有定时和主循环保护:
You should notice that the arguments for each single loop entry are copied to the process executing it. This means you copy a
for each element in b
. That is ineffective. So instead access the global a
. (Parallel
will fork the process, copying all global variables to the newly spawned processes, so a
is accessible). This gives me the following code (with timing and main loop guard as the documentation of joblib
recommends:
import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys
## Check if one line segment contains another.
def check_paths(path):
for other_path in a:
res='no cross'
chck = Path(other_path)
if chck.contains_path(path)==1:
res= 'cross'
break
return res
if __name__ == '__main__':
## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
b = zip(np.random.rand(300,2),np.random.rand(300,2))
now = time.time()
if len(sys.argv) >= 2:
res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
else:
res = [check_paths(Path(points)) for points in b]
print "Finished in", time.time()-now , "sec"
计时结果:
n_jobs=1: Finished in 34.2845709324 sec
n_jobs=2: Finished in 16.6254048347 sec
n_jobs=3: Finished in 11.219119072 sec
n_jobs=4: Finished in 8.61683392525 sec
n_jobs=5: Finished in 8.51907801628 sec
n_jobs=6: Finished in 8.21842098236 sec
n_jobs=7: Finished in 8.21816396713 sec
n_jobs=8: Finished in 7.81841087341 sec
饱和度现在略微移至n_jobs=4
,这是预期值.
The saturation now slightly moved to n_jobs=4
which is the value to be expected.
check_paths
进行了多个冗余计算,可以轻松消除这些计算.首先,对于other_paths=a
中的所有元素,在每个调用中都执行Path(...)
行.预先计算.其次,字符串res='no cross'
是在每个循环回合中写入的,尽管它只能更改一次(紧接着是断点并返回).将线移到循环的前面.然后代码如下所示:
check_paths
does several redundant calculations that can easily be eliminated. Firstly for all elements in other_paths=a
the line Path(...)
is executed in every call. Precalculate that. Secondly the string res='no cross'
is written is each loop turn, although it may only change once (followed by a break and return). Move the line in front of the loop. Then the code looks like this:
import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys
## Check if one line segment contains another.
def check_paths(path):
#global a
#print(path, a[:10])
res='no cross'
for other_path in a:
if other_path.contains_path(path)==1:
res= 'cross'
break
return res
if __name__ == '__main__':
## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
a = [Path(x) for x in a]
b = zip(np.random.rand(300,2),np.random.rand(300,2))
now = time.time()
if len(sys.argv) >= 2:
res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
else:
res = [check_paths(Path(points)) for points in b]
print "Finished in", time.time()-now , "sec"
有时间:
n_jobs=1: Finished in 5.33742594719 sec
n_jobs=2: Finished in 2.70858597755 sec
n_jobs=3: Finished in 1.80810618401 sec
n_jobs=4: Finished in 1.40814709663 sec
n_jobs=5: Finished in 1.50854086876 sec
n_jobs=6: Finished in 1.50901818275 sec
n_jobs=7: Finished in 1.51030707359 sec
n_jobs=8: Finished in 1.51062297821 sec
您代码上的一个侧节点,尽管我并没有真正遵循它的目的,因为这与您的问题无关,但contains_path
只会返回True
if this path completely contains the given path.
(请参阅
A side node on your code, although I haven't really followed its purpose as this was unrelated to your question, contains_path
will only return True
if this path completely contains the given path.
(see documentation). Therefore your function will basically always return no cross
given the random input.
这篇关于Joblib并行多个CPU比单个慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!