Dask For Loop平行 [英] Dask For Loop In Parallel
问题描述
我正在尝试使用延迟延迟的for循环找到正确的语法。我发现了一些教程和其他问题,但都不适合我的情况,这是非常基础的。
首先,这是并行运行for循环的正确方法吗?
%% time
list_names = ['a','b','c', 'd']
keep_return = []
@delayed
def loop_dummy(target):
for i in range(1000000000):
pass
print('passed value:'+ target)
return(1)
for list_names中的i:
c = loop_dummy(i)
keep_return.append(c)
total =延迟(sum)(keep_return)
total.compute()
生成的
传递的值是:a
传递的值是:b
传递的值是:c
传递的值是:d
挂牌时间:1min 53s
如果我以串行方式运行
%% time
list_names = ['a','b','c','d']
keep_return = []
def loop_dummy(target):
对于范围在i(1000000000)的用户:
通过
打印('通过的值是:'+目标)
返回(1)
在列表名称中:
c = loop_dummy(i)
keep_return.append(c)
实际上更快。
传递的值为:a
传递的值为:b
传递的值为:c
传递的值是:d
挂墙时间:1分钟49s
有人说Dask会有少量开销,但这似乎花了足够长的时间来证明,不是吗?
我实际的for循环涉及较重的计算,因此我建立了一个
此计算
for i in range(...):
通过
受GIL约束。您将要使用multiprocessing或dask.distributed Dask后端,而不是默认的线程后端。我建议以下内容:
total.compute(scheduler ='multiprocessing)
但是,如果您的实际计算主要是Numpy / Pandas / Scikit-Learn /其他数字程序包代码,则默认线程后端可能是正确的选择。 / p>
有关在调度程序之间进行选择的更多信息,请参见: http://dask.pydata.org/en/latest/scheduling.html
I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic.
First, is this the correct way to run a for-loop in parallel?
%%time
list_names=['a','b','c','d']
keep_return=[]
@delayed
def loop_dummy(target):
for i in range (1000000000):
pass
print('passed value is:'+target)
return(1)
for i in list_names:
c=loop_dummy(i)
keep_return.append(c)
total = delayed(sum)(keep_return)
total.compute()
This produced
passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 53s
If I run this in serial,
%%time
list_names=['a','b','c','d']
keep_return=[]
def loop_dummy(target):
for i in range (1000000000):
pass
print('passed value is:'+target)
return(1)
for i in list_names:
c=loop_dummy(i)
keep_return.append(c)
it is actually faster.
passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 49s
I have seen examples where it was stated there is a small amount of overhead for Dask, but this seems to take long enough to justify, no?
My actual for loop involves heavier computation where I build a model for various targets.
This computation
for i in range(...):
pass
Is bound by the GIL. You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following:
total.compute(scheduler='multiprocessing)
However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice.
More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html
这篇关于Dask For Loop平行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!